# Run Claude Code Locally: Complete Guide to ANTHROPIC_BASE_URL with Ollama & More
Claude Code has quickly become a go-to AI coding assistant for developers, but it has one significant dependency: the cloud. Every request hits Anthropic's API, which means you need an internet connection, you're sharing code with a third party, and you're paying per token.
What if you could run Claude Code entirely on your local machine using open-source LLMs? Thanks to the `ANTHROPIC_BASE_URL` environment variable, you can point Claude Code at local model servers like Ollama, LM Studio, llama.cpp, or vLLM—effectively turning your workstation into a private AI coding assistant.
Here's everything you need to know to get started.
## Understanding ANTHROPIC_BASE_URL: The Gateway to Local LLMs
The `ANTHROPIC_BASE_URL` environment variable allows you to override the default Anthropic API endpoint. Instead of sending requests to `https://api.anthropic.com`, you can redirect them to any server that implements the Anthropic-compatible API format.
This is where local LLM servers come in. Tools like Ollama, LM Studio, llama.cpp, and vLLM now offer native Anthropic-compatible endpoints, meaning they can accept the same request format that Claude Code sends and return responses in the expected structure.
The key requirement: your local model needs at least 32K context window support. Claude Code's agentic workflows—reading files, searching code, making edits—require substantial context to maintain conversation state and file contents. Models with smaller context windows will fail or produce degraded results.
**Popular local models that meet the 32K requirement:**
- Llama 3.1 70B (128K context)
- Qwen 2.5 32B/72B (32K-128K context)
- DeepSeek Coder V2 (128K context)
- Mixtral 8x7B (32K context)
## Setting Up Local LLM Backends
### Option 1: Ollama (Easiest)
Ollama is the simplest path to running local models. It handles model downloads, serving, and now includes native Anthropic API compatibility.
```bash
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a capable model
ollama pull llama3.1:70b
# Start Ollama (runs on localhost:11434 by default)
ollama serve
```
Then configure Claude Code:
```bash
export ANTHROPIC_BASE_URL=http://localhost:11434/v1
export ANTHROPIC_API_KEY=ollama # Any non-empty string works
# Run Claude Code
claude-code
```
### Option 2: LM Studio (Best for GUI Users)
LM Studio provides a polished desktop interface for running local models with zero command-line knowledge required.
1. Download LM Studio from [lmstudio.ai](https://lmstudio.ai)
2. Search and download a model (e.g., "Llama 3.1 70B")
3. Go to the "Local Server" tab
4. Enable "Anthropic API Compatibility Mode"
5. Click "Start Server" (default port: 1234)
Configure Claude Code:
```bash
export ANTHROPIC_BASE_URL=http://localhost:1234/v1
export ANTHROPIC_API_KEY=lm-studio
claude-code
```
### Option 3: llama.cpp (Most Control)
For maximum performance tuning, llama.cpp gives you low-level control over quantization, batch sizes, and hardware acceleration.
```bash
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Download a GGUF model
wget https://huggingface.co/.../model.gguf
# Start server with Anthropic compatibility
./server -m model.gguf --port 8080 --ctx-size 32768 --anthropic
```
Configure Claude Code:
```bash
export ANTHROPIC_BASE_URL=http://localhost:8080/v1
export ANTHROPIC_API_KEY=llama-cpp
claude-code
```
### Option 4: vLLM (Production-Grade Serving)
vLLM is optimized for high-throughput model serving with advanced features like continuous batching and PagedAttention.
```bash
# Install vLLM
pip install vllm
# Serve a model with Anthropic API
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--api-key vllm-key \
--max-model-len 32768 \
--enable-anthropic-api
```
Configure Claude Code:
```bash
export ANTHROPIC_BASE_URL=http://localhost:8000/v1
export ANTHROPIC_API_KEY=vllm-key
claude-code
```
## Performance Considerations and Trade-offs
### Hardware Requirements
Running capable models locally is resource-intensive:
- **70B parameter models**: 40GB+ VRAM (requires A100, H100, or multiple consumer GPUs)
- **32B parameter models**: 20GB+ VRAM (RTX 4090, A6000)
- **7-13B parameter models**: 8-16GB VRAM (RTX 3080, 4070 Ti)
Quantization helps: a Q4 quantized 70B model can run on ~40GB VRAM instead of 140GB.
### Speed vs. Quality
Local models are slower than Anthropic's API:
- **Cloud Claude Sonnet**: ~50-100 tokens/second
- **Local 70B (well-equipped GPU)**: ~10-30 tokens/second
- **Local 32B**: ~20-50 tokens/second
Smaller models respond faster but produce lower-quality code. The sweet spot for most developers is a well-quantized 32B-70B parameter model.
### When to Use Local LLMs
**Use local when you need:**
- Complete privacy (proprietary code, regulated industries)
- Offline functionality (air-gapped environments, travel)
- Cost control (heavy usage, experimentation)
- Custom fine-tuned models
**Stick with cloud when you need:**
- Maximum code quality (Anthropic's models are still state-of-the-art)
- Faster responses (cloud infrastructure wins on speed)
- No hardware investment (API is pay-as-you-go)
## The Takeaway
The `ANTHROPIC_BASE_URL` trick democratizes AI coding assistants. You're no longer locked into cloud providers—you can run Claude Code entirely on your terms, with your hardware, using open-source models.
Start with Ollama for simplicity, graduate to llama.cpp for performance tuning, or deploy vLLM for production serving. Just make sure your model has at least 32K context, and you'll have a capable local coding assistant.
The era of truly private, offline AI development tools has arrived. Time to put that GPU to work.