How to Run Llama 3 Locally: Complete Guide
Running large language models on your own hardware has never been more accessible. Meta's Llama 3 family — from the nimble 8B parameter variant to the powerful 70B and 405B models — can all be run locally with the right setup. This guide covers everything you need to know to run Llama 3 locally, from hardware requirements to optimization techniques.
Whether you're concerned about data privacy, want to eliminate API costs, or need offline AI capabilities, running Llama 3 on your own machine gives you full control over your AI stack.
Before we dive in, you can explore all Llama 3 variants on LLM Trust to see detailed specs and find the right model for your hardware.
Prerequisites: What You Need
Hardware Requirements
The hardware you need depends heavily on which Llama 3 model you plan to run:
Llama 3.2 1B & 3B (Lightweight)
- RAM: 4GB minimum (8GB recommended)
- GPU: Optional — runs efficiently on CPU
- Storage: 2-8GB
- Best For: Mobile devices, embedded systems, quick prototyping
Llama 3.2 11B Vision
- RAM: 16GB minimum (32GB recommended)
- GPU: 8GB+ VRAM (e.g., RTX 3070, RTX 4060)
- Storage: 20GB
- Best For: Image understanding tasks, multimodal applications
Llama 3.3 70B
- RAM: 64GB minimum (128GB recommended for full precision)
- GPU: 48GB+ VRAM for full precision, or 24GB with 4-bit quantization
- Storage: 140GB
- Best For: Production workloads, complex reasoning, code generation
Llama 3.1 405B
- RAM: 256GB+ system RAM
- GPU: Multiple A100/H100 GPUs, or high-RAM CPU inference
- Storage: 750GB+
- Best For: Research, maximum capability requirements
Software Requirements
- Operating System: Linux (recommended), macOS, or Windows with WSL2
- Python: 3.10 or later
- Git: For cloning repositories
- CUDA Toolkit: 12.x (for NVIDIA GPU acceleration)
- Package Manager: pip, conda, or your system package manager
Verify Your Hardware
Before proceeding, check your available resources:
# Check RAM
free -h
# Check GPU and VRAM (NVIDIA)
nvidia-smi
# Check disk space
df -h
# Check CPU info
lscpu | grep "Model name"
Method 1: Ollama (Easiest Setup)
Ollama is the fastest way to get Llama 3 running locally. It handles model downloading, quantization, and serving in a single tool.
Installation
macOS:
curl -fsSL https://ollama.com/install.sh | sh
Or download the macOS app from ollama.com.
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com.
Download and Run Llama 3
Once Ollama is installed, running Llama 3 is as simple as:
# Llama 3.2 3B (lightweight, great for testing)
ollama run llama3.2:3b
# Llama 3.3 70B (requires significant hardware)
ollama run llama3.3:70b
# Llama 3.2 11B Vision (multimodal)
ollama run llama3.2-vision:11b
Ollama automatically downloads the model on first run. Subsequent runs use the cached model.
Available Llama 3 Models in Ollama
| Model | Size | Download Command |
|---|---|---|
| Llama 3.2 1B | ~1.3GB | ollama run llama3.2:1b |
| Llama 3.2 3B | ~2.0GB | ollama run llama3.2:3b |
| Llama 3.2 11B Vision | ~7.5GB | ollama run llama3.2-vision:11b |
| Llama 3.2 90B Vision | ~52GB | ollama run llama3.2-vision:90b |
| Llama 3.3 70B | ~40GB | ollama run llama3.3:70b |
| Llama 3.1 8B | ~4.7GB | ollama run llama3.1:8b |
| Llama 3.1 70B | ~40GB | ollama run llama3.1:70b |
| Llama 3.1 405B | ~231GB | ollama run llama3.1:405b |
Using the Ollama API
Ollama exposes a local API on port 11434:
# Chat completion
curl http://localhost:11434/api/chat -d '{
"model": "llama3.3:70b",
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
}'
# Simple generation
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Write a Python function to calculate fibonacci numbers"
}'
Python Integration
import requests
def chat_with_llama(prompt, model="llama3.3:70b"):
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": False
}
)
return response.json()["message"]["content"]
print(chat_with_llama("What are the main benefits of running LLMs locally?"))
Method 2: llama.cpp (Maximum Control)
llama.cpp provides the most control over inference, including advanced quantization options and CPU-optimized execution.
Installation
# Clone the repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build with CUDA support (if you have an NVIDIA GPU)
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release
# Or build for CPU-only
cmake -B build
cmake --build build --config Release
Download Llama 3 Weights
You'll need the model weights in GGUF format. You can download pre-converted weights from Hugging Face:
# Install huggingface-cli
pip install huggingface_hub
# Download Llama 3.3 70B in Q4_K_M quantization (good quality/size balance)
huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF \
--include "Llama-3.3-70B-Instruct-Q4_K_M.gguf" \
--local-dir ./models/
Quantization Options
Quantization reduces model size at the cost of some quality. Common options:
| Quantization | Bits/Weight | Size (70B) | Quality |
|---|---|---|---|
| Q8_0 | 8 | ~70GB | Near-perfect |
| Q6_K | 6 | ~55GB | Excellent |
| Q5_K_M | 5 | ~47GB | Very good |
| Q4_K_M | 4 | ~40GB | Good (recommended) |
| Q3_K_M | 3 | ~32GB | Acceptable |
| Q2_K | 2 | ~25GB | Degraded |
Running Inference
# Basic chat mode
./build/bin/llama-cli -m ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--interactive \
--ctx-size 4096 \
--n-gpu-layers 99
# With specific parameters
./build/bin/llama-server \
-m ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 8192 \
--n-gpu-layers 99 \
--threads 8
Key Parameters Explained
--n-gpu-layers: Number of layers to offload to GPU (99 = all layers)--ctx-size: Context window size (max tokens for input + output)--threads: CPU threads for inference--batch-size: Prompt processing batch size (larger = faster prompt ingestion)
Method 3: Hugging Face Transformers
For integration with existing ML pipelines and research workflows:
Setup
pip install transformers torch accelerate bitsandbytes
Loading Llama 3
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "meta-llama/Llama-3.3-70B-Instruct"
# Load with 4-bit quantization for memory efficiency
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
load_in_4bit=True, # Requires bitsandbytes
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Generate text
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the difference between a CPU and GPU."}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Method 4: vLLM (Production Serving)
For production deployments with high throughput:
pip install vllm
# Start the server
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
vLLM provides an OpenAI-compatible API at http://localhost:8000/v1.
Configuration and Optimization
Prompt Templates
Llama 3 uses a specific chat template. The correct format:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
What is machine learning?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Most tools (Ollama, Transformers) apply this template automatically.
Memory Optimization Techniques
1. Quantization — The most effective technique. Q4_K_M reduces memory by ~75% with minimal quality loss.
2. KV Cache Optimization:
# In llama.cpp
--cache-type-k q8_0 # Quantize KV cache to 8-bit
3. Context Length Management:
# Use smaller context if you don't need full 128K
--ctx-size 4096 # Instead of 128K
4. Flash Attention:
# In vLLM or Transformers, flash attention is typically enabled by default
# In llama.cpp
--flash-attn
Performance Benchmarks (llama.cpp, RTX 4090)
| Model | Quantization | Tokens/sec | VRAM Usage |
|---|---|---|---|
| Llama 3.2 3B | Q4_K_M | ~120 | ~2.5GB |
| Llama 3.1 8B | Q4_K_M | ~65 | ~5GB |
| Llama 3.3 70B | Q4_K_M | ~12 | ~40GB |
| Llama 3.3 70B | Q4_K_M (partial GPU) | ~6 | ~22GB |
Troubleshooting
Common Issues and Solutions
"CUDA out of memory"
# Reduce GPU layers or use more aggressive quantization
--n-gpu-layers 40 # Offload fewer layers to GPU
# Or use a smaller quantization
# Q4_K_M → Q3_K_M or Q2_K
"Model file not found"
- Ensure the path to your GGUF file is correct
- For Ollama: try
ollama pull llama3.3:70bexplicitly - For Hugging Face: verify you have access (Llama models require approval)
Slow inference on CPU
# Ensure you're using all CPU threads
--threads $(nproc)
# Use optimized quantization for CPU
# Q4_0 and Q5_0 are faster on CPU than K-quants
Garbage output
- Verify you're using the correct prompt template
- Check that the model file isn't corrupted (compare SHA256)
- Try a higher quantization level
Ollama not using GPU
# Check if Ollama detects your GPU
ollama ps
# Force GPU usage
CUDA_VISIBLE_DEVICES=0 ollama run llama3.3:70b
Platform-Specific Notes
macOS (Apple Silicon):
- Ollama and llama.cpp both support Metal acceleration
- M3 Max with 128GB RAM can run 70B models effectively
- Memory bandwidth is the bottleneck, not compute
Windows:
- WSL2 is recommended for most tools
- Direct Windows support varies by tool
- Ensure CUDA drivers are installed in WSL2 if using GPU
Linux:
- Best overall performance and compatibility
- Ensure NVIDIA drivers and CUDA are properly installed
- Consider
nvidia-persistencedfor consistent GPU performance
Complete Workflow Summary
Here's the recommended path for most users:
- Start with Ollama: Fastest way to test if Llama 3 works for your use case
- Benchmark: Measure quality and speed with your specific prompts
- Optimize: Adjust quantization, context length, and GPU offloading
- Scale Up: Move to vLLM or TGI for production serving
- Fine-tune: Use LoRA/QLoRA to customize for your domain
Conclusion
Running Llama 3 locally gives you complete control over your AI infrastructure. Whether you choose the simplicity of Ollama, the flexibility of llama.cpp, or the production-readiness of vLLM, you now have the tools to deploy powerful AI on your own terms.
The key is matching your hardware to the right model size and quantization level. Start small, benchmark thoroughly, and scale up as needed.
Want to compare Llama 3 against other models? Use our comparison tool to see how Llama 3 stacks up against GPT-4, Claude, Mistral, and more.
Looking for the right model for your setup? Browse our model catalog filtered by hardware requirements, or sign up for LLM Trust to get personalized recommendations based on your specific needs.