Best Small LLMs in 2026
Top 10 Models Under 10B Parameters
You don't need a datacenter to run powerful AI. These small language models deliver impressive capabilities on phones, laptops, and edge devices. We rank the top 10 by efficiency, benchmarks, and real-world on-device performance.
Where to Run Small LLMs
Mobile & Phone
Run AI inference on iOS/Android. Models: SmolLM2, Gemma 2B, Llama 3.2 3B. Frameworks: MLX, MLC LLM, MediaPipe.
Laptop & Desktop
Consumer hardware with 8-24GB RAM. Models: Phi-3.5, Gemma 2 9B, Qwen 2.5 7B. Tools: Ollama, LM Studio, GPT4All.
Edge & IoT
Raspberry Pi, Jetson Nano, embedded systems. Models: TinyLlama, SmolLM2, Gemma 2B. Frameworks: llama.cpp, ExecuTorch.
Browser-Based
Run inference directly in the browser with WebGPU. Models: SmolLM2, Phi-3 Mini. Frameworks: WebLLM, Transformers.js.
Top 3 Small LLMs
Best efficiency per parameter. Runs on phones and laptops with surprising quality. 128K context in a 3.8B model.
69.0%
MMLU
~80 t/s
Speed
8 GB
VRAM
Best output quality in the small category. Knowledge distillation from Gemini delivers punch above its weight.
71.3%
MMLU
~45 t/s
Speed
18 GB
VRAM
Smallest Llama with 128K context. Optimized for on-device use with strong general capability.
63.4%
MMLU
~90 t/s
Speed
6 GB
VRAM
Complete Top 10
| # | Model | Params | MMLU | HumanEval | Memory | Speed |
|---|---|---|---|---|---|---|
| #1 | Phi-3.5 Mini Microsoft | 3.8B | 69.0% | 61.0% | 8 GB | ~80 t/s |
| #2 | Gemma 2 9B | 9B | 71.3% | 40.2% | 18 GB | ~45 t/s |
| #3 | Llama 3.2 3B Meta AI | 3B | 63.4% | 55.0% | 6 GB | ~90 t/s |
| #4 | Qwen 2.5 7B Alibaba | 7B | 70.0% | 65.0% | 14 GB | ~50 t/s |
| #5 | DeepSeek Coder V2 Lite DeepSeek | 16B (2.4B active) | 60.0% | 82.0% | 6 GB | ~70 t/s |
| #6 | Mistral 7B v0.3 Mistral AI | 7B | 64.2% | 32.0% | 14 GB | ~55 t/s |
| #7 | SmolLM2 1.7B HuggingFace | 1.7B | 50.0% | 30.0% | 3.5 GB | ~120 t/s |
| #8 | Gemma 2 2B | 2B | 52.0% | 28.0% | 4 GB | ~100 t/s |
| #9 | TinyLlama 1.1B TinyLlama | 1.1B | 35.0% | 18.0% | 2.2 GB | ~150 t/s |
| #10 | Phi-3 Mini (4K) Microsoft | 3.8B | 68.8% | 58.5% | 8 GB | ~80 t/s |
Speed measured with 4-bit quantization on consumer GPU (RTX 4090 or equivalent). Memory is FP16 requirement.
Detailed Reviews
Microsoft · 3.8B parameters · 128K context
Best efficiency per parameter. Runs on phones and laptops with surprising quality. 128K context in a 3.8B model.
69.0%
MMLU
61.0%
HumanEval
8 GB
VRAM
~80 t/s
Speed
3.8B
Parameters
Google · 9B parameters · 8K context
Best output quality in the small category. Knowledge distillation from Gemini delivers punch above its weight.
71.3%
MMLU
40.2%
HumanEval
18 GB
VRAM
~45 t/s
Speed
9B
Parameters
Meta AI · 3B parameters · 128K context
Smallest Llama with 128K context. Optimized for on-device use with strong general capability.
63.4%
MMLU
55.0%
HumanEval
6 GB
VRAM
~90 t/s
Speed
3B
Parameters
Alibaba · 7B parameters · 128K context
Strong multilingual small model. Excellent Chinese/Asian language support with 128K context.
70.0%
MMLU
65.0%
HumanEval
14 GB
VRAM
~50 t/s
Speed
7B
Parameters
DeepSeek · 16B (2.4B active) parameters · 128K context
MoE magic: 82% HumanEval with only 2.4B active parameters. Best small model for coding.
60.0%
MMLU
82.0%
HumanEval
6 GB
VRAM
~70 t/s
Speed
16B (2.4B active)
Parameters
Mistral AI · 7B parameters · 32K context
The OG efficient model. Still great for general tasks with sliding window attention for long context.
64.2%
MMLU
32.0%
HumanEval
14 GB
VRAM
~55 t/s
Speed
7B
Parameters
HuggingFace · 1.7B parameters · 8K context
Runs on literally anything. Phones, Raspberry Pi, browsers. Incredible for resource-constrained devices.
50.0%
MMLU
30.0%
HumanEval
3.5 GB
VRAM
~120 t/s
Speed
1.7B
Parameters
Google · 2B parameters · 8K context
Google's smallest. Good for simple tasks, classification, and mobile apps with Google-quality training.
52.0%
MMLU
28.0%
HumanEval
4 GB
VRAM
~100 t/s
Speed
2B
Parameters
TinyLlama · 1.1B parameters · 2K context
Smallest useful LLM. Great for learning, experimentation, and very basic NLP tasks on any device.
35.0%
MMLU
18.0%
HumanEval
2.2 GB
VRAM
~150 t/s
Speed
1.1B
Parameters
Microsoft · 3.8B parameters · 4K context
Original Phi-3 with shorter context but same efficiency. Perfect when you don't need 128K.
68.8%
MMLU
58.5%
HumanEval
8 GB
VRAM
~80 t/s
Speed
3.8B
Parameters
Best overall small model: Phi-3.5 Mini (3.8B) — unmatched efficiency with 128K context. Runs on a laptop and delivers surprising quality.
Best for coding on-device: DeepSeek Coder V2 Lite — 82% HumanEval with only 2.4B active parameters. MoE architecture makes it incredibly efficient.
Best for mobile phones: SmolLM2 1.7B — runs on literally anything. Perfect for mobile apps and very resource-constrained environments.
Compare with bigger models: Phi-3 vs Gemma 2 · Best Open-Source LLMs · Best Code LLMs