Best Small LLMs in 2026

Top 10 Models Under 10B Parameters

You don't need a datacenter to run powerful AI. These small language models deliver impressive capabilities on phones, laptops, and edge devices. We rank the top 10 by efficiency, benchmarks, and real-world on-device performance.

Where to Run Small LLMs

Mobile & Phone

Run AI inference on iOS/Android. Models: SmolLM2, Gemma 2B, Llama 3.2 3B. Frameworks: MLX, MLC LLM, MediaPipe.

Laptop & Desktop

Consumer hardware with 8-24GB RAM. Models: Phi-3.5, Gemma 2 9B, Qwen 2.5 7B. Tools: Ollama, LM Studio, GPT4All.

Edge & IoT

Raspberry Pi, Jetson Nano, embedded systems. Models: TinyLlama, SmolLM2, Gemma 2B. Frameworks: llama.cpp, ExecuTorch.

Browser-Based

Run inference directly in the browser with WebGPU. Models: SmolLM2, Phi-3 Mini. Frameworks: WebLLM, Transformers.js.

Top 3 Small LLMs

🥇3.8B
Phi-3.5 Mini

Best efficiency per parameter. Runs on phones and laptops with surprising quality. 128K context in a 3.8B model.

69.0%

MMLU

~80 t/s

Speed

8 GB

VRAM

View model
🥈9B
Gemma 2 9B

Best output quality in the small category. Knowledge distillation from Gemini delivers punch above its weight.

71.3%

MMLU

~45 t/s

Speed

18 GB

VRAM

View model
🥉3B
Llama 3.2 3B

Smallest Llama with 128K context. Optimized for on-device use with strong general capability.

63.4%

MMLU

~90 t/s

Speed

6 GB

VRAM

View model

Complete Top 10

#ModelParamsMMLUHumanEvalMemorySpeed
#1Phi-3.5 Mini

Microsoft

3.8B69.0%61.0%8 GB~80 t/s
#2Gemma 2 9B

Google

9B71.3%40.2%18 GB~45 t/s
#3Llama 3.2 3B

Meta AI

3B63.4%55.0%6 GB~90 t/s
#4Qwen 2.5 7B

Alibaba

7B70.0%65.0%14 GB~50 t/s
#5DeepSeek Coder V2 Lite

DeepSeek

16B (2.4B active)60.0%82.0%6 GB~70 t/s
#6Mistral 7B v0.3

Mistral AI

7B64.2%32.0%14 GB~55 t/s
#7SmolLM2 1.7B

HuggingFace

1.7B50.0%30.0%3.5 GB~120 t/s
#8Gemma 2 2B

Google

2B52.0%28.0%4 GB~100 t/s
#9TinyLlama 1.1B

TinyLlama

1.1B35.0%18.0%2.2 GB~150 t/s
#10Phi-3 Mini (4K)

Microsoft

3.8B68.8%58.5%8 GB~80 t/s

Speed measured with 4-bit quantization on consumer GPU (RTX 4090 or equivalent). Memory is FP16 requirement.

Detailed Reviews

#1
Phi-3.5 Mini

Microsoft · 3.8B parameters · 128K context

Best efficiency per parameter. Runs on phones and laptops with surprising quality. 128K context in a 3.8B model.

69.0%

MMLU

61.0%

HumanEval

8 GB

VRAM

~80 t/s

Speed

3.8B

Parameters

View full specs
#2
Gemma 2 9B

Google · 9B parameters · 8K context

Best output quality in the small category. Knowledge distillation from Gemini delivers punch above its weight.

71.3%

MMLU

40.2%

HumanEval

18 GB

VRAM

~45 t/s

Speed

9B

Parameters

View full specs
#3
Llama 3.2 3B

Meta AI · 3B parameters · 128K context

Smallest Llama with 128K context. Optimized for on-device use with strong general capability.

63.4%

MMLU

55.0%

HumanEval

6 GB

VRAM

~90 t/s

Speed

3B

Parameters

View full specs
#4
Qwen 2.5 7B

Alibaba · 7B parameters · 128K context

Strong multilingual small model. Excellent Chinese/Asian language support with 128K context.

70.0%

MMLU

65.0%

HumanEval

14 GB

VRAM

~50 t/s

Speed

7B

Parameters

View full specs
#5
DeepSeek Coder V2 Lite

DeepSeek · 16B (2.4B active) parameters · 128K context

MoE magic: 82% HumanEval with only 2.4B active parameters. Best small model for coding.

60.0%

MMLU

82.0%

HumanEval

6 GB

VRAM

~70 t/s

Speed

16B (2.4B active)

Parameters

View full specs
#6
Mistral 7B v0.3

Mistral AI · 7B parameters · 32K context

The OG efficient model. Still great for general tasks with sliding window attention for long context.

64.2%

MMLU

32.0%

HumanEval

14 GB

VRAM

~55 t/s

Speed

7B

Parameters

View full specs
#7
SmolLM2 1.7B

HuggingFace · 1.7B parameters · 8K context

Runs on literally anything. Phones, Raspberry Pi, browsers. Incredible for resource-constrained devices.

50.0%

MMLU

30.0%

HumanEval

3.5 GB

VRAM

~120 t/s

Speed

1.7B

Parameters

View full specs
#8
Gemma 2 2B

Google · 2B parameters · 8K context

Google's smallest. Good for simple tasks, classification, and mobile apps with Google-quality training.

52.0%

MMLU

28.0%

HumanEval

4 GB

VRAM

~100 t/s

Speed

2B

Parameters

View full specs
#9
TinyLlama 1.1B

TinyLlama · 1.1B parameters · 2K context

Smallest useful LLM. Great for learning, experimentation, and very basic NLP tasks on any device.

35.0%

MMLU

18.0%

HumanEval

2.2 GB

VRAM

~150 t/s

Speed

1.1B

Parameters

View full specs
#10
Phi-3 Mini (4K)

Microsoft · 3.8B parameters · 4K context

Original Phi-3 with shorter context but same efficiency. Perfect when you don't need 128K.

68.8%

MMLU

58.5%

HumanEval

8 GB

VRAM

~80 t/s

Speed

3.8B

Parameters

View full specs
Our Recommendation

Best overall small model: Phi-3.5 Mini (3.8B) — unmatched efficiency with 128K context. Runs on a laptop and delivers surprising quality.

Best for coding on-device: DeepSeek Coder V2 Lite — 82% HumanEval with only 2.4B active parameters. MoE architecture makes it incredibly efficient.

Best for mobile phones: SmolLM2 1.7B — runs on literally anything. Perfect for mobile apps and very resource-constrained environments.

Compare with bigger models: Phi-3 vs Gemma 2 · Best Open-Source LLMs · Best Code LLMs

More Guides

Last updated: March 12, 2026 · Benchmarks from official reports and community evaluations · Browse all models