Best Small LLMs in 2026

Top 10 Models Under 10B Parameters

You don't need a datacenter to run powerful AI. These small language models deliver impressive capabilities on phones, laptops, and edge devices. We rank the top 10 by efficiency, benchmarks, and real-world on-device performance.

Where to Run Small LLMs

Mobile & Phone

Run AI inference on iOS/Android. Models: SmolLM2, Gemma 2B, Llama 3.2 3B. Frameworks: MLX, MLC LLM, MediaPipe.

Laptop & Desktop

Consumer hardware with 8-24GB RAM. Models: Phi-3.5, Gemma 2 9B, Qwen 2.5 7B. Tools: Ollama, LM Studio, GPT4All.

Edge & IoT

Raspberry Pi, Jetson Nano, embedded systems. Models: TinyLlama, SmolLM2, Gemma 2B. Frameworks: llama.cpp, ExecuTorch.

Browser-Based

Run inference directly in the browser with WebGPU. Models: SmolLM2, Phi-3 Mini. Frameworks: WebLLM, Transformers.js.

Top 3 Small LLMs

🥇3.8B

Phi-3.5 Mini

Best efficiency per parameter. Runs on phones and laptops with surprising quality. 128K context in a 3.8B model.

69.0%

MMLU

~80 t/s

Speed

8 GB

VRAM

View model

🥈9B

Gemma 2 9B

Best output quality in the small category. Knowledge distillation from Gemini delivers punch above its weight.

71.3%

MMLU

~45 t/s

Speed

18 GB

VRAM

View model

🥉3B

Llama 3.2 3B

Smallest Llama with 128K context. Optimized for on-device use with strong general capability.

63.4%

MMLU

~90 t/s

Speed

6 GB

VRAM

View model

Complete Top 10

#	Model	Params	MMLU	HumanEval	Memory	Speed
#1	Phi-3.5 Mini Microsoft	3.8B	69.0%	61.0%	8 GB	~80 t/s
#2	Gemma 2 9B Google	9B	71.3%	40.2%	18 GB	~45 t/s
#3	Llama 3.2 3B Meta AI	3B	63.4%	55.0%	6 GB	~90 t/s
#4	Qwen 2.5 7B Alibaba	7B	70.0%	65.0%	14 GB	~50 t/s
#5	DeepSeek Coder V2 Lite DeepSeek	16B (2.4B active)	60.0%	82.0%	6 GB	~70 t/s
#6	Mistral 7B v0.3 Mistral AI	7B	64.2%	32.0%	14 GB	~55 t/s
#7	SmolLM2 1.7B HuggingFace	1.7B	50.0%	30.0%	3.5 GB	~120 t/s
#8	Gemma 2 2B Google	2B	52.0%	28.0%	4 GB	~100 t/s
#9	TinyLlama 1.1B TinyLlama	1.1B	35.0%	18.0%	2.2 GB	~150 t/s
#10	Phi-3 Mini (4K) Microsoft	3.8B	68.8%	58.5%	8 GB	~80 t/s

Speed measured with 4-bit quantization on consumer GPU (RTX 4090 or equivalent). Memory is FP16 requirement.

Detailed Reviews

Phi-3.5 Mini

Microsoft · 3.8B parameters · 128K context

Best efficiency per parameter. Runs on phones and laptops with surprising quality. 128K context in a 3.8B model.

69.0%

MMLU

61.0%

HumanEval

8 GB

VRAM

~80 t/s

Speed

3.8B

Parameters

View full specs

Gemma 2 9B

Google · 9B parameters · 8K context

Best output quality in the small category. Knowledge distillation from Gemini delivers punch above its weight.

71.3%

MMLU

40.2%

HumanEval

18 GB

VRAM

~45 t/s

Speed

Parameters

View full specs

Llama 3.2 3B

Meta AI · 3B parameters · 128K context

Smallest Llama with 128K context. Optimized for on-device use with strong general capability.

63.4%

MMLU

55.0%

HumanEval

6 GB

VRAM

~90 t/s

Speed

Parameters

View full specs

Qwen 2.5 7B

Alibaba · 7B parameters · 128K context

Strong multilingual small model. Excellent Chinese/Asian language support with 128K context.

70.0%

MMLU

65.0%

HumanEval

14 GB

VRAM

~50 t/s

Speed

Parameters

View full specs

DeepSeek Coder V2 Lite

DeepSeek · 16B (2.4B active) parameters · 128K context

MoE magic: 82% HumanEval with only 2.4B active parameters. Best small model for coding.

60.0%

MMLU

82.0%

HumanEval

6 GB

VRAM

~70 t/s

Speed

16B (2.4B active)

Parameters

View full specs

Mistral 7B v0.3

Mistral AI · 7B parameters · 32K context

The OG efficient model. Still great for general tasks with sliding window attention for long context.

64.2%

MMLU

32.0%

HumanEval

14 GB

VRAM

~55 t/s

Speed

Parameters

View full specs

SmolLM2 1.7B

HuggingFace · 1.7B parameters · 8K context

Runs on literally anything. Phones, Raspberry Pi, browsers. Incredible for resource-constrained devices.

50.0%

MMLU

30.0%

HumanEval

3.5 GB

VRAM

~120 t/s

Speed

1.7B

Parameters

View full specs

Gemma 2 2B

Google · 2B parameters · 8K context

Google's smallest. Good for simple tasks, classification, and mobile apps with Google-quality training.

52.0%

MMLU

28.0%

HumanEval

4 GB

VRAM

~100 t/s

Speed

Parameters

View full specs

TinyLlama 1.1B

TinyLlama · 1.1B parameters · 2K context

Smallest useful LLM. Great for learning, experimentation, and very basic NLP tasks on any device.

35.0%

MMLU

18.0%

HumanEval

2.2 GB

VRAM

~150 t/s

Speed

1.1B

Parameters

View full specs

#10

Phi-3 Mini (4K)

Microsoft · 3.8B parameters · 4K context

Original Phi-3 with shorter context but same efficiency. Perfect when you don't need 128K.

68.8%

MMLU

58.5%

HumanEval

8 GB

VRAM

~80 t/s

Speed

3.8B

Parameters

View full specs

Our Recommendation

Best overall small model: Phi-3.5 Mini (3.8B) — unmatched efficiency with 128K context. Runs on a laptop and delivers surprising quality.

Best for coding on-device: DeepSeek Coder V2 Lite — 82% HumanEval with only 2.4B active parameters. MoE architecture makes it incredibly efficient.

Best for mobile phones: SmolLM2 1.7B — runs on literally anything. Perfect for mobile apps and very resource-constrained environments.

Compare with bigger models: Phi-3 vs Gemma 2 · Best Open-Source LLMs · Best Code LLMs

More Guides

Best Open-Source LLMs Best Code LLMs Phi-3 vs Gemma 2 Llama 3 vs GPT-4 LLM Trust Blog Browse All Models