vMLX auto-detects 50+ architectures from HuggingFace and runs them natively on Apple Silicon. Browse popular model families, check RAM requirements, and discover our in-house abliterated and REAP-pruned MLX models.
These are the most popular HuggingFace MLX models people run locally on Mac with vMLX. Every model listed below is auto-detected, auto-configured, and ready to run in one click.
Run DeepSeek locally on Mac. V3 is a 671B MoE general-purpose powerhouse; R1 is the reasoning variant with chain-of-thought. Both work with vMLX's reasoning parser and tool calling.
MoE ReasoningRun Llama locally on any Mac. Meta's open-weight family spans 1B to 405B parameters. Llama 4 Scout and Maverick bring MoE efficiency and 10M-token context.
Dense & MoERun Qwen locally on Mac. Dense models from 0.5B to 72B, plus MoE variants up to 397B. Qwen 3.5 adds vision-language (VL) capabilities. Best-in-class tool calling support.
Dense & MoE VisionGoogle's efficient open model family. Strong multilingual performance in 1B, 4B, 12B, and 27B sizes. Optimized for Apple Silicon via MLX quantization.
DenseMistral AI's fast and capable models. Mistral (7B, 22B dense) and Mixtral (8x7B, 8x22B MoE) deliver strong coding and instruction-following on Mac.
Dense & MoEMicrosoft's compact models punch above their weight. Phi-4 (14B) and Phi-3 mini (3.8B) are ideal for Macs with 8–16 GB RAM. Great for coding tasks.
Dense CompactA large MoE model with efficient active parameters designed for long-context generation and complex multi-turn conversations.
MoETHUDM's fast reasoning model with native thinking support. Collapsible chain-of-thought blocks display inline in vMLX's UI.
ReasoningStepFun's lightweight model built for speed. Responsive real-time generation makes it a strong choice for interactive local inference on Mac.
CompactWe publish abliterated (CRACK) and REAP-pruned MLX models on HuggingFace. These models remove refusal behavior or prune redundant experts for dramatically better efficiency without sacrificing quality.
Abliterated vision-language model based on Qwen 3.5 VL 9B. CRACK removes alignment-imposed refusal while preserving instruction-following and visual reasoning. Available in both 8-bit and 4-bit MLX quantizations for flexible RAM usage.
View on HuggingFaceREAP-pruned 397B Mixture-of-Experts with only 17B active parameters. REAP (Redundant Expert Ablation Pruning) removes low-impact experts to cut memory and compute requirements while maintaining benchmark performance. The largest REAP-pruned MLX model available.
View on HuggingFaceAbliterated vision-language MoE model: 35B total parameters with only 3B active. Combines the efficiency of Mixture-of-Experts routing with CRACK abliteration for an uncensored VL experience that fits in modest RAM.
View on HuggingFaceThe largest vision-language model on MLX. A REAP-pruned 397B MoE with 17B active parameters and full multimodal support. Runs on 192 GB+ Macs with vMLX's 5-layer caching stack including VL-aware prefix caching, paged KV, q4/q8 quantized KV, batching, and disk cache.
View on HuggingFaceThe tiniest abliterated vision-language model in our lineup. At 4-bit quantization, it fits comfortably on 8 GB Macs while still delivering image understanding, OCR, and visual Q&A without refusal. Perfect entry point for VL on minimal hardware.
View on HuggingFaceApple Silicon's unified memory is shared between the OS, apps, and the model. The table below shows the largest MLX model size you can comfortably run at each RAM tier, with recommended examples.
| Unified RAM | Max Model Size | Example Models |
|---|---|---|
| 8 GB | ~4B | Phi-3 mini 3.8B, Qwen3.5-VL-2B CRACK (4-bit), Llama 3.2 3B |
| 16 GB | ~20B | Qwen3.5-VL-9B CRACK (8-bit), Phi-4 14B, Mistral 7B, Gemma 3 12B |
| 32 GB | ~35B | Qwen3.5-VL-35B-A3B CRACK, Gemma 3 27B, Llama 3 70B (4-bit) |
| 64 GB | ~70B | Llama 3.1 70B (8-bit), Qwen 2.5 72B, DeepSeek R1 Distill 70B |
| 128 GB | ~100B+ | Qwen3.5-397B-A17B REAP (4-bit), DeepSeek V3 (4-bit), Llama 4 Scout |
| 192 GB+ | 397B MoE | Qwen3.5-VL-397B-A17B REAP, DeepSeek V3 (8-bit), Llama 4 Maverick |
vMLX auto-detects the model architecture, tool call format, and reasoning format from HuggingFace config files. No manual configuration needed — just download and run.
BatchMambaCache for batched inference on state-space models. No other MLX engine supports this.<think> blocks. Collapsible reasoning UI with enable_thinking and reasoning_effort API support.Browse our abliterated and REAP-pruned models on HuggingFace, or download vMLX to run any MLX model on your Mac.