vMLX Features — 5-Layer Caching, Agentic Tools, Vision & Voice

5-Layer Caching Stack

vMLX implements the deepest caching pipeline available for local MLX inference. Every layer works together to eliminate redundant computation, cut time-to-first-token, and keep memory usage under control — so you get faster responses on the same hardware.

Prefix Caching

System prompts, tool definitions, and conversation history are cached and reused across turns. Repeat tokens are never recomputed, dramatically reducing time-to-first-token on long contexts.

Paged KV Cache

Key-value cache memory is allocated in fixed-size pages instead of one contiguous block. This eliminates fragmentation and lets vMLX handle longer contexts without running out of memory.

KV Cache Quantization (q4/q8)

Quantize the KV cache to 4-bit or 8-bit precision on the fly. This reduces cache memory usage by up to 75%, letting you run larger models or longer conversations within the same RAM budget.

Continuous Batching

Multiple requests share compute efficiently through continuous batching. New prompts enter the pipeline without waiting for others to finish, maximizing GPU utilization across concurrent sessions.

Persistent Disk Cache

Computed caches are written to disk and restored automatically on relaunch. Cold starts become warm starts — your most-used models and system prompts load almost instantly even after a restart.

Agentic Coding Tools

vMLX ships with 20+ built-in MCP tools that turn any local model into a coding agent. The model can read, write, and edit files on your machine, run shell commands, search the web, automate a browser, interact with git repositories, and more — all with your explicit approval before each action.

Because the tools run locally through vMLX's native runtime, there is no round-trip to a cloud server. Tool execution is near-instant and your code never leaves your machine.

Vision, Voice & Reasoning

vMLX supports vision-language (VL) models with full caching support — image tokens are cached just like text, so multi-turn vision conversations stay fast. Paste screenshots, drag in photos, or point the model at a URL and it processes everything locally.

Voice chat with text-to-speech lets you have natural spoken conversations with any model. Reasoning blocks are fully supported with enable_thinking and reasoning_effort parameters, giving you transparent chain-of-thought output from models that support extended thinking.

VL Model Support

Run vision-language models locally with full image caching. Multi-turn vision conversations maintain cache across turns for fast follow-ups.

Voice Chat & TTS

Speak naturally with any model. Built-in text-to-speech reads responses aloud, turning vMLX into a hands-free coding and research assistant.

Reasoning Blocks

Models that support extended thinking output structured reasoning blocks. Control the depth with enable_thinking and reasoning_effort to balance speed versus thoroughness on every request.

OpenAI-Compatible API

vMLX exposes a fully OpenAI-compatible REST API at http://127.0.0.1:8000 with 7 endpoints covering chat completions, completions, models, embeddings, and more. API key authentication is built in, so you can lock down access even on a shared network.

Any tool that speaks the OpenAI protocol works out of the box. Point Cursor, Continue, Aider, Open Interpreter, or your own scripts at vMLX and use local models as a drop-in replacement for GPT — with zero code changes and zero API costs.

Model Support

vMLX auto-detects and runs 50+ model architectures out of the box, from mainstream Llama and Mistral families to specialized Mamba/SSM state-space models. Speculative decoding is available to accelerate generation by running a small draft model alongside the main model.

On the tool-use side, 14 built-in tool call parsers handle every major function-calling format, and 4 reasoning parsers extract structured thinking from models that support it. A built-in HuggingFace browser lets you search, preview, and download any MLX model directly from the app.

50+ Architectures

Llama, Mistral, Qwen, Gemma, Phi, Command-R, DeepSeek, StarCoder, and many more — auto-detected from model metadata with zero configuration.

14 Tool Parsers

Every major function-calling format is supported. Models can invoke tools regardless of their native tool-call syntax.

Speculative Decoding

Pair a small draft model with your main model to accelerate token generation. The draft model proposes candidates that the main model verifies in parallel.

Mamba/SSM

First-class support for state-space models including Mamba architectures. These models offer constant-memory inference for extremely long contexts.

4 Reasoning Parsers

Structured reasoning extraction for models that support extended thinking, with transparent chain-of-thought display in the UI.

HuggingFace Browser

Search, preview model cards, and download any MLX model directly from the app. No terminal, no git-lfs, no manual file management.

Remote Endpoints

vMLX is not limited to local models. Connect to OpenAI, Anthropic, or any OpenAI-compatible API endpoint and use all of vMLX's agentic tools, multi-turn sessions, and UI features with cloud-hosted models. This gives you the best of both worlds: cloud intelligence with local tool execution.

Your API keys are stored locally and requests go directly to the provider — vMLX never proxies through its own servers. Switch between local and remote models mid-session, or run them side by side in different sessions.

Every feature, built for speed and privacy