vMLX is the most complete MLX inference engine for Mac. From a 5-layer caching stack to 20+ agentic coding tools, everything runs locally on Apple Silicon — no cloud, no telemetry, no compromises.
vMLX implements the deepest caching pipeline available for local MLX inference. Every layer works together to eliminate redundant computation, cut time-to-first-token, and keep memory usage under control — so you get faster responses on the same hardware.
vMLX ships with 20+ built-in MCP tools that turn any local model into a coding agent. The model can read, write, and edit files on your machine, run shell commands, search the web, automate a browser, interact with git repositories, and more — all with your explicit approval before each action.
Because the tools run locally through vMLX's native runtime, there is no round-trip to a cloud server. Tool execution is near-instant and your code never leaves your machine.
vMLX supports vision-language (VL) models with full caching support — image tokens are cached just like text, so multi-turn vision conversations stay fast. Paste screenshots, drag in photos, or point the model at a URL and it processes everything locally.
Voice chat with text-to-speech lets you have natural spoken conversations with any model. Reasoning blocks are fully supported with enable_thinking and reasoning_effort parameters, giving you transparent chain-of-thought output from models that support extended thinking.
enable_thinking and reasoning_effort to balance speed versus thoroughness on every request.vMLX exposes a fully OpenAI-compatible REST API at http://127.0.0.1:8000 with 7 endpoints covering chat completions, completions, models, embeddings, and more. API key authentication is built in, so you can lock down access even on a shared network.
Any tool that speaks the OpenAI protocol works out of the box. Point Cursor, Continue, Aider, Open Interpreter, or your own scripts at vMLX and use local models as a drop-in replacement for GPT — with zero code changes and zero API costs.
vMLX auto-detects and runs 50+ model architectures out of the box, from mainstream Llama and Mistral families to specialized Mamba/SSM state-space models. Speculative decoding is available to accelerate generation by running a small draft model alongside the main model.
On the tool-use side, 14 built-in tool call parsers handle every major function-calling format, and 4 reasoning parsers extract structured thinking from models that support it. A built-in HuggingFace browser lets you search, preview, and download any MLX model directly from the app.
vMLX is not limited to local models. Connect to OpenAI, Anthropic, or any OpenAI-compatible API endpoint and use all of vMLX's agentic tools, multi-turn sessions, and UI features with cloud-hosted models. This gives you the best of both worlds: cloud intelligence with local tool execution.
Your API keys are stored locally and requests go directly to the provider — vMLX never proxies through its own servers. Switch between local and remote models mid-session, or run them side by side in different sessions.
© 2026 ShieldStack LLC