Mistral Medium 3.5 Lands as a 128B Open-Weight Frontier Model — Here's What's New

PromptVerse Editorial

·May 3, 2026·6 min read

Mistral Medium 3.5 Lands as a 128B Open-Weight Frontier Model — Here's What's New

The week that gave us a leaked $900B Anthropic round also gave us, almost in the same news cycle, a quieter but arguably more consequential drop: Mistral Medium 3.5 went public on April 29, 2026. A 128-billion-parameter dense model, a 256k-token context window, a single set of weights that handles chat, reasoning, and code — and crucially, open weights on Hugging Face under a Modified MIT license.

The Mistral Medium 3.5 release is the kind of launch that doesn't dominate Twitter for a day but quietly rearranges what a lot of teams build on for the next quarter. We've spent the weekend running it through our usual prompt suite, and there's enough genuinely interesting here to write up.

What actually shipped

Pulling from Mistral's own model card, the Hugging Face page, and the early coverage from The Decoder and GIGAZINE:

Mistral Medium 3.5 — a 128B dense model. Not a mixture of experts. All 128B parameters activate for every token.
256k context window — four times what most of the open-weight ecosystem was shipping six months ago.
Configurable reasoning effort — a toggle between "instant reply" and "reasoning" modes, in the same model. No separate -thinking SKU.
Vision — image input is native, not a bolt-on adapter.
Multilingual by default — English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic among the explicitly supported languages.
Open weights, Modified MIT license, available on Hugging Face. Self-hostable on as little as four GPUs.
API pricing: $1.50/M input, $7.50/M output tokens via Mistral's hosted endpoint.

That last bullet — pricing roughly an order of magnitude below the top of OpenAI and Anthropic's frontier SKUs, while remaining self-hostable if you'd rather — is the one we keep coming back to.

Why "merged" matters

The headline architectural choice here is that Mistral Medium 3.5 is what they're calling a flagship merged model. Translation: chat, reasoning, and code are unified inside one set of weights, instead of being three separate SKUs that you route between.

In 2025 and early 2026, the prevailing pattern from most labs was to ship a base instruct model and a separate reasoning variant — o3 next to GPT-5, claude-opus-4-7 next to a thinking-mode counterpart, DeepSeek-V3 next to DeepSeek-R1. That works, but it creates an ugly orchestration problem: you have to know in advance which model to hit, and you pay full price on the reasoning model whether the prompt actually needed it or not.

A merged model with a reasoning_effort knob solves that on the prompt side. You hand the model a request, set the effort dial, and it either fires back instantly or thinks before answering — same weights, same context, same conversational state. Anthropic and OpenAI have been hinting at this direction; Mistral shipped it first in this size class with open weights.

Pro tip: if you're routing between a base and a thinking variant today, the migration path to a merged model is cleaner than you think. Treat reasoning_effort as a per-request parameter, default it to "low," and only raise it for prompts where your evals show it actually helps.

What's it actually good at

We've kept our impressions deliberately narrow because a weekend isn't a real evaluation. But a few patterns held up across the prompts we tried:

Code is the strongest single capability. Mistral has been positioning Medium 3.5 as a "frontier coder" and the early benchmark numbers (the published HumanEval-class scores and the third-party BenchLM rankings) back that up. In our own quick checks — the same TypeScript refactor prompts we use to sanity-check every new model — it produced cleaner output than we expected for a 128B dense model and was visibly faster than the Claude Opus tier on equivalent tasks.

Long-context reasoning is genuinely usable. 256k isn't 1M, but it's enough that the model can hold a non-trivial codebase, a long PDF, or a full conversation history in one shot. Recall in the back third of the window degraded a bit (as it does for everyone), but coherence held.

The "reasoning effort" toggle is real, not theater. With effort set low, latency is on the order of a fast instruct model. Set high, the model produces visibly more deliberate output — chain-of-thought stays internal, but the answers are noticeably better on multi-step word problems and code with subtle correctness traps.

Multilingual is a step up from the previous Medium line. French and Spanish in particular felt closer to native than the older Mistral Small/Medium lineage.

What we haven't validated yet, and so won't claim: agentic tool-use over long horizons, deep RAG performance against the Anthropic and OpenAI flagships, or vision quality versus the dedicated multimodal SKUs. Treat those as open questions for now.

How the Mistral Medium 3.5 release compares

A few quick triangulation points so you can place it on your own map:

vs. DeepSeek V4 preview (which we covered yesterday): DeepSeek V4 has the bigger headline number — 1M context — and the more aggressive open-source posture (full MIT). Mistral is smaller in context but denser in parameters and arguably more polished out of the box. Different bets, both worth watching.
vs. Claude Opus 4.7 / GPT-5.5: Mistral is meaningfully cheaper per token and self-hostable. Opus and GPT-5.5 are still ahead on the very hardest agentic and coding benchmarks. The decision is "frontier capability at frontier price" vs "90% of the capability at 15% of the price plus the option to run it on your own GPUs."
vs. Llama 4 / GLM-class open weights: Mistral is denser than most MoE-style open releases, which means worse compute efficiency at training time but cleaner inference characteristics for teams that don't want to wrestle with expert routing.

What we're going to do with it

Practical near-term plans, in case it's useful as a template:

Re-run our internal prompt eval suite against mistral-medium-3-5 and add it to the routing table next to the existing Claude and GPT lanes.
Try a self-hosted deployment on a small 4xH100 node to get a real read on cost-per-token vs. the API. The Hugging Face page suggests this is realistic; we want to confirm.
Migrate one of our internal coding agents that currently runs on a thinking-tier model to Medium 3.5 with reasoning_effort: high, and see if the cost savings hold without quality regressions.
Stress-test the long context — load a 200k-token codebase and ask narrow questions that require the model to actually traverse the whole window, not just attend to the start and end.

If you're a creator using Mistral through a wrapper rather than directly, the practical change is more modest — your wrapper will pick up the new model, the cost should drop, and your existing prompts should "just work" with maybe a small upward bump in quality. The bigger story is for the people building the wrappers.

The takeaway

The interesting thing about the Mistral Medium 3.5 release isn't any single benchmark number — it's the combination. Open weights, dense architecture, 256k context, merged reasoning, four-GPU self-hosting, and pricing that doesn't make your CFO wince. That stack of properties didn't exist six months ago, and now it does, and it's MIT-licensed.

Frontier capability is no longer something only three companies can ship. That's the story we're going to remember from this week, even if the bigger headline number belongs to Anthropic.

Sources: