MiniMax M2.7: what it does and how it compares to Claude, ChatGPT, Gemini, Grok, and DeepSeek V4

Frontier coding and agentic performance at roughly a fraction of Opus-tier prices. See the benchmarks before you switch.

MiniMax M2.7 compared head-to-head with Claude, ChatGPT, Gemini, Grok, and DeepSeek V4, with a 1.20 dollar per million token price tag
MiniMax M2.7 lines up against the major frontier models on price and performance.

I ran into MiniMax M2.7 the same way a lot of people probably will, not from a launch announcement but from seeing it listed as a model option inside my Genspark Chat agent .

That raised the obvious questions.

What is this thing, what can it actually do, and is it good enough to pick over Claude, ChatGPT, Gemini, Grok, or DeepSeek V4?

This article answers all three, with the benchmarks and prices laid out side by side so you can decide before you switch.

Quick verdict

MiniMax M2.7 is the best price-to-performance pick for agentic coding right now, landing near-frontier scores at $1.20 per million output tokens [1]. Claude Opus stays the top-quality choice when you need the highest score on the hardest tasks. DeepSeek V4 is the strongest open challenger if you want a 1M context window [2]. M2.7’s catch is simple. Top closed models still edge it on the toughest engineering problems.

At a glanceย 

  • Best value: MiniMax M2.7. Near-frontier coding at $1.20 per million output tokens.
  • Best quality: Claude Opus. Highest scores on the hardest tasks, at a steep price.
  • Best open challenger: DeepSeek V4. Open weights with a 1M context window.

Full comparison table

The numbers below come from different sources, so read the footnotes before you trust any single cell. MiniMax M2.7 figures are vendor-reported by MiniMax [1][3].

Rival figures come from each lab’s release notes and third-party benchmark coverage, and the newest point releases move fast, so verify before you commit budget.

DimensionMiniMax M2.7Claude OpusGPT-5.xGemini 3 ProGrokDeepSeek V4
Total / active params230B / ~10B (MoE) [4]Not disclosedNot disclosedNot disclosedNot disclosedPro: 1.6T / 49B; Flash: 284B / 13B [2]
Coding benchmarkSWE-Pro 56.22% [3]SWE-bench Verified 80.8% (Opus 4.6) [5]SWE-bench Verified ~80% (GPT-5.2) [5]SWE-bench Verified 78% (3 Pro) [5]Public data thinSOTA open agentic coding (V4-Pro) [2]
Context windowLong context (M-series) [4]Large, not stated hereLarge, not stated hereLarge, not stated hereNot stated here1M standard [2]
Output price /M tokens$1.20 (highspeed $2.40) [1]~$75 (Opus 4.6) [5]~$60 (GPT-5.2) [5]~$20 (3 Pro) [5]Not stated hereLower than GPT-5.4 mini on 5/7 CAISI tests [6]
Open weightsYes [7]NoNoNoNoYes [2]
Best use caseCheap agentic coding at volumeHardest tasks, top qualityBroad general useMultimodal + contextGeneral chatOpen + huge context

That table is the fast answer. The rest of this article explains what sits behind each number, starting with what MiniMax M2.7 actually is.

What is MiniMax M2.7?

MiniMax M2.7 is a 230B-parameter MoE model with roughly 10B active parameters per forward pass, built by MiniMax for coding and agentic workflows [4].

MoE = mixture of experts, meaning the model holds 230B parameters total but fires only about 10B for any given token.

That sparsity is the whole cost story. You pay for the work the model does per pass, not for the parameters sitting idle, which is why M2.7 can charge $1.20 per million output tokens while Opus-tier models charge far more [1].

M2.7 is the latest step in the M-series. The line ran M2, then M2.1, then M2.5, and now M2.7.

Each release pushed coding scores higher.

M2.5 hit 80.2% on SWE-bench Verified in February 2026, within 0.6 points of Claude Opus 4.6 at the time [5].

M2.7 shifts the headline benchmark to SWE-Pro, a harder test, where it scores 56.22% [3].

MiniMax M-series timeline showing SWE-bench scores rising from M2 to M2.5, with M2.7 on the harder SWE-Pro benchmark
The M-series climbed SWE-bench Verified from 69.4% to 80.2%, then M2.7 moved to the harder SWE-Pro test at 56.22%.

MiniMax released M2.7 as open weights on Hugging Face, ModelScope, Ollama, and NVIDIA NIM [7]. You can download it and run it yourself. Whether that makes it “open source” is a separate question, and the answer is messier than most articles admit. More on that below.

That covers the what. The build process is where M2.7 gets genuinely unusual.

The self-evolution feature, explained without the hype

MiniMax calls M2.7 its first model to deeply participate in its own evolution [3].

Definition Box

Strip the marketing and here is what happened.

An internal version of M2.7 optimized a programming scaffold across 100-plus rounds. It analyzed failure trajectories, modified code, ran evaluations, and decided to keep or revert each change. That loop produced a 30% performance improvement [3].

Diagram of MiniMax M2.7 self-evolution loop, analyzing failures, editing code, evaluating, and reverting across 100 rounds for a 30 percent gain
M2.7 optimized its own training scaffold across 100-plus rounds for a 30% performance gain.

Let me be straight about what this is and is not. It is not the model rewriting its own weights or going rogue. It is automated trial-and-error on the surrounding agent harness, with humans setting the rewards. Useful? Yes. Sentient? No.

The result that backs the claim is MLE-Bench Lite, a set of 22 machine-learning competitions. M2.7 hit a 66.6% medal rate, second only to Opus-4.6 and GPT-5.4 on that test [3].

A self-improving training loop landing second to the two most expensive frontier models is a number worth pausing on, even if it comes from MiniMax’s own evaluation.

So how does that build process translate into work you can actually use? The benchmarks tell the rest.

What MiniMax M2.7 can do

1. Software engineering

M2.7 scores 56.22% on SWE-Pro, which MiniMax reports as matching GPT-5.3-Codex [3].

SWE-Pro is a harder subset than standard SWE-bench, testing architectural decisions alongside bug fixes. On broader engineering tests, M2.7 posts SWE Multilingual 76.5, Multi SWE-Bench 52.7, VIBE-Pro 55.6% (near Opus 4.6), Terminal Bench 2 57.0%, and NL2Repo 39.8% [3].

Bar chart of MiniMax M2.7 software engineering benchmark scores including SWE Multilingual 76.5 and SWE-Pro 56.22, all vendor-reported
M2.7 software engineering benchmarks as reported by MiniMax. VIBE-Pro sits near Opus 4.6.

The number that caught my attention sits outside the usual coding tests. MiniMax says it reduced live production incident recovery to under three minutes on multiple occasions using M2.7, because the model correlates monitoring metrics, runs trace analysis, and verifies root causes in databases [3].

That is SRE-level reasoning (SRE = site reliability engineering, the practice of keeping production systems running), not autocomplete.

Treat the three-minute figure as a vendor anecdote until independent users confirm it, but the capability category is real.

2. Agentic work and tool use

M2.7 supports Agent Teams, meaning multiple agents collaborate with stable role identity and autonomous decisions [3].

On 40-plus complex skills, each over 2,000 tokens, it holds a 97% skill adherence rate [3].

Skill adherence = how reliably the model follows a defined procedure without drifting.

On Toolathon it reached 46.3%, which MiniMax places in the global top tier, and on the MM Claw end-to-end benchmark it scored 62.7%, close to Sonnet 4.6 [3].

If you run multi-agent setups, that 97% adherence number matters more than any raw coding score, because agent reliability collapses fast when a model ignores its instructions mid-task.

3. Office and professional work

M2.7 scored an ELO of 1495 on GDPval-AA, the highest among open-weight models, surpassing GPT-5.3 on that test [3].

It handles Word, Excel, and PowerPoint with high-fidelity multi-round editing and produces editable deliverables rather than screenshots [3].

For anyone automating document workflows, multi-round editing is the hard part, since most models degrade after the second or third revision pass.

Speed and the highspeed variant

M2.7 ships in two API versions, standard M2.7 and M2.7-highspeed [3]. Same results, faster throughput on highspeed. Both include automatic cache support with no configuration. You pick highspeed when latency matters and standard when cost matters, since highspeed costs double on output [1].

Strong standalone numbers mean little without context. How does M2.7 stack against the models you already pay for?

MiniMax M2.7 vs Claude, ChatGPT, Gemini, Grok, and DeepSeek V4

MiniMax M2.7 vs Claude

Claude Opus wins on the hardest tasks. On VIBE-Pro, M2.7 lands at 55.6%, which MiniMax describes as nearly on par with Opus 4.6 but still under it.

On MM Claw, M2.7 matches Sonnet 4.6 at 62.7% [3].

The gap narrows against Sonnet and widens against Opus.

Price is where M2.7 flips the table. Opus 4.6 output ran around $75 per million tokens [5].

M2.7 runs $1.20 [1].

You give up some accuracy on the toughest problems and pay roughly a fraction of the cost.

For high-volume agentic coding, that math favors M2.7. For one-shot accuracy on a critical refactor, it favors Opus.

MiniMax M2.7 vs ChatGPT

M2.7 matches GPT-5.3-Codex on SWE-Pro at 56.22% and surpasses GPT-5.3 on GDPval-AA ELO [3]. GPT-5.2 output pricing sat near $60 per million tokens [5].

The pattern repeats. Comparable coding output, far lower price. If your ChatGPT bill comes mostly from agentic coding loops, M2.7 is the obvious test candidate.

MiniMax M2.7 vs Gemini 3 Pro

Gemini 3 Pro scored 78% on SWE-bench Verified at roughly $20 per million output tokens [5].

Gemini’s strength is multimodal breadth and Google’s ecosystem, not bottom-dollar coding cost.

M2.7 undercuts it on price and offers open weights, which Gemini does not. Choose Gemini for multimodal and integrated tooling, M2.7 for cheap, self-hostable coding.

MiniMax M2.7 vs Grok

Here is the honest gap. Public, directly comparable benchmark data for the newest Grok against M2.7 on these specific tests is thin in the sources I can verify, and Grok’s latest figures may sit past my January 2026 knowledge cutoff. I won’t fabricate a number to fill the cell.

What I can say is that Grok is closed, while M2.7 ships open weights, and M2.7 publishes detailed agentic and coding benchmarks that Grok’s public materials do not directly mirror. Check current Grok benchmarks yourself before you weigh it against M2.7.

MiniMax M2.7 vs DeepSeek V4

This is the real open-weights race. DeepSeek V4 shipped two models, V4-Pro at 1.6T total / 49B active and V4-Flash at 284B total / 13B active, both open weights with 1M context as the default [2].

DeepSeek claims open-source SOTA on agentic coding benchmarks and world knowledge trailing only Gemini-3.1-Pro [2].

Independent evaluation backs the cost story; NIST’s CAISI found DeepSeek V4 cheaper than GPT-5.4 mini on five of seven benchmarks [6].

Face-off Table

MiniMax M2.7 vs DeepSeek V4

SpecMiniMax M2.7DeepSeek V4
Total parameters230BPro: 1.6T; Flash: 284B
Active parameters per pass~10BPro: 49B; Flash: 13B
Context windowLong context (M-series)1M standard
Open weightsYesYes
Output price /M tokens$1.20 standardCheaper than GPT-5.4 mini on 5/7 CAISI tests
Leans towardCheap, fast, high-frequency agent loopsRaw scale, long context, world knowledge

So which open model do you pick?

M2.7 is leaner, firing only ~10B active parameters per pass, which keeps cost and latency low for high-frequency agent loops [4].

DeepSeek V4-Pro carries far more active parameters and a 1M context window, which suits long-document and large-codebase work [2].

M2.7 for cheap, fast, repetitive agentic tasks. DeepSeek V4 for raw scale and context length.

Quality is half the decision. Cost is the half that usually triggers the switch.

MiniMax M2.7 pricing and cost-per-task

Pricing Table

MiniMax M2.7 API pricing

VariantInput /MOutput /MCache read /MCache write /M
MiniMax M2.7$0.30$1.20$0.06$0.375
MiniMax M2.7-highspeed$0.60$2.40$0.06$0.375
Check live pricing

Standard M2.7 costs $0.30 per million input tokens and $1.20 per million output tokens, with cache reads at $0.06. M2.7-highspeed costs $0.60 input and $2.40 output, same cache read price [1]. Cache writes run $0.375 per million tokens on both [1].

Run the math on a single heavy coding task. MiniMax’s earlier M2.5 guide pegged a complex SWE-bench task at roughly 3.52 million tokens, costing about $8.45 in output at $2.40/M [5].

The same task on an Opus-tier model at ~$75/M output ran near $264, a difference of roughly 30x for near-identical benchmark results [5]. M2.7 sits at the same $1.20 standard output price as M2.5, so the cost advantage holds [1].

Bar chart comparing cost of one coding task, about 4 to 8 dollars on MiniMax M2.7 versus about 264 dollars on an Opus-tier model
The same coding task costs roughly 30x less on M2.7 than on an Opus-tier model.

What does that mean for you? If you run hundreds of agentic tasks a month, the price gap stops being a line item and starts deciding which model you can afford to use at all.

Is MiniMax M2.7 open source?

No, not in the strict sense, and the distinction matters. MiniMax published the model weights on Hugging Face and ModelScope, so you can download and run M2.7 locally [7].

Open weights and open source are different things, though. A Hacker News thread on the release made the point bluntly; the weights being public follows the spirit of open source, but it does not make the model open source by definition [8].

What can you actually do? You can download the weights, serve them with SGLang, vLLM, or Transformers, and run inference yourself [7].

What you cannot assume is unrestricted commercial freedom or access to training data and code. Read MiniMax’s actual license terms before you build a product on top of it. “Open weights” is a real benefit, but it is not a blank check.

What real users say

Early community reaction skews positive with a recurring asterisk. One user running a 9-agent OpenClaw setup across Discord and iMessage reported M2.7 as noticeably better than the previous version [9].

On r/LocalLLaMA, the prior M2 release drew praise as the best open-source model at the time, fast and genuinely capable, though some flagged it as expensive to self-host at scale [10].

Pull Quote
I’ve been using MiniMax 2.7 high speed for a 9-agent OpenClaw setup across Discord and iMessage. It’s noticeably better than the last version.
r/openclaw user

The consistent caveat is the open-source label. Several threads pushed back on calling the weights-only release “open source” [8].

So the rough community read is this. The model is fast and strong for agentic coding, the version-over-version gains are real, and the licensing framing deserves a closer look before you depend on it.

How can you access MiniMax M2.7

You have several routes, depending on whether you want a ready-made agent, raw API access, or a local install.

The easiest path is Genspark. You can access MiniMax M2.7 free and unlimited on any paid Genspark plan, with the model available directly inside Genspark Chat alongside other frontier options [11].

You select it from the model picker and start working, no separate MiniMax account or API key required.

For direct access, MiniMax offers three of its own channels. MiniMax Agent at agent.minimax.io gives you a ready-to-use platform with no setup [3].

The MiniMax API and Token Plan at platform.minimax.io provide OpenAI-compatible endpoints for both M2.7 and M2.7-highspeed, billed per token [1][3].

The Token Plan is the better fit if you run steady volume, since it bundles usage at a fixed rate [3].

If you want to run M2.7 yourself, download the open weights from Hugging Face or ModelScope and serve them with SGLang, vLLM, or Transformers [7].

Ollama and NVIDIA NIM also host the model for quick local or hosted deployment. For local runs, MiniMax recommends temperature 1.0, top_p 0.95, and top_k 40 [7].

Self-hosting a 230B MoE model needs serious GPU memory even at 10B active parameters, so test through Genspark or the API first before committing hardware.

The quickest way to try M2.7 right now, at no extra cost, is to open it inside Genspark on your existing paid plan and run your hardest task through it.

Should you switch to MiniMax M2.7?

Pick M2.7 if cost-per-task drives your decision and you run agentic coding at volume. The output price of $1.20 per million tokens against Opus-tier rates near $75 makes sustained agent loops affordable in a way frontier closed models do not [1][5].

Stay on Claude Opus if you need the absolute top score on the hardest engineering tasks, where M2.7 still trails on VIBE-Pro and the toughest subsets [3].

Weigh DeepSeek V4 instead if you need a 1M context window or broader world knowledge, since that is where V4 leads M2.7 [2].

The fastest way to decide is to run your own hardest task through the M2.7 API and your current model side by side, then compare the output quality against the bill. The benchmarks point you in a direction, but your specific workload settles it. Test it before you migrate anything in production.

Frequently asked questions

MiniMax M2.7 is not free to use through the API, costing $0.30 per million input and $1.20 per million output tokens [1]. The weights are downloadable at no cost from Hugging Face, but you pay for the hardware to run them yourself [7].

M2.7 holds 230B total parameters with roughly 10B active per forward pass, using a mixture-of-experts architecture [4]. That sparsity keeps inference cost and latency low relative to the total parameter count.

It depends on the task. M2.7 matches Sonnet 4.6 on MM Claw and nears Opus 4.6 on VIBE-Pro, but Opus still leads on the hardest problems [3]. M2.7 wins decisively on price.

Yes. Download the weights from Hugging Face or ModelScope and serve them with SGLang, vLLM, or Transformers [7]. You need substantial GPU memory even with 10B active parameters.

Both produce identical results, but highspeed runs faster and costs double on output, $2.40 versus $1.20 per million tokens [1][3]. Choose highspeed for latency-sensitive work, standard for cost-sensitive volume.

M2.7 added the self-evolution training loop, Agent Teams for multi-agent work, and stronger office editing, while shifting its headline coding benchmark to the harder SWE-Pro test at 56.22% [3].


Sources

Leave a Reply

Your email address will not be published. Required fields are marked *