I urge everyone to go read the original report and _then_ to read this analysis and make up their own mind. Step away from the clickbait, go read the original report.
> DeepSeek models cost more to use than comparable U.S. models
They compare DeepSeek v3.1 to GPT-5 mini. Those have very different sizes, which makes it a weird choice. I would expect a comparison with GPT-5 High, which would likely have had the opposite finding, given the high cost of GPT-5 High, and relatively similar results.
Granted, DeepSeek typically focuses on a single model at a time, instead of OpenAI's approach to a suite of models of varying costs. So there is no model similar to GPT-5 mini, unlike Alibaba which has Qwen 30B A3B. Still, weird choice.
Besides, DeepSeek has shown with 3.2 that it can cut prices in half through further fundamental research.
> CAISI chose GPT-5-mini as a comparator for V3.1 because it is in a similar performance class, allowing for a more meaningful comparison of end-to-end expenses.
TLDR for others:
* DeepSeek cutting edge models are still far behind
* On par DeepSeek costs 35% more to run
* DeepSeek models 12 times more susceptible to jail breaking and malicious instructions
* DeepSeek models follow strict censorship
I guess none of these are a big deal to non-enterprise consumers.
Token price on 3.2 exp is <5% what the US LLMs are and it's very close in benchmarks. Which we know that ChatGPT, Google, Grok and Claude have explicitly gamed to inflate their capabilities
Read a study called "The Leaderboard Illusion" which credibly alleged that Meta Google OpenAI and Amazon got unfair treatment from LM Arena that distorted the benchmarks
They gave them special access to privately test and let them benchmark over and over without showing the failed tests
Meta got to privately test Llama 4 27 times to optimize it for high benchmark scores and then was allowed to report the only the highest cherry picked benchmark
Which makes sense because in real world applications Llama is recognized to be markedly inferior to models that scored lower
Which is one study that touches exactly one benchmark - and "credibly alleged" is being way too generous to it. The only case that was anywhere close to being proven LMArena fraud is Meta and Llama 4. Which is a nonentity now - nowhere near SOTA on anything, LMArena included.
Not that it makes LMArena a perfect benchmark. By now, everyone who wanted to push LMArena ratings at any cost knows what the human evaluators there are weak to, and what should they aim for.
But your claim of "we know that ChatGPT, Google, Grok and Claude have explicitly gamed <benchmarks> to inflate their capabilities" still has no leg to stand on.
There are a lot of other cases that extend well beyond LMArena where it was shown certain benchmark performance increases by the major US labs were only attributable to being over-optimized for the specific benchmarks. Some in ways that are not explainable by the benchmark tests merely contaminating the corpus.
There are cases where merely rewording the questions or assigning different letters to the answer dropped models like Llama 30% in the evaluations while others were unchanged
Open-LLM-Leaderboard had to rate limit because a "handful of labs" were doing so many evals in a single day that it hogged the entire eval cluster
“Coding Benchmarks Are Already Contaminated” (Ortiz et al., 2025)
“GSM-PLUS: A Re-translation Reveals Data Contamination” (Shi et al., ACL 2024).
“Prompt-Tuning Can Add 30 Points to TruthfulQA” (Perez et al., 2023).
“HellaSwag Can Be Gamed by a Linear Probe” (Rajpurohit & Berg-Kirkpatrick, EMNLP 2024).
“Label Bias Explains MMLU Jumps” (Hassan et al., arXiv 2025)
“HumanEval-Revival: A Re-typed Test for LLM Coding Ability” (Yang & Liu, ICML 2024 workshop).
“Data Contamination or Over-fitting? Detecting MMLU Memorisation in Open LLMs” (IBM, 2024)
And yes I relied on LLM to summarize these instead of reading the full papers
> I urge everyone to go read the original report and _then_ to read this analysis and make up their own mind. Step away from the clickbait, go read the original report.
Sadly, based on the responses I don’t think many people have read the report. Just read how the essay discusses “exfiltration” for example, and then look at the 3 places that shows up in the NIST report. The content of the report and the portrayal by the essay are not the same. Alas, our truncated attention spans these days appears to mean a clickbaity web page will win the eye share over a 70 page technical report.
I don't think the majority of human's ever had the attention spans to read and properly digest a paper like the NIST report to make up their minds. Before social media, regular media would tell them what to think. 99.99% of the population isn't going to read that NIST report, no matter what decade we're talking.
Because it isn't just that one report. Every single day we're trying to make our way in the world and we do not have the capacity to read the source material of every subject that might be of interest. Human's rely on, and have always relied on, authority like figures or media or some form of message aggregation to get their news of the world and form their opinions on it from that.
And for the record, in no way is this an endorsement for shallow takes or thinking and then strong views on this subject, or another. I disagree with that as much as you. I'm just stating that this isn't a new phenomenon.