I'm sure this is a very impressive model, but gemini-3-pro-preview is failing *s...

WhitneyLand · 2025-11-18T16:21:04 1763482864

>>benchmarks are meaningless

No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.

>>my fairly basic python benchmark

I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.

NaomiLehman · 2025-11-18T17:06:12 1763485572

they are not meaningless, but when you work a lot with LLMs and know them VERY well, then a few varied, complex prompts tell you all you need to know about things like EQ, sycophancy, and creative writing.

I like to compare them using chathub using the same prompts

Gemini still calls me "the architect" in half of the prompts. It's very cringe.

mpalmer · 2025-11-19T00:34:17 1763512457

    Gemini still calls me "the architect" in half of the prompts. It's very cringe.

Can't say I've ever seen this in my own chats. Maybe it's something about your writing style?

NaomiLehman · 2025-11-19T10:13:16 1763547196

it absolutely does. and human employees don't call me "the architect." that's the point.

gregw2 · 2025-11-19T13:27:07 1763558827

I wonder if under the covers it uses your word choices to infer your Myers-Briggs personality type and you are INTJ so it calls you "The Architect"?? Crazy thought but conceivable...

NaomiLehman · 2025-11-20T20:59:49 1763672389

Possible :) GPT 5.1-thinking, Sonnet 4.5-thinking, and Grok 4.1 don't make up names like that.

sothatsit · 2025-11-19T01:06:20 1763514380

It’s very different to get a “vibe check” for a model than to get an actual robust idea of how it works and what it can or can’t do.

This exact thing is why people strongly claimed that GPT-5 Thinking was strictly worse than o3 on release, only for people to change their minds later when they’ve had more time to use it and learn its strengths and weaknesses. It takes time for people to really get to grips with a new model, not just a few prompt comparisons where luck and prompt selection will play a big role.

beepbooptheory · 2025-11-18T20:29:58 1763497798

I get that one can perhaps have an intuition about these things, but doesn't this seem like a somewhat flawed attitude to have all things considered? That is, saying something to the effect of "well I know its not too sycophantic, no measurement needed, I have some special prompts of my own and it passed with flying colors!" just sounds a little suspect on first pass, even if its not like totally unbelievable I guess.

dekhn · 2025-11-18T16:43:38 1763484218

Using a single custom benchmark as a metric seems pretty unreliable to me.

Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.

prodigycorp · 2025-11-18T19:00:19 1763492419

after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run.

This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.

While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.

my bad to the google team for the cursory brush off.

chermi · 2025-11-18T20:18:39 1763497119

Walks are magical. But also this reads partially like you got sent to a reeducation camp lol.

nomel · 2025-11-18T21:32:53 1763501573

> This probably means my test is a little too niche.

> my python one needs to be down weighted or supplanted.

To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.

I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".

relaytheurgency · 2025-11-18T23:37:26 1763509046

I feel similarly. If you're working with some relatively niche APIs on services that don't get seen by the public, the AI isn't one-shotting anything. But I still find it helpful to generate some crap that I can then feel good about fixing.

agentcoops · 2025-11-19T13:06:53 1763557613

I definitely agree on the importance of personalized benchmarks for really feeling when, where and how much progress is occurring. The standard benchmarks are important, but it’s hard to really feel what a 5% improvement in X exam means beyond hype. I have a few projects across domains that I’ve been working on since ChatGPT 3 launched and I quickly give them a try on each new model release. Despite popular opinion, I could really tell a huge difference between GPT 4 and 5 , but nothing compared to the current delta between 5.1 and Gemini 3 Pro…

TLDR; I don’t think personal benchmarks should replace the official ones of course, but I think the former are invaluable for building your intuition about the rate of AI progress beyond hype.

lofaszvanitt · 2025-11-19T05:39:21 1763530761

No, do not share it. The bigger black hole these models are in, the better.

thefourthchime · 2025-11-18T16:28:14 1763483294

I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.

bitexploder · 2025-11-18T19:10:26 1763493026

Something else to consider. I often have much better success with something like: Create a prompt that creates a specification for a pacman game in a single html page. Consider edge cases and key implementation details that result in bugs. <take prompt>, execute prompt. It will often yield a much better result than one generic prompt. Now that models are trained on how to generate prompts for themselves this is quite productive. You can also ask it to implement everything in stages and implement tests, and even evaluate its tests! I know that isn't quite the same as "Implement pacman on an HTML page" but still, with very minimal human effort you can get the intended result.

amelius · 2025-11-18T20:28:31 1763497711

I thought this kind of chaining was already part of these systems.

bitexploder · 2025-11-19T17:46:56 1763574416

It can be, but the more specific context you can give the better, especially on your initial prompting. If it is opaque to you who knows what it is doing. Dialing in the initial spec/prompt for 5 minutes is still important. Different LLMs and models will do better or worse on this and by being a human in the loop on this initial stuff my experience is much higher quality, which indicates to me, the LLM tries, but just doesn't always have enough info to implement your intentions in many cases yet.

Workaccount2 · 2025-11-18T18:27:34 1763490454

It made a working game for me (with a slightly expanded prompt), but the ghosts got trapped in the box after coming back from getting killed. A second prompt fixed it. The art and animation however was really impressive.

ofa0e · 2025-11-18T16:36:01 1763483761

Your benchmarks should not involve IP.

sowbug · 2025-11-18T17:13:41 1763486021

The only intellectual property here would be trademark. No copyright, no patent, no trade secret. Unless someone wants to market the test results as a genuine Pac-Man-branded product, or otherwise dilute that brand, there's nothing should-y about it.

bongodongobob · 2025-11-18T18:55:41 1763492141

It's not an ethics thing. It's a guardrails thing.

sowbug · 2025-11-18T19:37:32 1763494652

That's a valid point, though an average LLM would certainly understand the difference between trademark and other forms of IP. I was responding to the earlier comment, whose author later clarified that it represented an ethical stance ("stealing the hard work of some honest, human souls").

ComplexSystems · 2025-11-18T16:42:20 1763484140

Why? This seems like a reasonable task to benchmark on.

adastra22 · 2025-11-18T16:54:06 1763484846

Because you hit guard rails.

ofa0e · 2025-11-18T16:54:24 1763484864

Sure, reasonable to benchmark on if your goal is to find out which companies are the best at stealing the hard work of some honest, human souls.

scragz · 2025-11-18T17:03:33 1763485413

correction: pacman is not a human and has no soul.

WhyOhWhyQ · 2025-11-19T01:13:36 1763514816

Why do you have to willfully misinterpret the person you're replying to? There's truth in their comment.

sosodev · 2025-11-18T16:26:18 1763483178

How can you be sure that your benchmark is meaningful and well designed?

Is the only thing that prevents a benchmark from being meaningful publicity?

prodigycorp · 2025-11-18T16:32:46 1763483566

I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.

I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.

gregsadetsky · 2025-11-18T16:49:34 1763484574

I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?

I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?

Thanks

[0] https://news.ycombinator.com/item?id=45968665

adastra22 · 2025-11-18T16:55:27 1763484927

> if it's not public, presumably LLMs would never get better at them.

Why? This is not obvious to me at all.

gregsadetsky · 2025-11-18T17:06:23 1763485583

You're correct of course - LLMs may get better at any task of course, but I meant that publishing the evals might (optimistically speaking) help LLMs get better at the task. If the eval was actually picked up / used in the training loop, of course.

adastra22 · 2025-11-18T17:11:13 1763485873

That kind of “get better at” doesn’t generalize. It will regurgitate its training data, which now includes the exact answer being looked for. It will get better at answering that exact problem.

But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.

Problem solving ability is largely not from the pretraining data.

gregsadetsky · 2025-11-18T17:21:10 1763486470

Yeah, great point.

I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)

benterix · 2025-11-18T16:16:07 1763482567

> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.

Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.

Archer6621 · 2025-11-19T07:25:17 1763537117

I wonder whether it could be related to some kind of over-fitting, i.e. a prompting style that tends to work better with the older models, but performs worse with the newer ones.

adastra22 · 2025-11-18T16:57:27 1763485047

I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.

Iulioh · 2025-11-18T16:22:28 1763482948

A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models....

GPT4/3o might be the best we will ever have

ddalex · 2025-11-18T16:00:39 1763481639

I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code

layer8 · 2025-11-18T19:04:48 1763492688

Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?

mring33621 · 2025-11-18T16:10:02 1763482202

I agree that benchmarks are noise. I guess, if you're selling an LLM wrapper, you'd care, but as a happy chat end-user, I just like to ask a new model about random stuff that I'm working on. That helps me decide if I like it or not.

I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.

IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.

This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!

t0mas88 · 2025-11-18T21:55:59 1763502959

Google reports a lower score for Gemini 3 Pro on SWEBench than Claude Sonnet 4.5, which is comparing a top tier model with a smaller one. Very curious to see whether there will be an Opus 4.5 that does even better.

testartr · 2025-11-18T16:12:08 1763482328

and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much

it's easy to focus on what they can't do

big-and-small · 2025-11-18T17:05:00 1763485500

Everything is about context. When you just ask non-concrete task it's still have to parse your input and figure what is tic-tac-toe in this context and what exactly you expect it to do. This is why all "thinking".

Ask it to implement tic-tac-toe in Python for command line. Or even just bring your own tic-tac toe code.

Then make it imagine playing against you and it's gonna be fast and reliable.

testartr · 2025-11-18T22:52:08 1763506328

prompt was very concrete: draw a tic tac toe ASCII table and let's play. gemini 2.5 thought for pages particular moves

luckydata · 2025-11-18T16:53:34 1763484814

I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.

nomel · 2025-11-18T22:03:39 1763503419

I find this hard to understand. I have AI completely choke on my code constantly. What are you doing where it performs so well? Web?

I constantly see failures in trivial vectors projections, broken bash scripts that don't properly quote variables (fail if space in filenames), and near completely inability to do relatively basic image processing tasks (if they don't rely on template matches).

I accidentally spent $50 on Gemeni 2.5 Pro last week, with Roo, trying to make a simple Mock interface for some lab equipment. The result: it asks permission to delete everything it did and start over...

mupuff1234 · 2025-11-18T15:56:58 1763481418

Could also just be rollout issues.

prodigycorp · 2025-11-18T15:58:38 1763481518

Could be. I'll reply to my comment later with pass/fail results of a re-run.

Rover222 · 2025-11-18T16:22:02 1763482922

curious if you tried grok 4.1 too

Filligree · 2025-11-18T15:53:10 1763481190

What's the benchmark?

ahmedfromtunis · 2025-11-18T15:59:49 1763481589

I don't think it would be a good idea to publish it on a prime source of training data.

Hammershaft · 2025-11-18T16:06:08 1763481968

He could post an encrypted version and post the key with it to avoid it being trained on?

benterix · 2025-11-18T16:14:12 1763482452

What makes you think it wouldn't end up in the training set anyway?

rs186 · 2025-11-18T18:08:39 1763489319

I wouldn't underestimate the intelligence of agentic AI, despite how stupid they are today.

stefs · 2025-11-18T23:01:33 1763506893

Every AI corp has people reading HN.

mlrtime · 2025-11-19T12:33:06 1763555586

This sounds like paranoia to me to be honest. Please tell me I'm wrong.

I could have easily come up with just the same claim, without seeing the benchmark, it doesn't exist.

Maybe if we weren't anonymous and your profile leads to credentials that you have experience in this field, otherwise I don't believe it without seeing/testing myself.

shawabawa3 · 2025-11-19T10:44:44 1763549084

but they've asked all the AI models this question. Whatever you tell an AI model is also in its training data

petters · 2025-11-18T15:58:36 1763481516

Good personal benchmarks should be kept secret :)

mlrtime · 2025-11-19T12:33:56 1763555636

pclmulqdq · 2025-11-19T13:47:52 1763560072

Avoiding contamination is very useful when you want an honest evaluation of something.

GuB-42 · 2025-11-19T12:48:54 1763556534

NIBBLES.BAS maybe [1]

If you make some assumptions about the species of the snake, it can count as a basic python benchmark ;)

[1] https://en.wikipedia.org/wiki/Nibbles_(video_game)

prodigycorp · 2025-11-18T16:01:03 1763481663

nice try!

ankit219 · 2025-11-18T18:42:59 1763491379

you already sent the prompt to gemini api - and they likely recorded it. So in a way they can access it anyway. Posting here or not would not matter in that aspect.

m00dy · 2025-11-18T15:51:21 1763481081

that's why everyone using AI for code should code in rust only.