Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Noam Brown:

> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.

I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.

https://x.com/polynoamial/status/1946478249187377206



> it’s also more efficient [than o1 or o3] with its thinking.

"So under his saturate response, he never loses. For her to win, must make him unable at some even -> would need Q_{even-1}>even, i.e. some a_j> sqrt2. but we just showed always a_j<=c< sqrt2. So she can never cause his loss. So against this fixed response of his, she never wins (outcomes: may be infinite or she may lose by sum if she picks badly; but no win). So she does NOT have winning strategy at λ=c. So at equality, neither player has winning strategy."[1]

Why use lot word when few word do trick?

1. https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro...


That's a big leap from "answering test questions" to "contributing to scientific discovery".


Having spent tens of thousands of hours contributing to scientific discovery by reading dense papers for a single piece of information, reverse engineering code written by biologists, and tweaking graphics to meet journal requirements… I can say with certainty it’s already contributing by allowing scientists to spend time on science versus spending an afternoon figuring out which undocumented argument in a R package from 2008 changes chart labels.


This. Even if LLM’s ultimately hit some hard ceiling as substantially-better-Googling-automatons they would already accelerate all thought-based work across the board, and that’s the level they’re already at now (arguably they’re beyond that).

We’re already at the point where these tools are removing repetitive/predictable tasks from researchers (and everyone else), so clearly they’re already accelerating research.


Not sure how you get around the contamination problems. I use these everyday and they are extremely problematic about making errors that are hard to perceive.

They are not reliable tools for any tasks that require accurate data.


That is not what they mean by contributing to scientific discovery.


Perhaps not, but my point stands from personal experience and knowing what’s going on in labs right now that AI is greatly contributing to research even if it’s not doing the parts that most people think of when they think science. A sufficiently advanced AI in the near term isn’t going to start churning out novel hypotheses and being able to collect non-online data without first being able to secure funding to hire grad students or whatever robots can replace those.


Yeah that’s the dream, but same as with the bar exams, they are fine tuning the models for specific tests. Which probably the model even has been trained on previous version of those tests


What's the clear path to improved efficiency now that we've reached peak data?


> now that we've reached peak data?

A) that's not clear

B) now we have "reasoning" models that can be used to analyse the data, create n rollouts for each data piece, and "argue" for / against / neutral on every piece of data going into the model. Imagine having every page of a "short story book" + 10 best "how to write" books, and do n x n on them. Huge compute, but basically infinite data as well.

We went from "a bunch of data" to "even more data" to "basically everything we got" to "ok, maybe use a previous model to sort through everything we got and only keep quality data" to "ok, maybe we can augment some data with synthetic datasets from tools etc" to "RL goes brrr" to (point B from above) "let's mix the data with quality sources on best practices".


Are you basically saying synthetic data and having a bunch of models argue with each other to distill the most agreeable of their various outputs solves the issue of peak data?

Because from my vantage point, those have not given step changes in AI utility the way crunching tons of data did. They have only incrementally improved things


A) We are out of the Internet-scale-for-free data. Of course the companies deploying LLM based systems at massive scale are of course ingesting a lot of human data from their users, that they are seeking to use to further improve their models.

B) Has learning though "self-play" (like with AlphaZero etc) been demonstrated working for improving LLMs? What is the latest key research on this?


Certainly the models have orders of magnitude more data available to them than the smartest human being who ever lived does/did. So we can assume that if the goal is "merely" superhuman intelligence, data is not a problem.

It might be a constraint on the evolution of godlike intelligence, or AGI. But at that point we're so far out in bong-hit territory that it will be impossible to say who's right or wrong about what's coming.

Has learning though "self-play" (like with AlphaZero etc) been demonstrated working for improving LLMs?

My understanding (which might be incorrect) is that this amounts to RLHF without the HF part, and is basically how DeepSeek-R1 was trained. I recall reading about OpenAI being butthurt^H^H^H^H^H^H^H^H concerned that their API might have been abused by the Chinese to train their own model.


Superhuman capability within the tasks that are well represented in the dataset, yes. If one takes the view that intelligence is the ability to solve novel problems (ref F. Chollet), then the amount of data alone might not take us to superhuman intelligence. At least without new breakthroughs in the construction of models or systems.

R1 managed to replicate a model on the level of one one they had access to. But as far as I know they did not improve on its predictive performance? They did improve in inference time, but that is another thing. The ability to replicate a model is well demonstrated and quite common practice for some years already, see teacher-student distillation.


The thing is, people claimed already a year or two ago that we'd reached peak data and progress would stall since there was no more high-quality human-written text available. Turns out they were wrong, and if anything progress accelerated.

The progress has come from all kinds of things. Better distillation of huge models to small ones. Tool use. Synthetic data (which is not leading to model collapse like theorized). Reinforcement learning.

I don't know exactly where the progress over the next year will be coming from, but it seems hard to believe that we'll just suddenly hit a wall on all of these methods at the same time and discover no new techniques. If progress had slowed down over the last year the wall being near would be a reasonable hypothesis, but it hasn't.


I'm loving it, can't wait to deploy this stuff locally. The mainframe will be replaced by commodity hardware, OpenAI will stare down the path of IBM unless they reinvent themselves.


> people claimed already a year or two ago that we'd reached peak data and progress would stall

The claim was we've reached peak data (which, yes we did) and that progress would have to come from some new models or changes. Everything you described has made incremental changes, not step changes. Incremental changes are effectively stalled progress. Even this model has no proof and no release behind it


there is also huge realm of private/commercial data which is not absorbed by LLMs yet. I think there are way more private/commercial data than public data.


We're so far from peak data that we've barely even scratched the surface, IMO.


What changed from this announcement?

> “We’ve achieved peak data and there’ll be no more,” OpenAI’s former chief scientist told a crowd of AI researchers.


I assume there was tool use in the fine tuning?


There wasn’t in the CoT for these problems.


> I think we’re close to AI substantially contributing to scientific discovery.

The new "Full Self-Driving next year"?


"AI" already contributes "substantially" to "scientific discovery". It's a very safe statement to make, whereas "full self-driving" has some concrete implications.


"AI" here means language models. Machine learning has been contributing to scientific discovery for ages, but this new wave of hype that marketing departments are calling "AI" are language models.


Well I also think full self-driving contribute substantially to navigating the car on the street..


I know it’s a meme but there actually are fully self driving cars, they make thousands of trips every day in a couple US cities.


The capitalization makes it a Tesla reference, which has notoriously been promising that as an un-managed consumer capability for years, while it is not yet launched even now.


> in a couple US cities

FWIW, when you get this reductive with your criterion there were technically self-driving cars in 2008 too.


To be a bit more specific: no, they were not routinely making thousands of taxi rides with paying customers every day in 2008.


We can go further. Automated trains have cars. Streetcars are automatable since the track is fixed.

And both of these reduce traffic


I thought FSD has to be at least level 4 to be called that.


As an aside, that is happening in China right now in commercial vehicles. I rode a robotaxi last month in Beijing, and those services are expanding throughout China. Really impressive.


We have Waymo and AlphaFold.


How is a claim, "clear evidence" to anything?


I read the GP's comment as "but [assuming this claim is correct], this is clear evidence to the contrary."


Most evidence you have about the world is claims from other people, not direct experiment. There seems to be a thought-terminating cliche here on HN, dismissing any claim from employees of large tech companies.

Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to. Noam Brown is a well known researcher in the field and I see no reason to doubt these claims other than a vague distrust of OpenAI or big tech employees generally which I reject.


> I judge people's trustworthiness individually and not solely by the organization they belong to

This is certainly a courageous viewpoint – I imagine this makes it very hard for you to engage in the modern world? Most of us are very bound by institutions we operate in!


Over the years my viewpoint has led to great success in predicting the direction and speed of development of many technologies, among other things. As a result, by objective metrics of professional and financial success, I have done very well. I think your imagination is misleading you.


> dismissing any claim from employees of large tech companies

Me: I have a way to turn lead into gold.

You: Show me!!!

Me: NO (and then spends the rest of my life in poverty).

Cold Fusion (physics not the programing language) is the best example of why you "Show your work". This is the Valley we're talking about. It's the thudnderdome of technology and companies. If you have a meaningful breakthrough you don't talk about it you drop it on the public and flex.


I don't think this is a reasonable take. Some people/organizations send signals about things that we're not ready to fully drop it on the world. Others consider those signals in context (reputation of sender, prior probability of being true, reasons for sender to be honest vs. deceptive, etc).

When my wife tells me there's a pie in the oven and it's smelling particularly good, I don't demand evidence or disbelieve the existence of the pie. And I start to believe that it'll probably be a particularly good pie.

This is from OpenAI. Here they've not been so great with public communications in the past, and they have a big incentive in a crowded marketplace to exaggerate claims. On the other hand, it seems like a dumb thing to say unless they're really going to deliver that soon.


> Some people/organizations send signals about things that we're not ready to fully drop it on the world.

This is called marketing.

> When my wife tells me there's a pie in the oven and it's smelling particularly good, I don't demand evidence

Because you have evidence, it smells.

And if later your ask your wife "where is the pie" and she says "I sprayed pie scent in the air, I was just singling" how are you going to feel?

Open AI spent its "fool us once" card already. Doing things this way does not earn back trust, failure to deliver (and they have done that more than once) ... See staff non disparagement, see the math fiasco, see open weights.


> This is called marketing.

Many signals are marketing, but the purpose of signals is not purely to develop markets. We all have to determine what we think will happen next and how others will act.

> Because you have evidence, it smells.

I think you read that differently than what I intended to write -- she claims it smells good.

> Open AI spent its "fool us once" card already.

> > This is from OpenAI. Here they've not been so great with public communications in the past, and they have a big incentive in a crowded marketplace to exaggerate claims.


A thought-terminating cliché? Not at all, certainly not when it comes to claims of technological or scientific breakthroughs. After all, that's partly why we have peer review and an emphasis on reproducibility. Until such a claim has been scrutinised by experts or reproduced by the community at large, it remains an unverified claim.

>> Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to.

That has nothing to do with anything I said. A claim can be false without it being fraudulent, in fact most false claims are probably not fraudulent; though, still, false.

Claims are also very often contested. See e.g. the various claims of Quantum Superiority and the debate they have generated.

Science is a debate. If we believe everything anyone says automatically, then there is no debate.


They don't give a lot of details but they give enough for it to be pretty hard to say the claim is false but unfraudulent.

Some researchers got a breakthrough and decided to share right then rather than the months later it would take for a viable product. It happens, researchers are humans after all and i'm generally glad to take a peek at the actual frontier rather than what's behind by many months.

You can and it's fair to ignore such claims until that part but i think anything more than that is fairly uncharitable for the situation.


I'm not ignoring it. I'm waiting to see evidence of it. Is that uncharitable?


OpenAI have already shown us they aren’t trustworthy. Remember the FrontierMath debacle?


It's only a "debacle" if you already assume OpenAI isn't trustworthy, because they said they don't train on the test set. I hope you can see that presenting your belief that they lied about training on the test set as evidence of them being untrustworthy is a circular argument. You're assuming the thing you're trying to prove.

The one OpenAI "scandal" that I did agree with was the thing where they threatened to cancel people's vested equity if they didn't sign a non-disparagement agreement. They did apologize for that one and make changes. But it doesn't have a lot to do with their research claims.

I'm open to actual evidence that OpenAI's research claims are untrustworthy, but again, I also judge people individually, not just by the organization they belong to.


They funded the entire benchmark and didn’t disclose their involvement. They then proceeded to make use of the benchmark while pretending like they weren’t affiliated with EpochAI. That’s a huge omission and more than enough reason to distrust their claims.


IMO their involvement is only an issue if they gained an advantage on the benchmark by it. If they didn't train on the test set then their gained advantage is minimal and I don't see a big problem with it nor do I see an obligation to disclose. Especially since there is a hold-out set that OpenAI doesn't have access to, which can detect any malfeasance.


It's typically difficult to find direct evidence for bias. That is why rules for conflict of interest and disclosure are strict in research and academia. Crucially, something is a conflict of interest if it could be perceived as a conflict of interest by someone external, so it doesn't matter if you think you could judge fairly, it's important if someone else might doubt you could.

Not disclosing a conflict of interest is generally considered a significant ethics violation, because it reduces trust in the general scientific/research system. Thus OpenAI has become untrustworthy in many people's view irrespective if their involvement with the benchmarks creation affected their results or not.


There’s no way to figure out whether they gained an advantage. We have to trust their claims, which again, is an issue for me after finding out they already lied.


Lied about what? Your only claim so far is that they failed to disclose something that in my opinion didn't need to be disclosed.


I’d expect any serious company to disclose conflicts of interest.


Thing is, for example, all of classical physics can be derived from Newton's laws, Maxwell's equations and the laws of Thermodynamics, all of which can be written on a slip of paper.

A sufficiently brilliant and determined human can invent or explain everything armed only with this knowledge.

There's no need to train him on a huge corpus of text, like they do with ChatGPT.

Not sure what this model's like, but I'm quite certain it's not trained on terabytes of Internet and book dumps, but rather is trained for abstract problem solving in some way, and is likely much smaller than these trillion parameter SOTA transformers, hence is much faster.


If you look at the history of physics I don't think it really worked like that. It took about three centuries from Newton to Maxwell because it's hard to just deduce everything from basic principles.


I think you misundertand me, I'm making some pie in the sky statement about AI being able to discover the laws of nature in an afternoon. I'm just making the observation that if you know the basic equiations, and enough math (which is about multivariate calc), you can derive every single formula in your Physics textbook (and most undergrads do as part of their education).

Since smart people can derive a lot of knowledge from a tiny set of axioms, smart AIs should be able to as well, which means you don't need to rely on a huge volume of curated information. Which means that needing to invest the internet and training on a terabyte of text might not be how these newer models are trained, and since they don't need to learn that much raw information, they might be smaller and faster.


There's no evidence this model works like that. The "axioms" for counting the number of r's in a word are magnitudes simpler than classical physic's, and yet it took a few years to get that right. It's always been context, not derivation of logic.


First, false equivalence. The 'strawberry' problem was because LLMs operate not on text directly, but on embedding vectors, which made it hard for it to manipulate the syntax of language directly. This does not prevent it from properly doing math proofs.

Second, we know nothing about these models or how they work and trained, and indeed, if they can do these things or not. But a smart human could (by smart I mean someone who gets good grades at engineering school effortlessly, not Albert Einstein)


Right, humans are pretrained on terabytes of sensory data instead.


And the billions of years of evolution and the language that you use to explain the task to him and and the schooling he needs to understand what you're saying it and... and and and?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: