More

pertymcpert · 2025-12-16T04:34:53 1765859693

The reason that towing affects EVs disproportionately more than ICE vehicles is because of the efficiency of EVs. It’s unintuitive but consider that with an ICE car, you have say 30% of the chemical energy of the fuel being converted to useful power. That means that per liter of gasoline burned driving, 700ml is effectively lost to waste heat. A large amount of that energy loss is a fixed cost, that is it doesn’t scale linearly with the power demand from the car.

EVs are 90% efficient at converting their chemical energy to useful work. This is a good thing in general, but it also means that drag and extra losses hurt its range much more. If 90% of the energy goes into useful power, than anything that requires 50% more power is going to almost halve the range. Whereas with an ICE engine, the high fixed losses mean that demanding 50% more power doesn’t increase fuel consumption by 50%. Pair that with the higher energy density of gasoline and you’ve got a bad comparison for EVs.

pertymcpert · 2025-12-09T23:14:53 1765322093

Why would it not assume you meant the best Cambridge?

pertymcpert · 2025-12-08T06:30:56 1765175456

That’s not what is happening with AI companies, and you damn well know it.

pertymcpert · 2025-12-08T06:30:07 1765175407

People need to be responsible for things they put their name on. End of story. No AI company claims their models are perfect and don’t hallucinate. But paper authors should at least verify every single character their submit.

bossyTeacher · 2025-12-08T07:34:59 1765179299

>No AI company claims their models are perfect and don’t hallucinate

You can't have it both ways. Either AIs are worth billions BECAUSE they can run mostly unsupervised or they are not. This is exactly like the AI driving system in Autopilot, sold as autonomous but reality doesn't live up to it.

amrocha · 2025-12-08T06:39:42 1765175982

Yes, but they don’t. So clearly AI is a foot gun. What are doing about it?

pertymcpert · 2025-12-08T05:28:00 1765171680

Sounds like creativity and intelligence to me.

tatjam · 2025-12-08T14:05:51 1765202751

I think the key is that the LLM is having no trouble mapping from one "embedding" of the language to another (the task they are best performers at!), and that appears extremely intelligent to us humans, but certainly is not all there's to intelligence.

But just take a look at how LLMs struggle to handle dynamical, complex systems such as the "vending machine" paper published some time ago. Those kind of tasks, which we humans tend to think of as "less intelligent" than say, converting human language to a C++ implementation, seem to have some kind of higher (or at least, different) complexity than the embedding mapping done by LLMs. Maybe that's what we typically refer to as creativity? And if so, modern LLMs certainly struggle with that!

Quite sci-fi that we have created a "mind" so alien we struggle to even agree on the word to define what it's doing :)

pertymcpert · 2025-11-24T21:38:31 1764020311

Not GP but my experience with Haiku-4.5 has been poor. It certainly doesn't feel like Sonnet 4.0 level performance. It looked at some python test failures and went in a completely wrong direction in trying to address a surface level detail rather than understanding the real cause of the problem. Tested it with Sonnet 4.5 and it did it fine, as an experienced human would.

chrisweekly · 2025-11-25T03:38:44 1764041924

Thanks!

pertymcpert · 2025-11-24T21:34:07 1764020047

The models are non-deterministic. You can't just assume that because it did better before that it was on average better than before. And the variance is quite large.

zsoltkacsandi · 2025-11-24T21:39:17 1764020357

No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.

baq · 2025-11-24T21:53:32 1764021212

This isn’t how you should be benchmarking models. You should give it the same task n times and see how often it succeeds and/or how long it takes to be successful (see also the 50% time horizon metric by METR).

zsoltkacsandi · 2025-11-24T22:03:13 1764021793

I did not say that I only ran the prompt once per attempt. When I say that second time it failed it means that I spent hours to restart, clear context, giving hints, everything to help the model to produce something that works.

thomasfromcdnjs · 2025-11-25T04:11:11 1764043871

You are really speaking to others points. Get a friend of yours to read what you are saying, it doesn't sound scientific in the slightest.

zsoltkacsandi · 2025-11-25T06:15:44 1764051344

I never claimed this was a scientific study. It was an observation repeated over time. That is empirical in the plain meaning of the word.

Criticizing it for “not being scientific” is irrelevant, I didn’t present it as science. Are people only allowed to share experiences here if they come wrapped in a peer-reviewed paper?

If you want to debate the substance of the observation, happy to. But don’t rewrite what I said into a claim I never made.

ewoodrich · 2025-11-24T21:59:35 1764021575

I was pretty disappointed to learn that the METR metric isn't actually evaluating a model's ability to complete long duration tasks. They're using the estimated time a human would take on a given task. But it did explain my increasing bafflement at how the METR line keeps steadily going up despite my personal experience coding daily with LLMs where they still frequently struggle to work independently for 10 minutes without veering off task after hitting a minor roadblock.

  On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours. This allows us to characterize the abilities of a given model by “the length (for humans) of tasks that the model can successfully complete with x% probability”.

  For each model, we can fit a logistic curve to predict model success probability using human task length. After fixing a success probability, we can then convert each model’s predicted success curve into a time duration, by looking at the length of task where the predicted success curve intersects with that probability.

[1] https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

ACCount37 · 2025-11-25T06:46:37 1764053197

It makes perfect sense to use human times as a baseline. Because otherwise, the test would be biased towards models with slower inference.

If model A generates 10 tokens a second and model B generates 100 tokens a second, then using real LLM inference time puts A at a massive 10x advantage, all other things equal.

ewoodrich · 2025-11-25T17:23:16 1764091396

But it doesn't evaluate the area that I am most eager to see improvements in LLM agent performance: unattended complex tasks that require adapting to unexpected challenges, problem solving and ambiguity for a long duration without a human steering them back in the right direction before they hit a wall or start causing damage.

If it takes me 8 hours to create a pleasant looking to-do app, and Gemini 3 can one shot that in 5 minutes, that's certainly impressive but doesn't help me evaluate whether I could drop an agent in my complex, messy project and expect it to successfully implement a large feature that may require reading docs, installing a new NPM package, troubleshooting DB configuration, etc for 30 min to 1 hr without going off the rails.

It's a legitimate benchmark, I'm not disputing that, but it unfortunately isn't measuring the area that could be a significant productivity multiplier in my day-to-day work. The METR time horizon score is still susceptible to the same pernicious benchmaxxing while I had previously hoped that it was measuring something much closer to my real world usage of LLM agents.

Improvements in long duration, multi-turn unattended development would save me lot of babysitting and frustrating back and forth with Claude Code/Codex. Which currently saps some of the enjoyment out of agentic development for me and requires tedious upfront work setting up effective rules and guardrails to work around those deficits.

ACCount37 · 2025-11-25T07:29:08 1764055748

There are many, many tasks that a given LLM can successfully do 5% of the time.

Feeling lucky?

pertymcpert · 2025-11-18T16:53:58 1763484838

I find 4.5 a much better model FWIW.

pertymcpert · 2025-11-16T03:45:18 1763264718

Where do you draw the line? Apple Silicon as a high powered replacement for Intel as a concept was all under Cook's tenure, from initial investigations to product ship. By your logic where would we stop the attribution?

stinkbeetle · 2025-11-16T04:46:44 1763268404

Draw the line for Apple silicon? With Jobs. I'm not sure what was unclear about my previous post. Jobs introduced Apple silicon. That's my logic. Jobs began the SoC design for iPhones and he began the high performance CPU initiative with the purchase of PA Semi. That's my logic.

Putting their CPUs in laptops wasn't an incredible initiative from Cook either, it was basically an inevitability that mobile class cores would eventually intercept high end CPUs for performance after Dennard scaling ended, and it was widely predicted by many Apple watchers even before their own core came out, but particularly after the first ones came out.

Some thought it would be sooner, some later. If Intel hadn't shat the bed for a decade, and/or if the PA Semi team and subsequent Apple CPU team turned out to be in the Samsung or Annapurna tier, then it might have taken many more years, or they might have switched over to an ARM Ltd core IP. But the trajectory for how things turned out was set in motion squarely by Jobs. Who brought up the CPU group and introduced the first high performance Apple CPU silicon.

pertymcpert · 2025-11-13T06:07:42 1763014062

I think if you go back far enough you’ll see that SVE was actually an Apple invention :)