More

seidleroni · 2025-12-03T16:51:41 1764780701

Is there any way to do this with the frontier LLM's?

red75prime · 2025-12-03T17:50:18 1764784218

Ask them to mark low confidence words.

akoboldfrying · 2025-12-03T21:31:15 1764797475

Do they actually have access to that info "in-band"? I would guess not. OTOH it should be straightforward for the LLM program to report this -- someone else commented that you can do this when running your own LLM locally, but I guess commercial providers have incentives not to make this info available.

red75prime · 2025-12-05T11:12:51 1764933171

Naturally, their "confidence" is represented as activations in layers close to output, so they might be able to use it. Research ([0], [1], [2], [3]) shows that results of prompting LLMs to express their confidence correlate with their accuracy. The models tend to be overconfident, but in my anecdotal experience the latest models are passably good at judging their own confidence.

[0] https://ieeexplore.ieee.org/abstract/document/10832237

[1] https://arxiv.org/abs/2412.14737

[2] https://arxiv.org/abs/2509.25532

[3] https://arxiv.org/abs/2510.10913

seidleroni · 2025-12-03T20:39:24 1764794364

interesting... I'll give that a shot

criemen · 2025-12-03T21:27:04 1764797224

It used to be that the answer was logprobs, but it seems that is no longer available.

seidleroni · 2025-12-03T16:13:42 1764778422

I don't believe that it just analyzes the transcription. I asked Gemini to look at the youtube video referenced on the site below and "build" something that duplicates that device. It did a pretty good approximation that it could not have done without going through the full video.

https://bitsnpieces.dev/posts/a-synth-for-my-daughter/

seidleroni · 2025-11-25T13:09:26 1764076166

The author actually discusses the results of the paper. He's not some rando but a Wharton Professor and when he is comparing the results to a grad student, it is with some authority.

"So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student. The idea was good, as were many elements of the execution, but there were also problems..."

seidleroni · 2025-11-05T13:44:45 1762350285

100%. I have known a couple of people that did some form of "medical tourism", mostly for expensive dental work. In both cases they did some form of tech contract work as a sole proprietorship and bought their own health insurance (not through a partner). The overlap of people who can save up thousands of dollars for treatment abroad and have poor health insurance is probably not too large.

ethbr1 · 2025-11-05T14:29:40 1762352980

I know it's popular with the US transgender community, looking for gender reassignment surgery.

Cost of travel to Thailand < savings on medical procedures (with equivalent or better outcomes)

seidleroni · 2025-10-29T11:53:07 1761738787

As someone who works with firmware, it is funny how different our definitions of "bare metal" is.

embedding-shape · 2025-10-29T11:58:54 1761739134

As someone who does material science, it's funny how our definition of "bare metal" is so different.

onionisafruit · 2025-10-29T12:54:18 1761742458

As someone who listens to loud rock and roll music …

amluto · 2025-10-29T16:06:27 1761753987

Ask an astronomer what a “metal” is.

Joeboy · 2025-10-29T13:02:45 1761742965

Wikipedia still thinks it means the thing I (and presumably you) do.

https://en.wikipedia.org/wiki/Bare_metal

Edit: For clarity, wikipedia does also have pages with other meanings of "bare metal", including "bare metal server". The above link is what you get if you just look up "bare metal".

I do aim to be some combination of clear, accurate and succinct, but I very often seem to end up in these HN pissing matches so I suppose I'm doing something wrong. Possibly the mistake is just commenting on HN in itself.

embedding-shape · 2025-10-29T13:43:59 1761745439

Seems there is a difference between "Bare Metal" and "Bare Machine".

I'm not sure what you did, but when you go to that Wikipedia article, it redirects to "Bare Machine", and the article contents is about "Bare Machine". Clicking the link you have sends you to https://en.wikipedia.org/wiki/Bare_machine

So it seems like you almost intentionally shared the article that redirects, instead of linking to the proper page?

Joeboy · 2025-10-29T13:56:04 1761746164

I indeed deliberately pasted a link that shows what happens when you try to go to the Wikipedia page for "bare metal".

embedding-shape · 2025-10-29T13:59:10 1761746350

Right, slightly misleading though, as https://en.wikipedia.org/wiki/Bare-metal_server is a separate page.

Joeboy · 2025-10-29T14:09:18 1761746958

Yes, but if you look up "bare metal" it goes to the page about actual bare metal (aka "bare machines" or whatever).

Can we stop this now? Please?

embedding-shape · 2025-10-29T14:11:38 1761747098

> Yes, but if you look up "bare metal" it goes to the page about actual bare metal (or bare machines or whatever).

Fix it then, if you think it's incorrect. Otherwise, link to https://en.wikipedia.org/wiki/Bare_metal_(disambiguation) like any normal and charitable commentator would do.

> Can we stop this now? Please?

Sure, feel free to stop at any point you want to.

Joeboy · 2025-10-29T16:52:49 1761756769

There is nothing that needs fixing? Both my link and yours give the same "primary" definition for "bare metal". Which is not unequivocally the correct definition, but it's the one I and the person I was replying to favour.

I thought my link made the point a bit better. I think maybe you've misunderstood something about how Wikipedia works, or about what I'm saying, or something. Which is OK, but maybe you could try to be a bit more polite about it? Or charitable, to use your own word?

Edit: In case this part isn't obvious, Wikipedia redirects are managed by Wikipedia editors, just like the rest of Wikipedia. Where the redirect goes is as much an indication of the collective will of Wikipedia editors as eg. a disambiguation page. I don't decide where a request for the "bare metal" page goes, that's Wikipedia.

Edit2: Unless you're suggesting I edited the redirect page? The redirect looks to have been created in 2013, and hasn't been changed since.

andrewl-hn · 2025-10-29T12:40:25 1761741625

In similar way I once worked on a financial system, where a COBOL-powered mainframe was referred to as "Backend", and all other systems around it written in C++, Java, .NET, etc. since early 80s - as "Frontend".

embedding-shape · 2025-10-29T13:45:55 1761745555

Had somewhat similar experience, the first "frontend" I worked on was a sort of proxy server that sat in front of a database basically, meant as a barrier for other applications to communicate via. At one point we called the client side web application "frontend-frontend" as it was the frontend for the frontend.

pgwhalen · 2025-10-29T12:44:21 1761741861

I don't work in firmware at all, but I'm working next to a team now migrating an application from VMs to K8S, and they refer to the VMs as "bare metal" which I find slightly cringeworthy - but hey, whatever language works to communicate an idea.

ghaff · 2025-10-29T17:19:51 1761758391

I'm not sure I've ever heard bare metal used to refer to virtualized instances. (There were debates around Type 1 and Type 2 (hosted) hypervisors at one point but haven't heard that come up in years.

seidleroni · 2025-09-10T13:40:10 1757511610

Having lost my mother to melanoma over 20 years ago, it is very encouraging to see the progress that has been made against this terrible disease. Very sorry for your loss.

seidleroni · 2025-08-18T17:10:31 1755537031

I'm not sure if this applies to all carriers but with many carriers if you are on wifi you can send/receive text messages and sometimes even make calls. Certainly you could use WhatsApp or similar in that case.

mikestew · 2025-08-18T18:43:22 1755542602

...with many carriers if you are on wifi you can send/receive text messages...

How would you propose that those messages get routed if there is no connection to either the internet nor a cell provider?

pavel_lishin · 2025-08-18T17:26:27 1755537987

Not if I'm in the middle of the woods with no connection to the outside world.

seidleroni · 2025-08-05T12:19:06 1754396346

I work in embedded systems, and the best advice I can offer is: resist the urge to speculate when problems arise. Stay quiet, grab an oscilloscope, and start probing the problem area. Objective measurements beat conjecture every time. It's hard to argue with scope captures that clearly show what's happening. As Jack Ganssle says, "One test is worth a thousand opinions."

jdhwosnhw · 2025-08-05T13:08:23 1754399303

I thoroughly disagree with this sentiment.

In my experience, the most helpful approach to performing RCA on complicated systems involves several hours, if not days, of hypothesizing and modeling prior to test(s). The hypothesis guides the tests, and without a fully formed conjecture you’re practically guaranteed to fit your hypothesis to the data ex post facto. Not to mention that in complex systems there is usually 10 benign things wrong for every 1 real issue you might find - without a clear hypothesis, its easy to go chasing down rabbit holes with your testing.

seidleroni · 2025-08-05T13:25:57 1754400357

That's a valid point. What I originally meant to convey is that when issues arise, people often assign blame and point fingers without any evidence, just based on their own feelings. It is important to gather objective evidence to support any claims. Sounds somewhat obvious but in my career I have found that people are very quick to blame issues on others when 15 minutes of testing would have gotten to the truth.

jdhwosnhw · 2025-08-05T15:30:17 1754407817

Very reasonable, I fully agree on that front

AnimalMuppet · 2025-08-06T12:56:06 1754484966

I think the GP is in a different world than you.

If you can grab an oscilloscope and gather meaningful data in 15 minutes, why would you spend several hours hypothesizing and modeling?

If you can't, then spending several hours or days modeling and hypothesizing is better than just guessing.

So I think that data beats informed opinions, but informed opinions beat pure guesses.

n_u · 2025-08-06T05:44:23 1754459063

I agree with both of you. I think it’s sort of a hybrid and a spectrum of how much you do of each first.

When you test part of the circuit with the scope, you are using prior knowledge to determine which tool to use and where to test. You don’t just take measurements blindly. You could test a totally different part of the system because there might be some crazy coupling but you don’t. In this system it seems like taking the measurement is really cheap and a quick analysis about what to measure is likely to give relevant results.

In a different system it could be that measurements are expensive and it’s easy to measure something irrelevant. So there it’s worth doing more analysis before measurements.

I think both cases fight what I’ve heard called intellectual laziness. It’s sometimes hard to make yourself be intellectually honest and do the proper unbiased analysis and measuring for RCA. It’s also really easy to sit around and conjecture compared to taking the time to measure. It’s really easy for your brain to say “oh it’s always caused by this thing cuz it’s junk” and move on because you want to be done with it. Is this really the cause? Could there be nothing else causing it? Would you investigate this more if other people’s lives depended on this?

I learned about this model of viewing RCA from people who work on safety critical systems. It takes a lot of energy and time to be thorough and your brain will use shortcuts and confirmation bias. I ask myself if I’m being lazy because I want a certain answer. Can I be more thorough? Is there a measurement I know will be annoying so I’m avoiding it?

MSFT_Edging · 2025-08-05T13:31:09 1754400669

Another disagreeing voice, but I try to employ problem speculation when some problem arises. I'm working in a cross-company, cross-team project, where everyone's input interacts in interesting ways. When we come across a weird problem, getting folks together to ask, "what does x issue sound like its caused by?". This gets people thinking about where certain functionality lives, the boundary points between the functionality, and a way to test the hypothesis.

It's helped a dozen times so far essentially playing 20 questions and being able to point to the exact problem and have it resolved quickly.

This is a semi-embedded system. FGPAs, SoCs, drivers, userspace, userspace drivers, etc. Lots of stuff to go wrong, speculation gives a place to start.

organsnyder · 2025-08-05T13:40:00 1754401200

Speculating is a great way to prioritizing what to investigate, but I've worked with many senior engineers (albeit not in embedded) that have made troubleshooting take longer because they disregarded potential causes based on pattern-matching against their past experiences.

razorfen · 2025-08-05T14:58:10 1754405890

This has become a personal debate for me recently, ever since I learned that there are several software luminaries who eschew debuggers (the equivalent of taking an oscilliscope probe to a piece of electronics).

I’ve always fallen on the side of debugging being about “isolate as narrowly as possible” and “don’t guess what’s happening when you can KNOW what’s happening”.

The arguments against this approach is that speculation and statically analyzing a system reinforces that system in your mind and makes you more effective overall in the long run, even if it may take longer to isolate a single defect.

I’ll stick with my debuggers, but I do agree that you can’t throw the baby out with the bathwater.

The modern extreme is asking Cursor’s AI agent “why is this broken?” I recently saw a relatively senior engineer joining a new company lean too heavily on Cursor to understand a company’s systems. They burned a lot of cycles getting poor answers. I think this is a far worse extreme.

organsnyder · 2025-08-05T16:34:57 1754411697

For me, it's about being aware of the entire stack, and deliberate about which possibilities I am downplaying.

At a previous company, I was assigned a task to fix requests that were timing out for certain users. We knew those users had more data than the standard deviation, so the team lead created a ticket that was something like "Optimize SQL queries for...". Turns out the issue was our XML transformation pipeline (I don't miss this stack at all) was configured to spool to disk for any messages over a certain size.

Since I started by benchmarking the query, I realized fairly quickly that the slowness wasn't in the database; since I was familiar with all layers of our stack, I knew where else to look.

Instrumentation is vital as well. If you can get metrics and error information without having to gather and correlate it manually, it's much easier to gain context quickly.

MSFT_Edging · 2025-08-05T15:34:58 1754408098

To me, it's the method for deciding where I put the oscilloscope/debugger.

Without the speculation, where do you know where to put your breakpoint? If you have a crash, cool, start at the stack trace. If you don't crash but something is wrong, you have a far broader scope.

The speculation makes you think about what could logically cause the issue. Sometimes you can skip the actual debugging and logic your way to the exact line without much wasted time.

sigbottle · 2025-08-05T17:21:30 1754414490

Its probably different depending on how much observability you have into the system.

Hardware, at least to me, seems impossible to debug from first principles, too many moving parts from phenomenon too tiny to see and from multiple different vendors.

Software is at a higher level of abstraction and you can associate a bug to some lines of code. Of course this means that you're writing way more code so the eventual complexity can grow to infinity by having like 4 different software systems have subtly different invariants that just causes a program to crash in a specific way.

Math proofs are the extreme end of this - technically all the words are there! Am i going to understand all of them? Maybe, but definitely not on the first pass.

Meh you can make the argument that if all the thoughts are in the abstract it becomes hard to debug again which is fair.

That doesn't mean any one is harder than the other and obviously between different problems in said disciplines you have different levels of observability. But yea idk

teiferer · 2025-08-05T12:37:25 1754397445

Implicit or explicit, you need a hypothesis to be able to start probing. Many issues can surely be found with an oscilloscope. Many others can't and an oscilloscope does not help in any way. It's experience that tells you which symptom indicates which class of issues this could be, so you use the right tool for debugging.

That's not to say that at some point you don't need to get your hands dirty. But it's equally important to balance that with thinking and theory building. It's whoever gets that balance right who will be most effective at debugging, not the one with the dirtiest hands.

cushychicken · 2025-08-05T23:01:25 1754434885

Love this.

The most dangerous words during debug are: “…but it should work this way!” This is a mantra I try hard to instill in all EEs I mentor.

“Should” isn’t worth a damn to me. You test your way out of hardware bugs - you don’t logic your way out.

fedeb95 · 2025-08-05T12:28:52 1754396932

this works also in general purpose corporate programming.

SkyPuncher · 2025-08-05T18:20:46 1754418046

Without speculation, what test do you decide to do?

Speculation is fine, but you need to ground it in reality.

seidleroni · 2025-08-04T15:59:37 1754323177

This is really great, well done! I have two young kids and was thinking about putting something like this together and I'm delighted that you beat me to it. Having a 6yo and 8yo, it would be great if there were some more basic games as well.

seidleroni · 2025-06-27T17:37:00 1751045820

As much as I love AI/LLM's and use them on a daily basis, this does a great job revealing the gap between current capabilities and what the massive hype machine would have us believe the systems are already capable of.

I wonder how long it will take frontier LLM's to be able to handle something like this with ease without it using a lot of "scaffolding".

roxolotl · 2025-06-27T18:42:18 1751049738

I don’t quite know why we would think they’d ever be able to without scaffolding. LLM are exactly what the name suggests, language models. So without scaffolding they can use to interact with the world with using language they are completely powerless.

poly2it · 2025-06-27T20:09:14 1751054954

Humans also use scaffolding to make better decisions. Imagine trying to run a profitable business over a longer period solely relying on memorised values.

samrus · 2025-06-28T01:54:12 1751075652

But the difference is who makes the scaffolding.

We dont need a more intelligent entity to give us those rules, like humans would give to the LLM. We learn and formalize those rules ourselves and communicate within each other. This makes it not scaffolding, since scaffolding is explicit instructions/restraints from outside the model. The "scaffolding" your saying humans are using is implicitly learnt by humans and then formalized and applied at instructions and restraints, and even then, human thay dont internalize/understand them dont do well in those tasks. So scaffolding really is running into the bitter lesson