I have seen pushback on this kind of behavior because "users don't like error codes" or other such nonsense. UX and Product like to pretend nothing will ever break, and when it does they want some funny little image, not useful output.
A good compromise is to log whenever a user would see the error code, and treat those events with very high priority.
We put the error code behind a kind of message/dialog that invites the user to contact us if the problem persists and then report that code.
It’s my long standing wish to be able to link traces/errors automatically to callers when they call the helpdesk. We have all the required information. It’s just that the helpdesk has actually very little use for this level of detail. So they can only attach it to the ticket so that actual application teams don’t have to search for it.
> I have seen pushback on this kind of behavior because "users don't like error codes" or other such nonsense […]
There are two dimensions to it: UX and security.
Displaying excessive technical information on an end-user interface will complicate support and likely reveal too much about the internal system design, making it vulnerable to external attacks.
The latter is particularly concerning for any design facing the public internet. A frequently recommended approach is exception shielding. It involves logging two messages upon encountering a problem: a nondescript user-facing message (potentially including a reference ID pinpointing the problem in space and time) and a detailed internal message with the problem’s details and context for L3 support / engineering.
I used «powermetrics» bundled with macOS with «bandwidth» as one of the samplers (--samplers / -s set to «cpu_power,gpu_power,thermal,bandwidth»).
Unfortunately, Apple has taken out the «bandwidth» sampler from «powermetrics», and it is no longer possible to measure the memory bandwidth as easily.
> UX and Product like to pretend nothing will ever break, and when it does they want some funny little image, not useful output.
Just ignore them or provide appeasement insofar that it doesn’t mess with your ability to maintain the system.
(cat picture or something)
Oh no, something went wrong.
Please don’t hesitate to reach out to our support: (details)
This code will better help us understand what happened: (request or trace ID)
Nah, that’s easy problem to solve with UX copy. „Something went wrong. Try again or contact support. Your support request number is XXXX XXXX“ (base 58 version of UUID).
It’s a wild violation of SRP to suggest that. Separating concerns is way more efficient. Database can handle audit trail and some key metrics much better, no special tools needed, you can join transaction log with domain tables as a bonus.
Are you assuming they're all stored identically? If so, that's not necessarily the case.
Once the logs have entered the ingestion endpoint, they can take the most optimal path for their use case. Metrics can be extracted and sent off to a time-series metric database, while logs can be multiplexed to different destinations, including stored raw in cheap archival storage, or matched to schemas, indexed, stored in purpose-built search engines like OpenSearch, and stored "cooked" in Apache Iceberg+Parquet tables for rapid querying with Spark, Trino, or other analytical engines.
Have you ever taken, say, VPC flow logs, saved them in Parquet format, and queried them with DuckDB? I just experimented with this the other day and it was mind-blowingly awesome--and fast. I, for one, am glad the days of writing parsers and report generators myself are over.
The problem statement in this article sounds weird. I thought in 2025 everyone logs at least thread id and context id (user id, request id etc), and in microservice architecture at least transaction or saga id. You don’t need structured logging, because grep by this id is sufficient for incident investigation. And for analytics and metrics databases of events and requests make more sense.
In such scenarios it makes sense to give clients an opportunity to react on such conditions programmatically, so just logging is wrong choice and if there’s a call back to client, client can decide whether to log it and how.
It's a nice idea but I've literally never seen it done, so I would be interested if you have examples of major libraries that do this. Abstractly it doesn't really seem to work to me in place of simple logs.
One test case here is that your library has existed for a decade and was fast, but Java removed a method that let you make it fast, but you can still run slow without that API. Java the runtime has a flag that the end use can enable to turn it back on a for a stop gap. How do you expect this to work in your model, you expect to have an onUnnecessarilySlow() callback already set up that all of your users have hooked up which is never invoked for a decade, and then once it actually happens you start calling it and expect it to do something at all sane in those systems?
Second example is all of the scenarios where you're some transitively used library for many users, it makes and callback strategy immediately not work if the person who needs to know about the situation and could take action is the application owner rather than the people writing library code which called you. It would require every library to offer these same callbacks and transitively propagate things, which would only work if it was just such a firm idiomatic pattern in some language ecosystem and I don't believe that it is in any language ecosystem.
>but Java removed a method that let you make it fast, but you can still run slow without that API
I’d like to see an example of that, because this is extremely hypothetical scenario. I don’t think any library is so advanced to anticipate such scenarios and write something to log. And of course Java specifically has longer cycle of deprecation and removal. :)
As for your second example, let’s say library A is smart and can detect certain issues. Library B depending on it is at higher abstraction level, so it has enough business context to react on them. I don’t think it’s necessary to propagate the problem and leak implementation details in this scenario.
Protobuf is the example I had in mind. It uses sun.misc.Unsafe which is being removed in upcoming Java releases, but it has a slow fallback path. It logs a warning when it runs if it can tell it's only using the fallback path but the fast path is still available if the application owner set a flag to turn it back on if they want to:
Java Protobuf also logs a warning now if you can tell you are using gencode old enough that it's covered by a DoS CVE. They actually did a release that broke compatability of the CVE covered gencode but restored it and print a warning in a newer release.
There's a lot here, to be honest these things always come back to investment cost and ROI compared to everything else that could be worked on.
Java 8 is still really popular, probably the most popular single version. It's not just servers in context, but also Android where Java 8 is the highest safe target, it's not clear what decade we'll be in when VarHandle would be safe to use there at all.
VarHandle was Java 9 but MemorySegment was Java 17. And the rest of FFM is only in 25 which is fully bleeding edge.
Protobuf may realistically try to move off of sun.misc.unsafe without the performance regressions in a way that is without adopting MemorySegment to avoid the versioning problem, but it takes significant and careful engineering time.
That said it's always possible to have waterfall of preferred implementations based on what's supported, it's just always an implementation/verification costs.
I’ve written code that followed this model, but it almost always just maps to logging anyway, and the rest of the time it’s narrow options presented in the callback. e.g. Retry vs wait vs abort.
It’s very rarely realistic that a client would code up meaningful paths for every possible failure mode in a library. These callbacks are usually reserved for expected conditions.
Yes, that’s the point. You log it until you encounter it for the first time, then you know more and can do something meaningful. E.g. let’s say you build an API client and library offers callback for HTTP 429. You don’t expect it to happen, so just log the errors in a generic handler in client code, but then after some business logic change you hit 429 for the first time. If library offers you control over what is going to happen next, you may decide how exactly you will retry and what happens to your state in between the attempts. If library just logs and starts retry cycle, you may get a performance hit that will be harder to fix.
Defining a callback for every situation where a library might encounter an unexpected condition and pointing them all at the logs seems like a massive waste of time.
I would much prefer a library have sane defaults, reasonable logging, and a way for me to plug in callbacks where needed. Writing On429 and a hundred other functions that just point to Logger.Log is not a good use of time.
This sub-thread in my understanding is about a special case (a non-error mode that client may want to avoid, in which case explicit callback makes sense), not about all possible unexpected errors. I’m not suggesting hooks as the best approach. And of course “on429” is the last thing I would think about when designing this. There are better ways.
If the statement is just that sometimes it’s appropriate to have callbacks, absolutely. A library that only logs in places where it really needs a callback is poorly designed.
I still don’t want to have to provide a 429 callback just to log, though. The library should log by default if the callback isn’t registered.
You should never get out of bed on an error in the log. Logs are for retrospective analysis, health checks and metrics are for situational awareness, alerts are for waking people up.
They can log if platform permits, i.e. when you can set TRACE and DEBUG to no-op, but of course it should be done reasonably. Having hooks is often an overkill compared to this.
You may log that or count failures in some metric, but the correct answer is to have a health check on third party service and an alert when that service is down. Logs may help to understand the nature of the incident, but they are not the channel through which you are informed about such problems.
The different issue is when third party broke the contract, so suddenly you get a lot of 4xx or 5xx responses, likely unrecoverable. Then you get ERROR level messages in the log (because it’s unexpected problem) and an alert when there’s a spike.
Libraries should not log on levels above DEBUG, period. If there’s something worthy for reporting on higher levels, pass this information to client code, either as an event, or as an exception or error code.
From a code modularization point of view, there shouldn’t really be much of a difference between programs and libraries. A program is just a library with a different calling convention. I like to structure programs such that their actual functionality could be reused as a library in another program.
This is difficult to reconcile with libraries only logging on a debug level.
I see your point, but disagree on a practical level. Libraries are being used while you’re in “developer” mode, while programs are used in “user” mode (trying awkwardly to differentiate between _being_ a developer and currently developing code around that library.
Usually a program is being used by the user to accomplish something, and if logging is meaningful than either in a cli context or a server context. In both cases, errors are more often being seen by people/users than by code. Therefore printing them to logs make sense.
While a lib is being used by a program. So it has a better way to communicate problems with the caller (and exceptions, error values, choose the poison of your language). But I almost never want a library to start logging shit because it’s almost guaranteed to not follow the same conventions as I do in my program elsewhere. Return me the error and let me handle.
It’s analogous to how Go has an implicit rule of that a library should never let a panic occur outside the library. Internally, fine. But at the package boundary, you should catch panics and return them as an error. You don’t know if the caller wants the app to die because it an error in your lib!
The main difference is that library is not aware of the context of the execution of the code, so cannot decide, whether the problem is expected, recoverable or severe.
And the program doesn’t know if the user is expecting failure, either. The library case is not actually much different.
It’s very reasonable that a logging framework should allow higher levels to adjust how logging at lower levels is recorded. But saying that libraries should only log debug is not. It’s very legitimate for a library to log “this looks like a problem to me”.
The same is true for programs that are being invoked. The program only knows relative to its own purpose, and the same is again true for libraries. I don’t see the difference, other than, as already mentioned, the mechanism of program vs. library invocation.
Consider a Smalltalk-like system, or something like TCL, that doesn’t distinguish between programs and libraries regarding invocation mechanism. How would you handle logging in that case?
Okay, but…most programs are written in Python or Rust or something, where invoking library functions is a lot safer, more ergonomic, more performant, and more common than spawning a subprocess and executing a program in it. Like you can’t really ignore the human expectations and conventions that are brought to bear when your code is run (the accommodation of which is arguably most of the purpose of programming languages).
When you publish a library, people are going to use it more liberally and in a wider range of contexts (which are therefore harder to predict, including whether a given violation requires human intervention)
The purpose of a program and of a library is different and intent of the authors of the code is usually clear enough to make the distinction in context. Small composable programs aren’t interesting case here, they shouldn’t be verbose anyway even to justify multiple logging levels (it’s probably just set to on/off using a command line argument).
The mechanism of invocation is important. Most programs allow you to set the logging verbosity at invocation. Libraries may provide an interface to do so but their entry points tend to be more numerous.
I have a logging level I call "log lots" where it will log the first time with probability 1, but as it hits more often the same line, it will log with lower and lower probability bottoming out around 1/20000 times. Sort of a "log with probability proportional to the unlikiness of the event". So if I get e.g. sporadic failures to some back end, I will see them all, but if it goes down hard I will see it is still down but also be able to read other log msgs.
Practicality. It is excessive for client code to calibrate library logging level. It’s ok to do it in logging configuration, but having an entry for every library there is also excessive. It is reasonable to expect that dev/staging may have base level at DEBUG and production will have base level at INFO, so that a library following the convention will not require extra effort to prevent log spam in production. Yes, we have entire logging industry around aggregation of terabytes of logs, with associated costs, but do you really need that? In other words, are we developers too lazy to adapt the sane logging policy, which actually requires minimum effort, and will just burn the company money for nothing?
A library might also be used in multiple place, maybe deeply in a dependency stack, so the execution context (high level stack) matters more than which library got a failure.
So handling failures should stay in the hands of the developer calling the library and this should be a major constraint for API design.
Eh, as with anything there are always exceptions. I generally agree with WARN and ERROR, though I can imagine a few situations where it might be appropriate for a library to log at those levels. Especially for a warning, like a library might emit "WARN Foo not available; falling back to Bar" on initialization, or something like that. And I think a library is fine logging at INFO (and DEBUG) as much as it wants.
Ultimately, though, it's important to be using a featureful logging framework (all the better if there's a "standard" one for your language or framework), so the end user can enable/disable different levels for different modules (including for your library).
In server contexts this is usually unnecessary noise, if it’s not change detection. Of course, good logging framework will help you to mute irrelevant messages, but as I said in another comment here, it’s a matter of practicality. Library shouldn’t create extra effort for its users to fine-tune logging output, so it must use reasonable defaults.
Normal users will be fine if they will see two big squares side by side as an installation step: „with AI“ and „without AI“, where the former will just install and enable the plugin. Explicit choice is better than opt-out, and it’s not going to be something people frequently change their mind about, so another switch can be buried in settings.
reply