I suggest you make explicit the assumption that this website is specifically about English text. Otherwise the leaderboard is pretty meaningless, with extreme differences in performance across other scripts - and potentially even languages such as Vietnamese or Czech which use Latin but have lots of accents.
Hey! I'm the dev who made this:) I think that you are right, data will bias towards english because we have a dataset that people can use that is in english. But you can also upload non-english docs into the battle mode as well as the playground!
LMArena splits their leaderboard by language: maybe you should consider doing the same thing
I assume to do that you’d need another model to do language detection on the inputs and/or outputs; but a language detection model can be a lot cheaper than an OCR model or an LLM
That's unfortunate because I have a bunch of photos with handwritten German on the back that I need to transcribe, and seeing as that I can't read German I can't really do it by myself either.
I reckon performance on German will be similar to English, the only real difference is the umlauts and those are very consistent. Not sure how it will do on the ß.