>Archive.today launches real browsers (not even headless) and tries to load lazy images, unroll folded content, login into accounts if prompted with login form, remove “subscribe our maillist” modals
There are some tricks which work for different websites - for example, for NYT it's enough to manually clear nytimes.com cookies, FT used to work after click from twitter/x and so on. So I guess there is some set of heuristics.
It seems that archive.is often has the full article for sites that are completely paywalled to every non-paying visitor: no cookie-driven freebies, nothing.
Publicly revealing everything they are doing would be a strategically bad idea, obviously.
It's not inconceivable that they actually pay for access to some of the sites; it wouldn't be surprising.
>Before 2019 - PhantomJS, after - ordinary (not headless) Chromium/80 with few small patches.
https://blog.archive.today/post/618635148292964352/what-scra... (2020)
>Archive.today launches real browsers (not even headless) and tries to load lazy images, unroll folded content, login into accounts if prompted with login form, remove “subscribe our maillist” modals
https://blog.archive.today/post/642952252228812800/people-of...