Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>What scraper or headless browser are you using? it works so well.

>Before 2019 - PhantomJS, after - ordinary (not headless) Chromium/80 with few small patches.

https://blog.archive.today/post/618635148292964352/what-scra... (2020)

>Archive.today launches real browsers (not even headless) and tries to load lazy images, unroll folded content, login into accounts if prompted with login form, remove “subscribe our maillist” modals

https://blog.archive.today/post/642952252228812800/people-of...



I get that it convincingly simulates a human but so do I (because I am a human) and I don't get through the paywall...


There are some tricks which work for different websites - for example, for NYT it's enough to manually clear nytimes.com cookies, FT used to work after click from twitter/x and so on. So I guess there is some set of heuristics.


It seems that archive.is often has the full article for sites that are completely paywalled to every non-paying visitor: no cookie-driven freebies, nothing.

Publicly revealing everything they are doing would be a strategically bad idea, obviously.

It's not inconceivable that they actually pay for access to some of the sites; it wouldn't be surprising.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: