Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do the major AI companies actually honor robots.txt? Even if some of their publicly known crawlers might do it, surely they have surreptitious campaigns where they do some hidden crawling, just like how they illegally pirate books, images and user data to train on.


My thought too, honoring robots.txt is just a convention. There's no requirement to follow robots.txt, or at least certainly no technical requirement. I don't think there's any automatic legal requirement either.

Maybe sites could add "you must honor policies set in robots.txt" to something like a terms of service but I have no idea if that would have enough teeth for a crawler to give up.


Cloudflare snd their customera have been desperately for years trying to kill scrapers in court. This is all. Meaningless, but they are probably gearing up for another legal battle to define robots.txt as a legal contract. Theyre going to use this marketplace theyre scamming people with to do it. They will fail.


I don't think terms of service are applicable anyway. Terms of Service aren't a signed contract as you may never see it nor know there is one. This happens both in the case of visiting the site interactively or fetching a page programatically.


There's a lack of clarity, but it seems likely to me that a majority of this traffic is actually people asking questions to the AI, and the AI going out and researching for answers. When the AI tools are being used like a web browser to do research, should they still be adhering to robots.txt, or is that only intended for search indexing?


Hard to tell, because minor crawlers mimic major companies to not getting banned.


Cloudflare, for all I hate their role as a gatekeeper these days, actually has the leverage to force the AI companies to bend.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: