Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The quality of the data matters a lot. Textbooks, scientific papers, etc… are substantially better for training smart and capable LLMs than random social chit-chat.

Google and Microsoft both have a lot of corporate info, including source repositories they can legally use.

Google has Google Books, Maps, and YouTube.

Microsoft has Azure, GitHub, LinkedIn, etc…

Facebook has… what? Instagram? Your crazy aunt screaming about her conspiracy of the week?



Not to mention bot spam. Google and Microsoft have a better idea of their data's provenance than Meta or Reddit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: