Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

LLMs are awful for this. I've got a project that's doing structured extraction and half the work is deduplication.

I didn't go down the route of LLMs for the clean up, as you're getting into scale and context issues with larger datasets.

I got into semantic similarity networks for this use case. You can do efficient pairwise matching with Annoy, set a cutoff threshold, and your isolated subgraphs are merger candidates.

I wrapped up my code in a little library if you're into this sort of thing.

github.com/specialprocedures/semnet



Nice looking library! Might try it for one of my own projects.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: