LLMs are awful for this. I've got a project that's doing structured extraction a...

LLMs are awful for this. I've got a project that's doing structured extraction and half the work is deduplication.

I didn't go down the route of LLMs for the clean up, as you're getting into scale and context issues with larger datasets.

I got into semantic similarity networks for this use case. You can do efficient pairwise matching with Annoy, set a cutoff threshold, and your isolated subgraphs are merger candidates.

I wrapped up my code in a little library if you're into this sort of thing.

github.com/specialprocedures/semnet