Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: A Vectorless LLM-Native Document Index Method (github.com/vectifyai)
14 points by mingtianzhang 77 days ago | hide | past | favorite | 3 comments
The word "index" originally came from how humans retrieve info: book indexes and tables of contents that guide us to the right place in documents.

Computers later borrowed the term for data structures: e.g., B-trees, hash tables, and more recently, vector indexes. They are highly efficient for machines; but abstract and unnatural: not something a human, or an LLM, can understand and directly use as a reasoning aid. This creates a gap between how indexes work for computers and how they should work for models that reason like humans.

PageIndex is a new step that "looks back to move forward". It revives the original, human-oriented idea of an index and adapts it for LLMs. Now the index itself (PageIndex) lives inside the LLM's context window: the model sees a hierarchical table-of-contents tree and reasons its way down to the right span, much like a person would retrieve information using a book's index.

PageIndex MCP shows how this works in practice: it runs as a MCP server, exposing a document's structure directly to LLMs/Agents. This means platforms like Claude, Cursor, or any MCP-enabled agent or LLM can navigate the index themselves and reason their way through documents, not with vectors/chunking, but in a human-like, reasoning-based way.



What happen when the TOC is too long? How does the index handles near misses? How do you disambiguate between close titles? What happens if the documents are not in a strict hierarchy?

Seems very situational.


Hi, thanks for your inspiring questions.

1. What happens when the TOC is too long? -- This is why we choose the tree structure. If the ToC is too long, it will do a hierarchy search, which means search over the father level nodes first and then select one node, and then search its child nodes.

2. How does the index handle near misses, and how do you disambiguate between close titles? For each node, we generate a description or summary to give more information rather than just titles.

3. For documents that are not in a hierarchy, it will just become a list structure, which you can still look through.

We also write down how it can combine with a reasoning process and give some comparisons to Vector DB, see https://vectifyai.notion.site/PageIndex-for-Reasoning-Based-....

We found our MCP service works well in general financial/legal/textbook/research paper cases, see https://pageindex.ai/mcp for some examples.

We do agree in some cases, like recommendation systems, you need semantic similarity and Vector DB, so I wouldn't recommend this approach. Keen to learn more cases that we haven't thought through!


thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: