How It Works
The Big Picture
A document goes in. A queryable knowledge store comes out.
Ingestible takes any document — PDF, DOCX, EPUB, HTML, audio, video, even a ZIP archive — and transforms it into a structured, searchable knowledge store that an AI can query efficiently. Instead of dumping an entire 500-page book (92,598 tokens) into a context window, your AI sends a search query and gets back exactly the passages it needs: ~1,317 tokens. That is a 99% reduction.
Document (any format)
↓
Parse → Clean → Structure → Chunk → Validate → Enrich → Embed & Index → Store
↓
Queryable knowledge store (three search indexes, hierarchical chunks, structured metadata)
The pipeline runs once per document. Every query after that is cheap — no LLM calls, just index lookups.
The Pipeline
Stage 1: Parse
What it does: Converts your document into clean, structured markdown — regardless of what format it started as.
Ingestible handles 25+ document formats through format-specific parsers, plus source code files via tree-sitter:
| Format | How it's parsed |
|---|---|
| IBM Docling for deep layout analysis (tables, headings, figures). Falls back to PyMuPDF if needed. Retries with OCR if text is sparse. | |
| DOCX, PPTX, XLSX | Native Python parsers, converted to markdown |
| EPUB | Chapter-based splitting, each chapter becomes a page |
| HTML / Markdown / RST | Direct conversion to normalized markdown |
| Audio (MP3, WAV, FLAC, ...) | Local Whisper transcription with timestamps |
| Video (MP4, MKV, AVI, ...) | Audio extraction + transcription + keyframe capture |
| Images (PNG, JPG, TIFF, ...) | OCR + layout analysis via Docling |
| Email (EML, MSG) | Headers, body, and inline attachments |
| ZIP archives | Auto-detects Notion exports, Confluence exports, or generic file archives |
| Source code (Python, TS, Go, Rust, Java, C/C++, Ruby, Swift, ...) | tree-sitter AST parsing with function/class boundaries (requires ingestible[code]) |
What it produces: A ParsedDocument — title, page count, table of contents, and per-page markdown content with tables and figures preserved.
Stage 2: Clean
What it does: Normalizes the parser output. Strips artifacts, fixes whitespace, removes conversion noise.
This is a small but important step. Parsers, especially PDF parsers, produce inconsistent output: stray line breaks inside paragraphs, duplicated headers, encoding artifacts. The cleaning stage irons these out so downstream stages work with consistent, high-quality text.
Stage 3: Structure
What it does: Builds a hierarchy tree — giving the AI a map of the document.
The structuring engine looks for hierarchy signals in this order:
- Embedded TOC tables — Many PDFs contain a table of contents rendered as a markdown table. These are parsed with page number associations.
- Heading patterns —
#/##/###markdown headings, numbered sections ("1.1 Title"), chapter patterns ("Chapter 3:"). - Page range heuristics — When heading levels are ambiguous (common in PDFs where visual hierarchy is lost), the engine infers structure from section length: 5+ pages is a chapter, 2-4 pages is a section, 1 page is a subsection.
What it produces: A DocumentStructure — a tree of nodes with titles, levels, page ranges, and content. This tree drives how the document gets chunked.
Stage 4: Chunk
What it does: Splits the document into a 4-level hierarchy of chunks, from broad overview down to individual passages.
This is where the token savings come from. Instead of one giant blob of text, the document becomes a navigable tree:
| Level | What it represents | Typical size |
|---|---|---|
| L0 | The whole document | ~500-800 tokens |
| L1 | A chapter | ~300-500 tokens |
| L2 | A section within a chapter | ~200-400 tokens |
| L3 | A single passage | ~250-500 tokens |
Five chunking strategies are available:
- Paragraph (default) — splits on natural paragraph boundaries
- Semantic — splits where the meaning shifts between sentences (uses embedding cosine distance)
- Recursive — recursive splitting with configurable overlap
- Docling — uses IBM Docling's native document segmentation (PDF only)
- Code — AST-based splitting via tree-sitter; chunks at function, class, and module boundaries. Requires
pip install ingestible[code]. Auto-selected for source code files with thecodeprofile.
Key rules the chunker follows:
- Never splits mid-sentence or mid-paragraph
- Tables and code blocks are kept atomic (never split across chunks)
- Small trailing chunks get merged into the previous passage
- Configurable overlap between adjacent passages for continuity
What it produces: A ChunkSet — the L0 overview, plus lists of L1, L2, and L3 chunks with parent-child relationships.
Stage 5: Validate
What it does: Catches problems before the expensive enrichment stage.
Validation checks chunk sizes, hierarchy integrity, and content coverage. If a chunk is too large, too small, or orphaned from the hierarchy, this stage flags it. Better to catch a structuring issue here than waste LLM credits enriching broken chunks.
Stage 6: Enrich (optional)
What it does: A one-time LLM pass that generates metadata for dramatically better search. This is where retrieval quality jumps from good to excellent.
Enrichment runs bottom-up — L3 passages first, then L2 sections, then L1 chapters, then the L0 document overview. For each level, the LLM generates different metadata:
Per L3 passage:
- A 2-3 sentence summary
- Key concepts discussed
- 3-5 hypothetical questions this passage answers
- Keywords for keyword search
- Difficulty level and cross-references
Per L2 section: Section summary, key points, concepts introduced
Per L1 chapter: Chapter summary, main topics, learning objectives
Per L0 document: Domain, document type, target audience, executive summary, key findings
The hypothetical questions are particularly powerful. A passage about "nervous system overflow reactions" gets tagged with questions like "What happens during a meltdown?" — so when a user asks that exact question, vector search finds the passage even though the wording is completely different.
Enrichment is skippable (--skip-enrichment). The pipeline works without it — you just get fewer embeddings per passage and no concept index.
Enrichment cost (using gpt-4o-mini):
| Document size | Estimated cost | Time |
|---|---|---|
| 55 pages | ~$0.01 | ~1 min |
| 500 pages | ~$0.20 | ~5 min |
| 1,000 pages | ~$0.50 | ~10 min |
Stage 7: Embed & Index
What it does: Converts chunks into searchable indexes — three of them.
Vector index (ChromaDB): Each chunk is converted into a 1024-dimensional vector using E5-large-v2. With enrichment, each passage produces multiple embeddings: one for the raw content, one for the summary, and one for each hypothetical question. All point back to the same chunk. At query time, the best-matching embedding wins.
L0 digest embeddings (executive summary, key findings) are also indexed, so high-level questions about the document match directly.
BM25 index: Classic keyword search. Scores passages by how often query terms appear, weighted by how rare those terms are across the whole document. Indexed over content plus any keywords extracted during enrichment.
Concept index: A simple mapping from concept names to chunk IDs. When a query mentions "meltdowns," the concept index immediately returns every chunk tagged with that concept.
What it produces: Three complementary search indexes, each with different strengths. They get fused at query time.
Stage 8: Store
What it does: Persists everything as a JSON file hierarchy.
data/documents/{doc_id}/
meta.json # document metadata + L0 chunk
structure.json # full hierarchy tree
chapters/
{chapter_id}/
meta.json # L1 chapter data
sections/
{section_id}/
meta.json # L2 section data
chunks/
{chunk_id}.json # L3 passage data
chroma/ # vector index
bm25_index.pkl # keyword index
concept_index.json # concept-to-chunk mapping
knowledge_graph.json # entity relationships (when enabled)
No database server required. Each document is self-contained in its own directory. Back it up, move it, delete it — it is just files.
The Chunk Hierarchy
Here is what the four levels look like for a concrete example: a 500-page book about cooking.
L0 — Document overview (~600 tokens) Title, author, executive summary ("A comprehensive guide to French cooking techniques covering knife skills, sauces, pastry, and plating"), and the full table of contents. This is the map. The AI reads this first and knows the entire structure of the book.
L1 — Chapter summaries (~400 tokens each) "Part II: The Five Mother Sauces" — a summary of the chapter covering bechamel, veloute, espagnole, hollandaise, and tomato sauce. Main topics, learning objectives, key takeaways. There might be 15-20 L1 chunks for a 500-page book.
L2 — Section summaries (~300 tokens each) "Chapter 7, Section 3: Hollandaise and Its Derivatives" — summary, key points (temperature control, emulsion science, fixing a broken sauce), concepts introduced. There might be 80-120 L2 chunks.
L3 — Passages (~350 tokens each) The actual text: "To rescue a broken hollandaise, remove the bowl from heat immediately. Add one tablespoon of cold water and whisk vigorously. The temperature shock helps the emulsion re-form..." There might be 600-800 L3 chunks.
When an AI asks "How do I fix a broken hollandaise?", it does not need to read all 500 pages. It gets:
- The L0 overview (sees that Chapter 7 covers sauces)
- Search results pointing to the specific L3 passage about rescuing broken hollandaise
- The full passage content
Total: ~1,500 tokens instead of ~90,000.
How Search Works
Every query runs three search systems in parallel, then merges results into one ranked list.
The Three Indexes
Vector search (weight: 0.4) finds passages by meaning. "How do I handle sensory overload?" matches a passage about "nervous system overflow reactions" because the embedding model understands they mean the same thing. With enrichment, each passage has 5-6 embeddings (content, summary, hypothetical questions), so there are more chances to match.
BM25 keyword search (weight: 0.6) finds passages by exact terms. If you search for "hollandaise" and only 3 passages out of 700 mention that word, those 3 passages get a very strong signal. BM25 catches exact matches that vector search might rank lower.
Concept index (small boost) does a direct lookup. If the query mentions a known concept, every chunk tagged with that concept gets a small score bump.
Reciprocal Rank Fusion (RRF)
The three indexes return separate ranked lists. RRF merges them without needing to normalize scores across systems:
score(chunk) = 0.4 / (60 + vector_rank) + 0.6 / (60 + bm25_rank) + concept_boost
The key property: chunks that multiple systems agree on get naturally boosted. A passage ranked #1 in vector search and #3 in BM25 scores higher than a passage ranked #1 in vector search alone. Cross-validation between systems is a strong relevance signal.
The Full Retrieval Flow
- Get the map — The AI reads the L0 overview (~500-800 tokens): title, executive summary, table of contents.
- Search — The AI sends a query. Hybrid search returns the top passages with scores, locations, and which search systems matched.
- Drill down — The AI requests the full content of the most promising passages.
Total context per query: ~1,000-2,000 tokens.
Search Enhancements
Four optional features push search quality further. All are off by default and can be enabled per-query or globally.
Auto-merge — When 3 or more L3 passages from the same L2 section all match a query, that is a signal the whole section is relevant. Auto-merge returns the parent L2 section instead of the individual fragments, giving the AI more coherent context. Enable with --auto-merge or INGEST_AUTO_MERGE_ENABLED=true.
HyDE (Hypothetical Document Embeddings) — Instead of embedding the raw query, Ingestible first asks an LLM to generate a hypothetical answer paragraph, then embeds that. The hypothetical answer is closer in vector space to the actual answer than the question is. Useful for complex or abstract queries. Enable with --hyde or INGEST_HYDE_ENABLED=true.
Knowledge graph retrieval — When knowledge graph extraction is enabled during enrichment, Ingestible builds a graph of entity relationships (e.g., "Hollandaise — is_a — Mother Sauce", "Bechamel — uses — Roux"). At search time, the graph finds related chunks that share entities with the top results but might not match by text. Enable with --kg or INGEST_KG_RETRIEVAL_ENABLED=true.
Cross-encoder reranking — A second-pass reranker that processes each (query, passage) pair together for more accurate scoring. Slower than the initial retrieval but more precise. The top candidates from RRF fusion are rescored and reblended. Configure with INGEST_RERANKER_MODEL.
What Makes It Different
The core insight is token economics.
Most RAG systems retrieve chunks and stuff them into context. Ingestible takes a different approach: it builds a hierarchical knowledge store with three complementary search indexes, so the AI reads only what it needs.
The Numbers
| Document | Full context | Per query | Savings |
|---|---|---|---|
| 55-page research paper | 4,975 tokens | ~585 tokens | 88% |
| 513-page book | 92,598 tokens | ~1,317 tokens | 99% |
| 1,000-page technical manual | ~180,000 tokens | ~2,000 tokens | 99% |
The 1,000-page manual does not even fit in most context windows. Without hierarchical retrieval, the AI simply cannot use it. With Ingestible, it costs ~2,000 tokens per query — room for dozens of queries in a single conversation.
Why It Works
Nothing is lost. All content is stored and indexed. The savings come from the AI reading only what is relevant per query, not from discarding content.
The hierarchy is the map. The L0 overview tells the AI the full structure of the document in ~800 tokens. It knows where topics live before searching.
Semantic search finds meaning, not keywords. "How do I handle sensory overload?" matches "nervous system overflow reactions" because vector embeddings capture meaning.
Multiple systems cross-validate. A passage found by both vector search and keyword search is more likely relevant than one found by only one system. RRF fusion makes this automatic.
Enrichment pays for itself. A one-time LLM pass at ingestion (~$0.20 for a 500-page book) generates summaries, hypothetical questions, and concept tags that improve every subsequent query. No LLM calls at search time.