How Search Works
The Three Search Systems
Every query runs three search systems in parallel, then merges results into one ranked list.
1. Vector Search (weight: 0.4)
What it does: Finds passages that are semantically similar to the query, even if they share no keywords.
How:
- At ingestion, every L3 passage is converted to a 1024-dimensional vector by E5-large-v2
- The model understands meaning — "Reizüberflutung" and "Nervensystem überläuft" end up near each other in vector space
- At query time, the question is converted to the same kind of vector
- ChromaDB finds the closest passage vectors by cosine distance
With enrichment, each passage produces multiple embeddings:
- Content embedding (the raw text)
- Summary embedding ("Explains meltdowns as nervous system overload...")
- 3-5 hypothetical question embeddings ("What happens during a meltdown?")
L0 digest embeddings are also indexed:
- Executive summary embedding — matches high-level queries about the document
- Key findings embeddings — one per finding, matches specific result queries
All embeddings point back to the same chunk_id. At query time, the search runs against all embeddings and deduplicates by chunk — keeping the best match. This means a query like "Was tun bei sensorischer Überlastung?" might match a hypothetical question embedding almost exactly, even if the raw content uses different wording.
Embedding configuration:
- Model: configurable via
INGEST_EMBEDDING_MODEL(default:intfloat/e5-large-v2) - Matryoshka truncation:
INGEST_EMBEDDING_DIMENSIONS(empty = full dimensions) - Binary quantization:
INGEST_EMBEDDING_QUANTIZATION(noneorbinary)
E5-large-v2 specifics:
- Requires prefix
"passage: "for indexed documents - Requires prefix
"query: "for search queries normalize_embeddings=Truefor unit-length vectors (cosine similarity)- Runs on MPS (Apple Silicon GPU) via sentence-transformers
2. BM25 Search (weight: 0.6)
What it does: Classic keyword/term-frequency matching. Catches exact term matches that vector search might rank lower.
How:
- BM25 (Best Match 25) scores each passage by:
- How often query terms appear in the passage (term frequency)
- How rare those terms are across the whole document (inverse document frequency)
- Passage length normalization
- "Reizüberflutung" appearing in only 3 of 656 passages → very high signal when it does appear
- Indexed over
content + keywords(keywords populated by enrichment) - Language-aware tokenization — detected automatically per document
When BM25 beats vector search:
- Exact technical terms, names, acronyms
- Rare domain-specific vocabulary
- When the query uses the exact same phrasing as the source
Alternative: SPLADE sparse retrieval
SPLADE replaces BM25 with learned sparse vectors. Instead of raw term frequency, a transformer model learns which terms are important and expands the vocabulary with related terms. Configure with INGEST_SPARSE_RETRIEVAL=splade and INGEST_SPLADE_MODEL (default: naver/splade-cocondenser-ensembledistil). Requires pip install transformers torch.
3. Concept Index (small boost)
What it does: Direct concept → chunk_ids lookup for "tell me everything about X" queries.
How:
- At ingestion (with enrichment), the LLM extracts concepts from each passage:
["Meltdown", "sensorische Überlastung", "Reizreduktion"] - Stored as a JSON mapping:
{"meltdown": ["chunk-1", "chunk-7"], "reizreduktion": ["chunk-1", "chunk-12"]} - At query time, a substring match finds all chunks tagged with matching concepts
- Adds a small fixed score boost (0.005) to matched chunks
Without enrichment: The concept index is empty. Concepts come from Stage 4 LLM analysis.
Optional: Cross-Encoder Reranking
When configured (INGEST_RERANKER_MODEL), a cross-encoder reranker rescores the top candidates from the initial retrieval:
- RRF fusion produces a candidate list (top
INGEST_RERANKER_TOP_K, default 50) - Cross-encoder scores each (query, passage) pair
- Final ranking blends the RRF score with the reranker score (
INGEST_RERANKER_WEIGHT)
Cross-encoder models are more accurate than bi-encoders but slower — they process query and document together. This is why they're used as a second-pass reranker, not the primary retrieval.
Configuration:
INGEST_RERANKER_MODEL— model name (e.g.,cross-encoder/ms-marco-MiniLM-L-6-v2). Empty = disabled.INGEST_RERANKER_TOP_K— candidates to rescore (default: 50)INGEST_RERANKER_WEIGHT— blend weight between RRF and reranker scores (default: 0.5)
Optional: Sub-Question Decomposition
When enabled (INGEST_SUB_QUESTIONS_ENABLED=true), complex queries are decomposed into simpler sub-questions via LLM. Each sub-question is searched independently, and results are merged. This helps with multi-part questions like "Compare the methodology and results of studies A and B" where a single query might miss one aspect.
Configure with INGEST_MAX_SUB_QUESTIONS (default: 3).
The Merge: Reciprocal Rank Fusion (RRF)
Each system returns its own ranked list. RRF combines them without needing to normalize scores across systems (which would be comparing apples to oranges):
score(chunk) = vector_weight / (K + vector_rank)
+ bm25_weight / (K + bm25_rank)
+ concept_boost
Where K = 60 (standard RRF constant), vector_weight = 0.4, bm25_weight = 0.6.
Example
A chunk ranked #1 in vector search and #3 in BM25:
0.4 / (60 + 1) + 0.6 / (60 + 3) = 0.00656 + 0.00952 = 0.01608
A chunk ranked #10 in vector search only (not in BM25 top results):
0.4 / (60 + 10) + 0 = 0.00571
The first chunk wins — it appeared in both systems. This is the key property of RRF: chunks that multiple systems agree on get naturally boosted.
The match_sources field on each result tells you which systems contributed: ["vector", "bm25"] means both agreed, ["vector"] means only semantic similarity matched.
Version-Aware Weighting
When document versioning is active, superseded chunks (from older versions) receive a 0.3x multiplier on their final RRF score. This means current-version results naturally rank higher while older versions remain discoverable if the query specifically matches them.
Corpus Search (Cross-Document)
File: search/corpus.py
Corpus search queries all ingested documents simultaneously:
- Discovers all document directories under
data/documents/ - Loads a
HybridSearcherfor each document (LRU-cached) - Runs hybrid search against each document in parallel
- Merges results across documents via cross-document RRF
- Results include the source document title for attribution
Use via CLI (ingest corpus-search "query") or the web UI search page.
The Full Retrieval Flow
When an AI queries a document, the interaction follows three steps:
Step 1: Get the Map (L0)
The AI receives the document overview: title, total pages, executive summary, and the full table of contents with chapter/section titles. This is ~500-800 tokens and gives the AI a structural understanding of where topics live.
Step 2: Search
The AI sends a natural language query. The hybrid search (vector + BM25 + concept) returns the top N passages ranked by relevance. Each result includes:
- The passage content (or summary if enriched)
- Score and which search systems matched
- Chapter/section location
- Page numbers
Step 3: Drill Down
The AI can request full content of specific passages or sections based on what looked promising in Step 2.
Total context per query: ~1,000-2,000 tokens instead of the full document (which could be 90,000+ tokens for a 500-page book).
Impact of Enrichment on Search Quality
| Aspect | Without Enrichment | With Enrichment |
|---|---|---|
| Embeddings per passage | 1 (content only) | 5-6 (content + summary + questions) |
| L0 digest embeddings | None | Executive summary + key findings |
| BM25 index | Content only | Content + extracted keywords |
| Concept index | Empty | Populated with LLM-extracted concepts |
| Query matching | Must match content wording | Can match questions, summaries, concepts |
| Result display | Raw content snippet | Summary + content |
Enrichment is a one-time cost at ingestion. Query time is always cheap (embedding lookup + BM25 score + concept lookup, no LLM in the loop).