Usage Guide

Installation

cd Ingestible

# Create virtual environment (requires Python 3.11-3.13, ChromaDB incompatible with 3.14)
python3.12 -m venv .venv
source .venv/bin/activate

# Install
pip install -e ".[dev]"

Docker

cp .env.example .env
# Edit .env with your API keys

docker compose up -d

The API and web UI are available at http://localhost:8081. Data is persisted via a Docker volume.

The API runs behind gunicorn with multiple workers for production use. For development, use ingest serve directly.

Configuration

Copy .env.example to .env and fill in:

cp .env.example .env

Required for enrichment (Stage 4):

INGEST_LLM_PROVIDER=anthropic        # or "openai"
INGEST_OPENAI_API_KEY=sk-...         # needed if provider=openai
INGEST_OPENAI_MODEL=gpt-4o-mini
INGEST_ANTHROPIC_API_KEY=sk-ant-...  # needed if provider=anthropic
INGEST_ANTHROPIC_MODEL=claude-haiku-4-5-20251001

Embedding:

INGEST_EMBEDDING_MODEL=intfloat/e5-large-v2    # default
INGEST_EMBEDDING_DIMENSIONS=                    # Matryoshka truncation (empty = full)
INGEST_EMBEDDING_QUANTIZATION=none              # "none" or "binary"

Chunking:

INGEST_MAX_CHUNK_TOKENS=500                     # max L3 passage size
INGEST_MIN_CHUNK_TOKENS=100
INGEST_CHUNK_OVERLAP_FRACTION=0.1
INGEST_CHUNKING_STRATEGY=paragraph              # "paragraph", "semantic", "recursive", "docling", or "code"
INGEST_SEMANTIC_THRESHOLD_PERCENTILE=10
INGEST_CONTEXTUAL_CHUNKING=false

Search tuning:

INGEST_VECTOR_WEIGHT=0.4                        # vector vs BM25 balance
INGEST_BM25_WEIGHT=0.6
INGEST_EXACT_MATCH_BOOST=0.003                  # bonus for exact term matches
INGEST_DEFAULT_SEARCH_RESULTS=5
INGEST_RERANKER_MODEL=                          # cross-encoder model (empty = disabled)
INGEST_RERANKER_TOP_K=50
INGEST_RERANKER_WEIGHT=0.5
INGEST_PINNED_TOP_N=3                           # max pinned documents in corpus results

Search enhancements:

INGEST_AUTO_MERGE_ENABLED=false                 # merge child chunks into parent when 3+ match
INGEST_AUTO_MERGE_THRESHOLD=3
INGEST_AUTO_MERGE_PREFER_L1=false               # prefer L1 (chapter) over L2 (section) merges
INGEST_HYDE_ENABLED=false                       # HyDE query expansion via LLM
INGEST_SUB_QUESTIONS_ENABLED=false              # decompose complex queries into sub-questions
INGEST_MAX_SUB_QUESTIONS=3
INGEST_KG_RETRIEVAL_ENABLED=false               # boost results via knowledge graph
INGEST_KG_BOOST_WEIGHT=0.01
INGEST_KG_MAX_HOPS=1
INGEST_KG_MIN_CONFIDENCE=0.5
INGEST_KG_SEED_TOP_N=10

Audio/video transcription:

INGEST_WHISPER_MODEL=base                       # faster-whisper model: tiny, base, small, medium, large-v3

Requires pip install ingestible[audio]. Video also requires ffmpeg on PATH.

Knowledge graph:

INGEST_EXTRACT_KNOWLEDGE_GRAPH=false            # extract entity-relationship triples

Enabled by default for paper and documentation profiles regardless of this setting.

Content classification:

INGEST_CONTENT_CLASSIFICATION=false            # classify blocks into tiers (T0-T3)
INGEST_PROFILE_LLM_FALLBACK=false              # LLM fallback for ambiguous profile detection

Other:

INGEST_DATA_DIR=data                            # where documents are stored
INGEST_LLM_CONCURRENCY=10                       # parallel LLM calls
INGEST_CLEANING_ENABLED=true
INGEST_DEDUP_ENABLED=true
INGEST_CHECKPOINT_ENABLED=true
INGEST_API_KEYS=key1,key2                       # API auth (empty = no auth)

Production / deployment:

INGEST_CORS_ORIGINS=*                          # comma-separated allowed origins
INGEST_RATE_LIMIT_INGEST=10/minute             # rate limit for ingestion endpoints
INGEST_RATE_LIMIT_SEARCH=60/minute             # rate limit for search endpoints
INGEST_RATE_LIMIT_DEFAULT=120/minute           # rate limit for other endpoints
INGEST_MAX_UPLOAD_BYTES=500000000              # max file upload size (500 MB)
INGEST_MAX_CONCURRENT_INGESTIONS=2             # background ingestion workers
INGEST_PARSE_TIMEOUT=300                       # seconds before parsing is aborted
INGEST_LLM_TIMEOUT=120                        # seconds per LLM API call
INGEST_LLM_MAX_RETRIES=3                      # retries on transient LLM errors
INGEST_EMBEDDING_DEVICE=auto                  # auto | cuda | mps | cpu
INGEST_LOG_JSON=true                          # JSON structured logging
INGEST_LOG_LEVEL=INFO                         # DEBUG, INFO, WARNING, ERROR
INGEST_MIN_DISK_SPACE_MB=500                  # readiness check threshold
INGEST_SEARCHER_CACHE_SIZE=50                  # max cached document searchers in memory

Embedding providers:

INGEST_EMBEDDING_PROVIDER=local                # local | openai | cohere | voyage
INGEST_OPENAI_EMBEDDING_API_KEY=               # falls back to INGEST_OPENAI_API_KEY
INGEST_COHERE_API_KEY=                         # for Cohere Embed v3/v4
INGEST_VOYAGE_API_KEY=                         # for Voyage-3

When using API providers, set INGEST_EMBEDDING_MODEL to the provider-specific model name (e.g., text-embedding-3-large for OpenAI).

Access control:

INGEST_ACCESS_CONTROL_ENABLED=false            # enable per-document access filtering

Tag documents during ingestion with --access-tags admin,engineering. Filter at search time with the X-Access-Tags: admin header. Empty tags = public (accessible to all).

Audit trail:

INGEST_AUDIT_ENABLED=false                     # log every search to data/audit.jsonl

LLM-as-judge evaluation:

INGEST_LLM_JUDGE_ENABLED=false                 # enable LLM-as-judge eval endpoint
INGEST_LLM_JUDGE_CONCURRENCY=5                 # parallel judge calls

Sparse retrieval:

INGEST_SPARSE_RETRIEVAL=bm25                   # bm25 or splade
INGEST_SPLADE_MODEL=naver/splade-cocondenser-ensembledistil

SPLADE uses learned sparse vectors with term expansion. Requires pip install transformers torch.

No API keys are needed if you use --skip-enrichment. The embedding model (E5-large-v2) runs locally.

See .env.example for the complete list with comments.

CLI Commands

Ingest a document

ingest add /path/to/document.pdf -v

Flags:

-v / --verbose — show progress for each stage
--skip-enrichment — skip LLM enrichment (no API cost, still builds search indexes)
--profile auto|paper|article|documentation|general|code — extraction profile (default: auto)
--chunking paragraph|semantic|recursive|docling|code — chunking strategy override
--force — force re-ingestion even if document is unchanged
--no-checkpoint — disable pipeline checkpoint/resume
--no-recursive — don't recurse into subdirectories (when ingesting a directory)
--parallel N — process N files concurrently (batch ingestion)
--access-tags admin,engineering — access control tags for this document
--weight 1.5 — corpus search weight boost (0.0 = neutral)
--pinned — pin document so it always appears in corpus search results
--tags research,ml — metadata tags for filtering
--data-dir /custom/path — override the data directory

What happens:

Document is parsed (Docling for PDFs with PyMuPDF4LLM fallback; native parsers for other formats)
Extraction profile is auto-detected (paper, article, documentation, or general)
Document structure is detected (TOC tables, heading patterns, page range heuristics)
Content is chunked into L0/L1/L2/L3 hierarchy
LLM enrichment generates summaries, concepts, questions (unless --skip-enrichment)
Embeddings are generated and indexed (ChromaDB + BM25 + concept index)
Everything is saved to data/documents/{doc_id}/

Cloud storage:

ingest add s3://my-bucket/docs/report.pdf -v
ingest add gs://my-bucket/docs/paper.pdf -v
ingest add az://my-container/docs/spec.pdf -v

Requires pip install ingestible[cloud].

First run downloads the E5-large-v2 model (~1.3GB). Subsequent runs use the cache.

Batch ingestion

# Ingest all supported files in a directory
ingest add /path/to/docs/ -v

# Process files concurrently
ingest add /path/to/docs/ --parallel 4

List ingested documents

ingest list

Shows all ingested documents with their IDs, titles, page counts, and ingestion dates.

Search within a document

ingest search <doc_id> "your query here"

Flags:

-n N — number of results (default: 5)
--hyde — enable HyDE query expansion (requires LLM)
--auto-merge — auto-merge child chunks into parent when 3+ match
--kg — boost results using knowledge graph relationships

Example:

ingest search so-ticke-ich-buch-f7fbdddde343 "Wie geht man mit Reizüberflutung um?" -n 10

Returns ranked results with scores, match sources (vector/bm25/concept), and content snippets.

Search across all documents

ingest corpus-search "your query here"

Searches all ingested documents simultaneously with cross-document RRF fusion. Results include the source document title.

Flags:

-n N — number of results (default: 10)
--doc-ids ID1 ID2 ... — limit search to specific document IDs
--tags research ml — only search documents with these metadata tags
--hyde — enable HyDE query expansion
--auto-merge — auto-merge child chunks into parent
--kg — boost results using knowledge graph relationships

Re-enrich a document

ingest enrich <doc_id>

Re-runs enrichment (Stage 4) without re-parsing or re-chunking. Useful when you want to update summaries and concepts after changing LLM settings or prompt templates.

Selective re-enrichment: When re-ingesting with --force, unchanged chunks (same content hash) automatically skip LLM enrichment — their summaries, concepts, and questions are copied from the previous version. When all L3 chunks are unchanged, L2/L1/L0 enrichment is also skipped. This happens automatically; no flags needed.

View audit trail

ingest audit                     # show last 20 entries
ingest audit -n 50               # show last 50
ingest audit --query "machine"   # filter by query text

Requires INGEST_AUDIT_ENABLED=true. Logs are stored in data/audit.jsonl.

Export

ingest export <doc_id> --format jsonl -o output.jsonl

Supported formats: jsonl, parquet, llamaindex, langchain, complete

Watch a directory for changes

ingest watch /path/to/docs/ -v

Watches a directory for file changes and auto-ingests on create/modify. Uses OS-native filesystem events with a 2-second debounce window to batch rapid writes (e.g. git checkout, editor batch-saves).

Flags:

-v / --verbose — show ingestion progress
--skip-enrichment — skip LLM enrichment for auto-ingested documents
--data-dir /custom/path — override the data directory
--cv-push — auto-push to CognitiveVault after each batch completes
--cv-url, --cv-api-key, --cv-vault-id — CognitiveVault connection (or use env vars)

Requires the watch extra:

pip install -e ".[watch]"

Example with CV auto-push:

export CV_URL=https://cv.example.com
export CV_API_KEY=cvk_...
export CV_VAULT_ID=vault-abc
ingest watch ./docs/ -v --cv-push

Export entire corpus

ingest corpus-export -o corpus.json
ingest corpus-export --doc-ids doc1 doc2 -o subset.json
ingest corpus-export --tags research -o research.json

Exports all documents (or a filtered subset) into a single JSON file.

Export to CognitiveVault

ingest export-cv [doc_id] --cv-url https://cv.example.com \
  --cv-api-key cvk_... --cv-vault-id vault-abc

See CognitiveVault Integration for details.

Document versions

ingest versions <doc_id>

Shows version history when a document has been re-ingested.

Clean up stale files

ingest cleanup --max-age 24

Removes stale checkpoint directories, lock files, and temp uploads older than the specified hours. Also runs automatically on API startup.

Evaluate retrieval quality

# Auto-generate queries from enrichment hypothetical questions
ingest eval <doc_id>

# Use manual Q&A pairs
ingest eval <doc_id> --queries eval_queries.jsonl

# Compare hybrid vs vector-only vs BM25-only
ingest eval <doc_id> --compare

# Save detailed results
ingest eval <doc_id> -k 10 --output results.json

# LLM-as-judge evaluation (requires LLM API)
ingest eval <doc_id> --judge

Metrics: Hit Rate@K, MRR (Mean Reciprocal Rank), Precision@K, Recall@K. With --judge, an LLM also scores faithfulness, relevance, and completeness (0-1 scale).

Synthetic mode uses the hypothetical questions generated during enrichment as queries — each question should retrieve the chunk it came from (self-consistency check).

Start MCP server

ingest mcp

Exposes Ingestible as an MCP (Model Context Protocol) server. AI agents can ingest documents, search the corpus, and retrieve chunk details directly.

Tools: ingest_document, search, list_documents, get_document_overview, get_chunk, get_knowledge_graph, delete_document, eval_judge.

Requires: pip install ingestible[mcp]

Claude Desktop config:

{
    "mcpServers": {
        "ingestible": {
            "command": "python",
            "args": ["-m", "ingestible.mcp_server"]
        }
    }
}

Show configuration

ingest config

Displays the effective configuration from all sources (defaults, env vars, .env file).

Web UI

ingest serve

Starts the web server with document browsing, hierarchical chunk viewer, search, and file upload.

Extraction Profiles

Ingestible auto-detects the document type and applies a tailored extraction profile:

Profile	Detects via	Extras
`paper`	Academic headings (Abstract, Methodology, References...)	Citation extraction, methodology, key findings
`article`	HTML/Markdown without academic signals	Executive summary
`documentation`	Code blocks, API/install headings, `.rst`/`.adoc` format	Code-aware chunking
`code`	Source code file extensions (`.py`, `.ts`, `.go`, `.rs`, etc.)	AST-based chunking via tree-sitter, function/class boundaries
`general`	Fallback	Standard enrichment

Override with --profile <name>. For ambiguous documents, enable LLM-based classification with INGEST_PROFILE_LLM_FALLBACK=true.

Supported Formats

Format	Extensions	Parser
PDF	`.pdf`	Docling (primary), PyMuPDF4LLM (fallback), OCR retry
Markdown	`.md`	Native
DOCX	`.docx`	python-docx + markdownify
HTML	`.html`, `.htm`	markdownify
EPUB	`.epub`	ebooklib
PowerPoint	`.pptx`	python-pptx
Excel	`.xlsx`, `.xls`	openpyxl
CSV	`.csv`	Native
reStructuredText	`.rst`	docutils
AsciiDoc	`.adoc`, `.asciidoc`	Native
Plain text	`.txt`	Native
Audio	`.mp3`, `.wav`, `.m4a`, `.flac`, `.ogg`, `.aac`, `.opus`	faster-whisper ASR (requires `pip install ingestible[audio]`)
Video	`.mp4`, `.mkv`, `.avi`, `.mov`, `.webm`, `.wmv`, `.flv`	ffmpeg + faster-whisper (requires `ffmpeg` on PATH)
Images	`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`	OCR + layout via Docling VLM
Email	`.eml`, `.msg`	Headers, body, attachments
XML	`.xml`	Structured to markdown
JSON	`.json`, `.jsonl`	Object/array rendering
ZIP	`.zip`	Auto-detects Notion, Confluence, or generic
Source code	`.py`, `.ts`, `.js`, `.go`, `.rs`, `.java`, `.kt`, `.c`, `.cpp`, `.cs`, `.rb`, `.swift`, `.scala`	tree-sitter AST parsing (requires `pip install ingestible[code]`)

PDF figure extraction: figures are extracted with captions, labels, page numbers, and classifications when using the Docling parser.

Typical Workflow

# 1. Validate the pipeline without API cost
ingest add mybook.pdf -v --skip-enrichment

# 2. Check it worked
ingest list
ingest search <doc_id> "test query"

# 3. Re-ingest with enrichment for better search quality
ingest add mybook.pdf -v --force

# 4. Search across all documents
ingest corpus-search "my query"

Data Storage

All data lives under data/documents/{doc_id}/:

data/documents/{doc_id}/
  meta.json           # document metadata + L0 chunk
  structure.json      # hierarchy tree
  chapters/           # L1/L2/L3 JSON files
  chroma/             # ChromaDB vector index (persistent)
  bm25_index.pkl      # BM25 sparse index
  concept_index.json  # concept → chunk_ids mapping
  knowledge_graph.json # entity-relationship triples (when KG extraction is enabled)
  .checkpoint/        # pipeline checkpoint files (cleared on success)
  archive/            # previous versions (when re-ingested)

To delete a document, remove its directory:

rm -rf data/documents/<doc_id>

Known Limitations

ChromaDB requires Python 3.11-3.13 (incompatible with 3.14)
PDFs with non-standard page dimension metadata may fail Docling parsing (PyMuPDF4LLM fallback handles these)
Visual-only hierarchy (same font size for all headings) requires TOC table extraction or page-range heuristics
Figure extraction only works with the Docling parser (not PyMuPDF4LLM fallback)