/Docs

Documentation

Persistent memory, semantic dedup, and context compression for LLM agents.

Interactive Swagger UI OpenAPI 3.1 spec GitHub Agentic Engineering Guide

/01

Quick start

Install

# Go install
go install github.com/Siddhant-K-code/distill@latest

# Docker
docker pull ghcr.io/siddhant-k-code/distill:latest

# Binary — download from GitHub Releases
# https://github.com/Siddhant-K-code/distill/releases

Start the API server

# Standalone (no vector DB needed)
export OPENAI_API_KEY=sk-...
distill api --port 8080

# With memory + sessions
distill api --port 8080 --memory --session

# With vector DB
distill serve --backend pinecone --index my-index --port 8080

Deduplicate chunks

curl -X POST http://localhost:8080/v1/dedupe \
  -H "Content-Type: application/json" \
  -d '{
    "chunks": [
      {"id": "1", "text": "React is a JavaScript library for building UIs."},
      {"id": "2", "text": "React.js is a JS library for building user interfaces."},
      {"id": "3", "text": "Vue is a progressive framework for building UIs."}
    ]
  }'

Response: chunks 1 and 2 are clustered together; only the best representative is returned. Chunk 3 is a separate cluster.

Run benchmarks

The repo ships a benchmark suite covering cluster, MMR, selector, and compress. Run it against your hardware to validate the ~12ms claim on your data size.

git clone https://github.com/Siddhant-K-code/distill
cd distill
go test -bench=. -benchmem ./...

# Example output:
# BenchmarkCluster/50-chunks    1000    1.2ms/op    480KB alloc
# BenchmarkMMR/50-chunks        1000    0.3ms/op     96KB alloc
# BenchmarkCompress/50-chunks   1000    0.8ms/op    240KB alloc

MCP (Claude Desktop / Cursor)

# Start MCP server with memory + sessions
distill mcp --memory --session

# claude_desktop_config.json
{
  "mcpServers": {
    "distill": {
      "command": "/path/to/distill",
      "args": ["mcp", "--memory", "--session"],
      "env": { "OPENAI_API_KEY": "sk-..." }
    }
  }
}

/02

API reference

Interactive docs at distill-api-4u92.onrender.com/docs. Full OpenAPI 3.1 spec at /openapi.yaml.

Method	Path	Description
POST	/v1/dedupe	Deduplicate chunks
POST	/v1/dedupe/stream	SSE streaming dedup with per-stage progress
POST	/v1/pipeline	Full pipeline: cache → cluster → select → compress → MMR → summarize
POST	/v1/batch	Submit async batch job
GET	/v1/batch/{id}	Poll batch job status and progress
GET	/v1/batch/{id}/results	Retrieve completed batch results
POST	/v1/retrieve	Query vector DB with dedup (requires backend)
POST	/v1/memory/store	Store memories with write-time dedup + sensitivity tagging
POST	/v1/memory/recall	Recall memories by relevance + recency
POST	/v1/memory/forget	Remove memories by ID, tag, or age
POST	/v1/memory/expire	Soft-delete memories without removing them
POST	/v1/memory/supersede	Replace a memory with a newer version
GET	/v1/memory/stats	Memory store statistics
POST	/v1/session/create	Create a session with token budget
POST	/v1/session/push	Push entries with dedup + budget enforcement
POST	/v1/session/context	Read current context window
POST	/v1/session/delete	Delete a session
GET	/health	Health check
GET	/metrics	Prometheus metrics
GET	/docs	Swagger UI (interactive)
GET	/openapi.yaml	OpenAPI 3.1 spec

Pipeline API

Six stages run in sequence. Each stage is independently toggleable. The summarize stage is off by default — enable it for long conversation threads.

POST /v1/pipeline
{
  "chunks": [{"id": "1", "text": "..."}],
  "options": {
    "dedup":     {"enabled": true, "threshold": 0.15},
    "compress":  {"enabled": true, "target_reduction": 0.5},
    "summarize": {"enabled": true, "max_tokens": 4000, "levels": 3}
  }
}

Hierarchical summarization

When enabled, the summarize stage collapses long conversation threads into progressively shorter representations. Level 1 is a full extractive summary. Level 2 is a paragraph. Level 3 is a single sentence. The level used depends on how far the entry has aged in the session window — older entries get shorter representations.

# Summarize a long conversation thread
curl -X POST localhost:8080/v1/pipeline -d '{
  "chunks": [...],
  "options": {
    "dedup":     {"enabled": true},
    "compress":  {"enabled": true},
    "summarize": {"enabled": true, "max_tokens": 4000, "levels": 3}
  }
}'
# Response includes "summary_level" per chunk: 0=full, 1=paragraph, 2=sentence

Batch API

# Submit
curl -X POST /v1/batch -d '{"chunks":[...],"options":{...}}'
# → {"job_id":"batch_1234","status":"queued"}

# Poll
curl /v1/batch/batch_1234
# → {"status":"processing","progress":0.45}

# Results
curl /v1/batch/batch_1234/results
# → {"chunks":[...],"stats":{...}}

/03

Context memory

Persistent memory that accumulates knowledge across agent sessions. Enable with --memory on api or mcp commands.

Store

curl -X POST localhost:8080/v1/memory/store -d '{
  "entries": [{
    "text": "Auth uses JWT with RS256 signing",
    "source": "code_review",
    "tags": ["auth", "security"],
    "auto_classify": true,
    "expires_at": "2026-12-01T00:00:00Z"
  }]
}'
# Response includes "conflicts" array if similar entries exist

Recall

curl -X POST localhost:8080/v1/memory/recall -d '{
  "query": "how does authentication work?",
  "max_results": 5,
  "boost_tags": ["auth"],
  "min_relevance": 0.3,
  "task_context": "fixing login bugs"
}'

Expire / Supersede

# Soft-delete (excluded from recall, kept for audit)
curl -X POST localhost:8080/v1/memory/expire -d '{"ids": ["abc123"]}'

# Replace with newer version
curl -X POST localhost:8080/v1/memory/supersede -d '{
  "old_id": "abc123", "new_id": "ghi789"
}'

Decay lifecycle

Full text → Summary (~20%) → Keywords (~5%) → Evicted
  (24h)        (7 days)         (30 days)

# Accessing a memory resets its decay clock.

Sensitivity levels

Level	Value	Detected patterns
None	0	No sensitive content
PII	1	Email, phone, credit card, SSN
Internal	2	Internal domains, pricing, roadmaps
Credentials	3	API keys, tokens, passwords, AWS/OpenAI/GitHub secrets

/04

Session management

Token-budgeted context windows for long-running agent tasks. Enable with --session.

# Create session
curl -X POST localhost:8080/v1/session/create -d '{
  "session_id": "task-42", "max_tokens": 128000
}'

# Push entries
curl -X POST localhost:8080/v1/session/push -d '{
  "session_id": "task-42",
  "entries": [
    {"role": "tool", "content": "...", "source": "file_read", "importance": 0.8}
  ]
}'

# Read context window
curl -X POST localhost:8080/v1/session/context -d '{"session_id": "task-42"}'

When a push exceeds the token budget: oldest entries outside the preserve_recent window are compressed through levels (Full → Summary → Sentence → Keywords), then evicted. Lowest-importance entries go first.

/05

Embedding providers

Distill supports OpenAI, Ollama, and Cohere. Use --embedding-provider to switch.

# OpenAI (default)
distill api --embedding-provider openai

# Ollama — fully local, no API key needed
distill api --embedding-provider ollama \
  --embedding-base-url http://localhost:11434

# Cohere
distill api --embedding-provider cohere
# requires COHERE_API_KEY

# Skip embeddings entirely — send pre-computed vectors
curl -X POST /v1/dedupe -d '{
  "chunks": [
    {"id": "1", "text": "...", "embedding": [0.1, 0.2, ...]}
  ]
}'

/06

CLI commands

Command	Description
distill api	Start standalone API server
distill serve	Start server with vector DB connection
distill pipeline	Run full pipeline (cache → cluster → select → compress → MMR → summarize)
distill mcp	Start MCP server for AI assistants
distill memory	Store, recall, and manage persistent memories
distill session	Manage token-budgeted context windows
distill analyze	Analyze a file for duplicates
distill sync	Upload vectors to Pinecone with dedup
distill query	Test a query from command line
distill config init	Generate distill.yaml template
distill config validate	Validate config file
distill completion	Shell completions (bash/zsh/fish/PowerShell)

/07

Configuration

# distill.yaml
server:
  port: 8080

embedding:
  provider: openai          # openai | ollama | cohere
  model: text-embedding-3-small

dedup:
  threshold: 0.15           # cosine distance (lower = stricter)
  lambda: 0.5               # MMR balance: 1.0=relevance, 0.0=diversity
  enable_mmr: true

memory:
  db_path: distill-memory.db
  dedup_threshold: 0.15
  conflict_threshold: 0.35

session:
  db_path: distill-sessions.db
  max_tokens: 128000

auth:
  api_keys:
    - ${DISTILL_API_KEY}

Priority: CLI flags > environment variables > config file > defaults.

/08

Observability

Prometheus

Metrics at /metrics. Grafana dashboard template in grafana/.

distill_requests_total          # request count by endpoint + status
distill_latency_seconds         # latency histogram per stage
distill_chunks_in_total         # input chunk count
distill_chunks_out_total        # output chunk count (after dedup)
distill_reduction_ratio         # chunk reduction ratio histogram
distill_cache_hit_rate          # per-call-site cache hit rate
distill_cache_write_cost        # cache write cost accounting
distill_cache_ttl_evictions     # TTL-aware evictions during batch runs

OpenTelemetry

# OTLP export
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
distill api

# Stdout (local dev)
distill api --otel-stdout

/Learn more

Agentic Engineering Guide

The book behind the concepts Distill implements: context engineering, agent memory, and reliability in production.

Blog: Context Window Efficiency

Deep dive into the algorithms behind deterministic deduplication and LLM reliability.

FAQ

Common questions about algorithms, integrations, performance, and deployment.

Book a demo

Talk through your use case, hosted API, or enterprise deployment.