Persistent memory, semantic dedup, and context compression for LLM agents.
# Go install go install github.com/Siddhant-K-code/distill@latest # Docker docker pull ghcr.io/siddhant-k-code/distill:latest # Binary — download from GitHub Releases # https://github.com/Siddhant-K-code/distill/releases
# Standalone (no vector DB needed) export OPENAI_API_KEY=sk-... distill api --port 8080 # With memory + sessions distill api --port 8080 --memory --session # With vector DB distill serve --backend pinecone --index my-index --port 8080
curl -X POST http://localhost:8080/v1/dedupe \
-H "Content-Type: application/json" \
-d '{
"chunks": [
{"id": "1", "text": "React is a JavaScript library for building UIs."},
{"id": "2", "text": "React.js is a JS library for building user interfaces."},
{"id": "3", "text": "Vue is a progressive framework for building UIs."}
]
}'Response: chunks 1 and 2 are clustered together; only the best representative is returned. Chunk 3 is a separate cluster.
The repo ships a benchmark suite covering cluster, MMR, selector, and compress. Run it against your hardware to validate the ~12ms claim on your data size.
git clone https://github.com/Siddhant-K-code/distill cd distill go test -bench=. -benchmem ./... # Example output: # BenchmarkCluster/50-chunks 1000 1.2ms/op 480KB alloc # BenchmarkMMR/50-chunks 1000 0.3ms/op 96KB alloc # BenchmarkCompress/50-chunks 1000 0.8ms/op 240KB alloc
# Start MCP server with memory + sessions
distill mcp --memory --session
# claude_desktop_config.json
{
"mcpServers": {
"distill": {
"command": "/path/to/distill",
"args": ["mcp", "--memory", "--session"],
"env": { "OPENAI_API_KEY": "sk-..." }
}
}
}Interactive docs at distill-api-4u92.onrender.com/docs. Full OpenAPI 3.1 spec at /openapi.yaml.
| Method | Path | Description |
|---|---|---|
| POST | /v1/dedupe | Deduplicate chunks |
| POST | /v1/dedupe/stream | SSE streaming dedup with per-stage progress |
| POST | /v1/pipeline | Full pipeline: cache → cluster → select → compress → MMR → summarize |
| POST | /v1/batch | Submit async batch job |
| GET | /v1/batch/{id} | Poll batch job status and progress |
| GET | /v1/batch/{id}/results | Retrieve completed batch results |
| POST | /v1/retrieve | Query vector DB with dedup (requires backend) |
| POST | /v1/memory/store | Store memories with write-time dedup + sensitivity tagging |
| POST | /v1/memory/recall | Recall memories by relevance + recency |
| POST | /v1/memory/forget | Remove memories by ID, tag, or age |
| POST | /v1/memory/expire | Soft-delete memories without removing them |
| POST | /v1/memory/supersede | Replace a memory with a newer version |
| GET | /v1/memory/stats | Memory store statistics |
| POST | /v1/session/create | Create a session with token budget |
| POST | /v1/session/push | Push entries with dedup + budget enforcement |
| POST | /v1/session/context | Read current context window |
| POST | /v1/session/delete | Delete a session |
| GET | /health | Health check |
| GET | /metrics | Prometheus metrics |
| GET | /docs | Swagger UI (interactive) |
| GET | /openapi.yaml | OpenAPI 3.1 spec |
Six stages run in sequence. Each stage is independently toggleable. The summarize stage is off by default — enable it for long conversation threads.
POST /v1/pipeline
{
"chunks": [{"id": "1", "text": "..."}],
"options": {
"dedup": {"enabled": true, "threshold": 0.15},
"compress": {"enabled": true, "target_reduction": 0.5},
"summarize": {"enabled": true, "max_tokens": 4000, "levels": 3}
}
}When enabled, the summarize stage collapses long conversation threads into progressively shorter representations. Level 1 is a full extractive summary. Level 2 is a paragraph. Level 3 is a single sentence. The level used depends on how far the entry has aged in the session window — older entries get shorter representations.
# Summarize a long conversation thread
curl -X POST localhost:8080/v1/pipeline -d '{
"chunks": [...],
"options": {
"dedup": {"enabled": true},
"compress": {"enabled": true},
"summarize": {"enabled": true, "max_tokens": 4000, "levels": 3}
}
}'
# Response includes "summary_level" per chunk: 0=full, 1=paragraph, 2=sentence# Submit
curl -X POST /v1/batch -d '{"chunks":[...],"options":{...}}'
# → {"job_id":"batch_1234","status":"queued"}
# Poll
curl /v1/batch/batch_1234
# → {"status":"processing","progress":0.45}
# Results
curl /v1/batch/batch_1234/results
# → {"chunks":[...],"stats":{...}}Persistent memory that accumulates knowledge across agent sessions. Enable with --memory on api or mcp commands.
curl -X POST localhost:8080/v1/memory/store -d '{
"entries": [{
"text": "Auth uses JWT with RS256 signing",
"source": "code_review",
"tags": ["auth", "security"],
"auto_classify": true,
"expires_at": "2026-12-01T00:00:00Z"
}]
}'
# Response includes "conflicts" array if similar entries existcurl -X POST localhost:8080/v1/memory/recall -d '{
"query": "how does authentication work?",
"max_results": 5,
"boost_tags": ["auth"],
"min_relevance": 0.3,
"task_context": "fixing login bugs"
}'# Soft-delete (excluded from recall, kept for audit)
curl -X POST localhost:8080/v1/memory/expire -d '{"ids": ["abc123"]}'
# Replace with newer version
curl -X POST localhost:8080/v1/memory/supersede -d '{
"old_id": "abc123", "new_id": "ghi789"
}'Full text → Summary (~20%) → Keywords (~5%) → Evicted (24h) (7 days) (30 days) # Accessing a memory resets its decay clock.
| Level | Value | Detected patterns |
|---|---|---|
| None | 0 | No sensitive content |
| PII | 1 | Email, phone, credit card, SSN |
| Internal | 2 | Internal domains, pricing, roadmaps |
| Credentials | 3 | API keys, tokens, passwords, AWS/OpenAI/GitHub secrets |
Token-budgeted context windows for long-running agent tasks. Enable with --session.
# Create session
curl -X POST localhost:8080/v1/session/create -d '{
"session_id": "task-42", "max_tokens": 128000
}'
# Push entries
curl -X POST localhost:8080/v1/session/push -d '{
"session_id": "task-42",
"entries": [
{"role": "tool", "content": "...", "source": "file_read", "importance": 0.8}
]
}'
# Read context window
curl -X POST localhost:8080/v1/session/context -d '{"session_id": "task-42"}'When a push exceeds the token budget: oldest entries outside the preserve_recent window are compressed through levels (Full → Summary → Sentence → Keywords), then evicted. Lowest-importance entries go first.
Distill supports OpenAI, Ollama, and Cohere. Use --embedding-provider to switch.
# OpenAI (default)
distill api --embedding-provider openai
# Ollama — fully local, no API key needed
distill api --embedding-provider ollama \
--embedding-base-url http://localhost:11434
# Cohere
distill api --embedding-provider cohere
# requires COHERE_API_KEY
# Skip embeddings entirely — send pre-computed vectors
curl -X POST /v1/dedupe -d '{
"chunks": [
{"id": "1", "text": "...", "embedding": [0.1, 0.2, ...]}
]
}'| Command | Description |
|---|---|
| distill api | Start standalone API server |
| distill serve | Start server with vector DB connection |
| distill pipeline | Run full pipeline (cache → cluster → select → compress → MMR → summarize) |
| distill mcp | Start MCP server for AI assistants |
| distill memory | Store, recall, and manage persistent memories |
| distill session | Manage token-budgeted context windows |
| distill analyze | Analyze a file for duplicates |
| distill sync | Upload vectors to Pinecone with dedup |
| distill query | Test a query from command line |
| distill config init | Generate distill.yaml template |
| distill config validate | Validate config file |
| distill completion | Shell completions (bash/zsh/fish/PowerShell) |
# distill.yaml
server:
port: 8080
embedding:
provider: openai # openai | ollama | cohere
model: text-embedding-3-small
dedup:
threshold: 0.15 # cosine distance (lower = stricter)
lambda: 0.5 # MMR balance: 1.0=relevance, 0.0=diversity
enable_mmr: true
memory:
db_path: distill-memory.db
dedup_threshold: 0.15
conflict_threshold: 0.35
session:
db_path: distill-sessions.db
max_tokens: 128000
auth:
api_keys:
- ${DISTILL_API_KEY}Priority: CLI flags > environment variables > config file > defaults.
Metrics at /metrics. Grafana dashboard template in grafana/.
distill_requests_total # request count by endpoint + status distill_latency_seconds # latency histogram per stage distill_chunks_in_total # input chunk count distill_chunks_out_total # output chunk count (after dedup) distill_reduction_ratio # chunk reduction ratio histogram distill_cache_hit_rate # per-call-site cache hit rate distill_cache_write_cost # cache write cost accounting distill_cache_ttl_evictions # TTL-aware evictions during batch runs
# OTLP export export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 distill api # Stdout (local dev) distill api --otel-stdout