v0.9.1OpenAPI 3.1 spec + interactive docs shippedRelease notes
v0.9.1 · MIT · open source

Your agent remembers
what matters.

Persistent memory across sessions. Semantic dedup. Context compression. ~12ms. No LLM calls.

Other tools compress what goes into your agent. Distill controls what your agent remembers: across sessions, without conflicts, ranked by what matters now.

SOURCESRAG chunksTool outputsMemoryDocsConversationDistillPIPELINECacheClusterSelectCompressMMR~12ms · no LLM calls · deterministicLLM30-40% fewer tokenssame input, same outputfull audit trail
/Problem

Redundant context breaks reliability.

Context deduplication is not an optimization. It is a correctness measure.

Non-deterministic outputs
Same prompt, different answers. Redundant context confuses the model about which version of a fact is current.
30-40% wasted tokens
RAG retrieval returns overlapping chunks. The model processes all of them, paying for the same information multiple times.
Production failures at scale
Context bloat triggers compaction. Compaction is lossy. Critical constraints and rejected approaches disappear silently.
Agents start from zero
Every new session re-learns the same constraints, preferences, and facts. Nothing carries over.
/Pipeline

Math, not magic. Fully deterministic.

No LLM calls. Same input, same output. Every decision is auditable.

SOURCESRAG chunksTool outputsMemoryDocsConversationDistillPIPELINECacheClusterSelectCompressMMR~12ms · no LLM calls · deterministicLLM30-40% fewer tokenssame input, same outputfull audit trail
01
Cache
Annotate stable prefixes. Sub-ms retrieval for known patterns. TTL-aware eviction for batch workloads.
02
Cluster
Agglomerative clustering groups semantically similar chunks.
03
Select
Best representative from each cluster by centroid or hybrid.
04
Compress
Extractive compression removes noise, preserves signal.
05
MMR
Maximal Marginal Relevance re-ranks for relevance + diversity.
06
Summarize
Hierarchical multi-level summarization collapses long conversations into progressively shorter representations without losing structure.
~12ms total overhead·O(n²) distance matrix·50 chunks in <2ms·POST /v1/pipeline chains all six stages
/Memory

Persistent memory across sessions.

Store once. Recall by relevance. Forget what's outdated. The model does not know what matters to you. Distill does.

# Enable memory + sessions
distill mcp --memory --session

# Store a memory
distill memory store \
  --text "Auth uses JWT with RS256" \
  --tags auth,security \
  --source code_review \
  --auto-classify

# Recall by relevance
distill memory recall \
  --query "how does auth work?" \
  --boost-tags auth \
  --min-relevance 0.3
Write-time dedup
Entries within cosine distance 0.15 are merged, not duplicated.
Conflict detection
Entries between 0.15 and 0.35 cosine distance are flagged as conflicts. Caller resolves via supersede.
Sensitivity tagging
Pattern-based PII, credential, and internal reference detection. <1ms. No LLM calls.
Expiry & supersession
Soft-delete without losing audit trail. Replace outdated memories, preserve history.
Task-relevance ranking
Boost by tags, task context, and min relevance filter on recall.
Hierarchical decay
Full text → Summary → Keywords → Evicted. Access resets the decay clock.
MCP tools: store_memory · recall_memory · forget_memory · memory_expire · memory_supersede · memory_stats
/Sessions

Token-budgeted context windows.

Push context incrementally as the agent works. Distill deduplicates entries, compresses aging ones, and evicts when the budget is exceeded.

128K token budget
Default max_tokens. Configurable per session.
Auto-compress
Full text → Summary → Sentence → Keywords as entries age.
Importance eviction
Lowest-importance entries compressed and evicted first.
MCP tools: create_session · push_session · session_context · delete_session
/Code Intelligence

Understand blast radius before you merge.

For engineering teams running automated code-mod rollouts, CVE patching, or any workflow where the same change pattern has caused problems before.

Code Change Impact Graph
Map structural dependencies across your codebase. Query which files are affected by a change before it lands. Blast-radius analysis without running the build.
Semantic Commit Analysis
Find similar past changes. Score risk based on historical patterns. Surface commits that caused incidents so the same mistake isn't repeated.
/Observability

Production observability included.

Prometheus metrics
  • - Request counts + latency histograms
  • - Chunk reduction ratios
  • - Cluster counts per request
  • - Per-call-site cache hit rate
  • - Cache write cost accounting
  • - TTL-aware cache eviction events
OpenTelemetry tracing
  • - Per-stage spans (embed, cluster, select, MMR)
  • - W3C Trace Context propagation
  • - OTLP export (Jaeger, Tempo)
  • - Stdout exporter for local dev
Structured logging
  • - log/slog with JSON + text output
  • - Request ID + trace ID on every log
  • - Debug / info / warn / error levels
  • - Grafana dashboard template included
/Integrations

Fits your existing stack.

HTTP API, MCP server, or CLI. Run fully local with Ollama. No API key, no external calls.

OpenAI
Anthropic
Ollama
Cohere
Claude (MCP)
Cursor (MCP)
LangChain
LlamaIndex
Pinecone
Qdrant
Prometheus
Grafana
OpenTelemetry
Docker
Fly.io
Render

Python SDK (LangChain + LlamaIndex native) in progress. #5

/Comparison

Why not use an LLM for compression?

DistillLLM compression
Latency~12ms~500ms
Cost per call$0.0001$0.01+
DeterministicYes, alwaysNo, varies per run
AuditableFull cluster logBlack box
Requires API keyOptional (Ollama works)Always
Batch supportAsync job queueSequential
/FAQ

Common questions.

Is this removing exact duplicates?
No. Exact dedup is trivial (hash comparison). Distill does semantic dedup. It identifies chunks that convey the same information in different words. Two paragraphs explaining JWT auth with different wording will be clustered together, and only the best one is kept.
What is Context Memory?
Persistent memory that accumulates knowledge across agent sessions. Store context once, recall it later by semantic similarity and recency. Memories are deduplicated on write, compressed over time through hierarchical decay, and automatically classified for sensitivity.
How is Memory different from Sessions?
Memory is cross-session: knowledge persists after a session ends and can be recalled in future sessions. Sessions are within-task: a bounded context window that tracks what the agent has seen during a single task, enforcing a token budget.
How does conflict detection work?
When storing a memory, Distill checks existing entries by cosine distance. Entries below 0.15 are duplicates and are skipped. Entries between 0.15 and 0.35 are flagged as conflicts: semantically related but different enough to be contradictory. The caller resolves via supersede.
Can I use this with local models?
Yes. The dedup pipeline makes no LLM calls. For embeddings, use --embedding-provider ollama with a local Ollama instance. No API key needed. You can also send pre-computed embeddings to skip embedding generation entirely.
Why not increase the context window?
Larger windows don't solve redundancy. If you stuff 50 chunks into a 128K window and 20 say the same thing, the model still processes all of them. Distill ensures the model sees unique, diverse chunks instead of overlapping ones.
/Contact

Want to use Distill in production?

Whether you need a hosted API, enterprise deployment, or want to talk through your use case, reach out directly.