v0.9.1OpenAPI 3.1 spec + interactive docs shippedRelease notes

v0.9.1 · MIT · open source

Your agent remembers
what matters.

Persistent memory across sessions. Semantic dedup. Context compression. ~12ms. No LLM calls.

Other tools compress what goes into your agent. Distill controls what your agent remembers: across sessions, without conflicts, ranked by what matters now.

Start free Try playground Get a demo

/Problem

Redundant context breaks reliability.

Context deduplication is not an optimization. It is a correctness measure.

Non-deterministic outputs

Same prompt, different answers. Redundant context confuses the model about which version of a fact is current.

30-40% wasted tokens

RAG retrieval returns overlapping chunks. The model processes all of them, paying for the same information multiple times.

Production failures at scale

Context bloat triggers compaction. Compaction is lossy. Critical constraints and rejected approaches disappear silently.

Agents start from zero

Every new session re-learns the same constraints, preferences, and facts. Nothing carries over.

/Pipeline

Math, not magic. Fully deterministic.

No LLM calls. Same input, same output. Every decision is auditable.

Cache

Annotate stable prefixes. Sub-ms retrieval for known patterns. TTL-aware eviction for batch workloads.

Cluster

Agglomerative clustering groups semantically similar chunks.

Select

Best representative from each cluster by centroid or hybrid.

Compress

Extractive compression removes noise, preserves signal.

MMR

Maximal Marginal Relevance re-ranks for relevance + diversity.

Summarize

Hierarchical multi-level summarization collapses long conversations into progressively shorter representations without losing structure.

~12ms total overhead·O(n²) distance matrix·50 chunks in <2ms·POST /v1/pipeline chains all six stages

/Memory

Persistent memory across sessions.

Store once. Recall by relevance. Forget what's outdated. The model does not know what matters to you. Distill does.

# Enable memory + sessions
distill mcp --memory --session

# Store a memory
distill memory store \
  --text "Auth uses JWT with RS256" \
  --tags auth,security \
  --source code_review \
  --auto-classify

# Recall by relevance
distill memory recall \
  --query "how does auth work?" \
  --boost-tags auth \
  --min-relevance 0.3

Write-time dedup

Entries within cosine distance 0.15 are merged, not duplicated.

Conflict detection

Entries between 0.15 and 0.35 cosine distance are flagged as conflicts. Caller resolves via supersede.

Sensitivity tagging

Pattern-based PII, credential, and internal reference detection. <1ms. No LLM calls.

Expiry & supersession

Soft-delete without losing audit trail. Replace outdated memories, preserve history.

Task-relevance ranking

Boost by tags, task context, and min relevance filter on recall.

Hierarchical decay

Full text → Summary → Keywords → Evicted. Access resets the decay clock.

MCP tools: store_memory · recall_memory · forget_memory · memory_expire · memory_supersede · memory_stats

/Sessions

Token-budgeted context windows.

Push context incrementally as the agent works. Distill deduplicates entries, compresses aging ones, and evicts when the budget is exceeded.

128K token budget

Default max_tokens. Configurable per session.

Auto-compress

Full text → Summary → Sentence → Keywords as entries age.

Importance eviction

Lowest-importance entries compressed and evicted first.

MCP tools: create_session · push_session · session_context · delete_session

/Code Intelligence

Understand blast radius before you merge.

For engineering teams running automated code-mod rollouts, CVE patching, or any workflow where the same change pattern has caused problems before.

Code Change Impact Graph

Map structural dependencies across your codebase. Query which files are affected by a change before it lands. Blast-radius analysis without running the build.

Semantic Commit Analysis

Find similar past changes. Score risk based on historical patterns. Surface commits that caused incidents so the same mistake isn't repeated.

/Observability

Production observability included.

Prometheus metrics

- Request counts + latency histograms
- Chunk reduction ratios
- Cluster counts per request
- Per-call-site cache hit rate
- Cache write cost accounting
- TTL-aware cache eviction events

OpenTelemetry tracing

- Per-stage spans (embed, cluster, select, MMR)
- W3C Trace Context propagation
- OTLP export (Jaeger, Tempo)
- Stdout exporter for local dev

Structured logging

- log/slog with JSON + text output
- Request ID + trace ID on every log
- Debug / info / warn / error levels
- Grafana dashboard template included

/Integrations

Fits your existing stack.

HTTP API, MCP server, or CLI. Run fully local with Ollama. No API key, no external calls.

OpenAI

Anthropic

Ollama

Cohere

Claude (MCP)

Cursor (MCP)

LangChain

LlamaIndex

Pinecone

Qdrant

Prometheus

Grafana

OpenTelemetry

Docker

Fly.io

Render

Python SDK (LangChain + LlamaIndex native) in progress. #5

/Comparison

Why not use an LLM for compression?

	Distill	LLM compression
Latency	~12ms	~500ms
Cost per call	$0.0001	$0.01+
Deterministic	Yes, always	No, varies per run
Auditable	Full cluster log	Black box
Requires API key	Optional (Ollama works)	Always
Batch support	Async job queue	Sequential

/FAQ

Common questions.

Is this removing exact duplicates?

No. Exact dedup is trivial (hash comparison). Distill does semantic dedup. It identifies chunks that convey the same information in different words. Two paragraphs explaining JWT auth with different wording will be clustered together, and only the best one is kept.

What is Context Memory?

Persistent memory that accumulates knowledge across agent sessions. Store context once, recall it later by semantic similarity and recency. Memories are deduplicated on write, compressed over time through hierarchical decay, and automatically classified for sensitivity.

How is Memory different from Sessions?

Memory is cross-session: knowledge persists after a session ends and can be recalled in future sessions. Sessions are within-task: a bounded context window that tracks what the agent has seen during a single task, enforcing a token budget.

How does conflict detection work?

When storing a memory, Distill checks existing entries by cosine distance. Entries below 0.15 are duplicates and are skipped. Entries between 0.15 and 0.35 are flagged as conflicts: semantically related but different enough to be contradictory. The caller resolves via supersede.

Can I use this with local models?

Yes. The dedup pipeline makes no LLM calls. For embeddings, use --embedding-provider ollama with a local Ollama instance. No API key needed. You can also send pre-computed embeddings to skip embedding generation entirely.

Why not increase the context window?

Larger windows don't solve redundancy. If you stuff 50 chunks into a 128K window and 20 say the same thing, the model still processes all of them. Distill ensures the model sees unique, diverse chunks instead of overlapping ones.

/Contact

Want to use Distill in production?

Whether you need a hosted API, enterprise deployment, or want to talk through your use case, reach out directly.

Book a call siddhantkhare2694@gmail.com

Your agent rememberswhat matters.