Add a reliability layer to your RAG pipeline with deterministic deduplication
# Go install go install github.com/Siddhant-K-code/distill@latest # Or Docker docker pull ghcr.io/siddhant-k-code/distill:latest # Or download binary from GitHub Releases # https://github.com/Siddhant-K-code/distill/releases
# Standalone (no vector DB needed) export OPENAI_API_KEY=sk-... distill api --port 8080 # With vector DB distill serve --backend pinecone --index my-index --port 8080
curl -X POST http://localhost:8080/v1/dedupe \
-H "Content-Type: application/json" \
-d '{
"chunks": [
{"text": "Deploy with Vercel for Next.js apps"},
{"text": "Use Vercel to deploy your Next.js application"},
{"text": "Configure environment variables in .env"}
],
"threshold": 0.15,
"lambda": 0.5
}'/v1/dedupeDeduplicate text chunks. No vector DB required - embeddings are computed on the fly.
{
"chunks": [ // Array of text chunks
{"text": "..."},
{"text": "...", "embedding": [0.1, ...]} // Optional pre-computed
],
"threshold": 0.15, // Clustering threshold (default: 0.15)
"lambda": 0.5 // MMR lambda (default: 0.5)
}{
"unique_chunks": [
{
"text": "...",
"cluster_id": 0,
"score": 0.95
}
],
"stats": {
"input_count": 12,
"output_count": 8,
"reduction_ratio": 0.33,
"clusters_formed": 8,
"latency_ms": 12
}
}/v1/retrieveQuery a vector DB and deduplicate results. Requires a configured backend (Pinecone or Qdrant).
{
"query": string, // Natural language query
"query_embedding": float[], // Or provide embedding directly
"target_k": number, // Chunks to return (default: 8)
"over_fetch_k": number, // Chunks to retrieve (default: 50)
"threshold": number, // Clustering threshold (default: 0.15)
"lambda": number, // MMR lambda (default: 0.5)
"namespace": string, // Optional namespace
"filter": object // Optional metadata filter
}{
"chunks": [
{
"id": "chunk_123",
"text": "...",
"score": 0.92,
"cluster_id": 0,
"metadata": {}
}
],
"stats": {
"retrieved": 50,
"clustered": 12,
"returned": 8,
"retrieval_latency_ms": 45,
"clustering_latency_ms": 12,
"total_latency_ms": 57
}
}/metricsPrometheus metrics endpoint. Exposes request counts, latency histograms, reduction ratios, and cluster counts.
/healthHealth check. Returns { status: 'ok' }.
Check for previously seen patterns. If a matching result exists, return it instantly - no recomputation needed.
Group semantically similar chunks using agglomerative clustering with average linkage. Chunks that say the same thing end up in the same cluster. No need to specify K.
Pick the best representative from each cluster based on relevance score, centroid proximity, or a hybrid of both.
Remove low-information content from selected chunks - stopwords, filler, and noise. Preserves meaning while reducing token count.
Apply Maximal Marginal Relevance to balance relevance and diversity. λ = 1.0 for pure relevance, λ = 0.0 for pure diversity.
Generate a config file with distill config init, then validate with distill config validate.
# distill.yaml
server:
port: 8080
host: "0.0.0.0"
embedding:
provider: "openai"
model: "text-embedding-3-small"
dedup:
threshold: 0.15
lambda: 0.5
selection_strategy: "score"
retriever:
backend: "pinecone"
index: "my-index"
auth:
api_keys:
- "${DISTILL_API_KEY}"
telemetry:
tracing:
enabled: true
exporter: "otlp"
endpoint: "localhost:4317"
sampling_rate: 1.0Environment variables are interpolated with ${VAR:-default} syntax.
Scrape /metrics for request rates, latency distributions, reduction ratios, and cluster counts.
distill_requests_total distill_request_duration_seconds distill_chunks_processed distill_reduction_ratio distill_active_requests distill_clusters_formed
A Grafana dashboard template is included in the repo.
Distributed traces for every pipeline stage. Export to Jaeger, Grafana Tempo, or any OTLP-compatible collector.
Spans: distill.request ├── distill.embedding ├── distill.clustering ├── distill.selection └── distill.mmr
W3C Trace Context propagation. Configurable sampling rate.
Run as an MCP server for AI assistants. Works with Claude Desktop, Cursor, and any MCP-compatible client.
distill mcp distill mcp --backend pinecone --index my-index
Connect your Pinecone index for retrieval with deduplication.
distill serve --backend pinecone --index my-index
Works with self-hosted or Qdrant Cloud instances.
distill serve --backend qdrant --collection my-collection
Pre-built images for every release. Multi-arch (amd64/arm64).
docker run -p 8080:8080 \ -e OPENAI_API_KEY=sk-... \ ghcr.io/siddhant-k-code/distill:latest
| Command | Description |
|---|---|
| distill api | Standalone API server (no vector DB) |
| distill serve | Server with vector DB connection |
| distill mcp | MCP server for AI assistants |
| distill query | Query vector DB from command line |
| distill sync | Bulk upload with deduplication |
| distill analyze | Analyze vectors for duplicates |
| distill config init | Generate distill.yaml template |
| distill config validate | Validate config file |
Deep dive into the algorithms behind deterministic deduplication and LLM reliability.