Documentation

Get a demo

Add a reliability layer to your RAG pipeline with deterministic deduplication

Quick start

1. Install

# Go install
go install github.com/Siddhant-K-code/distill@latest

# Or Docker
docker pull ghcr.io/siddhant-k-code/distill:latest

# Or download binary from GitHub Releases
# https://github.com/Siddhant-K-code/distill/releases

2. Start the API server

# Standalone (no vector DB needed)
export OPENAI_API_KEY=sk-...
distill api --port 8080

# With vector DB
distill serve --backend pinecone --index my-index --port 8080

3. Deduplicate

curl -X POST http://localhost:8080/v1/dedupe \
  -H "Content-Type: application/json" \
  -d '{
    "chunks": [
      {"text": "Deploy with Vercel for Next.js apps"},
      {"text": "Use Vercel to deploy your Next.js application"},
      {"text": "Configure environment variables in .env"}
    ],
    "threshold": 0.15,
    "lambda": 0.5
  }'

API reference

POST/v1/dedupe

Deduplicate text chunks. No vector DB required - embeddings are computed on the fly.

Request body

{
  "chunks": [               // Array of text chunks
    {"text": "..."},
    {"text": "...", "embedding": [0.1, ...]}  // Optional pre-computed
  ],
  "threshold": 0.15,        // Clustering threshold (default: 0.15)
  "lambda": 0.5             // MMR lambda (default: 0.5)
}

Response

{
  "unique_chunks": [
    {
      "text": "...",
      "cluster_id": 0,
      "score": 0.95
    }
  ],
  "stats": {
    "input_count": 12,
    "output_count": 8,
    "reduction_ratio": 0.33,
    "clusters_formed": 8,
    "latency_ms": 12
  }
}
POST/v1/retrieve

Query a vector DB and deduplicate results. Requires a configured backend (Pinecone or Qdrant).

Request body

{
  "query": string,           // Natural language query
  "query_embedding": float[], // Or provide embedding directly
  "target_k": number,        // Chunks to return (default: 8)
  "over_fetch_k": number,    // Chunks to retrieve (default: 50)
  "threshold": number,       // Clustering threshold (default: 0.15)
  "lambda": number,          // MMR lambda (default: 0.5)
  "namespace": string,       // Optional namespace
  "filter": object           // Optional metadata filter
}

Response

{
  "chunks": [
    {
      "id": "chunk_123",
      "text": "...",
      "score": 0.92,
      "cluster_id": 0,
      "metadata": {}
    }
  ],
  "stats": {
    "retrieved": 50,
    "clustered": 12,
    "returned": 8,
    "retrieval_latency_ms": 45,
    "clustering_latency_ms": 12,
    "total_latency_ms": 57
  }
}
GET/metrics

Prometheus metrics endpoint. Exposes request counts, latency histograms, reduction ratios, and cluster counts.

GET/health

Health check. Returns { status: 'ok' }.

How it works

Cache → Cluster → Select → Compress → MMR → Output

1. Cache lookup

Check for previously seen patterns. If a matching result exists, return it instantly - no recomputation needed.

2. Cluster

Group semantically similar chunks using agglomerative clustering with average linkage. Chunks that say the same thing end up in the same cluster. No need to specify K.

3. Select

Pick the best representative from each cluster based on relevance score, centroid proximity, or a hybrid of both.

4. Compress

Remove low-information content from selected chunks - stopwords, filler, and noise. Preserves meaning while reducing token count.

5. MMR re-rank

Apply Maximal Marginal Relevance to balance relevance and diversity. λ = 1.0 for pure relevance, λ = 0.0 for pure diversity.

Configuration

Generate a config file with distill config init, then validate with distill config validate.

# distill.yaml
server:
  port: 8080
  host: "0.0.0.0"

embedding:
  provider: "openai"
  model: "text-embedding-3-small"

dedup:
  threshold: 0.15
  lambda: 0.5
  selection_strategy: "score"

retriever:
  backend: "pinecone"
  index: "my-index"

auth:
  api_keys:
    - "${DISTILL_API_KEY}"

telemetry:
  tracing:
    enabled: true
    exporter: "otlp"
    endpoint: "localhost:4317"
    sampling_rate: 1.0

Environment variables are interpolated with ${VAR:-default} syntax.

Observability

Prometheus metrics

Scrape /metrics for request rates, latency distributions, reduction ratios, and cluster counts.

distill_requests_total
distill_request_duration_seconds
distill_chunks_processed
distill_reduction_ratio
distill_active_requests
distill_clusters_formed

A Grafana dashboard template is included in the repo.

OpenTelemetry tracing

Distributed traces for every pipeline stage. Export to Jaeger, Grafana Tempo, or any OTLP-compatible collector.

Spans:
  distill.request
  ├── distill.embedding
  ├── distill.clustering
  ├── distill.selection
  └── distill.mmr

W3C Trace Context propagation. Configurable sampling rate.

Integrations

MCP (Claude, Cursor)

Run as an MCP server for AI assistants. Works with Claude Desktop, Cursor, and any MCP-compatible client.

distill mcp
distill mcp --backend pinecone --index my-index

Pinecone

Connect your Pinecone index for retrieval with deduplication.

distill serve --backend pinecone --index my-index

Qdrant

Works with self-hosted or Qdrant Cloud instances.

distill serve --backend qdrant --collection my-collection

Docker

Pre-built images for every release. Multi-arch (amd64/arm64).

docker run -p 8080:8080 \
  -e OPENAI_API_KEY=sk-... \
  ghcr.io/siddhant-k-code/distill:latest

CLI commands

CommandDescription
distill apiStandalone API server (no vector DB)
distill serveServer with vector DB connection
distill mcpMCP server for AI assistants
distill queryQuery vector DB from command line
distill syncBulk upload with deduplication
distill analyzeAnalyze vectors for duplicates
distill config initGenerate distill.yaml template
distill config validateValidate config file

Learn more

The Engineering Guide to Context Window Efficiency

Deep dive into the algorithms behind deterministic deduplication and LLM reliability.