Prompt Template for ChromaSQL¶
Embed this prompt when you want an agent or LLM to produce ChromaSQL statements that execute cleanly against ChromaDB collections. It captures the guardrails our runtime expects such as supported clause order, filtering rules, and multi-collection constraints.
## Core Facts
- Read-only DSL: ChromaSQL issues `SELECT`/`EXPLAIN` queries; it never mutates data.
- Single collection per query: There are no joins. To span models, filter on the discriminator metadata key and run one ChromaSQL statement at a time.
- Every collection stores `id`, `document`, `embedding`, `metadata`, and query-time `distance`. Project only what the request needs; embeddings must be explicitly selected.
- Use dotted paths for metadata: `metadata.model_name`, `metadata.category`, etc. Alias with `AS` when helpful.
- Clause order:
`SELECT … FROM … [USING EMBEDDING (…)] [WHERE …] [SIMILARITY …] [WITH SCORE THRESHOLD …] [RERANK …] [ORDER BY …] [LIMIT … [OFFSET …]];`
`EXPLAIN` precedes the `SELECT`.
Note: `USING EMBEDDING` always requires parentheses around TEXT/VECTOR/BATCH.
## Filtering
- Use `WHERE` to filter on metadata OR document fields. Combine predicates with `AND` / `OR` (wrap complex logic in parentheses).
- **Metadata filtering:** Use dotted paths like `metadata.category = 'finance'`. Supported operators: `=`, `!=`, `<`, `<=`, `>`, `>=`, `IN`, `NOT IN`, `BETWEEN`.
- **Document filtering (text matching):** ChromaSQL supports the following operators for full-text search on the `document` field:
- `CONTAINS 'text'` - Case-sensitive substring matching
- `NOT CONTAINS 'text'` - Excludes documents containing the text
- `LIKE '%pattern%'` - Substring matching with strict `%value%` format (exactly two % signs)
- `NOT LIKE '%pattern%'` - Excludes documents matching the pattern
- `REGEX 'pattern'` - Regular expression matching (supports full regex syntax)
- `NOT REGEX 'pattern'` - Excludes documents matching the regex
- **CRITICAL:** Text operators (`CONTAINS`, `LIKE`, `REGEX`) only work on `document`, NOT on metadata (ChromaDB limitation). For metadata, use exact matches or comparisons.
- **CRITICAL:** Different filter types (metadata vs. document) can ONLY be combined with `AND`, not `OR`. Within each type, use `OR` freely. ChromaDB limitation.
- **Operator precedence:** AND binds tighter than OR. Always use parentheses to group document OR expressions before combining with metadata filters.
- ✅ VALID: `WHERE metadata.x = 'a' AND document CONTAINS 'b'`
- ✅ VALID: `WHERE (metadata.x = 'a' OR metadata.y = 'b') AND document CONTAINS 'c'`
- ✅ VALID: `WHERE metadata.package != '$TMP' AND (document CONTAINS 'check' OR document CONTAINS 'extract')`
- ❌ INVALID: `WHERE metadata.x = 'a' OR document CONTAINS 'b'`
- ❌ INVALID: `WHERE document CONTAINS 'a' OR document CONTAINS 'b' AND metadata.x = 'c'` -- AND binds tighter, creates OR between filter types
- Examples:
- `WHERE metadata.category = 'finance' AND metadata.amount > 1000`
- `WHERE document CONTAINS 'oauth' OR document CONTAINS 'authentication'`
- `WHERE document NOT CONTAINS 'deprecated'` -- Exclude deprecated docs
- `WHERE document LIKE '%positive pay%'` -- Multi-word phrase (space is literal)
- `WHERE document LIKE '%Bank of America%'` -- Three-word phrase
- `WHERE document LIKE '%US_POSIPAY%'` -- Underscore is treated as literal character
- `WHERE document NOT LIKE '%test%'` -- Exclude test documents
- `WHERE document REGEX '[a-z]+@[a-z]+\\.com'` -- Email pattern matching
- `WHERE document NOT REGEX '\\d{3}-\\d{2}-\\d{4}'` -- Exclude SSNs
- `WHERE (document CONTAINS 'vendor' AND document LIKE '%payment%') OR document CONTAINS 'invoice'`
- `WHERE metadata.status = 'active' AND document NOT CONTAINS 'archived'`
- `WHERE metadata.package != '$TMP' AND ((document CONTAINS 'check' AND document CONTAINS 'extract') OR document CONTAINS 'RFCHKE00')` -- Complex: metadata filter AND document OR expression (use parentheses!)
## Vector Search
- Add `USING EMBEDDING (TEXT '...')` for similarity search. The parentheses around TEXT are REQUIRED.
- For batch queries (multiple search terms), use `USING EMBEDDING BATCH (TEXT '...', TEXT '...', ...)`
- Examples:
- `USING EMBEDDING (TEXT 'positive pay processing')`
- `USING EMBEDDING (TEXT 'vendor invoice reconciliation')`
- `USING EMBEDDING BATCH (TEXT 'vendor invoice', TEXT 'payment receipt', TEXT 'purchase order')`
- Always add `TOPK n` with embedding queries; the default is 10.
- Change the metric with `SIMILARITY COSINE | L2 | IP`. Omit to keep the collection default.
- Optional `WITH SCORE THRESHOLD x` discards rows whose distance exceeds `x`.
- `RERANK BY MMR(lambda = …, candidates = …)` adds reranking metadata to the query plan.
## Sorting & Pagination
- `ORDER BY` supports `distance`, `id`, or metadata paths. Vector queries sort by `distance ASC` by default.
- Use `LIMIT n` and optional `OFFSET m` to paginate after retrieval. These operate on the merged result set.
## Filter-Only Retrieval
- Omitting `USING EMBEDDING` yields a metadata-only lookup (`collection.get`). `distance` cannot be selected or ordered in this mode.
## Diagnostics
- `EXPLAIN` before `SELECT` returns the planned request without hitting ChromaDB.
## Limitations & Pitfalls
- No DML, DDL, joins, aggregations, window functions, or arithmetic expressions in `ORDER BY`.
- **Text search operators** (`LIKE`, `CONTAINS`, `REGEX`) **only work on `document` field**, not on metadata. ChromaDB limitation: metadata filters don't support substring/pattern matching. Use `WHERE document LIKE '%value%'` or `WHERE document CONTAINS 'text'` for text search. For metadata, use exact matches (`=`, `IN`) or comparisons (`<`, `>`, `BETWEEN`).
- **LIKE pattern restrictions**: `LIKE` requires patterns in the form `'%value%'` (literal string with exactly two `%` wildcards at start and end). Underscores are treated as literal characters (not wildcards).
- ✅ VALID: `LIKE '%oauth%'`, `LIKE '%positive pay%'`, `LIKE '%Bank of America%'`, `LIKE '%US_POSIPAY%'` (underscore is literal)
- ❌ INVALID: `LIKE '%positive%pay%'` (extra % in middle), `LIKE '%oauth'` (missing ending %), `LIKE 'oauth%'` (missing starting %)
- For multi-word searches, include the entire phrase: `'%positive pay%'` NOT `'%positive%pay%'`
- For searching multiple separate terms anywhere in document: `WHERE document LIKE '%term1%' AND document LIKE '%term2%'`
- **REGEX patterns**: Use standard regex syntax. `REGEX` supports full regular expression patterns:
- Email: `REGEX '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}'`
- Phone: `REGEX '\\d{3}-\\d{4}'`
- Case-insensitive: `REGEX '(?i)python'` (matches PYTHON, python, Python)
- Word boundaries and other regex features are supported
- **Case-sensitive matching**: All text operators (`CONTAINS`, `LIKE`, `REGEX`) are case-sensitive by default. `WHERE document CONTAINS 'BofA'` will NOT match "bofa" or "BOFA". To query case-insensitively:
- Use OR for multiple variants: `WHERE document CONTAINS 'BofA' OR document CONTAINS 'bofa'`
- Use REGEX with (?i) flag: `WHERE document REGEX '(?i)bofa'` (matches BofA, bofa, BOFA, etc.)
- **Mixed int/float comparisons**: ChromaDB may coerce types when comparing integer metadata values with float boundaries. For reliable results, use integer boundaries with integer metadata, or ensure float boundaries clearly exclude/include values (e.g., use `>= 1499` instead of `>= 1500.5` when metadata is `1500`).
- Batch queries apply `TOPK` per batch item; results may contain duplicates across batch items.
- `RERANK` clauses add metadata to the query plan for downstream processing.
- Have Ideas to Improve the Prompt?
Submit a Pull Request.