Query Configuration¶
The query configuration file maps model names to ChromaDB collection names, enabling efficient multi-collection queries.
Generating Query Config¶
CLI Command¶
idxr vectorize generate-query-config \
--partition-out-dir build/vector \
--output query_config.json \
--model path/to/model_registry.yaml \
--collection-prefix my-index
Arguments:
--partition-out-dir(required): Directory containing partition subdirectories with resume state files--output(required): Path to write query config JSON file--model(required): Python module path to model registry (e.g.,my_package.registry:MODEL_REGISTRY)--collection-prefix(optional): Prefix to add to collection names
Python API¶
from indexer.vectorize_lib import generate_query_config
from pathlib import Path
config = generate_query_config(
partition_out_dir=Path("build/vector"),
output_path=Path("query_config.json"),
collection_prefix="ecc-prod",
)
Config File Structure¶
Full Example¶
{
"metadata": {
"generated_at": "2025-10-31T10:00:00.123456",
"total_models": 3,
"total_collections": 5,
"collection_prefix": "ecc-prod"
},
"model_to_collections": {
"Table": {
"collections": ["partition_00001", "partition_00002"],
"total_documents": 450000,
"partitions": ["partition_00001", "partition_00002"]
},
"Field": {
"collections": ["partition_00002", "partition_00003", "partition_00005"],
"total_documents": 680000,
"partitions": ["partition_00002", "partition_00003", "partition_00005"]
},
"Domain": {
"collections": ["partition_00004", "partition_00005"],
"total_documents": 120000,
"partitions": ["partition_00004", "partition_00005"]
}
},
"collection_to_models": {
"partition_00001": ["Table"],
"partition_00002": ["Table", "Field"],
"partition_00003": ["Field"],
"partition_00004": ["Domain"],
"partition_00005": ["Field", "Domain"]
}
}
Field Descriptions¶
metadata¶
- generated_at: ISO 8601 timestamp of config generation
- total_models: Number of distinct models with indexed documents
- total_collections: Number of ChromaDB collections (partitions)
- collection_prefix: Optional prefix applied to collection names
model_to_collections¶
Mapping from model name to collection information:
- collections: List of collection names containing this model
- total_documents: Total document count across all collections for this model
- partitions: List of partition names (same as collections unless prefix is used)
collection_to_models¶
Inverse mapping from collection name to list of model names it contains.
Resume State Requirements¶
The query config is generated by scanning *_resume_state.json files in each partition directory. These files must contain:
{
"ModelName": {
"started": true,
"complete": true,
"collection_count": 100000,
"documents_indexed": 100000,
"indexed_at": "2025-10-31T10:00:00"
}
}
Required fields:
started: Must betrue(models not started are excluded)collection_count: Must be > 0 (models with no documents are excluded)
Optional fields:
complete: Indicates if indexing completed (doesn't affect inclusion)documents_indexed: Actual documents indexed (may differ from collection_count)indexed_at: Timestamp of indexing
Collection Filtering Logic¶
When generating the query config:
- Scan partition directories for
*_resume_state.jsonfiles - Filter models where
started == trueandcollection_count > 0 - Build mappings for model-to-collections and collection-to-models
- Sort collections for deterministic ordering
Excluded Models¶
Models are excluded if:
startedisfalseor missingcollection_countis0or missing- Resume state file is malformed
- Model state is not a dictionary
Using Collection Prefix¶
When indexing with a collection prefix, ensure query config uses the same prefix:
Indexing with Prefix¶
idxr vectorize index \
--collection-prefix ecc-prod \
# ... other args
Query Config with Prefix¶
idxr vectorize generate-query-config \
--collection-prefix ecc-prod \
# ... other args
The prefix is applied to collection names in the config:
{
"model_to_collections": {
"Table": {
"collections": ["ecc-prod_partition_00001", "ecc-prod_partition_00002"]
}
}
}
Config Validation¶
When loading a query config, the following validations are performed:
- File exists and is readable
- Valid JSON format
- Required keys present:
metadata,model_to_collections,collection_to_models - Metadata fields include
total_modelsandtotal_collections
Validation Errors¶
from indexer.vectorize_lib import load_query_config
from pathlib import Path
try:
config = load_query_config(Path("query_config.json"))
except ValueError as e:
# Possible errors:
# - "Query config file <path> does not exist"
# - "Failed to read query config: <json error>"
# - "Query config is missing required keys: ..."
print(f"Config error: {e}")
Advanced Configuration¶
Multi-Environment Configs¶
Generate separate configs for different environments:
# Production index
idxr vectorize generate-query-config \
--partition-out-dir prod/vector \
--output query_config_prod.json \
--collection-prefix prod
# Staging index
idxr vectorize generate-query-config \
--partition-out-dir staging/vector \
--output query_config_staging.json \
--collection-prefix staging
Partial Model Indexing¶
If you index models at different times, regenerate the query config after each indexing batch:
# After indexing Table model
idxr vectorize generate-query-config \
--partition-out-dir build/vector \
--output query_config.json
# After adding Field model
# Regenerate to include Field in the config
idxr vectorize generate-query-config \
--partition-out-dir build/vector \
--output query_config.json
The config will automatically include all models that have been indexed.
Version Control¶
Store query configs in version control alongside model registries:
project/
├── config/
│ ├── model_registry.yaml
│ ├── query_config_prod.json
│ └── query_config_staging.json
├── src/
└── README.md
Update the config after indexing changes:
# After re-indexing
idxr vectorize generate-query-config \
--partition-out-dir build/vector \
--output config/query_config_prod.json
# Commit changes
git add config/query_config_prod.json
git commit -m "Update query config after re-indexing"
Troubleshooting¶
No Collections Found¶
Problem: Query config has total_collections: 0
Causes:
- No resume state files in partition directories
- All models have started: false
- All models have collection_count: 0
Solution:
# Check partition directory structure
ls -la build/vector/partition_*/
# Check resume state content
cat build/vector/partition_00001/partition_00001_resume_state.json
Model Not in Config¶
Problem: Model exists in resume state but not in query config
Causes:
- Model has started: false
- Model has collection_count: 0
- Resume state file is malformed
Solution:
import json
from pathlib import Path
# Check resume state
resume_file = Path("build/vector/partition_00001/partition_00001_resume_state.json")
state = json.loads(resume_file.read_text())
for model, info in state.items():
print(f"{model}:")
print(f" started: {info.get('started')}")
print(f" collection_count: {info.get('collection_count')}")
Collection Count Mismatch¶
Problem: Config shows fewer collections than expected
Causes: - Some partitions have no valid models (excluded from config) - Collection prefix mismatch
Solution:
# Verify all partitions have resume states
find build/vector -name "*_resume_state.json"
# Check for empty or invalid states
for f in build/vector/partition_*/*_resume_state.json; do
echo "=== $f ==="
cat "$f"
done
Next Steps¶
- Review API Reference for programmatic config access
- Check Examples for query config usage patterns
- Read Getting Started for complete workflow