Query Configuration

The query configuration file maps model names to ChromaDB collection names, enabling efficient multi-collection queries.

Generating Query Config

CLI Command

idxr vectorize generate-query-config \
  --partition-out-dir build/vector \
  --output query_config.json \
  --model path/to/model_registry.yaml \
  --collection-prefix my-index

Arguments:

  • --partition-out-dir (required): Directory containing partition subdirectories with resume state files
  • --output (required): Path to write query config JSON file
  • --model (required): Python module path to model registry (e.g., my_package.registry:MODEL_REGISTRY)
  • --collection-prefix (optional): Prefix to add to collection names

Python API

from indexer.vectorize_lib import generate_query_config
from pathlib import Path

config = generate_query_config(
    partition_out_dir=Path("build/vector"),
    output_path=Path("query_config.json"),
    collection_prefix="ecc-prod",
)

Config File Structure

Full Example

{
  "metadata": {
    "generated_at": "2025-10-31T10:00:00.123456",
    "total_models": 3,
    "total_collections": 5,
    "collection_prefix": "ecc-prod"
  },
  "model_to_collections": {
    "Table": {
      "collections": ["partition_00001", "partition_00002"],
      "total_documents": 450000,
      "partitions": ["partition_00001", "partition_00002"]
    },
    "Field": {
      "collections": ["partition_00002", "partition_00003", "partition_00005"],
      "total_documents": 680000,
      "partitions": ["partition_00002", "partition_00003", "partition_00005"]
    },
    "Domain": {
      "collections": ["partition_00004", "partition_00005"],
      "total_documents": 120000,
      "partitions": ["partition_00004", "partition_00005"]
    }
  },
  "collection_to_models": {
    "partition_00001": ["Table"],
    "partition_00002": ["Table", "Field"],
    "partition_00003": ["Field"],
    "partition_00004": ["Domain"],
    "partition_00005": ["Field", "Domain"]
  }
}

Field Descriptions

metadata

  • generated_at: ISO 8601 timestamp of config generation
  • total_models: Number of distinct models with indexed documents
  • total_collections: Number of ChromaDB collections (partitions)
  • collection_prefix: Optional prefix applied to collection names

model_to_collections

Mapping from model name to collection information:

  • collections: List of collection names containing this model
  • total_documents: Total document count across all collections for this model
  • partitions: List of partition names (same as collections unless prefix is used)

collection_to_models

Inverse mapping from collection name to list of model names it contains.

Resume State Requirements

The query config is generated by scanning *_resume_state.json files in each partition directory. These files must contain:

{
  "ModelName": {
    "started": true,
    "complete": true,
    "collection_count": 100000,
    "documents_indexed": 100000,
    "indexed_at": "2025-10-31T10:00:00"
  }
}

Required fields:

  • started: Must be true (models not started are excluded)
  • collection_count: Must be > 0 (models with no documents are excluded)

Optional fields:

  • complete: Indicates if indexing completed (doesn't affect inclusion)
  • documents_indexed: Actual documents indexed (may differ from collection_count)
  • indexed_at: Timestamp of indexing

Collection Filtering Logic

When generating the query config:

  1. Scan partition directories for *_resume_state.json files
  2. Filter models where started == true and collection_count > 0
  3. Build mappings for model-to-collections and collection-to-models
  4. Sort collections for deterministic ordering

Excluded Models

Models are excluded if:

  • started is false or missing
  • collection_count is 0 or missing
  • Resume state file is malformed
  • Model state is not a dictionary

Using Collection Prefix

When indexing with a collection prefix, ensure query config uses the same prefix:

Indexing with Prefix

idxr vectorize index \
  --collection-prefix ecc-prod \
  # ... other args

Query Config with Prefix

idxr vectorize generate-query-config \
  --collection-prefix ecc-prod \
  # ... other args

The prefix is applied to collection names in the config:

{
  "model_to_collections": {
    "Table": {
      "collections": ["ecc-prod_partition_00001", "ecc-prod_partition_00002"]
    }
  }
}

Config Validation

When loading a query config, the following validations are performed:

  1. File exists and is readable
  2. Valid JSON format
  3. Required keys present: metadata, model_to_collections, collection_to_models
  4. Metadata fields include total_models and total_collections

Validation Errors

from indexer.vectorize_lib import load_query_config
from pathlib import Path

try:
    config = load_query_config(Path("query_config.json"))
except ValueError as e:
    # Possible errors:
    # - "Query config file <path> does not exist"
    # - "Failed to read query config: <json error>"
    # - "Query config is missing required keys: ..."
    print(f"Config error: {e}")

Advanced Configuration

Multi-Environment Configs

Generate separate configs for different environments:

# Production index
idxr vectorize generate-query-config \
  --partition-out-dir prod/vector \
  --output query_config_prod.json \
  --collection-prefix prod

# Staging index
idxr vectorize generate-query-config \
  --partition-out-dir staging/vector \
  --output query_config_staging.json \
  --collection-prefix staging

Partial Model Indexing

If you index models at different times, regenerate the query config after each indexing batch:

# After indexing Table model
idxr vectorize generate-query-config \
  --partition-out-dir build/vector \
  --output query_config.json

# After adding Field model
# Regenerate to include Field in the config
idxr vectorize generate-query-config \
  --partition-out-dir build/vector \
  --output query_config.json

The config will automatically include all models that have been indexed.

Version Control

Store query configs in version control alongside model registries:

project/
├── config/
   ├── model_registry.yaml
   ├── query_config_prod.json
   └── query_config_staging.json
├── src/
└── README.md

Update the config after indexing changes:

# After re-indexing
idxr vectorize generate-query-config \
  --partition-out-dir build/vector \
  --output config/query_config_prod.json

# Commit changes
git add config/query_config_prod.json
git commit -m "Update query config after re-indexing"

Troubleshooting

No Collections Found

Problem: Query config has total_collections: 0

Causes: - No resume state files in partition directories - All models have started: false - All models have collection_count: 0

Solution:

# Check partition directory structure
ls -la build/vector/partition_*/

# Check resume state content
cat build/vector/partition_00001/partition_00001_resume_state.json

Model Not in Config

Problem: Model exists in resume state but not in query config

Causes: - Model has started: false - Model has collection_count: 0 - Resume state file is malformed

Solution:

import json
from pathlib import Path

# Check resume state
resume_file = Path("build/vector/partition_00001/partition_00001_resume_state.json")
state = json.loads(resume_file.read_text())

for model, info in state.items():
    print(f"{model}:")
    print(f"  started: {info.get('started')}")
    print(f"  collection_count: {info.get('collection_count')}")

Collection Count Mismatch

Problem: Config shows fewer collections than expected

Causes: - Some partitions have no valid models (excluded from config) - Collection prefix mismatch

Solution:

# Verify all partitions have resume states
find build/vector -name "*_resume_state.json"

# Check for empty or invalid states
for f in build/vector/partition_*/*_resume_state.json; do
    echo "=== $f ==="
    cat "$f"
done

Next Steps