Skip to content

Multi-Collection Execution Architecture

Overview

ChromaSQL now supports executing queries across multiple ChromaDB collections with intelligent routing based on metadata filters. This implementation is generic and extensible - it doesn't hardcode any specific discriminator field like "model", allowing developers to build custom routing strategies for their use cases.

What Was Built

1. Core Abstractions (chromasql/multi_collection.py)

Two Protocol Interfaces:

  • CollectionRouter - Decides which collections to query based on parsed AST
  • Returns None → query all collections
  • Returns Sequence[str] → query specific collections
  • Fully customizable by implementing the protocol

  • AsyncCollectionProvider - Abstracts async collection retrieval

  • Works with any async ChromaDB client (HTTP, Cloud, custom)
  • Handles collection caching and connection pooling

Main Function: - execute_multi_collection() - Orchestrates multi-collection execution - Parses query → routes to collections → executes in parallel → merges results - Supports both vector and filter-only queries - Handles partial failures gracefully - Respects LIMIT/OFFSET/ORDER BY after merging

2. Pre-Built Adapters (chromasql/adapters.py)

Three Ready-to-Use Implementations:

  1. MetadataFieldRouter - Generic metadata-based routing
  2. Extracts values from any metadata field path (e.g., ("model",), ("tenant", "id"))
  3. Uses query config to map values to collections
  4. Configurable fallback behavior

  5. SimpleAsyncClientAdapter - Wraps raw ChromaDB async clients

  6. For simpler setups without query config infrastructure

3. Comprehensive Tests (tests/chromasql/test_multi_collection.py)

11 Test Cases Covering: - Static and dynamic routing - Metadata-based routing (single value, IN lists) - Fallback behavior (query all vs. strict mode) - LIMIT/OFFSET/ORDER BY after merge - Vector and filter-only query modes - Partial collection failure handling - Adapter implementations

Test Coverage: 96% overall (1067 statements, 45 miss)

4. Documentation

Three Documentation Files: - CONTRIBUTING.md - Updated with multi-collection architecture - EXAMPLES.md - 6 comprehensive usage examples - MULTI_COLLECTION_SUMMARY.md - This file

Key Design Decisions

✅ Generic, Not Specific

Decision: Don't hardcode "model" as the discriminator field

Rationale: - Different developers have different metadata structures - Field could be: tenant_id, region, category, model, etc. - Protocol-based design allows unlimited customization

Implementation:

# Generic - works with any field
router = MetadataFieldRouter(config, field_path=("model",))
router = MetadataFieldRouter(config, field_path=("tenant_id",))
router = MetadataFieldRouter(config, field_path=("org", "region"))

✅ Leverage Existing Analysis Module

Decision: Use chromasql.analysis.extract_metadata_values() for routing

Rationale: - Already existed and was designed for this purpose - Keeps routing logic separate from query execution - Easy to test in isolation

Code Reference: chromasql/analysis.py

✅ Protocol-Based Extensibility

Decision: Use Python protocols instead of base classes

Rationale: - More Pythonic and flexible - Duck typing enables easy mocking in tests - No inheritance complexity

Example:

class CollectionRouter(Protocol):
    def route(self, query: Query) -> Optional[Sequence[str]]:
        ...

✅ Graceful Partial Failures

Decision: Return results from successful collections even if some fail

Rationale: - Common in distributed systems for some nodes to be unavailable - Better to return partial results than fail entirely - Logs errors for monitoring

Behavior: - If 35 of 37 collections succeed → return merged results from 35 - If ALL collections fail → raise ChromaSQLExecutionError

✅ Result Merging Strategy

Decision: Merge by distance and re-apply ORDER BY/LIMIT/OFFSET

Rationale: - Vector queries need global ranking across collections - LIMIT/OFFSET should apply to final merged results, not per-collection - ORDER BY may include multiple fields (e.g., ORDER BY metadata.year DESC, distance ASC)

Implementation: chromasql/multi_collection.py

Integration with Your Setup

Your Current Infrastructure

37 collections
16M total records
metadata.model as discriminator
AsyncMultiCollectionQueryClient already exists
query_config.json maps models → collections

How to Use

from pathlib import Path
from chromasql.adapters import MetadataFieldRouter
from chromasql.multi_collection import execute_multi_collection
from idxr.query_lib.async_multi_collection_adapter import AsyncMultiCollectionAdapter
from idxr.vectorize_lib.query_client import AsyncMultiCollectionQueryClient
from idxr.vectorize_lib.query_config import load_query_config

# Load config
config = load_query_config(Path("output/query_config.json"))

# Initialize client (your existing code)
client = AsyncMultiCollectionQueryClient(
    config_path=Path("output/query_config.json"),
    client_type="cloud",
    cloud_api_key=api_key,
)
await client.connect()

# Create adapters
adapter = AsyncMultiCollectionAdapter(client)
router = MetadataFieldRouter(
    query_config=config,
    field_path=("model",),  # Your discriminator
    fallback_to_all=True,    # Query all 37 if not specified
)

# Execute ChromaSQL with routing
result = await execute_multi_collection(
    query_str="""
        SELECT id, distance, document
        FROM sap_data
        WHERE metadata.model IN ('Table', 'Field')
        USING EMBEDDING (TEXT 'financial tables')
        TOPK 10;
    """,
    router=router,
    collection_provider=adapter,
    embed_fn=your_embed_function,
)

# Router extracted {'Table', 'Field'} from WHERE clause
# Queried only collections containing those models (e.g., 5 of 37)
# Results merged and ranked globally by distance

Routing Examples

Query with model filter → targeted routing:

SELECT * FROM demo
WHERE metadata.model = 'Table'
USING EMBEDDING (TEXT 'query')
-- Queries only collections containing 'Table'

Query without model filter → all collections:

SELECT * FROM demo
WHERE metadata.year > 2020
USING EMBEDDING (TEXT 'query')
-- Queries all 37 collections (model not constrained)

Filter-only query → works too:

SELECT * FROM demo
WHERE metadata.model = 'Field'
  AND metadata.status = 'active'
-- No USING EMBEDDING = filter-only mode
-- Still routes based on metadata.model

Performance Characteristics

Parallel Execution

  • All collections queried in parallel using asyncio.gather()
  • No sequential bottlenecks

Network Efficiency

  • Only queries collections that contain the filtered models
  • Example: Filter on 2 models → query 5 collections (not all 37)

Result Merging

  • In-memory merge after collection queries complete
  • Complexity: O(n log n) where n = total results from all collections
  • For TOPK 10 across 5 collections → sorts ~50 items, returns top 10
  1. Use specific model filters when possible:

    WHERE metadata.model IN ('Table', 'Field')  -- Good: queries 5 collections
    WHERE metadata.status = 'active'             -- Queries all 37 collections
    

  2. Fetch more candidates per collection for better recall:

    result = await execute_multi_collection(
        query_str=...,
        n_results_per_collection=50,  # Fetch 50 from each collection
        # Final result still respects TOPK/LIMIT in query
    )
    

  3. Monitor routing decisions:

    query = parse(sql)
    collections = router.route(query)
    logger.info(f"Querying {len(collections) if collections else 'all'} collection(s)")
    

Testing & Coverage

Test Suite

  • 11 new tests in test_multi_collection.py
  • All existing 109 tests still pass
  • Total: 120 passing tests

Coverage

chromasql/__init__.py              100%
chromasql/adapters.py               75% (16 miss - edge cases)
chromasql/analysis.py               93% (2 miss - rare branches)
chromasql/ast.py                   100%
chromasql/errors.py                100%
chromasql/executor.py              100%
chromasql/explain.py               100%
chromasql/grammar.py               100%
chromasql/multi_collection.py       77% (27 miss - error paths)
chromasql/parser.py                100%
chromasql/plan.py                  100%
chromasql/planner.py               100%
-----------------------------------------------------
TOTAL                               96% coverage

Key Test Scenarios Covered

  • ✅ Routing based on single value (WHERE model = 'Table')
  • ✅ Routing based on IN list (WHERE model IN ('Table', 'Field'))
  • ✅ Fallback when discriminator absent
  • ✅ Strict mode (error if discriminator missing)
  • ✅ Partial collection failures
  • ✅ Result merging with LIMIT/OFFSET
  • ✅ Both vector and filter-only queries
  • ✅ Adapter implementations

Files Modified/Created

New Files

  • chromasql/multi_collection.py (393 lines) - Core multi-collection execution
  • chromasql/adapters.py (300 lines) - Pre-built adapters
  • tests/chromasql/test_multi_collection.py (412 lines) - Test suite
  • chromasql/EXAMPLES.md (500+ lines) - Usage examples
  • chromasql/MULTI_COLLECTION_SUMMARY.md (this file)

Modified Files

  • 📝 chromasql/__init__.py - Export new APIs
  • 📝 chromasql/CONTRIBUTING.md - Add multi-collection patterns section

Unchanged (All Tests Still Pass)

  • chromasql/parser.py
  • chromasql/planner.py
  • chromasql/executor.py
  • chromasql/ast.py
  • ✅ All other core modules

Next Steps

For Your Use Case

  1. Try it out:

    poetry run python -m your_module
    

  2. Monitor routing behavior:

  3. Log router.route(query) results
  4. Track which collections are queried
  5. Measure query latency improvements

  6. Tune performance:

  7. Adjust n_results_per_collection if needed
  8. Consider adding more discriminators (e.g., environment, region)

For Other Developers

  1. Implement custom routers:
  2. Extend CollectionRouter protocol
  3. Use extract_metadata_values() helper
  4. See examples in EXAMPLES.md

  5. Contribute improvements:

  6. Add support for nested metadata paths in analysis module
  7. Implement additional merge strategies (e.g., score blending)
  8. Add async streaming for large result sets

  9. Documentation:

  10. Add your routing strategy to EXAMPLES.md
  11. Share patterns in discussions

API Reference

Public Exports

From chromasql:

from chromasql import (
    # Core API (unchanged)
    parse, build_plan, execute_plan, plan_to_dict,
    ExecutionResult, QueryPlan,

    # Analysis helpers
    extract_metadata_values,

    # Multi-collection support
    CollectionRouter,
    AsyncCollectionProvider,
    execute_multi_collection,

    # Pre-built adapters
    AsyncMultiCollectionAdapter,
    MetadataFieldRouter,
    SimpleAsyncClientAdapter,
)

Main Function Signature

async def execute_multi_collection(
    query_str: str,
    router: CollectionRouter,
    collection_provider: AsyncCollectionProvider,
    *,
    embed_fn: Optional[EmbedFunction] = None,
    merge_strategy: str = "distance",
    n_results_per_collection: Optional[int] = None,
) -> ExecutionResult:
    """Execute a ChromaSQL query across multiple collections."""

Protocol Signatures

class CollectionRouter(Protocol):
    def route(self, query: Query) -> Optional[Sequence[str]]:
        """Return collection names or None for all."""

class AsyncCollectionProvider(Protocol):
    async def get_collection(self, name: str) -> Any:
        """Get collection by name."""

    async def list_collection_names(self) -> Sequence[str]:
        """List all available collections."""

Questions & Support

For questions or issues:

  1. Check EXAMPLES.md for usage patterns
  2. Review CONTRIBUTING.md for architecture details
  3. Run tests: poetry run pytest tests/chromasql/ -v
  4. Open an issue with reproduction steps

Implementation completed: All tasks done ✅ Test coverage: 96% overall, 120 tests passing Status: Ready for production use