Problem
- adding a new provider is harder than it should be
- extraction and transformation are coupled too early in the pipeline
Today, the sync path requires a provider implementation, source config, auth handling, and FOCUS mapper before a provider can be meaningfully tested. This makes it difficult to answer the first important question: “Can this provider authenticate and pull raw billing data?”
The project already has a raw billing layer, but the pipeline does not treat raw extraction as a standalone capability.
As a result, provider development feels all-or-nothing: a new provider must be wired through the full FOCUS transform path before its pull behavior can be validated.
Proposed solution
Introduce a clearer separation between provider extraction and FOCUS transformation.
Ideally, providers should support an extract-only path first:
- Validate provider config/auth.
- Test connection.
- Discover or define raw billing sources.
- Pull a small sample of raw records.
- Store raw records in
raw_billing_data.
- Optionally transform raw records to FOCUS when a mapper exists.
This could be modeled with two separate contracts:
class ProviderConnector:
async def test_connection(self) -> ProviderTestResult: ...
async def pull(start_date, end_date, limit=None) -> list[RawProviderRecord]: ...
class FocusMapper:
def map_record(raw: RawProviderRecord) -> list[FocusRecord]: ...
The pipeline should allow:
extract_only: run provider auth/source/pull and save raw data.
transform_only: transform existing raw records using a mapper.
- full sync: extract, transform, and load FOCUS records.
A provider without a mapper should still be valid as an extract-only provider. The registry could expose capabilities such as can_extract and can_transform, making provider readiness explicit.
Alternatives considered
Keep the current provider registry and decorator structure, but make mapper_class optional during extraction. This is a smaller change and may be a good first step, but it still leaves provider pull behavior hidden behind DLT source config objects.
Another option is to keep using get_sources() as the provider contract and add a dedicated “preview extraction” API around the existing extractors. This would avoid a larger interface change, but it may still be harder to unit test than a direct pull()/pull_sample() provider contract.
A more incremental approach would be:
- Allow
pipeline_context() to create providers and sources without requiring a mapper.
- Move mapper lookup into the transform stage.
- Add tests for extract-only providers.
- Later introduce a formal
ProviderConnector abstraction.
Additional context
The codebase already has the right data model split:
raw_billing_data stores extracted provider data before transformation.
billing_data stores transformed FOCUS records.
However, the current orchestration requires a mapper before extraction can run. That makes it hard to test a new provider in stages.
Desired provider onboarding flow:
- Add provider auth/config metadata.
- Confirm connection works.
- Pull 5-10 raw records.
- Save raw payloads.
- Add fixture-based mapper tests.
- Enable full sync once transform is ready.
Problem
Today, the sync path requires a provider implementation, source config, auth handling, and FOCUS mapper before a provider can be meaningfully tested. This makes it difficult to answer the first important question: “Can this provider authenticate and pull raw billing data?”
The project already has a raw billing layer, but the pipeline does not treat raw extraction as a standalone capability.
As a result, provider development feels all-or-nothing: a new provider must be wired through the full FOCUS transform path before its pull behavior can be validated.
Proposed solution
Introduce a clearer separation between provider extraction and FOCUS transformation.
Ideally, providers should support an extract-only path first:
raw_billing_data.This could be modeled with two separate contracts:
The pipeline should allow:
extract_only: run provider auth/source/pull and save raw data.transform_only: transform existing raw records using a mapper.A provider without a mapper should still be valid as an extract-only provider. The registry could expose capabilities such as
can_extractandcan_transform, making provider readiness explicit.Alternatives considered
Keep the current provider registry and decorator structure, but make
mapper_classoptional during extraction. This is a smaller change and may be a good first step, but it still leaves provider pull behavior hidden behind DLT source config objects.Another option is to keep using
get_sources()as the provider contract and add a dedicated “preview extraction” API around the existing extractors. This would avoid a larger interface change, but it may still be harder to unit test than a directpull()/pull_sample()provider contract.A more incremental approach would be:
pipeline_context()to create providers and sources without requiring a mapper.ProviderConnectorabstraction.Additional context
The codebase already has the right data model split:
raw_billing_datastores extracted provider data before transformation.billing_datastores transformed FOCUS records.However, the current orchestration requires a mapper before extraction can run. That makes it hard to test a new provider in stages.
Desired provider onboarding flow: