Skip to content

feat: decouple provider extraction from transformation #151

@oskarkocol

Description

@oskarkocol

Problem

  • adding a new provider is harder than it should be
  • extraction and transformation are coupled too early in the pipeline

Today, the sync path requires a provider implementation, source config, auth handling, and FOCUS mapper before a provider can be meaningfully tested. This makes it difficult to answer the first important question: “Can this provider authenticate and pull raw billing data?”

The project already has a raw billing layer, but the pipeline does not treat raw extraction as a standalone capability.

As a result, provider development feels all-or-nothing: a new provider must be wired through the full FOCUS transform path before its pull behavior can be validated.

Proposed solution
Introduce a clearer separation between provider extraction and FOCUS transformation.

Ideally, providers should support an extract-only path first:

  1. Validate provider config/auth.
  2. Test connection.
  3. Discover or define raw billing sources.
  4. Pull a small sample of raw records.
  5. Store raw records in raw_billing_data.
  6. Optionally transform raw records to FOCUS when a mapper exists.

This could be modeled with two separate contracts:

class ProviderConnector:
    async def test_connection(self) -> ProviderTestResult: ...
    async def pull(start_date, end_date, limit=None) -> list[RawProviderRecord]: ...
class FocusMapper:
    def map_record(raw: RawProviderRecord) -> list[FocusRecord]: ...

The pipeline should allow:

  • extract_only: run provider auth/source/pull and save raw data.
  • transform_only: transform existing raw records using a mapper.
  • full sync: extract, transform, and load FOCUS records.

A provider without a mapper should still be valid as an extract-only provider. The registry could expose capabilities such as can_extract and can_transform, making provider readiness explicit.

Alternatives considered
Keep the current provider registry and decorator structure, but make mapper_class optional during extraction. This is a smaller change and may be a good first step, but it still leaves provider pull behavior hidden behind DLT source config objects.

Another option is to keep using get_sources() as the provider contract and add a dedicated “preview extraction” API around the existing extractors. This would avoid a larger interface change, but it may still be harder to unit test than a direct pull()/pull_sample() provider contract.

A more incremental approach would be:

  • Allow pipeline_context() to create providers and sources without requiring a mapper.
  • Move mapper lookup into the transform stage.
  • Add tests for extract-only providers.
  • Later introduce a formal ProviderConnector abstraction.

Additional context
The codebase already has the right data model split:

  • raw_billing_data stores extracted provider data before transformation.
  • billing_data stores transformed FOCUS records.

However, the current orchestration requires a mapper before extraction can run. That makes it hard to test a new provider in stages.

Desired provider onboarding flow:

  1. Add provider auth/config metadata.
  2. Confirm connection works.
  3. Pull 5-10 raw records.
  4. Save raw payloads.
  5. Add fixture-based mapper tests.
  6. Enable full sync once transform is ready.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions