Skip to content

Conversation

marco-virgolin-ist
Copy link

@marco-virgolin-ist marco-virgolin-ist commented Jun 11, 2025

Hi,

This PR is includes the support for causal DAGs in synthetic data generation/imputation.
DAGs are expressed as a dictionary { int: list[int] } where the key is the depenent column_idx and the value is a list of column indices the key depends on.
E.g. for column 0 independent, column 1 depending on 0, column 2 depending on 1 and 0, you'd have:

{ 0: [], 1: [0], 2: [1,0] }

The PR modifies (IMHO minimally) src/tabpfn_extensions/unsupervised/unsupervised.py and adds a respective examples/unsupervised/generate_data_following_dag.py. Some small changes apply to other files.

Regarding the src file:

  • enables to pass the dag dictionary to appropriate function calls.
  • read the dag dictionary in impute_ and order the variables using python's graphlib TopologicalSorter to synthesize them in the right order (e.g. independent first)
  • set the conditional_idx of the column_idx being generated/imputed to its dependencies (e.g., if column_idx==2 then conditional_idx==[0,1])
  • proceed as before

What this PR does not do (based on the contribution guidelines):

  • does not create a new extension and applicable aspects
  • does not include tests (!!!) Unfortunately I see no tests for unsupervised to add to (am I missing something?). I hope the limited scope of the proposed changes + the provided example suffice. If you feel otherwise, it would be great if a test suite for unsupervised is created, to which we could then add tests specific to the addition of dag support.

@CLAassistant
Copy link

CLAassistant commented Jun 11, 2025

CLA assistant check
All committers have signed the CLA.

marco-virgolin-ist and others added 2 commits June 24, 2025 16:52
  - Fix density_() to use column_idx instead of hardcoded column 0 for target
  - Remove incorrect categorical_features index remapping during DAG processing

  When DAG reorders features for generation, the first feature to generate may
  not be the column at index 0. Previously, density_() always used column 0 as
  target regardless of which column was being generated, causing incorrect
  model training.

  Fixes synthetic data generation with DAG-based feature ordering.
@priorphil priorphil requested a review from noahho August 29, 2025 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants