Support for DAG in unsupervised synthesis / imputation #93

marco-virgolin-ist · 2025-06-11T09:24:01Z

Hi,

This PR is includes the support for causal DAGs in synthetic data generation/imputation.
DAGs are expressed as a dictionary { int: list[int] } where the key is the depenent column_idx and the value is a list of column indices the key depends on.
E.g. for column 0 independent, column 1 depending on 0, column 2 depending on 1 and 0, you'd have:

{ 0: [], 1: [0], 2: [1,0] }

The PR modifies (IMHO minimally) src/tabpfn_extensions/unsupervised/unsupervised.py and adds a respective examples/unsupervised/generate_data_following_dag.py. Some small changes apply to other files.

Regarding the src file:

enables to pass the dag dictionary to appropriate function calls.
read the dag dictionary in impute_ and order the variables using python's graphlib TopologicalSorter to synthesize them in the right order (e.g. independent first)
set the conditional_idx of the column_idx being generated/imputed to its dependencies (e.g., if column_idx==2 then conditional_idx==[0,1])
proceed as before

What this PR does not do (based on the contribution guidelines):

does not create a new extension and applicable aspects
does not include tests (!!!) Unfortunately I see no tests for unsupervised to add to (am I missing something?). I hope the limited scope of the proposed changes + the provided example suffice. If you feel otherwise, it would be great if a test suite for unsupervised is created, to which we could then add tests specific to the addition of dag support.

CLAassistant · 2025-06-11T09:24:06Z

All committers have signed the CLA.

- Fix density_() to use column_idx instead of hardcoded column 0 for target - Remove incorrect categorical_features index remapping during DAG processing When DAG reorders features for generation, the first feature to generate may not be the column at index 0. Previously, density_() always used column 0 as target regardless of which column was being generated, causing incorrect model training. Fixes synthetic data generation with DAG-based feature ordering.

marco-virgolin-ist added 2 commits June 11, 2025 10:41

support for DAGs

c6e6d76

working also if not all deps are expressed

f97b458

marco-virgolin-ist and others added 2 commits June 24, 2025 16:52

fix change of cat indices due to dag sorting

bb56568

priorphil requested a review from noahho August 29, 2025 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for DAG in unsupervised synthesis / imputation #93

Support for DAG in unsupervised synthesis / imputation #93

marco-virgolin-ist commented Jun 11, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Jun 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support for DAG in unsupervised synthesis / imputation #93

Are you sure you want to change the base?

Support for DAG in unsupervised synthesis / imputation #93

Conversation

marco-virgolin-ist commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marco-virgolin-ist commented Jun 11, 2025 •

edited

Loading

CLAassistant commented Jun 11, 2025 •

edited

Loading