Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Add Python-native DataPipe interface for fluent data preprocessing #115

Open
mathysgrapotte opened this issue Feb 19, 2025 · 1 comment

Comments

@mathysgrapotte
Copy link
Owner

Is your feature request related to a problem? Please describe.

Python users currently need to rely on YAML/config files for defining data preprocessing pipelines, which can be cumbersome for interactive experimentation and native Python workflows.

Describe the solution you'd like

A fluent Python interface (DataPipe) that enables chaining of data processing operations (split-transform-encode) while maintaining compatibility with existing config-based workflows. The interface should provide a clear pipeline construction pattern similar to Nextflow processes but in native Python.

@mathysgrapotte
Copy link
Owner Author

very vague suggestion :

pipe = (DataPipe(raw_df, loader)
        .split(RandomSplitter, ratios=[0.8, 0.2])
        .transform(AddNoise, columns=['ecg'], std=0.1)
        .encode(LabelEncoder, column='diagnosis')
        .build())

dataset = HandlerTorch(pipe.get_tensors()).to_dataset()

but we have to see about this once refactoring is done

@mathysgrapotte mathysgrapotte moved this to Todo - long issues in Stimulus v1.0 Feb 19, 2025
@mathysgrapotte mathysgrapotte moved this from Todo - long issues to Todo - depend on other issues in Stimulus v1.0 Feb 19, 2025
@mathysgrapotte mathysgrapotte moved this from Todo - depend on other issues to Todo in Stimulus v1.0 Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant