Skip to content

marcelo-pires-dsa/synthetic-data

Repository files navigation

Synthetic Data with SDV

This project demonstrate how to create syntetic datasets using real data samples as a starting point. On the of the main advantages is to be alr to generate a big amount of data, otherwise would be very difficult or even impossible due project scope or resources.

Requirements

  • Python >= 3.9
  • Pip
  • Pipenv

Repository Structure

├───dataset
│   ├───raw -> Real data
│   └───synthetic -> Synthesized data
├───output -> Output for synthesizer, etc
└───R -> Data Wrangling project (not pushed yet)
└─── synthetic-data-modeling.py
└─── synthetic-metadata.py
└─── synthetic-data-metadata.json

NOTE: For this example, I added the datasets as LFS, that´s not the way I really do it. Usually all my datasets are versioned with appropriate method and only metadata is stored in Git for tracking.

Installing

Installing Python

Check the official Python documentation

Installing pip

Installing pipenv

This project uses pipenv to manage virtual environment in an easy and straight forward way. All dependencies are defined in the Pipfile and Pipfile.lock

When using pipenv, avoid to manually changing both Pipfile and Pipfile.lock, thus use the Pipenv command to manage the dependencies.

Generating Synthetic data

Due to Github LFS limitations for my user account

  1. Get the raw dataset from https://drive.google.com/file/d/1IZRzR-9-S9taKQTM2sPQqzK8Jhy3V6YK/view?usp=sharing
  2. Put the file in the datasets/raw folder
  3. Run synthetic-metadata.py
  4. Run synthetic-data-modeling.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages