propmptscontain all the different prompts used in synthetic data generationconfigcontains the yaml file to map prompts to their corresponding locationutils.pycontains the helper code to extract and parse data from LLM responsesdata_pipeline.pycontains the main source code for generating synthetic data according to the pipeline explained in our paper.commitpackft_subset.csvfile contains therepoandcommitfields of the samples used in training, this can be used to map to the original commitpackft for extracting respective samples
- Make sure the proper packages are installed via the
environment.yamlfile provided at root folder - Run the following command for generating data with Llama-3.3-70B-Instruct Model
python data_pipeline.py --output_dir /path_to_output --language "python" --data_path /path_to_seed_code --llm_path huggingface/local path to LLM