diff --git a/rdagent/scenarios/data_science/scen/prompts.yaml b/rdagent/scenarios/data_science/scen/prompts.yaml index a4f7fd7c..cdb68ef7 100644 --- a/rdagent/scenarios/data_science/scen/prompts.yaml +++ b/rdagent/scenarios/data_science/scen/prompts.yaml @@ -5,73 +5,30 @@ scen_desc: |- ------The expected output & submission format specifications------ {{scen.submission_specifications}} -description_template: +competition_description_template: system: |- - You are an assistant that extracts structured information from unstructured text. + You are a data science assistant that extracts structured information from unstructured text. The user will provide you a Kaggle competition description, and you need to extract specific details from it. For the dataset, the competition may not include detailed information about the dataset. The user has read the dataset and provide you the relevant information. Please include it in your response. Please answer in Json format with the following schema: { - "Competition Type": "The type of competition, e.g., 'Classification', 'Regression', 'Clustering', 'Prediction", "Time-Series Forecasting", - "Competition Description": "A brief description of the competition", - "Target Description": "A description of the target variable to be predicted", - "Competition Features": "Two-line description of the overall features involved within the competition as background." + "Competition Task Type": "The type of competition task, e.g., 'Classification', 'Regression', 'Clustering', 'Recommendation", "Time-Series Forecasting", + "Competition Data Type": "The type of competition data, e.g., 'Tabular', 'Time Series', 'Text (Natural Language Processing)', 'Image (Computer Vision)', 'Audio', 'Video'", + "Competition Brief Description": "A brief description of the competition", + "Competition Target Description": "A description of the target variable to be predicted", "Submission Specifications": "The submission specification & sample submission csv descriptions for the model to output." "Submission channel number to each sample": "The number of channels in the output for each sample, e.g., 1 for regression, N for N class classification with probabilities, etc. A Integer. If not specified, it is 1." - "Evaluation Description": "A brief description of the metrics used in the evaluation. Please note that if `evaluation_metric_direction` is True, it indicates that higher values are better; if False, lower values are preferred." } - Since these might be very similar column names in data like one_hot_encoded columns, you can use some regex to group them together. user: |- Competition Description: - {{ competition_descriptions }} - Evaluation_metric_direction: - {{ evaluation_metric_direction }} + {{ competition_raw_description }} competition_background: |- - You are solving a data science tasks and the type of the competition is {{ competition_type }}. - The competition description is: {{competition_description}}. - - We provide an overall script in file: train.py. The user will run the train.py script along with several feature and model scripts to train several model to get a good performance on this task. + You are a world-class data scientist and machine learning engineer with deep expertise in statistics, mathematics, and computer science. + Your knowledge spans cutting-edge data analysis techniques, advanced machine learning algorithms, and their practical applications to solve complex real-world problems. + You are dedicated to producing accurate, efficient, and innovative solutions. - The train.py script is as follows: - ```python - {{ train_script }} - ``` - - The final output of our pipeline is from a ensemble of up to four models. Each model is trained on a different subset of the data. - The four model types are: XGBoost, RandomForest, LightGBM and Neural Network (A Pytorch model). - About the Neural Network model, You can try different architectures and hyperparameters to improve the performance. You can even use a pytorch model to ensemble the other three types of models. Try to open your mind on the NN model. - - The data is extracted from the competition dataset, focusing on relevant attributes in {{ competition_features }}. - - The user firstly designs and implements a feature book for each model. The feature book is a combination of several features and feature groups. - The feature book is built from: - - Raw features: The raw features are the original features from the dataset. - - generated features: The generated features are the features that are calculated based on the raw features according to some formulations. The calculation should be align with some physical or logical meaning. Don't just simply apply some numeric operations to the raw features. - - feature groups: The feature groups are preprocessed group of features from the raw features like normalization, one hot encoding, etc. - The feature or feature group is defined in the following parts: - - Name: The name of the feature or feature group. - - Description: A description of the feature or feature group. - - Formulation: The formulation of the feature or feature group. - - Variables: The variable list used in the formulation. Notice: The variable should be a specific feature in the dataset. Please make sure the feature name is exactly the same as the feature name in the dataset. - - For each model, the user will design and implement the model in a separate script. - The model is defined in the following parts: - - Name: The name of the model. - - Description: A description of the model. - - Architecture: The detailed architecture of the model, such as neural network layers or tree structures. - - ModelType: The type of the model, which should be one of ["XGBoost", "RandomForest", "LightGBM", "NN"]. - The model should provide clear and detailed documentation of its architecture and hyperparameters. - - The user tries to optimize the performance iteratively by employing one of the feature related or model related action items: - - Feature related: - - "Feature engineering": The user will design several new tasks and implement several new features. The new feature might only affect the model using all the feature book. - - "Feature processing": The user will design a new task to process the feature book like normalization or one hot encoding to improve the model performance. Any processing with help of a deep model is not included in this task. - - Model related: - - "Model feature selection": The user will modify one model to select the part of the features from the feature book to improve the model performance. - - "Model tuning": The user will tune the hyperparameters of XGBoost, RandomForest or LightGBM or build or improve the NN model to improve the model performance. - Notice: You can automatically optimize the hyperparameters of the model using some library when training the model. Since we don't have a lot of time to train the model, please use a small number of trials to optimize the hyperparameters. - Our validation set split is not deterministic, so when you are using hyperparameter tuning, you can merge training and validation and use cross validation method to tune the hyperparameters. - One you have determine the best model parameter, you should retrain the model on all training and validation set to get the final model. - - For each loop, you need to help user decide which action item to choose and provide the corresponding code to implement the action item. \ No newline at end of file + The task type for this competition is {{ competition_task_type }}. + The data type used in this competition is {{ competition_data_type }}. + Briefly, the competition involves: {{ competition_brief_description }}. + #TODO: Add more details about the competition? \ No newline at end of file diff --git a/rdagent/scenarios/data_science/scen/scen.py b/rdagent/scenarios/data_science/scen/scen.py index 79e0de7d..a519f1ef 100644 --- a/rdagent/scenarios/data_science/scen/scen.py +++ b/rdagent/scenarios/data_science/scen/scen.py @@ -21,18 +21,17 @@ class DataScienceScen(Scenario): def __init__(self, competition: str) -> None: self.competition = competition - self.competition_descriptions = crawl_descriptions(competition, DS_RD_SETTING.local_data_path) + self.competition_raw_description = crawl_descriptions(competition, DS_RD_SETTING.local_data_path) leaderboard = leaderboard_scores(competition) - self.evaluation_metric_direction = float(leaderboard[0]) > float(leaderboard[-1]) + self.competition_metric_direction = float(leaderboard[0]) > float(leaderboard[-1]) self._analysis_competition_description() def _analysis_competition_description(self): - sys_prompt = T(".prompts:description_template.system").r() - user_prompt = T(".prompts:description_template.user").r( - competition_descriptions=self.competition_descriptions, - evaluation_metric_direction=self.evaluation_metric_direction, + sys_prompt = T(".prompts:competition_description_template.system").r() + user_prompt = T(".prompts:competition_description_template.user").r( + competition_raw_description=self.competition_raw_description, ) response_analysis = APIBackend().build_messages_and_create_chat_completion( @@ -42,41 +41,34 @@ def _analysis_competition_description(self): ) response_json_analysis = json.loads(response_analysis) - self.competition_type = response_json_analysis.get("Competition Type", "No type provided") - self.competition_description = response_json_analysis.get("Competition Description", "No description provided") - self.target_description = response_json_analysis.get("Target Description", "No target provided") - self.competition_features = response_json_analysis.get("Competition Features", "No features provided") + self.competition_task_type = response_json_analysis.get("Competition Task Type", "No type provided") + self.competition_data_type = response_json_analysis.get("Competition Data Type", "No data type provided") + self.competition_brief_description = response_json_analysis.get("Competition Brief Description", "No brief description provided") + self.competition_target_description = response_json_analysis.get("Competition Target Description", "No target description provided") self.submission_specifications = response_json_analysis.get( "Submission Specifications", "No submission requirements provided" ) self.model_output_channel = response_json_analysis.get("Submission channel number to each sample", 1) - self.evaluation_desc = response_json_analysis.get( - "Evaluation Description", "No evaluation specification provided." - ) def get_competition_full_desc(self) -> str: - evaluation_direction = "higher the better" if self.evaluation_metric_direction else "lower the better" - return f"""Competition Type: {self.competition_type} - Competition Description: {self.competition_description} - Target Description: {self.target_description} - Competition Features: {self.competition_features} + return f"""Competition Task Type: {self.competition_task_type} + Competition Data Type: {self.competition_data_type} + Competition Brief Description: {self.competition_brief_description} + Competition Target Description: {self.competition_target_description} Submission Specifications: {self.submission_specifications} Model Output Channel: {self.model_output_channel} - Evaluation Descriptions: {self.evaluation_desc} - Is the evaluation metric the higher the better: {evaluation_direction} """ @property def background(self) -> str: background_template = T(".prompts:competition_background") background_prompt = background_template.r( - competition_type=self.competition_type, - competition_description=self.competition_description, - target_description=self.target_description, - competition_features=self.competition_features, + competition_task_type=self.competition_task_type, + competition_data_type=self.competition_data_type, + competition_brief_description=self.competition_brief_description, + target_description=self.competition_target_description, submission_specifications=self.submission_specifications, - evaluation_desc=self.evaluation_desc, - evaluate_bool=self.evaluation_metric_direction, + evaluate_bool=self.competition_metric_direction, ) return background_prompt