We need to build a model that can predict car's price by its features.
- We don't have a ready dataset, so we have to parse auto.ru website to collect information for machine learning.
- The metrics that will evaluate the quality of the model is MAPE (mean absolute percentage error).'
- Project should be done by a team.
- Parsing code should be presented (in kaggle notebook or uploaded to github).
- Project code must be presented on github and kaggle.
- MAPE metric for the final model must exceed baseline's result.
- Predicted price values must be submitted to kaggle, the result from the leaderboard must be presented on github.
- Gathering of a team.
- Data enrichment.
- Parsing of relevant data from auto.ru.
- Unification and merging of test and parsed datasets.
- EDA
- Quick dataset overview using profile report.
- Handling of duplicates.
- Handling of missing values.
- Visualisation of features distribution and relationship with a target value.
- Outlier analysis.
- Dividing features into categories.
- Analysis of relation between features categories and with a target value.
- Feature Engineering
- Two new features have been included into dataset.
- ML
- Encoding of all binary and categorical features.
- Testing of 5 different models: Random Forest, CatBoost, Gradient Boosting, XGBoost, LightGBM. Bagging and stacking have also been tested.
- Standartisation of numeric variabled hasn't given quality increase thus hasn't been used.
- The best result was shown by Stacking of Gradient Boosting and XGBoost.
- The best MAPE metric on the leaderboard is 11.84955%.
- The leaderboard ranking is 22.
- Perfoming of feature engineering.
- Trying some NLP methods to extract useful data.
- Better hyperparameters tuning.
- Testing of other models (i.g. ExtraTrees).
- More deep analysis of the data and the results to understand what impacts MAPE the most.
- The project is too massive to be done within a week.
- The team had a lot of issues with kaggle and parsing and has wasted a lot of time on solving them.
- Time for Feature Engineering had to be reduced due to deadline.
- The work has been very stressful under such conditions, the team is not satisfied.