Skip to content

trangthao/side-projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Side Projects Portfolio

Welcome to my portfolio of side projects, where I have explored a variety of methodologies, industries, and tasks across data science, machine learning, and analytics. Each project highlights my approach to solving real-world problems using data-driven methods.


Table of Contents

  1. Projects by Industry
  2. Projects by Tasks

Projects by Industry

1. Travel and Hospitality

  • Expedia Hotel Recommendation System
    • Description: Developed a machine learning model to predict a user's likelihood of booking one of 100 hotel clusters based on Expedia's customer interaction data. The system enhances personalized hotel recommendations, improving user satisfaction and platform engagement.
    • Tech Stack: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, XGBoost.
    • Key Highlights: Predicted hotel clusters using classification models to recommend contextually relevant hotels. Employed clustering to group similar hotels and handle sparse data for newly listed hotels.

2. Banking

  • Attention-based NLP Classification

    • Description: Developed a text classification model leveraging attention mechanisms to classify consumer complaints into financial product categories.
    • Tech Stack: Python, PyTorch, GloVe, NLTK, Scikit-learn.
    • Key Highlights: Preprocessed complaint narratives, used pre-trained GloVe embeddings for text encoding, implemented a bidirectional LSTM with an attention mechanism, and achieved high classification accuracy with interpretable attention weights.
  • Customer Churn Prediction Analysis

    • Description: Developed an ensemble-based machine learning model to predict customer churn for a bank, enabling proactive retention strategies to reduce revenue loss.
    • Tech Stack: Python, Scikit-learn, XGBoost, Random Forest, Matplotlib, Seaborn.
    • Key Highlights: EDA to uncover churn drivers, feature engineering for customer behavior insights, high recall and precision for identifying at-risk customers, and actionable retention strategies to reduce churn rate.
  • Loan Eligibility Prediction Analysis

    • Description: Built a machine learning model to predict loan eligibility for applicants based on financial and demographic data. The project aimed to assist financial institutions in streamlining loan approval processes and improving decision-making accuracy.
    • Tech Stack: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, XGBoost.
    • Key Highlights: Predicted loan eligibility using classification models, incorporating Gradient Boosting for improved accuracy. Addressed class imbalance using techniques like class weighting and feature engineering to enhance model robustness and reliability.
  • Credit Card Default Prediction

    • Description: Built a machine learning model to predict credit card default risk using borrower data, enabling banks to identify high-risk customers and reduce lending risks. The model incorporates personal and financial history details for accurate forecasting of payment delinquencies.
    • Tech Stack: Python, Pandas, Scikit-Learn, LightGBM, XGBoost, Keras, TensorFlow, Matplotlib, Seaborn.
    • Key Highlights: Cleaned and preprocessed imbalanced datasets, performed exploratory data analysis (EDA), implemented multiple machine learning models, evaluated model performance using precision, recall, and AUC, and optimized model selection for high-risk customer prediction.

3. Advertising

  • Criteo Advertising CATE Estimation

    • Description: Implemented Conditional Average Treatment Effect (CATE) estimation using the Criteo Uplift Modeling Dataset to predict and optimize user responsiveness to advertising campaigns.
    • Tech Stack: Python, EconML, NumPy, Pandas, Scikit-learn, Matplotlib.
    • Key Highlights: Used causal inference techniques like Double Machine Learning and meta-learners, generated Qini and uplift curves for evaluation, and provided actionable insights for ad campaign targeting and budget optimization.
  • Budget Optimization for Marketing Channels

    • Description: Developed an optimization model to allocate marketing budgets effectively across multiple channels, maximizing returns based on historical performance data.
    • Tech Stack: Python, GEKKO, Pandas, Matplotlib.
    • Key Highlights: Modeled multi-channel budget constraints, maximized campaign ROI, handled complex channel-wise constraints, and provided actionable budget recommendations.

4. Retail

  • Power BI dashboard: Amazon Prime Videos

    • Description: Designed and developed an interactive Power BI dashboard to analyze key performance indicators (KPIs) for Amazon Prime. The dashboard provides actionable insights into subscription trends, customer engagement, and service performance.
    • Tech Stack: Power BI
    • Key Highlights: Implemented advanced DAX measures for dynamic filtering and drill-through analysis. Created interactive charts and graphs for in-depth analysis of regional trends, customer segments, and content preferences.
  • Avocado price prediction

    • Description: Developed a machine learning-based approach to forecast avocado prices and trends, aiding strategic decision-making for a farming company. The project analyzed seasonality, regional variations, and sales trends to optimize pricing and inventory strategies.
    • Tech Stack: Python, Pandas, NumPy, Matplotlib, Seaborn, Statsmodels, Prophet.
    • Key Highlights: Compared ARIMA and Prophet models for price forecasting, with Prophet outperforming in capturing seasonality and long-term trends. Delivered actionable insights for dynamic pricing, promotion timing, and regional sales optimization to boost profitability.
  • Bigmart Sales Prediction

    • Description: Developed a machine learning model to predict sales for Bigmart stores, helping the company optimize inventory management and promotional strategies. The model uses historical sales data along with product and store-specific features to accurately forecast future sales and improve operational efficiency.
    • Tech Stack: Python, Pandas, Scikit-Learn, XGBoost, LightGBM, Matplotlib, Seaborn.
    • Key Highlights: Cleaned and preprocessed sales data, performed exploratory data analysis (EDA), explored various machine learning models including Random Forest, Gradient Boosting, and Neural Networks, evaluated model performance using metrics like R-squared, MAE, and RMSE, and selected the best-performing model for sales prediction.
  • Text Classification with LSTM

    • Description: Developed a Long Short-Term Memory (LSTM) model to classify text into predefined categories. The model processes sequential data and automatically classifies text for applications like sentiment analysis, spam detection, or content categorization. By leveraging LSTM's ability to capture dependencies in text data, the project aims to automate text classification tasks with high accuracy.
    • Tech Stack: Python, TensorFlow, Keras, Pandas, NumPy, Matplotlib, Seaborn, Scikit-Learn.
    • Key Highlights: Preprocessed text data using tokenization and padding, built an LSTM-based model for classification, evaluated the model using accuracy, precision, recall, and F1 score, and performed hyperparameter tuning to optimize model performance.

5. Logistics

  • Shipment multilabel classification
    • Description: Developed a machine learning model to predict the optimal mode of transport for shipments, enabling businesses to optimize logistics operations. The model uses shipment data including product types, distances, and destinations to determine the most suitable transport method, improving delivery efficiency, reducing costs, and enhancing customer satisfaction.
    • Tech Stack: Python, Pandas, Scikit-Learn, LightGBM, XGBoost, Keras, TensorFlow, Matplotlib, Seaborn.
    • Key Highlights: Explored and preprocessed the dataset, addressing missing values and encoding categorical features. Implemented multiple multilabel classification approaches such as naive independent models, classifier chains, natively multilabel models, and the multilabel-to-multiclass approach. Evaluated model performance using precision, recall, F1 score, and AUC, and optimized the model selection for accurate prediction of shipment modes.

5. Public Sector

  • CNN Image Classification
    • Description: Developed a Convolutional Neural Network (CNN) to classify images into predefined categories. The model is designed to automate image sorting, product tagging, and object recognition tasks. It uses deep learning techniques to efficiently classify images with high accuracy.
    • Tech Stack: Python, TensorFlow, Keras, NumPy, Matplotlib, Seaborn.
    • Key Highlights: Preprocessed image data, built and trained a CNN model using Keras, implemented model evaluation with accuracy, precision, and recall, fine-tuned the model using hyperparameter optimization, and visualized model performance through confusion matrices and ROC curves.

6. Health

  • Medical Embeddings for Clinical Trial Data
    • Description: Developed custom word embeddings for clinical trial data to enhance information retrieval and improve decision-making in the healthcare field. The model captures semantic relationships between medical terms, enabling the creation of a search engine that retrieves relevant clinical trials based on textual queries.
    • Tech Stack: Python, Gensim, TensorFlow, Keras, Pandas, NumPy, Matplotlib, Seaborn.
    • Key Highlights: Preprocessed clinical trial data, trained Word2Vec and FastText models to generate medical embeddings, built a search engine using cosine similarity, and evaluated model performance based on query relevance and accuracy in retrieving the most similar trials.

Projects by Tasks

1. Causal Inference

  • Criteo Advertising CATE Estimation
    • Description: Implemented Conditional Average Treatment Effect (CATE) estimation using the Criteo Uplift Modeling Dataset to predict and optimize user responsiveness to advertising campaigns.
    • Tech Stack: Python, EconML, NumPy, Pandas, Scikit-learn, Matplotlib.
    • Key Highlights: Used causal inference techniques like Double Machine Learning and meta-learners, generated Qini and uplift curves for evaluation, and provided actionable insights for ad campaign targeting and budget optimization.

2. Classification

  • Attention-based NLP Classification

    • Description: Developed a text classification model leveraging attention mechanisms to classify consumer complaints into financial product categories.
    • Tech Stack: Python, PyTorch, GloVe, NLTK, Scikit-learn.
    • Key Highlights: Preprocessed complaint narratives, used pre-trained GloVe embeddings for text encoding, implemented a bidirectional LSTM with an attention mechanism, and achieved high classification accuracy with interpretable attention weights.
  • Customer Churn Prediction Analysis

    • Description: Developed an ensemble-based machine learning model to predict customer churn for a bank, enabling proactive retention strategies to reduce revenue loss.
    • Tech Stack: Python, Scikit-learn, XGBoost, Random Forest, Matplotlib, Seaborn.
    • Key Highlights: EDA to uncover churn drivers, feature engineering for customer behavior insights, high recall and precision for identifying at-risk customers, and actionable retention strategies to reduce churn rate.
  • Loan Eligibility Prediction Analysis

    • Description: Built a machine learning model to predict loan eligibility for applicants based on financial and demographic data. The project aimed to assist financial institutions in streamlining loan approval processes and improving decision-making accuracy.
    • Tech Stack: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, XGBoost.
    • Key Highlights: Predicted loan eligibility using classification models, incorporating Gradient Boosting for improved accuracy. Addressed class imbalance using techniques like class weighting and feature engineering to enhance model robustness and reliability.
  • Credit Card Default Prediction

    • Description: Built a machine learning model to predict credit card default risk using borrower data, enabling banks to identify high-risk customers and reduce lending risks. The model incorporates personal and financial history details for accurate forecasting of payment delinquencies.
    • Tech Stack: Python, Pandas, Scikit-Learn, LightGBM, XGBoost, Keras, TensorFlow, Matplotlib, Seaborn.
    • Key Highlights: Cleaned and preprocessed imbalanced datasets, performed exploratory data analysis (EDA), implemented multiple machine learning models, evaluated model performance using precision, recall, and AUC, and optimized model selection for high-risk customer prediction.
  • Shipment multilabel classification

    • Description: Developed a machine learning model to predict the optimal mode of transport for shipments, enabling businesses to optimize logistics operations. The model uses shipment data including product types, distances, and destinations to determine the most suitable transport method, improving delivery efficiency, reducing costs, and enhancing customer satisfaction.
    • Tech Stack: Python, Pandas, Scikit-Learn, LightGBM, XGBoost, Keras, TensorFlow, Matplotlib, Seaborn.
    • Key Highlights: Explored and preprocessed the dataset, addressing missing values and encoding categorical features. Implemented multiple multilabel classification approaches such as naive independent models, classifier chains, natively multilabel models, and the multilabel-to-multiclass approach. Evaluated model performance using precision, recall, F1 score, and AUC, and optimized the model selection for accurate prediction of shipment modes.
  • Text Classification with LSTM

    • Description: Developed a Long Short-Term Memory (LSTM) model to classify text into predefined categories. The model processes sequential data and automatically classifies text for applications like sentiment analysis, spam detection, or content categorization. By leveraging LSTM's ability to capture dependencies in text data, the project aims to automate text classification tasks with high accuracy.
    • Tech Stack: Python, TensorFlow, Keras, Pandas, NumPy, Matplotlib, Seaborn, Scikit-Learn.
    • Key Highlights: Preprocessed text data using tokenization and padding, built an LSTM-based model for classification, evaluated the model using accuracy, precision, recall, and F1 score, and performed hyperparameter tuning to optimize model performance.

3. Image Classification

  • CNN Image Classification
    • Description: Developed a Convolutional Neural Network (CNN) to classify images into predefined categories. The model is designed to automate image sorting, product tagging, and object recognition tasks. It uses deep learning techniques to efficiently classify images with high accuracy.
    • Tech Stack: Python, TensorFlow, Keras, NumPy, Matplotlib, Seaborn.
    • Key Highlights: Preprocessed image data, built and trained a CNN model using Keras, implemented model evaluation with accuracy, precision, and recall, fine-tuned the model using hyperparameter optimization, and visualized model performance through confusion matrices and ROC curves.

4. Prediction

  • Avocado price prediction

    • Description: Developed a machine learning-based approach to forecast avocado prices and trends, aiding strategic decision-making for a farming company. The project analyzed seasonality, regional variations, and sales trends to optimize pricing and inventory strategies.
    • Tech Stack: Python, Pandas, NumPy, Matplotlib, Seaborn, Statsmodels, Prophet.
    • Key Highlights: Compared ARIMA and Prophet models for price forecasting, with Prophet outperforming in capturing seasonality and long-term trends. Delivered actionable insights for dynamic pricing, promotion timing, and regional sales optimization to boost profitability.
  • Bigmart Sales Prediction

    • Description: Developed a machine learning model to predict sales for Bigmart stores, helping the company optimize inventory management and promotional strategies. The model uses historical sales data along with product and store-specific features to accurately forecast future sales and improve operational efficiency.
    • Tech Stack: Python, Pandas, Scikit-Learn, XGBoost, LightGBM, Matplotlib, Seaborn.
    • Key Highlights: Cleaned and preprocessed sales data, performed exploratory data analysis (EDA), explored various machine learning models including Random Forest, Gradient Boosting, and Neural Networks, evaluated model performance using metrics like R-squared, MAE, and RMSE, and selected the best-performing model for sales prediction.

5. Recommendation Systems

  • Expedia Hotel Recommendation System
    • Description: Developed a machine learning model to predict a user's likelihood of booking one of 100 hotel clusters based on Expedia's customer interaction data. The system enhances personalized hotel recommendations, improving user satisfaction and platform engagement.
    • Tech Stack: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, XGBoost.
    • Key Highlights: Predicted hotel clusters using classification models to recommend contextually relevant hotels. Employed clustering to group similar hotels and handle sparse data for newly listed hotels.

6. Information Retrieval

  • Medical Embeddings for Clinical Trial Data
    • Description: Developed custom word embeddings for clinical trial data to enhance information retrieval and improve decision-making in the healthcare field. The model captures semantic relationships between medical terms, enabling the creation of a search engine that retrieves relevant clinical trials based on textual queries.
    • Tech Stack: Python, Gensim, TensorFlow, Keras, Pandas, NumPy, Matplotlib, Seaborn.
    • Key Highlights: Preprocessed clinical trial data, trained Word2Vec and FastText models to generate medical embeddings, built a search engine using cosine similarity, and evaluated model performance based on query relevance and accuracy in retrieving the most similar trials.

7. Optimization

  • Budget Optimization for Marketing Channels
    • Description: Developed an optimization model to allocate marketing budgets effectively across multiple channels, maximizing returns based on historical performance data.
    • Tech Stack: Python, GEKKO, Pandas, Matplotlib.
    • Key Highlights: Modeled multi-channel budget constraints, maximized campaign ROI, handled complex channel-wise constraints, and provided actionable budget recommendations.

8. Visualization

  • Power BI dashboard: Amazon Prime Videos
    • Description: Designed and developed an interactive Power BI dashboard to analyze key performance indicators (KPIs) for Amazon Prime. The dashboard provides actionable insights into subscription trends, customer engagement, and service performance.
    • Tech Stack: Power BI
    • Key Highlights: Implemented advanced DAX measures for dynamic filtering and drill-through analysis. Created interactive charts and graphs for in-depth analysis of regional trends, customer segments, and content preferences.

About

The repository is for my side projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published