Using publicly available datasets from "https://github.com/awesomedata/awesome-public-datasets/tree/master/Datasets" GitHub repositories to perform data exploration, preprocessing, implement machine learning models, and visualize the results using Python programming only. Dataset
Dataset 1: Titanic Dataset Analysis and Machine Learning This repository contains code for analyzing the Titanic dataset and implementing machine learning models to predict passenger survival. The Titanic dataset is a famous dataset widely used for data analysis and machine learning tasks. It contains information about passengers on the Titanic, including their age, sex, class, fare, and survival status. Dataset Selection Rationale: The Titanic dataset is chosen for analysis due to its popularity and the rich information it provides about the passengers. It is an excellent dataset for practicing data pre-processing, exploratory data analysis (EDA), and implementing machine learning algorithms.
Dataset 2: Scorecard.csv, a single CSV file with all the years data combined. In it, we've converted categorical variables represented by integer keys in the original data to their labels and added a Year column database.sqlite, a SQLite database containing a single Scorecard table that contains the same information as Scorecard.csv
Instructions for Running the Code: To run the code and perform the analysis, follow these steps:
- Clone this repository to your local machine.
- Make sure you have the required Python packages installed. If not, you can install them using pip: Assignment Overview: Titanic Dataset Analysis and Machine Learning This repository contains code for analyzing the Titanic dataset and implementing machine learning models to predict passenger survival. The Titanic dataset is a famous dataset widely used for data analysis and machine learning tasks. It contains information about passengers on the Titanic, including their age, sex, class, fare, and survival status. Dataset Selection Rationale: The Titanic dataset is chosen for analysis due to its popularity and the rich information it provides about the passengers. It is an excellent dataset for practicing data pre-processing, exploratory data analysis (EDA), and implementing machine learning algorithms. Instructions for Running the Code: To run the code and perform the analysis, follow these steps:
- Clone this repository to your local machine.
- Make sure you have the required Python packages installed. If not, you can install them using pip:
Assignment Overview: Titanic Dataset Analysis and Machine Learning This repository contains code for analyzing the Titanic dataset and implementing machine learning models to predict passenger survival. The Titanic dataset is a famous dataset widely used for data analysis and machine learning tasks. It contains information about passengers on the Titanic, including their age, sex, class, fare, and survival status. Dataset Selection Rationale: The Titanic dataset is chosen for analysis due to its popularity and the rich information it provides about the passengers. It is an excellent dataset for practicing data preprocessing, exploratory data analysis (EDA), and implementing machine learning algorithms. Instructions for Running the Code: To run the code and perform the analysis, follow these steps:
- Clone this repository to your local machine.
- Make sure you have the required Python packages installed. If not, you can install them using pip: bashCopy code pip install rpy2 pandas matplotlib seaborn scikit-learn
- Download the Titanic dataset from Kaggle (https://www.kaggle.com/c/titanic/data) and save it as "titanic.csv" in the "Python_Assignment10" folder.
- Open the "Assignment10.py" file in your preferred Python IDE or text editor.
- Execute the code in your Python environment. The code will read the dataset, perform data preprocessing, implement two machine learning models (Logistic Regression and Random Forest Classifier), and print the accuracy and confusion matrix for each model.
- After executing the code, you will see a heatmap visualization representing the correlation between features in the dataset. Note: Ensure that you have both R and Python environments set up correctly. The code uses the "rpy2" package to interact with R and perform some operations in R. If you encounter any issues related to R environment setup or R packages, please refer to the relevant documentation.