GitHub - JamesMcCullochDickens/Machine-Learning-Project: CSI 5155 Machine Learning Final Project

Author: James Dickens, s.n. 7118781

Final Project - CSI 5155: Machine Learning, Course taught by Dr. Herna Viktor.

This is my code for the machine learning task of binary classification of the data available at http://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD), which consists of weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. The goal is to evaluate five commonly used machine learning models (including a semi-supervised neural network!) to classify whether a given instance makes more than 50K a year or not , a.k.a a binary classification task.

My code is organized as follows:

Preprocess.py takes the initial census-income.data file and census-income.test file and

prints information about the data and its attributes
removes duplicates from the training data
deals with instance weight conflicts
replaces missing values with their defaults
writes the result to files: 'census-income.data/training_data_preprocess1', 'census-income.test/testing_data_preprocess1'

Preprocess2.py

eliminates certain features
simplifies certain features by using binning
writes the result to the files: 'census-income.data/training_data_preprocess2', 'census-income.test/testing_data_preprocess2'

Preprocess3.py

One-hot-encoding applied to categorical attributes: class of worker, education, enrolled education, married, race, sex, employment status, and tax filer status
feature calibration of the occupation code and industry code
Binarization of sex, country of birth of parents, country of birth of the person, and income category
writes the result to the files: 'census-income.data/train_data',
'census-income.test/test_data'

Models.py

Trains and evalutes the five models on the training/testing set and prints evaluation metrics

For convenience I have included Weka attribute files for the full original dataset in the file WekaFullAttributes.txt, as well as a reduced feature list as per the output of Preproces2.py in the file ReducedFeatureListWeka.txt.

The rules output by the decision tree are written to the file Tree Rules.txt, and the rules output by the Skope Rules classifier are available in the file Skope Rules.txt

Sample output from the models program is available in the file Sample Output.txt.

Thanks for stopping by!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
census-income.data		census-income.data
census-income.test		census-income.test
.gitattributes		.gitattributes
CSI 5155 - Project Report - James Dickens s.n. 7118781.pdf		CSI 5155 - Project Report - James Dickens s.n. 7118781.pdf
Models.py		Models.py
PreProcess3.py		PreProcess3.py
Preprocess.py		Preprocess.py
Preprocess2.py		Preprocess2.py
README.md		README.md
ReducedFeatureListWeka.txt		ReducedFeatureListWeka.txt
Sample Output.txt		Sample Output.txt
Skope Rules.txt		Skope Rules.txt
Tree Rules.txt		Tree Rules.txt
WekaFullAttributeList.txt		WekaFullAttributeList.txt

JamesMcCullochDickens/Machine-Learning-Project

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages