Data & Notebooks — ML / Spark Examples

This repository is a collection of datasets, example scripts and Jupyter notebooks for learning data processing, Spark-based text exercises, and machine learning experiments. It is intended as a personal learning workspace and a reference for common data-science exercises.

Contents (high level)

Datasets: CSV files used across notebooks (example: DailyDelhiClimateTrain.csv, homeprices.csv, Titanic.csv).
Notebooks: interactive notebooks for tutorials and experiments (e.g. Spark_Text_Exercises.ipynb, LSTM_Daily_Climate_Forecasting.ipynb).
Scripts: helper and runnable scripts (e.g. spark_text_lab.py, setup_emr.py, commit_each_file.ps1).

See the repository root for the full file list.

Quick setup

Create and activate a Python virtual environment, then install requirements:

python -m venv .venv
.\.venv\Scripts\Activate.ps1    # PowerShell on Windows
pip install -r requirements.txt
python -m nltk.downloader punkt    # optional: for text examples

Run a sample script (PySpark is used in some examples — local mode by default):

python spark_text_lab.py --corpus sample_corpus.txt

Open notebooks in Jupyter Lab/Notebook:

jupyter lab

Best practices

Do not commit secrets: add sensitive files (for example .env) to .gitignore before committing.
Keep large binary files and heavy datasets out of git; use external storage or LFS when needed.

Automating single-file commits

The repository includes commit_each_file.ps1, a helper PowerShell script that can commit and push files individually with pre-written messages. Use the -DryRun flag to preview actions before executing.

Contributing

This repo is primarily a personal/learning collection. If you want to contribute or suggest edits, open a PR with a clear description of the change and small, focused commits.

Notes

Notebooks may contain large JSON cells — use notebook diff tools (e.g. nbdime) when reviewing changes.
Some examples require local services (e.g. MongoDB) or additional setup; see notebook headers for per-example requirements.

If you'd like, I can expand the README with a file-by-file table of contents and one-line commit messages for each file.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
AKASENG DERICK ATEMNKENG		AKASENG DERICK ATEMNKENG
Asah priness		Asah priness
MANCHO RENNE		MANCHO RENNE
MBI KITZITO		MBI KITZITO
NGONO VANESSA		NGONO VANESSA
NIBA EMMANUEL		NIBA EMMANUEL
NINGPAH MEBUNE		NINGPAH MEBUNE
Nkeng Shanice		Nkeng Shanice
RAYMOND NKEGOAH		RAYMOND NKEGOAH
TAKEM FIDEL		TAKEM FIDEL
TCHINDA RAOUL		TCHINDA RAOUL
TESH DARRELLE		TESH DARRELLE
Tabe-Joel		Tabe-Joel
VICTORY DIANA		VICTORY DIANA
untracked_backup_20260130_192711		untracked_backup_20260130_192711
.env		.env
.gitignore		.gitignore
BigData_DataBase_Lab.ipynb		BigData_DataBase_Lab.ipynb
DailyDelhiClimateTest.csv		DailyDelhiClimateTest.csv
DailyDelhiClimateTrain.csv		DailyDelhiClimateTrain.csv
LSTM_Daily_Climate_Forecasting.ipynb		LSTM_Daily_Climate_Forecasting.ipynb
LSTM_Daily_Climate_Train_Test.ipynb		LSTM_Daily_Climate_Train_Test.ipynb
LabWordcountLandmark (1).ipynb		LabWordcountLandmark (1).ipynb
LabWordcountLandmark.ipynb		LabWordcountLandmark.ipynb
Learning Pandas.ipynb		Learning Pandas.ipynb
Machine_Learning_learn.ipynb		Machine_Learning_learn.ipynb
NCHS_-_Leading_Causes_of_Death__United_States.csv		NCHS_-_Leading_Causes_of_Death__United_States.csv
Pandas_learn.ipynb		Pandas_learn.ipynb
README.md		README.md
Solution_LabWordcountLandmark.ipynb		Solution_LabWordcountLandmark.ipynb
Spark_Text_Exercises.ipynb		Spark_Text_Exercises.ipynb
Titanic.csv		Titanic.csv
app.py		app.py
areas.csv		areas.csv
commit_each_file.ps1		commit_each_file.ps1
full_data.csv		full_data.csv
homeprices.csv		homeprices.csv
insurance_data.csv		insurance_data.csv
requirements.txt		requirements.txt
salaries.csv		salaries.csv
sample_corpus.txt		sample_corpus.txt
setup_emr.py		setup_emr.py
spark_text_lab.py		spark_text_lab.py
test1.ipynb		test1.ipynb
time-series.ipynb		time-series.ipynb
weather_data.csv		weather_data.csv
welcome.txt		welcome.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data & Notebooks — ML / Spark Examples

Contents (high level)

Quick setup

Best practices

Automating single-file commits

Contributing

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data & Notebooks — ML / Spark Examples

Contents (high level)

Quick setup

Best practices

Automating single-file commits

Contributing

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages