Spark Text Lab
Quick start
- Create a Python virtualenv and install dependencies:
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python -m nltk.downloader punkt-
Start MongoDB locally (or use a remote URI). Update
MONGO_URIin the script or pass it as an argument. -
Run quick test with sample corpus:
python spark_text_lab.py --corpus sample_corpus.txt --mongo-uri mongodb://localhost:27017Files
spark_text_lab.py: main script with implementations for Exercises 1-6.sample_corpus.txt: small sample corpus for quick testing.
Notes
- The notebook
Spark_Text_Exercises.ipynbshows step-by-step usage (created next). - The script uses PySpark in local mode; to run on a cluster, adjust
SparkSessionbuilder settings.