Skip to content

Latest commit

 

History

History
30 lines (20 loc) · 795 Bytes

File metadata and controls

30 lines (20 loc) · 795 Bytes

Spark Text Lab

Quick start

  1. Create a Python virtualenv and install dependencies:
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python -m nltk.downloader punkt
  1. Start MongoDB locally (or use a remote URI). Update MONGO_URI in the script or pass it as an argument.

  2. Run quick test with sample corpus:

python spark_text_lab.py --corpus sample_corpus.txt --mongo-uri mongodb://localhost:27017

Files

  • spark_text_lab.py: main script with implementations for Exercises 1-6.
  • sample_corpus.txt: small sample corpus for quick testing.

Notes

  • The notebook Spark_Text_Exercises.ipynb shows step-by-step usage (created next).
  • The script uses PySpark in local mode; to run on a cluster, adjust SparkSession builder settings.