Skip to content

Latest commit

 

History

History
72 lines (50 loc) · 1.81 KB

README.md

File metadata and controls

72 lines (50 loc) · 1.81 KB

DataKRK #9 - Python for Data Science

This repository contains notebooks & data from my talk at DataKRK

Online versions:

Requirements

  • ipython
  • numpy
  • scipy
  • pandas
  • scikit-learn
  • matplotlib
  • ggplot

I recommend using Anaconda, and installing ggplot from head

PySpark

In order to run PySpark code, you need to setup a Spark cluster (AWS EMR has built-in support for Spark), but probably you can also run it on your local machine in standalone mode.

Install IPython and create new profile for PySpark:

$ ipython profile create pyspark

Edit ~/.ipython/profile_pyspark/ipython_config.py and paste:

c = get_config()

c.NoteBookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8743 # or whatever you want; be aware of conflicts with CDH

Edit ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py and paste:

import os
import sys
 
ENVS = {
    'PYSPARK_PYTHON': 'python2.7',
    "MASTER": "yarn-client",
    "SPARK_HOME": '/home/hadoop/spark/',
    'PYSPARK_SUBMIT_ARGS': '--jars /home/hadoop/spark/lib/spark-examples-1.1.0-hadoop2.4.0.jar',
}

def set_envs():
    for key, val in ENVS.iteritems():
        os.environ[key] = val


set_envs()
SPARK_HOME = os.environ['SPARK_HOME']
sys.path.insert(0, os.path.join(SPARK_HOME, 'python'))
sys.path.insert(0, os.path.join(SPARK_HOME, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(SPARK_HOME, 'python/pyspark/shell.py'))

Run ipython notebook --profile=pyspark