-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #9 from KaveIO/speedups
Speedups + spark example
- Loading branch information
Showing
14 changed files
with
345 additions
and
69 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
import multiprocessing | ||
|
||
# number of cores to use in parallel processing | ||
ncores = multiprocessing.cpu_count() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,195 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Phi_K spark tutorial\n", | ||
"\n", | ||
"This notebook shows you how to obtain the Phi_K correlation matrix for a spark dataframe.\n", | ||
"Calculating the Phi_K matrix consists of two steps:\n", | ||
"\n", | ||
"- Obtain the 2d contingency tables for all variable pairs. To make these we use the `popmon` package, which relies on the `spark histogrammar` package.\n", | ||
"- Calculate the Phi_K value for each variable pair from its contingency table.\n", | ||
"\n", | ||
"Make sure you install the popmon package to make the 2d histograms, that are then used to calculate phik." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"!pip install popmon" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import itertools\n", | ||
"import pandas as pd\n", | ||
"import phik\n", | ||
"from phik import resources\n", | ||
"from phik.phik import spark_phik_matrix_from_hist2d_dict\n", | ||
"import popmon\n", | ||
"from popmon.analysis.hist_numpy import get_2dgrid\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# histogramming in popmon is done using the histogrammar library" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from pyspark.sql import SparkSession\n", | ||
"spark = SparkSession.builder.config('spark.jars.packages','org.diana-hep:histogrammar-sparksql_2.11:1.0.4').getOrCreate()\n", | ||
"sc = spark.sparkContext" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Load data\n", | ||
"\n", | ||
"A simulated dataset is part of the phik-package. The dataset concerns fake car insurance data. Load the dataset here:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"data = pd.read_csv( resources.fixture('fake_insurance_data.csv.gz') )\n", | ||
"sdf = spark.createDataFrame(data)\n", | ||
"sdf.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"combis = itertools.combinations_with_replacement(sdf.columns, 2)\n", | ||
"combis = [list(c) for c in combis]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"print (combis)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# step 1: create histograms (this runs spark histogrammar in the background)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# see the doc-string of pm_make_histograms() for binning options.\n", | ||
"hists = sdf.pm_make_histograms(combis)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"grids = {k:get_2dgrid(h) for k,h in hists.items()}\n", | ||
"print (grids)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# we can store the histograms if we want to\n", | ||
"if False:\n", | ||
" import pickle\n", | ||
"\n", | ||
" with open('grids.pkl', 'wb') as outfile:\n", | ||
" pickle.dump(grids, outfile)\n", | ||
"\n", | ||
" with open('grids.pkl', 'rb') as handle:\n", | ||
" grids = pickle.load(handle)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# step 2: calculate phik matrix (runs rdd parallellization over all 2d histograms)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"phik_matrix = spark_phik_matrix_from_hist2d_dict(sc, grids)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"phik_matrix" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.