|
| 1 | +================================== |
| 2 | +histogrammar Python implementation |
| 3 | +================================== |
| 4 | + |
| 5 | +histogrammar is a Python package for creating histograms. histogrammar has multiple histogram types, |
| 6 | +supports numeric and categorical features, and works with Numpy arrays and Pandas and Spark dataframes. |
| 7 | +Once a histogram is filled, it's easy to plot it, store it in JSON format (and retrieve it), or convert |
| 8 | +it to Numpy arrays for further analysis. |
| 9 | + |
| 10 | +At its core histogrammar is a suite of data aggregation primitives designed for use in parallel processing. |
| 11 | +In the simplest case, you can use this to compute histograms, but the generality of the primitives |
| 12 | +allows much more. |
| 13 | + |
| 14 | +Several common histogram types can be plotted in Matplotlib, Bokeh and PyROOT with a single method call. |
| 15 | +If Numpy or Pandas is available, histograms and other aggregators can be filled from arrays ten to a hundred times |
| 16 | +more quickly via Numpy commands, rather than Python for loops. If PyROOT is available, histograms and other |
| 17 | +aggregators can be filled from ROOT TTrees hundreds of times more quickly by JIT-compiling a specialized C++ filler. |
| 18 | +Histograms and other aggregators may also be converted into CUDA code for inclusion in a GPU workflow. And if |
| 19 | +PyCUDA is available, they can also be filled from Numpy arrays by JIT-compiling the CUDA code. |
| 20 | +This Python implementation of histogrammar been tested to guarantee compatibility with its Scala implementation. |
| 21 | + |
| 22 | +Latest Python release: v1.0.20 (Feb 2021). |
| 23 | + |
| 24 | +Announcements |
| 25 | +============= |
| 26 | + |
| 27 | +Spark 3.0 |
| 28 | +--------- |
| 29 | + |
| 30 | +With Spark 3.0, based on Scala 2.12, make sure to pick up the correct histogrammar jar file: |
| 31 | + |
| 32 | +.. code-block:: python |
| 33 | +
|
| 34 | + spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar-sparksql_2.12:1.0.11").getOrCreate() |
| 35 | +
|
| 36 | +For Spark 2.X compiled against scala 2.11, in the string above simply replace "2.12" with "2.11". |
| 37 | + |
| 38 | +February, 2021 |
| 39 | + |
| 40 | +Example notebooks |
| 41 | +================= |
| 42 | + |
| 43 | +.. list-table:: |
| 44 | + :widths: 80 20 |
| 45 | + :header-rows: 1 |
| 46 | + |
| 47 | + * - Tutorial |
| 48 | + - Colab link |
| 49 | + * - `Basic tutorial <https://nbviewer.jupyter.org/github/histogrammar/histogrammar-python/blob/master/histogrammar/notebooks/histogrammar_tutorial_basic.ipynb>`_ |
| 50 | + - |notebook_basic_colab| |
| 51 | + * - `Detailed example (featuring configuration, Apache Spark and more) <https://nbviewer.jupyter.org/github/histogrammar/histogrammar-python/blob/master/histogrammar/notebooks/histogrammar_tutorial_advanced.ipynb>`_ |
| 52 | + - |notebook_advanced_colab| |
| 53 | + |
| 54 | +Documentation |
| 55 | +============= |
| 56 | + |
| 57 | +See `histogrammar-docs <https://histogrammar.github.io/histogrammar-docs/>`_ for a complete introduction to `histogrammar`. |
| 58 | +(A bit old but still good.) There you can also find documentation about the Scala implementation of `histogrammar`. |
| 59 | + |
| 60 | +Check it out |
| 61 | +============ |
| 62 | + |
| 63 | +The `historgrammar` library requires Python 3.6+ and is pip friendly. To get started, simply do: |
| 64 | + |
| 65 | +.. code-block:: bash |
| 66 | +
|
| 67 | + $ pip install histogrammar |
| 68 | +
|
| 69 | +or check out the code from our GitHub repository: |
| 70 | + |
| 71 | +.. code-block:: bash |
| 72 | +
|
| 73 | + $ git clone https://github.com/histogrammar/histogrammar-python |
| 74 | + $ pip install -e histogrammar-python |
| 75 | +
|
| 76 | +where in this example the code is installed in edit mode (option -e). |
| 77 | + |
| 78 | +You can now use the package in Python with: |
| 79 | + |
| 80 | +.. code-block:: python |
| 81 | +
|
| 82 | + import histogrammar |
| 83 | +
|
| 84 | +**Congratulations, you are now ready to use the histogrammar library!** |
| 85 | + |
| 86 | +Quick run |
| 87 | +========= |
| 88 | + |
| 89 | +As a quick example, you can do: |
| 90 | + |
| 91 | +.. code-block:: python |
| 92 | +
|
| 93 | + import pandas as pd |
| 94 | + import histogrammar as hg |
| 95 | + from histogrammar import resources |
| 96 | +
|
| 97 | + # open synthetic data |
| 98 | + df = pd.read_csv(resources.data('test.csv.gz'), parse_dates=['date']) |
| 99 | + df.head() |
| 100 | +
|
| 101 | + # create a histogram, tell it to look for column 'age' |
| 102 | + # fill the histogram with column 'age' and plot it |
| 103 | + hist = hg.Histogram(num=100, low=0, high=100, quantity='age') |
| 104 | + hist.fill.numpy(df) |
| 105 | + hist.plot.matplotlib() |
| 106 | +
|
| 107 | + # generate histograms of all features in the dataframe using automatic binning |
| 108 | + # (importing histogrammar automatically adds this functionality to a pandas or spark dataframe) |
| 109 | + hists = df.hg_make_histograms() |
| 110 | + print(hists.keys()) |
| 111 | +
|
| 112 | + # multi-dimensional histograms are also supported. e.g. features longitude vs latitude |
| 113 | + hists = df.hg_make_histograms(features=['longitude:latitude']) |
| 114 | + ll = hists['longitude:latitude'] |
| 115 | + ll.plot.matplotlib() |
| 116 | +
|
| 117 | + # store histogram and retrieve it again |
| 118 | + ll.toJsonFile('longitude_latitude.json') |
| 119 | + ll2 = hg.Factory().fromJsonFile('longitude_latitude.json') |
| 120 | +
|
| 121 | +
|
| 122 | +These examples also work with Spark dataframes. For more examples please see the notebooks and tutorials. |
| 123 | + |
| 124 | + |
| 125 | +Project contributors |
| 126 | +==================== |
| 127 | + |
| 128 | +This package was originally authored by DIANA-HEP and is now maintained by volunteers. |
| 129 | + |
| 130 | +Contact and support |
| 131 | +=================== |
| 132 | + |
| 133 | +* Issues & Ideas & Support: https://github.com/histogrammar/histogrammar-python/issues |
| 134 | + |
| 135 | +Please note that `histogrammar` is supported only on a best-effort basis. |
| 136 | + |
| 137 | +License |
| 138 | +======= |
| 139 | +`histogrammar` is completely free, open-source and licensed under the `Apache-2.0 license <https://en.wikipedia.org/wiki/Apache_License>`_. |
| 140 | + |
| 141 | +.. |notebook_basic_colab| image:: https://colab.research.google.com/assets/colab-badge.svg |
| 142 | + :alt: Open in Colab |
| 143 | + :target: https://colab.research.google.com/histogrammar/histogrammar-python/blob/master/histogrammar/notebooks/histogrammar_tutorial_basic.ipynb |
| 144 | +.. |notebook_advanced_colab| image:: https://colab.research.google.com/assets/colab-badge.svg |
| 145 | + :alt: Open in Colab |
| 146 | + :target: https://colab.research.google.com/histogrammar/histogrammar-python/blob/master/histogrammar/notebooks/histogrammar_tutorial_advanced.ipynb |
0 commit comments