Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large timeseries #1205

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,9 @@ rather than Matplotlib.
* `Timeseries Data <Timeseries_Data.html>`_
Using hvPlot when working with timeseries data.

* `Large Timeseries Data <visualizing_large_timeseries.html>`_
Using hvPlot when working with large timeseries data.

* `Statistical Plots <Statistical_Plots.html>`_
A number of statistical plot types modeled on the pandas.plotting module.

Expand Down Expand Up @@ -117,5 +120,6 @@ rather than Matplotlib.
Network Graphs <NetworkX>
Geographic Data <Geographic_Data>
Timeseries Data <Timeseries_Data>
Large Timeseries Data <visualizing_large_timeseries>
Statistical Plots <Statistical_Plots>
Pandas API <Pandas_API>
276 changes: 276 additions & 0 deletions examples/user_guide/visualizing_large_timeseries.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "artificial-english",
"metadata": {},
"source": [
"This notebook illustrates the various ways supported by HoloViz tools that a set of timeseries line plots can be visualized. Nearly all of these approaches will work well for small datasets, but choosing an appropriate method is crucial for being able to work naturally with larger datasets of different types and for different goals. Examples are shown using hvPlot syntax, but the results would be similar in HoloViews and (for the simplest cases) pure Bokeh syntax.\n",
"\n",
"\n",
"\n",
"## Getting the data \n",
"\n",
"Here we have a DataFrame with 1.2 million rows containing standardized data from 5 different sensors."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "arabic-container",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_parquet(\"https://datasets.holoviz.org/sensor/v1/data.parq\")\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "banned-richards",
"metadata": {},
"outputs": [],
"source": [
"df0 = df[df.sensor=='0']"
]
},
{
"cell_type": "markdown",
"id": "fourth-sentence",
"metadata": {},
"source": [
"Let's go ahead and plot this data using each approach.\n",
"\n",
"# The old way: Bokeh's custom Canvas rendering\n",
"\n",
"Prior to HoloViews version 1.16.0 (May 2023), the default hvPlot and HoloViews Bokeh plotting used Bokeh's custom HTML Canvas rendering, which can handle up to a few tens of thousand datapoints without any issue, but struggles above 100K points. Bokeh also supports WebGL rendering that uses your machine's graphic card GPU, but until mid-2023 Bokeh's WebGL support was incomplete and so it was not typically used in HoloViews. Here, we are using newer versions of Bokeh, HoloViews, and hvPlot, where WebGL is enabled by default, but we can explicitly disable it to see the previous supported approach:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "subsequent-adult",
"metadata": {},
"outputs": [],
"source": [
"import holoviews as hv\n",
"import hvplot.pandas # noqa: F401\n",
"\n",
"hv.renderer(\"bokeh\").webgl = False # True by default\n",
"\n",
"df0.hvplot(x=\"time\", y=\"value\", responsive=True, min_height=300, autorange='y', title=\"Canvas: Outdated\")"
]
},
{
"cell_type": "markdown",
"id": "generous-marathon",
"metadata": {},
"source": [
"Here we are using `autorange='y'` from HoloViews 1.17 and hvPlot 0.9.0 to scale the y axis to fit the entire visible curve after a zoom or pan; you can omit that option if you prefer to set the y scaling manually using the zoom tool.\n",
"\n",
"With this size of data, the Canvas approach should be pretty usable, but anything larger will generally be quite slow to zoom or pan. For this reason, the default is normally `webgl=True`, and you'll want to choose one of the other plotting methods below instead of Canvas. If you want to see how slow it gets with all the data, just change `df` to `df0` and add `by='sensor'`."
]
},
{
"cell_type": "markdown",
"id": "dangerous-arrow",
"metadata": {},
"source": [
"# 1. WebGL: new baseline for timeseries plotting\n",
"\n",
"Nowadays, you should be able to forget about the old Canvas approach. As a baseline, timeseries plotting will now use WebGL rendering by default, taking advantage of your computer's graphic card to accelerate plotting. The result should be visually difficult to distinguish from the Canvas approach, but able to zoom and pan smoothly for larger datasets. \n",
"\n",
"Note that WebGL performance depends greatly on the details of the machine where your web browser is running. Unfortunately, most recent Macs have only relatively slow WebGL support (as fast as Canvas, but not necessarily much faster) due to lacking on-board Nvidia or AMD GPUs, so if you want to compare WebGL to Canvas you'll need a Windows or Linux machine locally."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aerial-pizza",
"metadata": {},
"outputs": [],
"source": [
"hv.renderer(\"bokeh\").webgl = True # The default\n",
"\n",
"df0.hvplot(x=\"time\", y=\"value\", responsive=True, min_height=300, autorange='y', title=\"WebGL\")"
]
},
{
"cell_type": "markdown",
"id": "directed-proof",
"metadata": {},
"source": [
"WebGL plotting should work well for up to a million points, but beyond that, it suffers from the fact that standard Bokeh plotting sends *all* of your data from the server to the local browser. Especially if your code is running on a remote server, transferring that data can be very slow, and updating it on pan and zoom can take a long time even with WebGL for larger datasets. The rest of the methods below use various techniques for not sending all of that data, focusing only on the data needed at any one time.\n",
"\n",
"# 2. LTTB Downsampling\n",
"\n",
"To get smaller sizes, you could simply plot every _n_th datapoint using `df.sample`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "commercial-nursery",
"metadata": {},
"outputs": [],
"source": [
"df0.sample(500).hvplot(x=\"time\", y=\"value\", responsive=True, min_height=300, autorange='y',\n",
" title = \"Decimation: Don't do this!\")"
]
},
{
"cell_type": "markdown",
"id": "constitutional-metabolism",
"metadata": {},
"source": [
"However, arbitrarily strided sampling like that will suffer from [aliasing](https://en.wikipedia.org/wiki/Downsampling_(signal_processing)), e.g. misrepresenting the curve because the selected samples miss important peaks, troughs, or slopes in the signal. In the example here, large spikes are clearly seen in the WebGL plot but are not visible in the decimated plot, which is why simple decimation of this type is not recommended.\n",
"\n",
"Instead, there are ways to reduce the number of samples while preserving the curve shape, such as the [Largest Triangle Three Buckets](https://skemman.is/handle/1946/15343) algorithm (LTTB). LTTB allows data points not contributing significantly to the visible shape to be dropped, reducing the amount of data to send to the browser but preserving the appearance (and particularly the envelope, i.e. highest and lowest values in a region).\n",
"\n",
"In hvPlot, adding `downsample=True` will enable the LTTB algorithm, which will automatically choose an appropriate number of samples for the current plot, updating with additional plots as you zoom in:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "powerful-command",
"metadata": {},
"outputs": [],
"source": [
"df0.hvplot(x=\"time\", y=\"value\", downsample=True, responsive=True, min_height=300, autorange='y', title=\"LTTB\")"
]
},
{
"cell_type": "markdown",
"id": "conservative-maldives",
"metadata": {},
"source": [
"Here you should see that the LTTB plot is visually quite similar to the WebGL plot, but it is rendered much more quickly (especially for local browsing of remote computation). The plot will then be updated with additional detail automatically if you zoom in, as long as the Python code underlying this page is still running (as LTTB depends on Python dynamically, while a Bokeh WebGL plot depends only on JavaScript running in your web browser, which is an advantage when sending static HTML files).\n",
"\n",
"With LTTB, it is now practical to include all of the different sensors in a single plot without slowdown, updating to show more detail when zooming in:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "accessible-evanescence",
"metadata": {},
"outputs": [],
"source": [
"df.hvplot(x=\"time\", y=\"value\", downsample=True, by='sensor', responsive=True, min_height=300, autorange='y',\n",
" title=\"Categorical LTTB\")"
]
},
{
"cell_type": "markdown",
"id": "conscious-collector",
"metadata": {},
"source": [
"LTTB is thus a good default way to browse a timeseries dataset if you don't know how large it might be or if you already know it is too large for WebGL.\n",
"\n",
"# 3. Datashader rasterizing\n",
"\n",
"Bokeh WebGL and LTTB both send data to the web browser and ask the web browser to \"connect the dots\" between them by drawing a line in the browser page, with LTTB simply sending fewer points. [Datashader](https://datashader.org) works in a different way, rendering the data into a frame buffer on the server, and then sending that buffer to the web browser rather than the individual data points. Thus Datashader will send only a fixed amount of data (the rendered plot), potentially greatly speeding up plots of the largest datasets. As for LTTB, plots will only be updated after a zoom or pan if Python is still running, because Python is what renders and supplies the updated image. Setting the argument `line_width` to a value above 0 will enable [anti-aliasing](https://en.wikipedia.org/wiki/Anti-aliasing) of the line. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "conventional-patch",
"metadata": {},
"outputs": [],
"source": [
"df0.hvplot(x=\"time\", y=\"value\", rasterize=True, cnorm='eq_hist', padding=(0, 0.1), line_width=1,\n",
" responsive=True, min_height=300, autorange='y', title=\"Rasterize\")"
]
},
{
"cell_type": "markdown",
"id": "helpful-content",
"metadata": {},
"source": [
"If you zoom in enough, you'll see a normal line, but for a long timeseries in a zoomed out plot like this one, what you will see is Datashader's \"aggregation\" of *all* the line segments between the points, with darker colors indicating areas where the data trace goes back and forth multiple times in a single pixel (with the number of \"switchbacks\" indicated in the color key). This representation conveys a lot more about the behavior of this data, with the previous plots showing a single solid color regardless of how many line segments crossed that pixel. Datashader rendering can be used to get a good overview of the full shape of a long timeseries, helping you understand how the signal varies even when the steps involved are smaller than the pixels on the screen.\n",
"\n",
"For data with different \"categories\" (sensors, in this case), Datashader can assign a different color to each of the sensor categories and then aggregating all of them into the final display by mixing their colors:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "imposed-liechtenstein",
"metadata": {},
"outputs": [],
"source": [
"df.hvplot(x=\"time\", y=\"value\", datashade=True, hover=True, padding=(0, 0.1), responsive=True,\n",
" min_height=300, autorange='y', line_width=1, by='sensor', title=\"Rasterize categories\")"
]
},
{
"cell_type": "markdown",
"id": "honey-globe",
"metadata": {},
"source": [
"This categorical color mixing can help indicate when traces overlap each other, to give you a clue when to zoom in, and becomes particularly important the more categories there are."
]
},
{
"cell_type": "markdown",
"id": "naughty-adventure",
"metadata": {},
"source": [
"[The example above needs `rasterize`, plus instant inspection. Also needs to illustrate what happens when very large numbers of traces overlap.] "
]
},
{
"cell_type": "markdown",
"id": "expired-gallery",
"metadata": {},
"source": [
"# 4. Minimap\n",
"\n",
"The LTTB and Datashader options are about rendering or omitting datapoints when showing a large time range that would include many data points. What if you have years of data, but the timescale involved is such that you typically study a single day or a single hour? In that case the new \"minimap\" approach can help you ensure that you see the larger context while actually plotting only the smaller time range.\n",
"\n",
"A minimap is added using the HoloViews RangeToolLink:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "smoking-findings",
"metadata": {},
"outputs": [],
"source": [
"from holoviews.plotting.links import RangeToolLink\n",
"\n",
"# Does not yet work with downsample1d. For now, to make it easier on the browser, let's just take a subset of the data\n",
"downsampled_df = df.iloc[::10]\n",
"\n",
"plot = df0.hvplot(x=\"time\", y=\"value\", height=500)\n",
"minimap = df0.hvplot(x=\"time\", y=\"value\", height=150).opts(ylabel='', xlabel='')\n",
"\n",
"link = RangeToolLink(minimap, plot, axes=[\"x\", \"y\"], boundsx=(None, pd.Timestamp(\"2022-02-01\")), boundsy=(-5, 5))\n",
"\n",
"(plot + minimap).opts(shared_axes=False).cols(1)"
]
},
{
"cell_type": "markdown",
"id": "lesser-magazine",
"metadata": {},
"source": [
"Here, you can drag the grey box on the bottom plot and the top plot will update to show that range of the data, letting you explore a large dataset while plotting only a short stretch at a time."
]
}
],
"metadata": {
"language_info": {
"name": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}