|
| 1 | +--- |
| 2 | +slug: /use-cases/AI/jupyter-notebook |
| 3 | +sidebar_label: 'Exploring data in Jupyter notebooks with chDB' |
| 4 | +title: 'Exploring data in Jupyter notebooks with chDB' |
| 5 | +description: 'This guide explains how to setup and use chDB to explore data from ClickHouse Cloud or local files in Jupyer notebooks' |
| 6 | +keywords: ['ML', 'Jupyer', 'chDB', 'pandas'] |
| 7 | +doc_type: 'guide' |
| 8 | +--- |
| 9 | + |
| 10 | +import Image from '@theme/IdealImage'; |
| 11 | +import image_1 from '@site/static/images/use-cases/AI_ML/jupyter/1.png'; |
| 12 | +import image_2 from '@site/static/images/use-cases/AI_ML/jupyter/2.png'; |
| 13 | +import image_3 from '@site/static/images/use-cases/AI_ML/jupyter/3.png'; |
| 14 | +import image_4 from '@site/static/images/use-cases/AI_ML/jupyter/4.png'; |
| 15 | +import image_5 from '@site/static/images/use-cases/AI_ML/jupyter/5.png'; |
| 16 | +import image_6 from '@site/static/images/use-cases/AI_ML/jupyter/6.png'; |
| 17 | +import image_7 from '@site/static/images/use-cases/AI_ML/jupyter/7.png'; |
| 18 | +import image_8 from '@site/static/images/use-cases/AI_ML/jupyter/8.png'; |
| 19 | +import image_9 from '@site/static/images/use-cases/AI_ML/jupyter/9.png'; |
| 20 | + |
| 21 | +# Exploring data with Jupyter notebooks and chDB |
| 22 | + |
| 23 | +In this guide, you will learn how you can explore a dataset on ClickHouse Cloud data in Jupyter notebook with the help of [chDB](/chdb) - a fast in-process SQL OLAP Engine powered by ClickHouse. |
| 24 | + |
| 25 | +**Prerequisites**: |
| 26 | +- a virtual environment |
| 27 | +- a working ClickHouse Cloud service and your [connection details](/cloud/guides/sql-console/gather-connection-details) |
| 28 | + |
| 29 | +**What you'll learn:** |
| 30 | +- Connect to ClickHouse Cloud from Jupyter notebooks using chDB |
| 31 | +- Query remote datasets and convert results to Pandas DataFrames |
| 32 | +- Combine cloud data with local CSV files for analysis |
| 33 | +- Visualize data using matplotlib |
| 34 | + |
| 35 | +We'll be using the UK Property Price dataset which is available on ClickHouse Cloud as one of the starter datasets. |
| 36 | +It contains data about the prices that houses were sold for in the United Kingdom from 1995 to 2024. |
| 37 | + |
| 38 | +## Setup {#setup} |
| 39 | + |
| 40 | +To add this dataset to an existing ClickHouse Cloud service, login to [console.clickhouse.cloud](https://console.clickhouse.cloud/) with your account details. |
| 41 | + |
| 42 | +In the left hand menu, click on `Data sources`. Then click `Predefined sample data`: |
| 43 | + |
| 44 | +<Image size="md" img={image_1} alt="Add example data set"/> |
| 45 | + |
| 46 | +Select `Get started` in the UK property price paid data (4GB) card: |
| 47 | + |
| 48 | +<Image size="md" img={image_2} alt="Select UK price paid dataset"/> |
| 49 | + |
| 50 | +Then click `Import dataset`: |
| 51 | + |
| 52 | +<Image size="md" img={image_3} alt="Import UK price paid dataset"/> |
| 53 | + |
| 54 | +ClickHouse will automatically create the `pp_complete` table in the `default` database and fill the table with 28.92 million rows of price point data. |
| 55 | + |
| 56 | +In order to reduce the likelihood of exposing your credentials, we recommend to add your Cloud username and password as environment variables on your local machine. |
| 57 | +From a terminal run the following command to add your username and password as environment variables: |
| 58 | + |
| 59 | +```bash |
| 60 | +export CLICKHOUSE_USER=default |
| 61 | +export CLICKHOUSE_PASSWORD=your_actual_password |
| 62 | +``` |
| 63 | + |
| 64 | +:::note |
| 65 | +The environment variables above persist only as long as your terminal session. |
| 66 | +To set them permanently, add them to your shell configuration file. |
| 67 | +::: |
| 68 | + |
| 69 | +Now activate your virtual environment. |
| 70 | +From within your virtual environment, install Jupyter Notebook with the following command: |
| 71 | + |
| 72 | +```python |
| 73 | +pip install notebook |
| 74 | +``` |
| 75 | + |
| 76 | +launch Jupyter Notebook with the following command: |
| 77 | + |
| 78 | +```python |
| 79 | +jupyter notebook |
| 80 | +``` |
| 81 | + |
| 82 | +A new browser window should open with the Jupyter interface on `localhost:8888`. |
| 83 | +Click `File` > `New` > `Notebook` to create a new Notebook. |
| 84 | + |
| 85 | +<Image size="md" img={image_4} alt="Create a new notebook"/> |
| 86 | + |
| 87 | +You will be prompted to select a kernel. |
| 88 | +Select any Python kernel available to you, in this example we will select the `ipykernel`: |
| 89 | + |
| 90 | +<Image size="md" img={image_5} alt="Select kernel"/> |
| 91 | + |
| 92 | +In a blank cell, you can type the following command to install chDB which we will be using connect to our remote ClickHouse Cloud instance: |
| 93 | + |
| 94 | +```python |
| 95 | +pip install chdb |
| 96 | +``` |
| 97 | + |
| 98 | +You can now import chDB and run a simple query to check that everything is set up correctly: |
| 99 | + |
| 100 | +```python |
| 101 | +import chdb |
| 102 | + |
| 103 | +result = chdb.query("SELECT 'Hello, ClickHouse!' as message") |
| 104 | +print(result) |
| 105 | +``` |
| 106 | + |
| 107 | +## Exploring the data {#exploring-the-data} |
| 108 | + |
| 109 | +With the UK price paid data set up and chDB up and running in a Jupyter notebook, we can now get started exploring our data. |
| 110 | + |
| 111 | +Let's imagine we are interested in checking how price has changed with time for a specific area in the UK such as the capital city, London. |
| 112 | +ClickHouse's [`remoteSecure`](/sql-reference/table-functions/remote) function allows you to easily retrieve the data from ClickHouse Cloud. |
| 113 | +You can instruct chDB to return this data in process as a Pandas data frame - which is a convenient and familiar way of working with data. |
| 114 | + |
| 115 | +Write the following query to fetch the UK price paid data from your ClickHouse Cloud service and turn it into a `pandas.DataFrame`: |
| 116 | + |
| 117 | +```python |
| 118 | +import os |
| 119 | +from dotenv import load_dotenv |
| 120 | +import chdb |
| 121 | +import pandas as pd |
| 122 | +import matplotlib.pyplot as plt |
| 123 | +import matplotlib.dates as mdates |
| 124 | + |
| 125 | +# Load environment variables from .env file |
| 126 | +load_dotenv() |
| 127 | + |
| 128 | +username = os.environ.get('CLICKHOUSE_USER') |
| 129 | +password = os.environ.get('CLICKHOUSE_PASSWORD') |
| 130 | + |
| 131 | +query = f""" |
| 132 | +SELECT |
| 133 | + toYear(date) AS year, |
| 134 | + avg(price) AS avg_price |
| 135 | +FROM remoteSecure( |
| 136 | +'****.europe-west4.gcp.clickhouse.cloud', |
| 137 | +default.pp_complete, |
| 138 | +'{username}', |
| 139 | +'{password}' |
| 140 | +) |
| 141 | +WHERE town = 'LONDON' |
| 142 | +GROUP BY toYear(date) |
| 143 | +ORDER BY year; |
| 144 | +""" |
| 145 | + |
| 146 | +df = chdb.query(query, "DataFrame") |
| 147 | +df.head() |
| 148 | +``` |
| 149 | + |
| 150 | +In the snippet above, `chdb.query(query, "DataFrame")` runs the specified query and outputs the result to the terminal as a Pandas DataFrame. |
| 151 | +In the query we are using the `remoteSecure` function to connect to ClickHouse Cloud. |
| 152 | +The `remoteSecure` functions takes as parameters: |
| 153 | +- a connection string |
| 154 | +- the name of the database and table to use |
| 155 | +- your username |
| 156 | +- your password |
| 157 | + |
| 158 | +As a security best practice, you should prefer using environment variables for the username and password parameters rather than specifying them directly in the function, although this is possible if you wish. |
| 159 | + |
| 160 | +The `remoteSecure` function connects to the remote ClickHouse Cloud service, runs the query and returns the result. |
| 161 | +Depending on the size of your data, this could take a few seconds. |
| 162 | +In this case we return an average price point per year, and filter by `town='LONDON'`. |
| 163 | +The result is then stored as a DataFrame in a variable called `df`. |
| 164 | + |
| 165 | +`df.head` displays only the first few rows of the returned data: |
| 166 | + |
| 167 | +<Image size="md" img={image_6} alt="dataframe preview"/> |
| 168 | + |
| 169 | +Run the following command in a new cell to check the types of the columns: |
| 170 | + |
| 171 | +```python |
| 172 | +df.dtypes |
| 173 | +``` |
| 174 | + |
| 175 | +```response |
| 176 | +year uint16 |
| 177 | +avg_price float64 |
| 178 | +dtype: object |
| 179 | +``` |
| 180 | + |
| 181 | +Notice that while `date` is of type `Date` in ClickHouse, in the resulting data frame it is of type `uint16`. |
| 182 | +chDB automatically infers the most appropriate type when returning the DataFrame. |
| 183 | + |
| 184 | +With the data now available to us in a familiar form, let's explore how prices of property in London have changed with time. |
| 185 | + |
| 186 | +In a new cell, run the following command to build a simple chart of time vs price for London using matplotlib: |
| 187 | + |
| 188 | +```python |
| 189 | +plt.figure(figsize=(12, 6)) |
| 190 | +plt.plot(df['year'], df['avg_price'], marker='o') |
| 191 | +plt.xlabel('Year') |
| 192 | +plt.ylabel('Price (£)') |
| 193 | +plt.title('Price of London property over time') |
| 194 | + |
| 195 | +# Show every 2nd year to avoid crowding |
| 196 | +years_to_show = df['year'][::2] # Every 2nd year |
| 197 | +plt.xticks(years_to_show, rotation=45) |
| 198 | + |
| 199 | +plt.grid(True, alpha=0.3) |
| 200 | +plt.tight_layout() |
| 201 | +plt.show() |
| 202 | +``` |
| 203 | + |
| 204 | +<Image size="md" img={image_7} alt="dataframe preview"/> |
| 205 | + |
| 206 | +Perhaps unsurprisingly, property prices in London have increased substantially over time. |
| 207 | + |
| 208 | +A fellow data scientist has sent us a .csv file with additional housing related variables and is curious how |
| 209 | +the number of houses sold in London has changed over time. |
| 210 | +Let's plot some of these against the housing prices and see if we can discover any correlation. |
| 211 | + |
| 212 | +You can use the `file` table engine to read files directly on your local machine. |
| 213 | +In a new cell, run the following command to make a new DataFrame from the local .csv file. |
| 214 | + |
| 215 | +```python |
| 216 | +query = f""" |
| 217 | +SELECT |
| 218 | + toYear(date) AS year, |
| 219 | + sum(houses_sold)*1000 |
| 220 | + FROM file('/Users/datasci/Desktop/housing_in_london_monthly_variables.csv') |
| 221 | +WHERE area = 'city of london' AND houses_sold IS NOT NULL |
| 222 | +GROUP BY toYear(date) |
| 223 | +ORDER BY year; |
| 224 | +""" |
| 225 | + |
| 226 | +df_2 = chdb.query(query, "DataFrame") |
| 227 | +df_2.head() |
| 228 | +``` |
| 229 | + |
| 230 | +<details> |
| 231 | +<summary>Read from multiple sources in a single step</summary> |
| 232 | +It's also possible to read from multiple sources in a single step. You could use the query below using a `JOIN` to do so: |
| 233 | + |
| 234 | +```python |
| 235 | +query = f""" |
| 236 | +SELECT |
| 237 | + toYear(date) AS year, |
| 238 | + avg(price) AS avg_price, housesSold |
| 239 | +FROM remoteSecure( |
| 240 | +'****.europe-west4.gcp.clickhouse.cloud', |
| 241 | +default.pp_complete, |
| 242 | +'{username}', |
| 243 | +'{password}' |
| 244 | +) AS remote |
| 245 | +JOIN ( |
| 246 | + SELECT |
| 247 | + toYear(date) AS year, |
| 248 | + sum(houses_sold)*1000 AS housesSold |
| 249 | + FROM file('/Users/datasci/Desktop/housing_in_london_monthly_variables.csv') |
| 250 | + WHERE area = 'city of london' AND houses_sold IS NOT NULL |
| 251 | + GROUP BY toYear(date) |
| 252 | + ORDER BY year |
| 253 | +) AS local ON local.year = remote.year |
| 254 | +WHERE town = 'LONDON' |
| 255 | +GROUP BY toYear(date) |
| 256 | +ORDER BY year; |
| 257 | +""" |
| 258 | +``` |
| 259 | +</details> |
| 260 | + |
| 261 | +<Image size="md" img={image_8} alt="dataframe preview"/> |
| 262 | + |
| 263 | +Although we are missing data from 2020 onwards, we can plot the two datasets against each other for the years 1995 to 2019. |
| 264 | +In a new cell run the following command: |
| 265 | + |
| 266 | +```python |
| 267 | +# Create a figure with two y-axes |
| 268 | +fig, ax1 = plt.subplots(figsize=(14, 8)) |
| 269 | + |
| 270 | +# Plot houses sold on the left y-axis |
| 271 | +color = 'tab:blue' |
| 272 | +ax1.set_xlabel('Year') |
| 273 | +ax1.set_ylabel('Houses Sold', color=color) |
| 274 | +ax1.plot(df_2['year'], df_2['houses_sold'], marker='o', color=color, label='Houses Sold', linewidth=2) |
| 275 | +ax1.tick_params(axis='y', labelcolor=color) |
| 276 | +ax1.grid(True, alpha=0.3) |
| 277 | + |
| 278 | +# Create a second y-axis for price data |
| 279 | +ax2 = ax1.twinx() |
| 280 | +color = 'tab:red' |
| 281 | +ax2.set_ylabel('Average Price (£)', color=color) |
| 282 | + |
| 283 | +# Plot price data up until 2019 |
| 284 | +ax2.plot(df[df['year'] <= 2019]['year'], df[df['year'] <= 2019]['avg_price'], marker='s', color=color, label='Average Price', linewidth=2) |
| 285 | +ax2.tick_params(axis='y', labelcolor=color) |
| 286 | + |
| 287 | +# Format price axis with currency formatting |
| 288 | +ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'£{x:,.0f}')) |
| 289 | + |
| 290 | +# Set title and show every 2nd year |
| 291 | +plt.title('London Housing Market: Sales Volume vs Prices Over Time', fontsize=14, pad=20) |
| 292 | + |
| 293 | +# Use years only up to 2019 for both datasets |
| 294 | +all_years = sorted(list(set(df_2[df_2['year'] <= 2019]['year']).union(set(df[df['year'] <= 2019]['year'])))) |
| 295 | +years_to_show = all_years[::2] # Every 2nd year |
| 296 | +ax1.set_xticks(years_to_show) |
| 297 | +ax1.set_xticklabels(years_to_show, rotation=45) |
| 298 | + |
| 299 | +# Add legends |
| 300 | +ax1.legend(loc='upper left') |
| 301 | +ax2.legend(loc='upper right') |
| 302 | + |
| 303 | +plt.tight_layout() |
| 304 | +plt.show() |
| 305 | +``` |
| 306 | + |
| 307 | +<Image size="md" img={image_9} alt="Plot of remote data set and local data set"/> |
| 308 | + |
| 309 | +From the plotted data, we see that sales started around 160,000 in the year 1995 and surged quickly, peaking at around 540,000 in 1999. |
| 310 | +After that, volumes declined sharply through the mid-2000s, dropping severely during the 2007-2008 financial crisis and falling to around 140,000. |
| 311 | +Prices on the other hand showed steady, consistent growth from about £150,000 in 1995 to around £300,000 by 2005. |
| 312 | +Growth accelerated significantly after 2012, rising steeply from roughly £400,000 to over £1,000,000 by 2019. |
| 313 | +Unlike sales volume, prices showed minimal impact from the 2008 crisis and maintained an upward trajectory. Yikes! |
| 314 | + |
| 315 | +## Summary {#summary} |
| 316 | + |
| 317 | +This guide demonstrated how chDB enables seamless data exploration in Jupyter notebooks by connecting ClickHouse Cloud with local data sources. |
| 318 | +Using the UK Property Price dataset, we showed how to query remote ClickHouse Cloud data with the `remoteSecure()` function, read local CSV files with the `file()` table engine, and convert results directly to Pandas DataFrames for analysis and visualization. |
| 319 | +Through chDB, data scientists can leverage ClickHouse's powerful SQL capabilities alongside familiar Python tools like Pandas and matplotlib, making it easy to combine multiple data sources for comprehensive analysis. |
| 320 | + |
| 321 | +While many a London-based data scientist may not be able to afford their own home or apartment any time soon, at least they can analyze the market that priced them out! |
0 commit comments