Skip to content

Commit 4453f28

Browse files
authored
Merge pull request #4462 from Blargian/add_jupyer_integration_guide
AI/ML: add integration guide on Jupyter notebooks + chdb
2 parents 1e4e195 + 9c7bc28 commit 4453f28

File tree

14 files changed

+347
-11
lines changed

14 files changed

+347
-11
lines changed
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
slug: /cloud/guides/sql-console/gather-connection-details
3+
sidebar_label: 'Gather your connection details'
4+
title: 'Gather your connection details'
5+
description: 'Gather your connection details'
6+
doc_type: 'guide'
7+
---
8+
9+
import ConnectionDetails from '@site/docs/_snippets/_gather_your_details_http.mdx';
10+
11+
<ConnectionDetails />

docs/use-cases/AI_ML/index.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
description: 'Landing page for Machine Learning and GenAI use case guides'
33
pagination_prev: null
44
pagination_next: null
5-
slug: /use-cases/AI
5+
slug: /use-cases/AI/ask-ai
66
title: 'Machine learning and GenAI'
77
keywords: ['machine learning', 'genAI', 'AI']
88
doc_type: 'landing-page'
@@ -15,6 +15,8 @@ With ClickHouse, it's easier than ever to unleash GenAI on your analytics data.
1515
In this section, you'll find some guides around how ClickHouse is used for
1616
Machine Learning and GenAI.
1717

18-
| Section | Description |
19-
|--------------------------|--------------------------------------------------------------------------------------------|
20-
| [MCP](/use-cases/AI/MCP) | A collection of guides to get you setup using Model Context Protocol (MCP) with ClickHouse |
18+
| Section | Description |
19+
|----------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
20+
| [AI chat](/use-cases/AI_ML/AIChat) | This guide explains how to enable and use the AI Chat feature in the ClickHouse Cloud Console. |
21+
| [MCP](/use-cases/AI/MCP) | A collection of guides to get you setup using Model Context Protocol (MCP) with ClickHouse |
22+
| [AI-powered SQL generation](/use-cases/AI/ai-powered-sql-generation) | This feature allows users to describe their data requirements in plain text, which the system then translates into corresponding SQL statements. |
Lines changed: 321 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
---
2+
slug: /use-cases/AI/jupyter-notebook
3+
sidebar_label: 'Exploring data in Jupyter notebooks with chDB'
4+
title: 'Exploring data in Jupyter notebooks with chDB'
5+
description: 'This guide explains how to setup and use chDB to explore data from ClickHouse Cloud or local files in Jupyer notebooks'
6+
keywords: ['ML', 'Jupyer', 'chDB', 'pandas']
7+
doc_type: 'guide'
8+
---
9+
10+
import Image from '@theme/IdealImage';
11+
import image_1 from '@site/static/images/use-cases/AI_ML/jupyter/1.png';
12+
import image_2 from '@site/static/images/use-cases/AI_ML/jupyter/2.png';
13+
import image_3 from '@site/static/images/use-cases/AI_ML/jupyter/3.png';
14+
import image_4 from '@site/static/images/use-cases/AI_ML/jupyter/4.png';
15+
import image_5 from '@site/static/images/use-cases/AI_ML/jupyter/5.png';
16+
import image_6 from '@site/static/images/use-cases/AI_ML/jupyter/6.png';
17+
import image_7 from '@site/static/images/use-cases/AI_ML/jupyter/7.png';
18+
import image_8 from '@site/static/images/use-cases/AI_ML/jupyter/8.png';
19+
import image_9 from '@site/static/images/use-cases/AI_ML/jupyter/9.png';
20+
21+
# Exploring data with Jupyter notebooks and chDB
22+
23+
In this guide, you will learn how you can explore a dataset on ClickHouse Cloud data in Jupyter notebook with the help of [chDB](/chdb) - a fast in-process SQL OLAP Engine powered by ClickHouse.
24+
25+
**Prerequisites**:
26+
- a virtual environment
27+
- a working ClickHouse Cloud service and your [connection details](/cloud/guides/sql-console/gather-connection-details)
28+
29+
**What you'll learn:**
30+
- Connect to ClickHouse Cloud from Jupyter notebooks using chDB
31+
- Query remote datasets and convert results to Pandas DataFrames
32+
- Combine cloud data with local CSV files for analysis
33+
- Visualize data using matplotlib
34+
35+
We'll be using the UK Property Price dataset which is available on ClickHouse Cloud as one of the starter datasets.
36+
It contains data about the prices that houses were sold for in the United Kingdom from 1995 to 2024.
37+
38+
## Setup {#setup}
39+
40+
To add this dataset to an existing ClickHouse Cloud service, login to [console.clickhouse.cloud](https://console.clickhouse.cloud/) with your account details.
41+
42+
In the left hand menu, click on `Data sources`. Then click `Predefined sample data`:
43+
44+
<Image size="md" img={image_1} alt="Add example data set"/>
45+
46+
Select `Get started` in the UK property price paid data (4GB) card:
47+
48+
<Image size="md" img={image_2} alt="Select UK price paid dataset"/>
49+
50+
Then click `Import dataset`:
51+
52+
<Image size="md" img={image_3} alt="Import UK price paid dataset"/>
53+
54+
ClickHouse will automatically create the `pp_complete` table in the `default` database and fill the table with 28.92 million rows of price point data.
55+
56+
In order to reduce the likelihood of exposing your credentials, we recommend to add your Cloud username and password as environment variables on your local machine.
57+
From a terminal run the following command to add your username and password as environment variables:
58+
59+
```bash
60+
export CLICKHOUSE_USER=default
61+
export CLICKHOUSE_PASSWORD=your_actual_password
62+
```
63+
64+
:::note
65+
The environment variables above persist only as long as your terminal session.
66+
To set them permanently, add them to your shell configuration file.
67+
:::
68+
69+
Now activate your virtual environment.
70+
From within your virtual environment, install Jupyter Notebook with the following command:
71+
72+
```python
73+
pip install notebook
74+
```
75+
76+
launch Jupyter Notebook with the following command:
77+
78+
```python
79+
jupyter notebook
80+
```
81+
82+
A new browser window should open with the Jupyter interface on `localhost:8888`.
83+
Click `File` > `New` > `Notebook` to create a new Notebook.
84+
85+
<Image size="md" img={image_4} alt="Create a new notebook"/>
86+
87+
You will be prompted to select a kernel.
88+
Select any Python kernel available to you, in this example we will select the `ipykernel`:
89+
90+
<Image size="md" img={image_5} alt="Select kernel"/>
91+
92+
In a blank cell, you can type the following command to install chDB which we will be using connect to our remote ClickHouse Cloud instance:
93+
94+
```python
95+
pip install chdb
96+
```
97+
98+
You can now import chDB and run a simple query to check that everything is set up correctly:
99+
100+
```python
101+
import chdb
102+
103+
result = chdb.query("SELECT 'Hello, ClickHouse!' as message")
104+
print(result)
105+
```
106+
107+
## Exploring the data {#exploring-the-data}
108+
109+
With the UK price paid data set up and chDB up and running in a Jupyter notebook, we can now get started exploring our data.
110+
111+
Let's imagine we are interested in checking how price has changed with time for a specific area in the UK such as the capital city, London.
112+
ClickHouse's [`remoteSecure`](/sql-reference/table-functions/remote) function allows you to easily retrieve the data from ClickHouse Cloud.
113+
You can instruct chDB to return this data in process as a Pandas data frame - which is a convenient and familiar way of working with data.
114+
115+
Write the following query to fetch the UK price paid data from your ClickHouse Cloud service and turn it into a `pandas.DataFrame`:
116+
117+
```python
118+
import os
119+
from dotenv import load_dotenv
120+
import chdb
121+
import pandas as pd
122+
import matplotlib.pyplot as plt
123+
import matplotlib.dates as mdates
124+
125+
# Load environment variables from .env file
126+
load_dotenv()
127+
128+
username = os.environ.get('CLICKHOUSE_USER')
129+
password = os.environ.get('CLICKHOUSE_PASSWORD')
130+
131+
query = f"""
132+
SELECT
133+
toYear(date) AS year,
134+
avg(price) AS avg_price
135+
FROM remoteSecure(
136+
'****.europe-west4.gcp.clickhouse.cloud',
137+
default.pp_complete,
138+
'{username}',
139+
'{password}'
140+
)
141+
WHERE town = 'LONDON'
142+
GROUP BY toYear(date)
143+
ORDER BY year;
144+
"""
145+
146+
df = chdb.query(query, "DataFrame")
147+
df.head()
148+
```
149+
150+
In the snippet above, `chdb.query(query, "DataFrame")` runs the specified query and outputs the result to the terminal as a Pandas DataFrame.
151+
In the query we are using the `remoteSecure` function to connect to ClickHouse Cloud.
152+
The `remoteSecure` functions takes as parameters:
153+
- a connection string
154+
- the name of the database and table to use
155+
- your username
156+
- your password
157+
158+
As a security best practice, you should prefer using environment variables for the username and password parameters rather than specifying them directly in the function, although this is possible if you wish.
159+
160+
The `remoteSecure` function connects to the remote ClickHouse Cloud service, runs the query and returns the result.
161+
Depending on the size of your data, this could take a few seconds.
162+
In this case we return an average price point per year, and filter by `town='LONDON'`.
163+
The result is then stored as a DataFrame in a variable called `df`.
164+
165+
`df.head` displays only the first few rows of the returned data:
166+
167+
<Image size="md" img={image_6} alt="dataframe preview"/>
168+
169+
Run the following command in a new cell to check the types of the columns:
170+
171+
```python
172+
df.dtypes
173+
```
174+
175+
```response
176+
year uint16
177+
avg_price float64
178+
dtype: object
179+
```
180+
181+
Notice that while `date` is of type `Date` in ClickHouse, in the resulting data frame it is of type `uint16`.
182+
chDB automatically infers the most appropriate type when returning the DataFrame.
183+
184+
With the data now available to us in a familiar form, let's explore how prices of property in London have changed with time.
185+
186+
In a new cell, run the following command to build a simple chart of time vs price for London using matplotlib:
187+
188+
```python
189+
plt.figure(figsize=(12, 6))
190+
plt.plot(df['year'], df['avg_price'], marker='o')
191+
plt.xlabel('Year')
192+
plt.ylabel('Price (£)')
193+
plt.title('Price of London property over time')
194+
195+
# Show every 2nd year to avoid crowding
196+
years_to_show = df['year'][::2] # Every 2nd year
197+
plt.xticks(years_to_show, rotation=45)
198+
199+
plt.grid(True, alpha=0.3)
200+
plt.tight_layout()
201+
plt.show()
202+
```
203+
204+
<Image size="md" img={image_7} alt="dataframe preview"/>
205+
206+
Perhaps unsurprisingly, property prices in London have increased substantially over time.
207+
208+
A fellow data scientist has sent us a .csv file with additional housing related variables and is curious how
209+
the number of houses sold in London has changed over time.
210+
Let's plot some of these against the housing prices and see if we can discover any correlation.
211+
212+
You can use the `file` table engine to read files directly on your local machine.
213+
In a new cell, run the following command to make a new DataFrame from the local .csv file.
214+
215+
```python
216+
query = f"""
217+
SELECT
218+
toYear(date) AS year,
219+
sum(houses_sold)*1000
220+
FROM file('/Users/datasci/Desktop/housing_in_london_monthly_variables.csv')
221+
WHERE area = 'city of london' AND houses_sold IS NOT NULL
222+
GROUP BY toYear(date)
223+
ORDER BY year;
224+
"""
225+
226+
df_2 = chdb.query(query, "DataFrame")
227+
df_2.head()
228+
```
229+
230+
<details>
231+
<summary>Read from multiple sources in a single step</summary>
232+
It's also possible to read from multiple sources in a single step. You could use the query below using a `JOIN` to do so:
233+
234+
```python
235+
query = f"""
236+
SELECT
237+
toYear(date) AS year,
238+
avg(price) AS avg_price, housesSold
239+
FROM remoteSecure(
240+
'****.europe-west4.gcp.clickhouse.cloud',
241+
default.pp_complete,
242+
'{username}',
243+
'{password}'
244+
) AS remote
245+
JOIN (
246+
SELECT
247+
toYear(date) AS year,
248+
sum(houses_sold)*1000 AS housesSold
249+
FROM file('/Users/datasci/Desktop/housing_in_london_monthly_variables.csv')
250+
WHERE area = 'city of london' AND houses_sold IS NOT NULL
251+
GROUP BY toYear(date)
252+
ORDER BY year
253+
) AS local ON local.year = remote.year
254+
WHERE town = 'LONDON'
255+
GROUP BY toYear(date)
256+
ORDER BY year;
257+
"""
258+
```
259+
</details>
260+
261+
<Image size="md" img={image_8} alt="dataframe preview"/>
262+
263+
Although we are missing data from 2020 onwards, we can plot the two datasets against each other for the years 1995 to 2019.
264+
In a new cell run the following command:
265+
266+
```python
267+
# Create a figure with two y-axes
268+
fig, ax1 = plt.subplots(figsize=(14, 8))
269+
270+
# Plot houses sold on the left y-axis
271+
color = 'tab:blue'
272+
ax1.set_xlabel('Year')
273+
ax1.set_ylabel('Houses Sold', color=color)
274+
ax1.plot(df_2['year'], df_2['houses_sold'], marker='o', color=color, label='Houses Sold', linewidth=2)
275+
ax1.tick_params(axis='y', labelcolor=color)
276+
ax1.grid(True, alpha=0.3)
277+
278+
# Create a second y-axis for price data
279+
ax2 = ax1.twinx()
280+
color = 'tab:red'
281+
ax2.set_ylabel('Average Price (£)', color=color)
282+
283+
# Plot price data up until 2019
284+
ax2.plot(df[df['year'] <= 2019]['year'], df[df['year'] <= 2019]['avg_price'], marker='s', color=color, label='Average Price', linewidth=2)
285+
ax2.tick_params(axis='y', labelcolor=color)
286+
287+
# Format price axis with currency formatting
288+
ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'£{x:,.0f}'))
289+
290+
# Set title and show every 2nd year
291+
plt.title('London Housing Market: Sales Volume vs Prices Over Time', fontsize=14, pad=20)
292+
293+
# Use years only up to 2019 for both datasets
294+
all_years = sorted(list(set(df_2[df_2['year'] <= 2019]['year']).union(set(df[df['year'] <= 2019]['year']))))
295+
years_to_show = all_years[::2] # Every 2nd year
296+
ax1.set_xticks(years_to_show)
297+
ax1.set_xticklabels(years_to_show, rotation=45)
298+
299+
# Add legends
300+
ax1.legend(loc='upper left')
301+
ax2.legend(loc='upper right')
302+
303+
plt.tight_layout()
304+
plt.show()
305+
```
306+
307+
<Image size="md" img={image_9} alt="Plot of remote data set and local data set"/>
308+
309+
From the plotted data, we see that sales started around 160,000 in the year 1995 and surged quickly, peaking at around 540,000 in 1999.
310+
After that, volumes declined sharply through the mid-2000s, dropping severely during the 2007-2008 financial crisis and falling to around 140,000.
311+
Prices on the other hand showed steady, consistent growth from about £150,000 in 1995 to around £300,000 by 2005.
312+
Growth accelerated significantly after 2012, rising steeply from roughly £400,000 to over £1,000,000 by 2019.
313+
Unlike sales volume, prices showed minimal impact from the 2008 crisis and maintained an upward trajectory. Yikes!
314+
315+
## Summary {#summary}
316+
317+
This guide demonstrated how chDB enables seamless data exploration in Jupyter notebooks by connecting ClickHouse Cloud with local data sources.
318+
Using the UK Property Price dataset, we showed how to query remote ClickHouse Cloud data with the `remoteSecure()` function, read local CSV files with the `file()` table engine, and convert results directly to Pandas DataFrames for analysis and visualization.
319+
Through chDB, data scientists can leverage ClickHouse's powerful SQL capabilities alongside familiar Python tools like Pandas and matplotlib, making it easy to combine multiple data sources for comprehensive analysis.
320+
321+
While many a London-based data scientist may not be able to afford their own home or apartment any time soon, at least they can analyze the market that priced them out!

docs/use-cases/index.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ doc_type: 'landing-page'
99

1010
In this section of the docs you can find our use case guides.
1111

12-
| Page | Description |
13-
|---------------------------------------------|-------------------------------------------------------------------------------|
14-
| [Observability](observability/index.md) | Use case guide on how to setup and use ClickHouse for Observability |
15-
| [Time-Series](time-series/index.md) | Use case guide on how to setup and use ClickHouse for time-series |
16-
| [Data Lake](data_lake/index.md) | Use case guide on Data Lakes in ClickHouse |
17-
| [Machine Learning and GenAI](/use-cases/AI) | Use case guides for Machine Learning and GenAI applications with ClickHouse |
12+
| Page | Description |
13+
|--------------------------------|-------------------------------------------------------------------------------|
14+
| [Observability](observability/index.md) | Use case guide on how to setup and use ClickHouse for Observability |
15+
| [Time-Series](time-series/index.md) | Use case guide on how to setup and use ClickHouse for time-series |
16+
| [Data Lake](data_lake/index.md) | Use case guide on Data Lakes in ClickHouse |
17+
| [Machine Learning and GenAI](/use-cases/AI/ask-ai) | Use case guides for Machine Learning and GenAI applications with ClickHouse |

scripts/aspell-ignore/en/aspell-dict.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2534,6 +2534,8 @@ marimo's
25342534
matcher
25352535
materializations
25362536
materializedview
2537+
matplot
2538+
matplotlib
25372539
maxIntersections
25382540
maxIntersectionsPosition
25392541
maxMap
@@ -3733,4 +3735,4 @@ setinputsizes
37333735
setoutputsizes
37343736
stmt
37353737
subclasses
3736-
truncations
3738+
truncations
300 KB
Loading
271 KB
Loading
232 KB
Loading
424 KB
Loading
151 KB
Loading

0 commit comments

Comments
 (0)