You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: guides/20240920_guide_building_a_duckdb_playground_with_daytona.md
+52-52
Original file line number
Diff line number
Diff line change
@@ -1,60 +1,60 @@
1
1
---
2
2
title: "Building DuckDB Playground Environment in Daytona Workspace."
3
-
description: "Set up a DuckDB environment in Daytona Workspace and master some data tasks including cleaning, reformating and splitting CSV file, with this step-by-step guide."
3
+
description: "Set up a DuckDB environment in Daytona Workspace and master some data tasks including cleaning, reformatting, and splitting a CSV file, with this step-by-step guide."
4
4
date: 2024-09-20
5
5
author: "Jeffrey Whewhetu"
6
-
tags: ["DuckDB", "OLAP", "daytona", "Python"]
6
+
tags: ["DuckDB", "OLAP", "Daytona", "Python"]
7
7
---
8
8
9
9
# Building DuckDB Playground Environment in Daytona Workspace
10
10
11
11
# Introduction
12
-
This is a comprehensive hands-on guide in using [DuckDB](20240922_definition_duckdb.md) database to perform a realworld data project in a containerized [workspace](20240819_definition_daytona workspace.md) using Daytona. You'll follow me along from setup to actually working with DuckDB cli and even with [Python](20240820_defintion_python.md) via it's Client API. So it's a long ride and you can get a coffee closed by.
12
+
This is a comprehensive hands-on guide in using [DuckDB](20240922_definition_duckdb.md) database to perform a real-world data project in a containerized [workspace](20240819_definition_daytona workspace.md) using Daytona. You'll follow me along from setup to actually working with DuckDB cli and even with [Python](20240820_defintion_python.md) via its Client API. So it's a long ride and you can get a coffee nearby.
13
13
14
-
In this comprehensive guide, you will learn how to prepare personal loan marketing campaign data for importation into a DuckDB database and do some analysis on the dataset. Your tasks will include collecting and reviewing the data, cleaning and structuring it according to a specification, handling errors and inconsistencies, transforming and splitting it into multiple CSV files. The CSV file you'll work on is called `bank_marketing.csv`, download from GitHub [here](https://github.com/c0d33ngr/playground-duckdb/blob/main/bank_marketing.csv)
14
+
In this comprehensive guide, you will learn how to prepare personal loan marketing campaign data for importation into a DuckDB database and analyze the dataset. Your tasks will include collecting and reviewing the data, cleaning and structuring it according to a specification, handling errors and inconsistencies, and transforming and splitting it into multiple CSV files. The CSV file you'll work on is called `bank_marketing.csv`, download from GitHub [here](https://github.com/c0d33ngr/playground-duckdb/blob/main/bank_marketing.csv)
15
15
16
16
# TL;DR
17
17
18
18
- What you need to follow along with the guide.
19
-
- What's DuckDB and Why use it
19
+
- What's DuckDB and Why Use it
20
20
- Set up a Daytona Workspace with DuckDB [environment](20240819_definition_development environment.md)
21
21
- Hands-on practice using DuckDB as a CLI Tool
22
22
- Hands-on practice using DuckDB client API with [Python](20240820_defintion_python.md)
23
23
- Conclusion
24
24
25
25
# Prerequisites
26
26
27
-
To follow along with hands-on guide about DuckDB Playground in Daytona, you'll need to have the following;
27
+
To follow along with a hands-on guide about DuckDB Playground in Daytona, you'll need to have the following;
28
28
29
29
- An [IDE](20240819_definition_integrated development environment _ide_.md)(It could be VS Code, or JetBrains) or just a terminal.
30
-
-[Docker](20240819_definition_docker.md) installation on your PC or Mac. Click here for more info
31
-
- Daytona installation on your PC or Mac. Click here for more info
32
-
- A GitHub account to create a [repository](20240819_definition_repository.md). Link here to create one, if you don’t have
33
-
- Basic knowledge of [Git](20240819_definition_git.md) and GitHub
30
+
-[Docker](20240819_definition_docker.md) installation on your PC or Mac. Click here for more info.
31
+
- Daytona installation on your PC or Mac. Click here for more info.
32
+
- A GitHub account to create a [repository](20240819_definition_repository.md). Link here to create one, if you don’t have one.
33
+
- Basic knowledge of [Git](20240819_definition_git.md) and GitHub.
34
34
35
-
# What's DuckDB and Why use it
35
+
# What's DuckDB and Why Use it
36
36
37
37
## DuckDB
38
38
39
-
[DuckDB](20240922_definition_duckdb.md) is a fast in-process data analytical database with support of feature-rich SQL dialect complemented with deep integrations into client APIs. It's designed to provide high performance on complex queries against large databases in embedded configuration, such as combining tables with hundreds of columns and billions of rows. It's specialized for [online analytical processing (OLAP)](20240922_definition_online_analytical_processing_olap.md) workloads
39
+
[DuckDB](20240922_definition_duckdb.md) is a fast in-process data analytical database with support of feature-rich SQL dialect complemented with deep integrations into client APIs. It's designed to perform highly complex queries against large databases in embedded configuration, such as combining tables with hundreds of columns and billions of rows. It's specialized for [online analytical processing (OLAP)](20240922_definition_online_analytical_processing_olap.md) workloads.
40
40
41
41
## Features of it
42
42
43
-
DuckDB has lots of features that make it stand out among other databases which focus on [OLAP](20240922_definition_online_analytical_processing_olap.md). Some of the features are:
43
+
DuckDB has many features that make it stand out among other databases focusing on [OLAP](20240922_definition_online_analytical_processing_olap.md). Some of the features are:
44
44
45
-
-**Simple:** It's very simple to install and perform embedded in-process operation.
45
+
-**Simple:** It's straightforward to install and perform embedded in-process operations.
46
46
-**Portable:** Since it has no external dependencies, it's extremely portable and can be compiled for all major operating systems and CPU architectures.
47
-
-**Feature-Rich:** DuckDB has some interesting features such as extensive support for SQL complex queries, integrations to languages like [Python](20240820_defintion_python.md), R and Java and data can be stored as persistent, single-file databases.
48
-
-**Speed:**it's faster as it uses columnar-vectorized query execution engine which improves performance to run [OLAP](20240922_definition_online_analytical_processing_olap.md) workloads.
47
+
-**Feature-Rich:** DuckDB has some interesting features such as extensive support for SQL complex queries, integrations to languages like [Python](20240820_defintion_python.md), R and Java, and data can be stored as persistent, single-file databases.
48
+
-**Speed:**It's faster as it uses a columnar-vectorized query execution engine which improves performance to run [OLAP](20240922_definition_online_analytical_processing_olap.md) workloads.
49
49
-**Free:** Lastly, it's a free [open source](20240819_definition_open source.md) database system which anyone can use because of its permissive MIT License.
50
50
51
51
# Setting up Daytona Workspace for DuckDB Playground
52
52
53
-
Alright that's enough reading, now let us get started to writing codes. To do so you will need to set up a DuckDB [environment](20240819_definition_development environment.md) in a [Daytona workspace](20240819_definition_daytona workspace.md). Let’s begin.
53
+
Alright, that's enough reading, now let us start writing codes. To do so you will need to set up a DuckDB [environment](20240819_definition_development environment.md) in a [Daytona workspace](20240819_definition_daytona workspace.md). Let’s begin.
54
54
55
55
## Step 1: Create a GitHub Repository
56
56
57
-
First head to GitHub website and create a [repository](20240819_definition_repository.md) with the name of your choice. For my repository name, I’ll use `playground-duckdb`. The full URL path to the repository is `https://github.com/c0d33ngr/playground-duckdb`
57
+
First head to the GitHub website and create a [repository](20240819_definition_repository.md) with the name of your choice. For my repository name, I’ll use `playground-duckdb`. The full URL path to the repository is `https://github.com/c0d33ngr/playground-duckdb`
58
58
59
59
## Step 2: Clone the repository using Git
60
60
@@ -64,19 +64,19 @@ In my case, it’s `git clone https://github.com/c0d33ngr/playground-duckdb`
64
64
65
65
## Step 3: Prepare your `devcontainer.json` file and dataset in CSV format
66
66
67
-
Run the command to move into your cloned repository but don’t forget to replace `playground-duckdb` with your own repository name you created if yours isn’t the same with mine.
67
+
Run the command to move into your cloned repository but don’t forget to replace `playground-duckdb` with the repository name you created if yours isn’t the same as mine.
68
68
69
69
```bash
70
70
cd playground-duckdb
71
71
```
72
72
73
-
Download the bank campaign dataset you are going to perform data tasks on which is in CSV format, from GitHub repo [here](https://github.com/c0d33ngr/playground-duckdb/blob/main/bank_marketing.csv).
73
+
Download the bank campaign dataset you are going to perform data tasks on which is in CSV format, from the GitHub repo [here](https://github.com/c0d33ngr/playground-duckdb/blob/main/bank_marketing.csv).
74
74
75
75
Note: It has to be in the directory of your clone repository. In my case, it's inside `playground-duckdb`.
76
76
77
-
Now, lets proceed to the next step.
77
+
Now, let us proceed to the next step.
78
78
79
-
Create a hidden directory named `.devcontainer` where our `devcontainer.json` file will be. Let’s do so and move into it
79
+
Create a hidden directory named `.devcontainer` where our `devcontainer.json` file will be. Let’s do so and move into it.
80
80
81
81
Run the command to do so
82
82
@@ -92,7 +92,7 @@ I use `nano` to create my `.devcontainer.json` file using this command.
92
92
nano devcontainer.json
93
93
```
94
94
95
-
Paste this code into your `devcontainer.json` file
95
+
Paste this code into your `devcontainer.json` file.
96
96
97
97
```yaml
98
98
{
@@ -109,81 +109,81 @@ Paste this code into your `devcontainer.json` file
109
109
The `devcontainer.json` content contains configurations to start your DuckDB environment in a [Daytona workspace](20240819_definition_daytona workspace.md).
110
110
111
111
-`name`: This sets the name of the development container environment to `DuckDB Playground`.
112
-
-`image`: This uses a base Ubuntu image from Microsoft image repository.
113
-
-`features`: This configuration add DuckDB installation and Python setups in the Daytona workspace
114
-
-`postCreateComand`: This install the Python packages needed for this guide into the workspace.
112
+
-`image`: This uses a base Ubuntu image from the Microsoft image repository.
113
+
-`features`: This configuration adds DuckDB installation and Python setups in the Daytona workspace
114
+
-`postCreateComand`: This installs the Python packages needed for this guide into the workspace.
115
115
116
-
After created and saved the `devcontainer.json` file, move up back to the root directory of your clone [repository](20240819_definition_repository.md). For me, I run the command below
116
+
After creating and saving the `devcontainer.json` file, move up back to the root directory of your clone [repository](20240819_definition_repository.md). For me, I run the command below.
117
117
118
118
```bash
119
119
cd ../..
120
120
```
121
121
122
122
## Step 4: Commit and Push Changes to GitHub
123
123
124
-
Run this commands to push your changes to GitHub
124
+
Run these commands to push your changes to GitHub.
125
125
126
126
```bash
127
127
git add .
128
128
git commit -m “add devcontainer.json file”
129
129
git push
130
130
```
131
131
132
-
Now, you have successfully push our updated repository that contains our configuration file (`devcontainer.json`) for our DuckDB environment
132
+
Now, you have successfully pushed our updated repository, which contains our configuration file (`devcontainer.json`) for our DuckDB environment.
133
133
134
134
## Step 5: Verify Daytona Installation
135
135
136
-
Run this command to check `daytona` is properly installed in your PC or Mac
136
+
Run this command to check `daytona` is properly installed on your PC or Mac.
137
137
138
138
```bash
139
139
daytona –-version
140
140
```
141
141
142
-
You should see your version of `daytona` installed
142
+
You should see your version of `daytona` installed.
143
143
144
144
## Step 6: Create a Daytona Workspace with DuckDB Playground Environment in it
145
145
146
-
Let’s start daytona server by running the command
146
+
Let’s start the daytona server by running the command.
147
147
148
148
```bash
149
149
daytona serve
150
150
```
151
151
152
-
You should see logs like my screenshot
152
+
You should see logs like my screenshot.
153
153
154
154
Open a new tab in your terminal, for Linux its `Shift + Ctrl + T`
155
155
156
-
Run the command below in a new tab of your terminal and follow the prompt instruction. It would ask you for a [workspace](20240819_definition_daytona workspace.md) name to use, just choose the default.
156
+
Run the command below in a new tab of your terminal and follow the prompt instructions. It would ask you for a [workspace](20240819_definition_daytona workspace.md) name to use, choose the default.
157
157
158
158
Replace `USERNAME` and `REPOSITORY-NAME` with your username for GitHub and the repository name you created earlier.
After you successfully ran the above command you should see screenshot like mine showing your Daytona workspace that contains the DuckDB environment is running
170
+
After you successfully run the above command you should see a screenshot like mine showing your Daytona workspace that contains the DuckDB environment is running.
171
171
172
172
You can now run this command to open the DuckDB [environment](20240819_definition_development environment.md) in your default [IDE](20240819_definition_integrated development environment _ide_.md) you choose when installing Daytona (Replace `WORKSPACE-NAME` with the name you used when creating the workspace above, in my case it's `playground-duckdb`).
173
173
174
174
```bash
175
175
daytona code WORKSPACE-NAME
176
176
```
177
177
178
-
That’s it. Daytona will create a DuckDB playground environment for you and open it in your default IDE you set.
178
+
That’s it. Daytona will create a DuckDB playground environment for you and open it in the default IDE you set.
179
179
180
180
# Using DuckDB as a Command Line Interface (CLI) Tool
181
181
182
-
In this section, you'll learn how to work with [DuckDB](20240922_definition_duckdb.md) by creating a database from a CSV file, examining its structure, retrieving distinct values, and exporting data to separate CSV files for client, campaign, and economics data. Finally, you'll verify the exported data, gaining hands-on experience with DuckDB's querying and data manipulation capabilities. Lets get started
182
+
In this section, you'll learn how to work with [DuckDB](20240922_definition_duckdb.md) by creating a database from a CSV file, examining its structure, retrieving distinct values, and exporting data to separate CSV files for client, campaign, and economics data. Finally, you'll verify the exported data, gaining hands-on experience with DuckDB's querying and data manipulation capabilities. Let us get started.
183
183
184
184
## Step 1: Enter DuckDB Interactive Shell
185
185
186
-
By now, you should be in your default IDE set up using `daytona`. In your IDE terminal, type the command below to enter into DuckDB database shell in interactive mode where you'll run some SQLbased queries that conformed to DuckDB database.
186
+
By now, you should be in your default IDE set up using `daytona`. In your IDE terminal, type the command below to enter into the DuckDB database shell in interactive mode where you'll run some SQL-based queries that conform to the DuckDB database.
187
187
188
188
```sql
189
189
duckdb
@@ -231,18 +231,18 @@ COPY (
231
231
) TO 'client.csv' (DELIMITER ',', HEADER TRUE);
232
232
```
233
233
234
-
## Step 5: Retrieve List of Distinct Records in `day` Column
234
+
## Step 5: Retrieve the List of Distinct Records in `day` Column
235
235
236
-
Run the following SQL query to retrieve a list of distinct days from the bank_marketing table. The results would be useful in the preparation of the SQL query for step 7. We need to know the unique records in the `day` column.
236
+
Run the following SQL query to retrieve a list of distinct days from the bank_marketing table. The results would be useful in preparing the SQL query for step 7. We need to know the unique records in the `day` column.
237
237
238
238
```sql
239
239
SELECT DISTINCT day
240
240
FROM'bank_marketing.csv';
241
241
```
242
242
243
-
## Step 6: Retrieve List of Distinct Records in `month` Column
243
+
## Step 6: Retrieve the List of Distinct Records in `month` Column
244
244
245
-
Run the following SQL query to retrieve list of distinct months from the `bank_marketing` table. The results are also needed for the creation of a new column called `last_contact_date` later in step 7.
245
+
Run the following SQL query to retrieve the list of distinct months from the `bank_marketing` table. The results are also needed for the creation of a new column called `last_contact_date` later in step 7.
246
246
247
247
```sql
248
248
SELECT DISTINCT month
@@ -283,15 +283,15 @@ COPY (
283
283
WHEN LOWER(month) ='oct' THEN 10
284
284
WHEN LOWER(month) ='nov' THEN 11
285
285
WHEN LOWER(month) ='dec' THEN 12
286
-
ELSE NULL-- default value if month is unknown
286
+
ELSE NULL-- default value if the month is unknown
287
287
END,
288
288
CAST(day ASBIGINT)
289
289
) AS last_contact_date
290
290
FROM bank_marketing
291
291
) TO 'campaign.csv' (DELIMITER ',', HEADER TRUE);
292
292
```
293
293
294
-
## Step 8: Export Economical Data to CSV
294
+
## Step 8: Export economic data to CSV
295
295
296
296
Run the following SQL query to export economics data to a CSV file named `economics.csv`
297
297
@@ -307,7 +307,7 @@ COPY (
307
307
308
308
## Step 9: Read Data from Exported CSV files
309
309
310
-
Run the following SQL queries to read data from the `client.csv`, `campaign.csv` and `economics.csv` files.
310
+
Run the following SQL queries to read data from the `client.csv`, `campaign.csv`, and `economics.csv` files.
311
311
312
312
```sql
313
313
SELECT*
@@ -324,11 +324,11 @@ SELECT *
324
324
FROM'economics.csv';
325
325
```
326
326
327
-
Now, our three CSV files are prepared for some analysis using DuckDB Client API via Python. Let head to the next section for the analysis.
327
+
Now, our three CSV files have been prepared for analysis using DuckDB Client API via Python. Let's head to the next section for the analysis.
328
328
329
329
# Using DuckDB with Python through its Client API
330
330
331
-
In this section, you'll learn how to analyze and visualize data using [DuckDB](20240922_definition_duckdb.md) and [Matplotlib](20240922_definition_matplotlib.md). You'll calculate the campaign success rate, create a bar chart to compare average client age by education level, and generate a scatter plot to explore the relationship between contact duration and campaign outcome. We'll use the cleaned and transformed CSV files spilt from our `bank_marketing.csv` in this section.
331
+
In this section, you'll learn how to analyze and visualize data using [DuckDB](20240922_definition_duckdb.md) and [Matplotlib](20240922_definition_matplotlib.md). You'll calculate the campaign success rate, create a bar chart to compare average client age by education level and generate a scatter plot to explore the relationship between contact duration and campaign outcome. We'll use the cleaned and transformed CSV files split from our `bank_marketing.csv` in this section.
332
332
333
333
## Step 1: Analysis of Customer Campaign Success Rate
334
334
@@ -385,7 +385,7 @@ plt.tight_layout()
385
385
plt.show()
386
386
```
387
387
388
-
Run the file in your IDE terminal using `python3 client_age_by_education.py` and you should see visualization.
388
+
Run the file in your IDE terminal using `python3 client_age_by_education.py` and you should see the visualization.
389
389
390
390
## Step 3: Analysis and Visualization of Contact Duration and Campaign Outcome through Correlation
391
391
@@ -421,10 +421,10 @@ That's it. You have done lots of data tasks using [DuckDB](20240922_definition_d
421
421
422
422
# Conclusion
423
423
424
-
In this comprehensive guide, you have explored the capabilities of using DuckDB in a Daytona Workspace with no stress through hands-on example.
424
+
In this comprehensive guide, you have explored the capabilities of using DuckDB in a Daytona Workspace with no stress through hands-on examples.
425
425
Throughout this guide, you have gained practical experience in:
426
-
- Creating and managing database with DuckDB in memory
427
-
- Perform SQL queries for data cleaning, transformation and splitting
426
+
- Creating and managing a database with DuckDB in memory
427
+
- Perform SQL queries for data cleaning, transformation, and splitting
428
428
- Integration of DuckDB using its Client API with Python for data analysis.
0 commit comments