Skip to content

Commit d229df0

Browse files
committed
update guide t remove grammar errors
Signed-off-by: Jeffrey <[email protected]>
1 parent 1898776 commit d229df0

File tree

1 file changed

+52
-52
lines changed

1 file changed

+52
-52
lines changed

guides/20240920_guide_building_a_duckdb_playground_with_daytona.md

+52-52
Original file line numberDiff line numberDiff line change
@@ -1,60 +1,60 @@
11
---
22
title: "Building DuckDB Playground Environment in Daytona Workspace."
3-
description: "Set up a DuckDB environment in Daytona Workspace and master some data tasks including cleaning, reformating and splitting CSV file, with this step-by-step guide."
3+
description: "Set up a DuckDB environment in Daytona Workspace and master some data tasks including cleaning, reformatting, and splitting a CSV file, with this step-by-step guide."
44
date: 2024-09-20
55
author: "Jeffrey Whewhetu"
6-
tags: ["DuckDB", "OLAP", "daytona", "Python"]
6+
tags: ["DuckDB", "OLAP", "Daytona", "Python"]
77
---
88

99
# Building DuckDB Playground Environment in Daytona Workspace
1010

1111
# Introduction
12-
This is a comprehensive hands-on guide in using [DuckDB](20240922_definition_duckdb.md) database to perform a real world data project in a containerized [workspace](20240819_definition_daytona workspace.md) using Daytona. You'll follow me along from setup to actually working with DuckDB cli and even with [Python](20240820_defintion_python.md) via it's Client API. So it's a long ride and you can get a coffee closed by.
12+
This is a comprehensive hands-on guide in using [DuckDB](20240922_definition_duckdb.md) database to perform a real-world data project in a containerized [workspace](20240819_definition_daytona workspace.md) using Daytona. You'll follow me along from setup to actually working with DuckDB cli and even with [Python](20240820_defintion_python.md) via its Client API. So it's a long ride and you can get a coffee nearby.
1313

14-
In this comprehensive guide, you will learn how to prepare personal loan marketing campaign data for importation into a DuckDB database and do some analysis on the dataset. Your tasks will include collecting and reviewing the data, cleaning and structuring it according to a specification, handling errors and inconsistencies, transforming and splitting it into multiple CSV files. The CSV file you'll work on is called `bank_marketing.csv`, download from GitHub [here](https://github.com/c0d33ngr/playground-duckdb/blob/main/bank_marketing.csv)
14+
In this comprehensive guide, you will learn how to prepare personal loan marketing campaign data for importation into a DuckDB database and analyze the dataset. Your tasks will include collecting and reviewing the data, cleaning and structuring it according to a specification, handling errors and inconsistencies, and transforming and splitting it into multiple CSV files. The CSV file you'll work on is called `bank_marketing.csv`, download from GitHub [here](https://github.com/c0d33ngr/playground-duckdb/blob/main/bank_marketing.csv)
1515

1616
# TL;DR
1717

1818
- What you need to follow along with the guide.
19-
- What's DuckDB and Why use it
19+
- What's DuckDB and Why Use it
2020
- Set up a Daytona Workspace with DuckDB [environment](20240819_definition_development environment.md)
2121
- Hands-on practice using DuckDB as a CLI Tool
2222
- Hands-on practice using DuckDB client API with [Python](20240820_defintion_python.md)
2323
- Conclusion
2424

2525
# Prerequisites
2626

27-
To follow along with hands-on guide about DuckDB Playground in Daytona, you'll need to have the following;
27+
To follow along with a hands-on guide about DuckDB Playground in Daytona, you'll need to have the following;
2828

2929
- An [IDE](20240819_definition_integrated development environment _ide_.md)(It could be VS Code, or JetBrains) or just a terminal.
30-
- [Docker](20240819_definition_docker.md) installation on your PC or Mac. Click here for more info
31-
- Daytona installation on your PC or Mac. Click here for more info
32-
- A GitHub account to create a [repository](20240819_definition_repository.md). Link here to create one, if you don’t have
33-
- Basic knowledge of [Git](20240819_definition_git.md) and GitHub
30+
- [Docker](20240819_definition_docker.md) installation on your PC or Mac. Click here for more info.
31+
- Daytona installation on your PC or Mac. Click here for more info.
32+
- A GitHub account to create a [repository](20240819_definition_repository.md). Link here to create one, if you don’t have one.
33+
- Basic knowledge of [Git](20240819_definition_git.md) and GitHub.
3434

35-
# What's DuckDB and Why use it
35+
# What's DuckDB and Why Use it
3636

3737
## DuckDB
3838

39-
[DuckDB](20240922_definition_duckdb.md) is a fast in-process data analytical database with support of feature-rich SQL dialect complemented with deep integrations into client APIs. It's designed to provide high performance on complex queries against large databases in embedded configuration, such as combining tables with hundreds of columns and billions of rows. It's specialized for [online analytical processing (OLAP)](20240922_definition_online_analytical_processing_olap.md) workloads
39+
[DuckDB](20240922_definition_duckdb.md) is a fast in-process data analytical database with support of feature-rich SQL dialect complemented with deep integrations into client APIs. It's designed to perform highly complex queries against large databases in embedded configuration, such as combining tables with hundreds of columns and billions of rows. It's specialized for [online analytical processing (OLAP)](20240922_definition_online_analytical_processing_olap.md) workloads.
4040

4141
## Features of it
4242

43-
DuckDB has lots of features that make it stand out among other databases which focus on [OLAP](20240922_definition_online_analytical_processing_olap.md). Some of the features are:
43+
DuckDB has many features that make it stand out among other databases focusing on [OLAP](20240922_definition_online_analytical_processing_olap.md). Some of the features are:
4444

45-
- **Simple:** It's very simple to install and perform embedded in-process operation.
45+
- **Simple:** It's straightforward to install and perform embedded in-process operations.
4646
- **Portable:** Since it has no external dependencies, it's extremely portable and can be compiled for all major operating systems and CPU architectures.
47-
- **Feature-Rich:** DuckDB has some interesting features such as extensive support for SQL complex queries, integrations to languages like [Python](20240820_defintion_python.md), R and Java and data can be stored as persistent, single-file databases.
48-
- **Speed:** it's faster as it uses columnar-vectorized query execution engine which improves performance to run [OLAP](20240922_definition_online_analytical_processing_olap.md) workloads.
47+
- **Feature-Rich:** DuckDB has some interesting features such as extensive support for SQL complex queries, integrations to languages like [Python](20240820_defintion_python.md), R and Java, and data can be stored as persistent, single-file databases.
48+
- **Speed:** It's faster as it uses a columnar-vectorized query execution engine which improves performance to run [OLAP](20240922_definition_online_analytical_processing_olap.md) workloads.
4949
- **Free:** Lastly, it's a free [open source](20240819_definition_open source.md) database system which anyone can use because of its permissive MIT License.
5050

5151
# Setting up Daytona Workspace for DuckDB Playground
5252

53-
Alright that's enough reading, now let us get started to writing codes. To do so you will need to set up a DuckDB [environment](20240819_definition_development environment.md) in a [Daytona workspace](20240819_definition_daytona workspace.md). Let’s begin.
53+
Alright, that's enough reading, now let us start writing codes. To do so you will need to set up a DuckDB [environment](20240819_definition_development environment.md) in a [Daytona workspace](20240819_definition_daytona workspace.md). Let’s begin.
5454

5555
## Step 1: Create a GitHub Repository
5656

57-
First head to GitHub website and create a [repository](20240819_definition_repository.md) with the name of your choice. For my repository name, I’ll use `playground-duckdb`. The full URL path to the repository is `https://github.com/c0d33ngr/playground-duckdb`
57+
First head to the GitHub website and create a [repository](20240819_definition_repository.md) with the name of your choice. For my repository name, I’ll use `playground-duckdb`. The full URL path to the repository is `https://github.com/c0d33ngr/playground-duckdb`
5858

5959
## Step 2: Clone the repository using Git
6060

@@ -64,19 +64,19 @@ In my case, it’s `git clone https://github.com/c0d33ngr/playground-duckdb`
6464

6565
## Step 3: Prepare your `devcontainer.json` file and dataset in CSV format
6666

67-
Run the command to move into your cloned repository but don’t forget to replace `playground-duckdb` with your own repository name you created if yours isn’t the same with mine.
67+
Run the command to move into your cloned repository but don’t forget to replace `playground-duckdb` with the repository name you created if yours isn’t the same as mine.
6868

6969
```bash
7070
cd playground-duckdb
7171
```
7272

73-
Download the bank campaign dataset you are going to perform data tasks on which is in CSV format, from GitHub repo [here](https://github.com/c0d33ngr/playground-duckdb/blob/main/bank_marketing.csv).
73+
Download the bank campaign dataset you are going to perform data tasks on which is in CSV format, from the GitHub repo [here](https://github.com/c0d33ngr/playground-duckdb/blob/main/bank_marketing.csv).
7474

7575
Note: It has to be in the directory of your clone repository. In my case, it's inside `playground-duckdb`.
7676

77-
Now, lets proceed to the next step.
77+
Now, let us proceed to the next step.
7878

79-
Create a hidden directory named `.devcontainer` where our `devcontainer.json` file will be. Let’s do so and move into it
79+
Create a hidden directory named `.devcontainer` where our `devcontainer.json` file will be. Let’s do so and move into it.
8080

8181
Run the command to do so
8282

@@ -92,7 +92,7 @@ I use `nano` to create my `.devcontainer.json` file using this command.
9292
nano devcontainer.json
9393
```
9494

95-
Paste this code into your `devcontainer.json` file
95+
Paste this code into your `devcontainer.json` file.
9696

9797
```yaml
9898
{
@@ -109,81 +109,81 @@ Paste this code into your `devcontainer.json` file
109109
The `devcontainer.json` content contains configurations to start your DuckDB environment in a [Daytona workspace](20240819_definition_daytona workspace.md).
110110

111111
- `name`: This sets the name of the development container environment to `DuckDB Playground`.
112-
- `image`: This uses a base Ubuntu image from Microsoft image repository.
113-
- `features`: This configuration add DuckDB installation and Python setups in the Daytona workspace
114-
- `postCreateComand`: This install the Python packages needed for this guide into the workspace.
112+
- `image`: This uses a base Ubuntu image from the Microsoft image repository.
113+
- `features`: This configuration adds DuckDB installation and Python setups in the Daytona workspace
114+
- `postCreateComand`: This installs the Python packages needed for this guide into the workspace.
115115

116-
After created and saved the `devcontainer.json` file, move up back to the root directory of your clone [repository](20240819_definition_repository.md). For me, I run the command below
116+
After creating and saving the `devcontainer.json` file, move up back to the root directory of your clone [repository](20240819_definition_repository.md). For me, I run the command below.
117117

118118
```bash
119119
cd ../..
120120
```
121121

122122
## Step 4: Commit and Push Changes to GitHub
123123

124-
Run this commands to push your changes to GitHub
124+
Run these commands to push your changes to GitHub.
125125

126126
```bash
127127
git add .
128128
git commit -m “add devcontainer.json file”
129129
git push
130130
```
131131

132-
Now, you have successfully push our updated repository that contains our configuration file (`devcontainer.json`) for our DuckDB environment
132+
Now, you have successfully pushed our updated repository, which contains our configuration file (`devcontainer.json`) for our DuckDB environment.
133133

134134
## Step 5: Verify Daytona Installation
135135

136-
Run this command to check `daytona` is properly installed in your PC or Mac
136+
Run this command to check `daytona` is properly installed on your PC or Mac.
137137

138138
```bash
139139
daytona –-version
140140
```
141141

142-
You should see your version of `daytona` installed
142+
You should see your version of `daytona` installed.
143143

144144
## Step 6: Create a Daytona Workspace with DuckDB Playground Environment in it
145145

146-
Let’s start daytona server by running the command
146+
Let’s start the daytona server by running the command.
147147

148148
```bash
149149
daytona serve
150150
```
151151

152-
You should see logs like my screenshot
152+
You should see logs like my screenshot.
153153

154154
Open a new tab in your terminal, for Linux its `Shift + Ctrl + T`
155155

156-
Run the command below in a new tab of your terminal and follow the prompt instruction. It would ask you for a [workspace](20240819_definition_daytona workspace.md) name to use, just choose the default.
156+
Run the command below in a new tab of your terminal and follow the prompt instructions. It would ask you for a [workspace](20240819_definition_daytona workspace.md) name to use, choose the default.
157157

158158
Replace `USERNAME` and `REPOSITORY-NAME` with your username for GitHub and the repository name you created earlier.
159159

160160
```bash
161161
daytona create https://github.com/USERNAME/REPOSITORY-NAME
162162
```
163163

164-
In my case, it's this
164+
In my case, it's this.
165165

166166
```bash
167167
daytona create https://github.com/c0d33ngr/playground-duckdb
168168
```
169169

170-
After you successfully ran the above command you should see screenshot like mine showing your Daytona workspace that contains the DuckDB environment is running
170+
After you successfully run the above command you should see a screenshot like mine showing your Daytona workspace that contains the DuckDB environment is running.
171171

172172
You can now run this command to open the DuckDB [environment](20240819_definition_development environment.md) in your default [IDE](20240819_definition_integrated development environment _ide_.md) you choose when installing Daytona (Replace `WORKSPACE-NAME` with the name you used when creating the workspace above, in my case it's `playground-duckdb`).
173173

174174
```bash
175175
daytona code WORKSPACE-NAME
176176
```
177177

178-
That’s it. Daytona will create a DuckDB playground environment for you and open it in your default IDE you set.
178+
That’s it. Daytona will create a DuckDB playground environment for you and open it in the default IDE you set.
179179

180180
# Using DuckDB as a Command Line Interface (CLI) Tool
181181

182-
In this section, you'll learn how to work with [DuckDB](20240922_definition_duckdb.md) by creating a database from a CSV file, examining its structure, retrieving distinct values, and exporting data to separate CSV files for client, campaign, and economics data. Finally, you'll verify the exported data, gaining hands-on experience with DuckDB's querying and data manipulation capabilities. Lets get started
182+
In this section, you'll learn how to work with [DuckDB](20240922_definition_duckdb.md) by creating a database from a CSV file, examining its structure, retrieving distinct values, and exporting data to separate CSV files for client, campaign, and economics data. Finally, you'll verify the exported data, gaining hands-on experience with DuckDB's querying and data manipulation capabilities. Let us get started.
183183

184184
## Step 1: Enter DuckDB Interactive Shell
185185

186-
By now, you should be in your default IDE set up using `daytona`. In your IDE terminal, type the command below to enter into DuckDB database shell in interactive mode where you'll run some SQL based queries that conformed to DuckDB database.
186+
By now, you should be in your default IDE set up using `daytona`. In your IDE terminal, type the command below to enter into the DuckDB database shell in interactive mode where you'll run some SQL-based queries that conform to the DuckDB database.
187187

188188
```sql
189189
duckdb
@@ -231,18 +231,18 @@ COPY (
231231
) TO 'client.csv' (DELIMITER ',', HEADER TRUE);
232232
```
233233

234-
## Step 5: Retrieve List of Distinct Records in `day` Column
234+
## Step 5: Retrieve the List of Distinct Records in `day` Column
235235

236-
Run the following SQL query to retrieve a list of distinct days from the bank_marketing table. The results would be useful in the preparation of the SQL query for step 7. We need to know the unique records in the `day` column.
236+
Run the following SQL query to retrieve a list of distinct days from the bank_marketing table. The results would be useful in preparing the SQL query for step 7. We need to know the unique records in the `day` column.
237237

238238
```sql
239239
SELECT DISTINCT day
240240
FROM 'bank_marketing.csv';
241241
```
242242

243-
## Step 6: Retrieve List of Distinct Records in `month` Column
243+
## Step 6: Retrieve the List of Distinct Records in `month` Column
244244

245-
Run the following SQL query to retrieve list of distinct months from the `bank_marketing` table. The results are also needed for the creation of a new column called `last_contact_date` later in step 7.
245+
Run the following SQL query to retrieve the list of distinct months from the `bank_marketing` table. The results are also needed for the creation of a new column called `last_contact_date` later in step 7.
246246

247247
```sql
248248
SELECT DISTINCT month
@@ -283,15 +283,15 @@ COPY (
283283
WHEN LOWER(month) = 'oct' THEN 10
284284
WHEN LOWER(month) = 'nov' THEN 11
285285
WHEN LOWER(month) = 'dec' THEN 12
286-
ELSE NULL -- default value if month is unknown
286+
ELSE NULL -- default value if the month is unknown
287287
END,
288288
CAST(day AS BIGINT)
289289
) AS last_contact_date
290290
FROM bank_marketing
291291
) TO 'campaign.csv' (DELIMITER ',', HEADER TRUE);
292292
```
293293

294-
## Step 8: Export Economical Data to CSV
294+
## Step 8: Export economic data to CSV
295295

296296
Run the following SQL query to export economics data to a CSV file named `economics.csv`
297297

@@ -307,7 +307,7 @@ COPY (
307307

308308
## Step 9: Read Data from Exported CSV files
309309

310-
Run the following SQL queries to read data from the `client.csv`, `campaign.csv` and `economics.csv` files.
310+
Run the following SQL queries to read data from the `client.csv`, `campaign.csv`, and `economics.csv` files.
311311

312312
```sql
313313
SELECT *
@@ -324,11 +324,11 @@ SELECT *
324324
FROM 'economics.csv';
325325
```
326326

327-
Now, our three CSV files are prepared for some analysis using DuckDB Client API via Python. Let head to the next section for the analysis.
327+
Now, our three CSV files have been prepared for analysis using DuckDB Client API via Python. Let's head to the next section for the analysis.
328328

329329
# Using DuckDB with Python through its Client API
330330

331-
In this section, you'll learn how to analyze and visualize data using [DuckDB](20240922_definition_duckdb.md) and [Matplotlib](20240922_definition_matplotlib.md). You'll calculate the campaign success rate, create a bar chart to compare average client age by education level, and generate a scatter plot to explore the relationship between contact duration and campaign outcome. We'll use the cleaned and transformed CSV files spilt from our `bank_marketing.csv` in this section.
331+
In this section, you'll learn how to analyze and visualize data using [DuckDB](20240922_definition_duckdb.md) and [Matplotlib](20240922_definition_matplotlib.md). You'll calculate the campaign success rate, create a bar chart to compare average client age by education level and generate a scatter plot to explore the relationship between contact duration and campaign outcome. We'll use the cleaned and transformed CSV files split from our `bank_marketing.csv` in this section.
332332

333333
## Step 1: Analysis of Customer Campaign Success Rate
334334

@@ -385,7 +385,7 @@ plt.tight_layout()
385385
plt.show()
386386
```
387387

388-
Run the file in your IDE terminal using `python3 client_age_by_education.py` and you should see visualization.
388+
Run the file in your IDE terminal using `python3 client_age_by_education.py` and you should see the visualization.
389389

390390
## Step 3: Analysis and Visualization of Contact Duration and Campaign Outcome through Correlation
391391

@@ -421,10 +421,10 @@ That's it. You have done lots of data tasks using [DuckDB](20240922_definition_d
421421

422422
# Conclusion
423423

424-
In this comprehensive guide, you have explored the capabilities of using DuckDB in a Daytona Workspace with no stress through hands-on example.
424+
In this comprehensive guide, you have explored the capabilities of using DuckDB in a Daytona Workspace with no stress through hands-on examples.
425425
Throughout this guide, you have gained practical experience in:
426-
- Creating and managing database with DuckDB in memory
427-
- Perform SQL queries for data cleaning, transformation and splitting
426+
- Creating and managing a database with DuckDB in memory
427+
- Perform SQL queries for data cleaning, transformation, and splitting
428428
- Integration of DuckDB using its Client API with Python for data analysis.
429429

430430
# References

0 commit comments

Comments
 (0)