Skip to content

Commit

Permalink
[Doc] Files() supports NAS (StarRocks#54030)
Browse files Browse the repository at this point in the history
Signed-off-by: 絵空事スピリット <[email protected]>
  • Loading branch information
EsoragotoSpirit authored Dec 17, 2024
1 parent e78d2f6 commit 82ef268
Show file tree
Hide file tree
Showing 6 changed files with 172 additions and 10 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ This document outlines the features of various data loading and unloading method
</tr>
<tr>
<td>INSERT from FILES</td>
<td rowspan="2">HDFS, S3, OSS, Azure, GCS</td>
<td rowspan="2">HDFS, S3, OSS, Azure, GCS, NFS(NAS) [5]</td>
<td>Yes (v3.3+)</td>
<td>To be supported</td>
<td>Yes (v3.1+)</td>
Expand Down Expand Up @@ -104,6 +104,8 @@ This document outlines the features of various data loading and unloading method

[4]\: Currently, only INSERT from FILES is supported for loading with PIPE.

[5]\: You need to mount a NAS device as NFS under the same directory of each BE or CN node to access the files in NFS via the `file://` protocol.

:::

#### JSON CDC formats
Expand Down Expand Up @@ -159,7 +161,7 @@ This document outlines the features of various data loading and unloading method
<tr>
<td>INSERT INTO FILES</td>
<td>N/A</td>
<td>HDFS, S3, OSS, Azure, GCS</td>
<td>HDFS, S3, OSS, Azure, GCS, NFS(NAS) [3]</td>
<td>Yes (v3.3+)</td>
<td>To be supported</td>
<td>Yes (v3.2+)</td>
Expand Down Expand Up @@ -208,6 +210,8 @@ This document outlines the features of various data loading and unloading method

[2]\: Currently, unloading data using PIPE is not supported.

[3]\: You need to mount a NAS device as NFS under the same directory of each BE or CN node to access the files in NFS via the `file://` protocol.

:::

## File format-related parameters
Expand Down
58 changes: 56 additions & 2 deletions docs/en/sql-reference/sql-functions/table-functions/files.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Currently, the FILES() function supports the following data sources and file for
- Google Cloud Storage
- Other S3-compatible storage system
- Microsoft Azure Blob Storage
- NFS(NAS)
- **File formats:**
- Parquet
- ORC
Expand Down Expand Up @@ -90,6 +91,19 @@ The URI used to access the files. You can specify a path or a file.
-- Example: "path" = "wasbs://[email protected]/path/file.parquet"
```

- To access NFS(NAS):

```SQL
"path" = "file:///<absolute_path>"
-- Example: "path" = "file:///home/ubuntu/parquetfile/file.parquet"
```

:::note

To access the files in NFS via the `file://` protocol, you need to mount a NAS device as NFS under the same directory of each BE or CN node.

:::

#### data_format

The format of the data file. Valid values: `parquet`, `orc`, and `csv`.
Expand Down Expand Up @@ -448,6 +462,15 @@ SELECT * FROM FILES(
2 rows in set (22.335 sec)
```
Query the data from the Parquet files in NFS(NAS):
```SQL
SELECT * FROM FILES(
'path' = 'file:///home/ubuntu/parquetfile/*.parquet',
'format' = 'parquet'
);
```
#### Example 2: Insert the data rows from a file
Insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table `insert_wiki_edit`:
Expand All @@ -465,6 +488,18 @@ Query OK, 2 rows affected (23.03 sec)
{'label':'insert_d8d4b2ee-ac5c-11ed-a2cf-4e1110a8f63b', 'status':'VISIBLE', 'txnId':'2440'}
```
Insert the data rows from the CSV files in NFS(NAS) into the table `insert_wiki_edit`:
```SQL
INSERT INTO insert_wiki_edit
SELECT * FROM FILES(
'path' = 'file:///home/ubuntu/csvfile/*.csv',
'format' = 'csv',
'csv.column_separator' = ',',
'csv.row_delimiter' = '\n'
);
```
#### Example 3: CTAS with data rows from a file
Create a table named `ctas_wiki_edit` and insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table:
Expand Down Expand Up @@ -657,8 +692,7 @@ DESC FILES(
Unload all data rows in `sales_records` as multiple Parquet files under the path **/unload/partitioned/** in the HDFS cluster. These files are stored in different subpaths distinguished by the values in the column `sales_time`.
```SQL
INSERT INTO
FILES(
INSERT INTO FILES(
"path" = "hdfs://xxx.xx.xxx.xx:9000/unload/partitioned/",
"format" = "parquet",
"hadoop.security.authentication" = "simple",
Expand All @@ -669,3 +703,23 @@ FILES(
)
SELECT * FROM sales_records;
```
Unload the query results into CSV and Parquet files in NFS(NAS):
```SQL
-- CSV
INSERT INTO FILES(
'path' = 'file:///home/ubuntu/csvfile/',
'format' = 'csv',
'csv.column_separator' = ',',
'csv.row_delimitor' = '\n'
)
SELECT * FROM sales_records;
-- Parquet
INSERT INTO FILES(
'path' = 'file:///home/ubuntu/parquetfile/',
'format' = 'parquet'
)
SELECT * FROM sales_records;
```
26 changes: 25 additions & 1 deletion docs/en/unloading/unload_using_insert_into_files.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Compared to other data export methods supported by StarRocks, unloading data wit

> **NOTE**
>
> Please note that unloading data with INSERT INTO FILES does not support exporting data into local file systems.
> Please note that unloading data with INSERT INTO FILES does not support directly exporting data into local file systems. However, you can export the data into local files by using NFS. See [Unload to local files using NFS](#unload-to-local-files-using-nfs).
## Preparation

Expand Down Expand Up @@ -201,6 +201,30 @@ FILES(
SELECT * FROM sales_records;
```

### Unload to local files using NFS

To access the files in NFS via the `file://` protocol, you need to mount a NAS device as NFS under the same directory of each BE or CN node.

Example:

```SQL
-- Unload data into CSV files.
INSERT INTO FILES(
'path' = 'file:///home/ubuntu/csvfile/',
'format' = 'csv',
'csv.column_separator' = ',',
'csv.row_delimitor' = '\n'
)
SELECT * FROM sales_records;

-- Unload data into Parquet files.
INSERT INTO FILES(
'path' = 'file:///home/ubuntu/parquetfile/',
'format' = 'parquet'
)
SELECT * FROM sales_records;
```

## See also

- For more instructions on the usage of INSERT, see [SQL reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ sidebar_label: "能力边界"
</tr>
<tr>
<td>INSERT from FILES</td>
<td rowspan="2">HDFS, S3, OSS, Azure, GCS</td>
<td rowspan="2">HDFS, S3, OSS, Azure, GCS, NFS(NAS) [5]</td>
<td>Yes (v3.3+)</td>
<td>待支持</td>
<td>Yes (v3.1+)</td>
Expand Down Expand Up @@ -104,6 +104,8 @@ sidebar_label: "能力边界"

[4]:PIPE 当前只支持 INSERT from FILES。

[5]:需要将同一 NAS 设备作为 NFS 挂载到每个 BE 或 CN 节点的相同目录下,才能通过 `file://` 协议访问 NFS 中的文件。

:::

#### JSON CDC 格式
Expand Down Expand Up @@ -159,7 +161,7 @@ sidebar_label: "能力边界"
<tr>
<td>INSERT INTO FILES</td>
<td>N/A</td>
<td>HDFS, S3, OSS, Azure, GCS</td>
<td>HDFS, S3, OSS, Azure, GCS, NFS(NAS) [3]</td>
<td>Yes (v3.3+)</td>
<td>待支持</td>
<td>Yes (v3.2+)</td>
Expand Down Expand Up @@ -208,6 +210,8 @@ sidebar_label: "能力边界"

[2]:目前,不支持使用 PIPE 导出数据。

[3]:需要将同一 NAS 设备作为 NFS 挂载到每个 BE 或 CN 节点的相同目录下,才能通过 `file://` 协议访问 NFS 中的文件。

:::

## 文件格式相关参数
Expand Down
58 changes: 56 additions & 2 deletions docs/zh/sql-reference/sql-functions/table-functions/files.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ displayed_sidebar: docs
- AWS S3
- Google Cloud Storage
- Microsoft Azure Blob Storage
- NFS(NAS)
- **文件格式:**
- Parquet
- ORC
Expand Down Expand Up @@ -89,6 +90,19 @@ FILES( data_location , data_format [, schema_detect ] [, StorageCredentialParams
-- 示例: "path" = "wasbs://[email protected]/path/file.parquet"
```

- 要访问 NFS(NAS),您需要将此参数指定为:

```SQL
"path" = "file:///<absolute_path>"
-- 示例: "path" = "file:///home/ubuntu/parquetfile/file.parquet"
```

:::note

如需通过 `file://` 协议访问 NFS 中的文件,需要将同一 NAS 设备作为 NFS 挂载到每个 BE 或 CN 节点的相同目录下。

:::

#### data_format

数据文件的格式。有效值:`parquet``orc``csv`
Expand Down Expand Up @@ -447,6 +461,15 @@ SELECT * FROM FILES(
2 rows in set (22.335 sec)
```
查询 NFS(NAS) 中的 Parquet 文件:
```SQL
SELECT * FROM FILES(
'path' = 'file:///home/ubuntu/parquetfile/*.parquet',
'format' = 'parquet'
);
```
#### 示例二:导入文件中的数据
将 AWS S3 存储桶 `inserttest` 内 Parquet 文件 **parquet/insert_wiki_edit_append.parquet** 中的数据插入至表 `insert_wiki_edit` 中:
Expand All @@ -464,6 +487,18 @@ Query OK, 2 rows affected (23.03 sec)
{'label':'insert_d8d4b2ee-ac5c-11ed-a2cf-4e1110a8f63b', 'status':'VISIBLE', 'txnId':'2440'}
```
将 NFS(NAS) 中 CSV 文件的数据插入至表 `insert_wiki_edit` 中:
```SQL
INSERT INTO insert_wiki_edit
SELECT * FROM FILES(
'path' = 'file:///home/ubuntu/csvfile/*.csv',
'format' = 'csv',
'csv.column_separator' = ',',
'csv.row_delimiter' = '\n'
);
```
#### 示例三:使用文件中的数据建表
基于 AWS S3 存储桶 `inserttest` 内 Parquet 文件 **parquet/insert_wiki_edit_append.parquet** 中的数据创建表 `ctas_wiki_edit`:
Expand Down Expand Up @@ -656,8 +691,7 @@ DESC FILES(
`sales_records` 中的所有数据行导出为多个 Parquet 文件,存储在 HDFS 集群的路径 **/unload/partitioned/** 下。这些文件存储在不同的子路径中,这些子路径根据列 `sales_time` 中的值来区分。
```SQL
INSERT INTO
FILES(
INSERT INTO FILES(
"path" = "hdfs://xxx.xx.xxx.xx:9000/unload/partitioned/",
"format" = "parquet",
"hadoop.security.authentication" = "simple",
Expand All @@ -668,3 +702,23 @@ FILES(
)
SELECT * FROM sales_records;
```
将查询结果导出至 NFS(NAS) 中的 CSV 或 Parquet 文件中:
```SQL
-- CSV
INSERT INTO FILES(
'path' = 'file:///home/ubuntu/csvfile/',
'format' = 'csv',
'csv.column_separator' = ',',
'csv.row_delimitor' = '\n'
)
SELECT * FROM sales_records;
-- Parquet
INSERT INTO FILES(
'path' = 'file:///home/ubuntu/parquetfile/',
'format' = 'parquet'
)
SELECT * FROM sales_records;
```
24 changes: 23 additions & 1 deletion docs/zh/unloading/unload_using_insert_into_files.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ displayed_sidebar: docs

> **说明**
>
> 使用 INSERT INTO FILES 导出数据不支持将数据导出至本地文件系统
> 使用 INSERT INTO FILES 导出数据不支持将数据直接导出至本地文件系统,但您可以使用 NFS 将数据导出到本地文件。请参阅 [使用 NFS 导出到本地文件](#使用-nfs-导出到本地文件)
## 准备工作

Expand Down Expand Up @@ -199,6 +199,28 @@ FILES(
SELECT * FROM sales_records;
```

### 使用 NFS 导出到本地文件

如需通过 `file://` 协议访问 NFS 中的文件,需要将同一 NAS 设备作为 NFS 挂载到每个 BE 或 CN 节点的相同目录下。

```SQL
-- 导出为 CSV 文件。
INSERT INTO FILES(
'path' = 'file:///home/ubuntu/csvfile/',
'format' = 'csv',
'csv.column_separator' = ',',
'csv.row_delimitor' = '\n'
)
SELECT * FROM sales_records;

-- 导出为 Parquet 文件。
INSERT INTO FILES(
'path' = 'file:///home/ubuntu/parquetfile/',
'format' = 'parquet'
)
SELECT * FROM sales_records;
```

## 另请参阅

- 有关使用 INSERT 的更多说明,请参阅 [SQL 参考 - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md)
Expand Down

0 comments on commit 82ef268

Please sign in to comment.