Skip to content

Commit 2f04812

Browse files
AI Translate 17-table-functions to Simplified-Chinese (#2816)
* [INIT] Start translation to Simplified-Chinese * 🌐 Translate 01-infer-schema.md to Simplified-Chinese --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 0c9391a commit 2f04812

File tree

2 files changed

+201
-45
lines changed

2 files changed

+201
-45
lines changed

.translation-init

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
Translation initialization: 2025-09-26T01:16:02.268240
1+
Translation initialization: 2025-09-26T04:21:21.696179

docs/cn/sql-reference/20-sql-functions/17-table-functions/01-infer-schema.md

Lines changed: 200 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,26 @@ title: INFER_SCHEMA
44

55
自动检测文件元数据模式并检索列定义。
66

7+
`infer_schema` 目前支持以下文件格式:
8+
- **Parquet** - 原生支持模式推断
9+
- **CSV** - 支持自定义分隔符和表头检测
10+
- **NDJSON** - 换行分隔的 JSON 文件
711

8-
:::caution
12+
**压缩支持**:所有格式均支持扩展名为 `.zip``.xz``.zst` 的压缩文件。
913

10-
`infer_schema` 目前仅支持 parquet 文件格式。
14+
:::info 文件大小限制
15+
每个独立文件的模式推断最大大小限制为 **100MB**
16+
:::
17+
18+
:::info 模式合并
19+
处理多个文件时,`infer_schema` 会自动合并不同模式:
20+
21+
- **兼容类型** 会被提升(例如,INT8 + INT16 → INT16)
22+
- **不兼容类型** 会回退到 **VARCHAR**(例如,INT + FLOAT → VARCHAR)
23+
- 某些文件中 **缺失的列** 会被标记为 **nullable**
24+
- 后续文件中的 **新列** 会被添加到最终模式
1125

26+
这确保所有文件都能使用统一模式读取。
1227
:::
1328

1429
## 语法
@@ -17,81 +32,222 @@ title: INFER_SCHEMA
1732
INFER_SCHEMA(
1833
LOCATION => '{ internalStage | externalStage }'
1934
[ PATTERN => '<regex_pattern>']
35+
[ FILE_FORMAT => '<format_name>' ]
36+
[ MAX_RECORDS_PRE_FILE => <number> ]
37+
[ MAX_FILE_COUNT => <number> ]
2038
)
2139
```
2240

23-
其中:
41+
## 参数
2442

25-
### internalStage
43+
| 参数 | 描述 | 默认值 | 示例 |
44+
|-----------|-------------|---------|---------|
45+
| `LOCATION` | 暂存区位置:`@<stage_name>[/<path>]` | 必需 | `'@my_stage/data/'` |
46+
| `PATTERN` | 文件名匹配模式 | 所有文件 | `'*.csv'`, `'*.parquet'` |
47+
| `FILE_FORMAT` | 解析用的文件格式名称 | 暂存区格式 | `'csv_format'`, `'NDJSON'` |
48+
| `MAX_RECORDS_PRE_FILE` | 每文件采样的最大记录数 | 所有记录 | `100`, `1000` |
49+
| `MAX_FILE_COUNT` | 处理的最大文件数 | 所有文件 | `5`, `10` |
50+
51+
## 示例
52+
53+
### Parquet 文件
2654

2755
```sql
28-
internalStage ::= @<internal_stage_name>[/<path>]
56+
-- 创建暂存区并导出数据
57+
CREATE STAGE test_parquet;
58+
COPY INTO @test_parquet FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'PARQUET');
59+
60+
-- 使用模式从 Parquet 文件推断模式
61+
SELECT * FROM INFER_SCHEMA(
62+
location => '@test_parquet',
63+
pattern => '*.parquet'
64+
);
65+
```
66+
67+
结果:
68+
```
69+
+-------------+-----------------+----------+----------+----------+
70+
| column_name | type | nullable | filenames| order_id |
71+
+-------------+-----------------+----------+----------+----------+
72+
| number | BIGINT UNSIGNED | false | data_... | 0 |
73+
+-------------+-----------------+----------+----------+----------+
2974
```
3075

31-
### externalStage
76+
### CSV 文件
3277

3378
```sql
34-
externalStage ::= @<external_stage_name>[/<path>]
79+
-- 创建暂存区并导出 CSV 数据
80+
CREATE STAGE test_csv;
81+
COPY INTO @test_csv FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'CSV');
82+
83+
-- 创建 CSV 文件格式
84+
CREATE FILE FORMAT csv_format TYPE = 'CSV';
85+
86+
-- 使用模式和文件格式推断模式
87+
SELECT * FROM INFER_SCHEMA(
88+
location => '@test_csv',
89+
pattern => '*.csv',
90+
file_format => 'csv_format'
91+
);
3592
```
3693

37-
### PATTERN = 'regex_pattern'
94+
结果:
95+
```
96+
+-------------+---------+----------+----------+----------+
97+
| column_name | type | nullable | filenames| order_id |
98+
+-------------+---------+----------+----------+----------+
99+
| column_1 | BIGINT | true | data_... | 0 |
100+
+-------------+---------+----------+----------+----------+
101+
```
38102

39-
一个基于 [PCRE2](https://www.pcre.org/current/doc/html/) 的正则表达式模式字符串,用单引号括起来,指定要匹配的文件名。点击[这里](#loading-data-with-pattern-matching)查看示例。有关 PCRE2 语法,请参见 http://www.pcre.org/current/doc/html/pcre2syntax.html。
103+
带表头的 CSV 文件:
40104

41-
## 示例
105+
```sql
106+
-- 创建支持表头的 CSV 文件格式
107+
CREATE FILE FORMAT csv_headers_format
108+
TYPE = 'CSV'
109+
field_delimiter = ','
110+
skip_header = 1;
111+
112+
-- 导出带表头的数据
113+
CREATE STAGE test_csv_headers;
114+
COPY INTO @test_csv_headers FROM (
115+
SELECT number as user_id, 'user_' || number::string as user_name
116+
FROM numbers(5)
117+
) FILE_FORMAT = (TYPE = 'CSV', output_header = true);
118+
119+
-- 推断带表头的模式
120+
SELECT * FROM INFER_SCHEMA(
121+
location => '@test_csv_headers',
122+
file_format => 'csv_headers_format'
123+
);
124+
```
42125

43-
在 stage 中生成一个 parquet 文件
126+
限制记录数以加快推断
44127

45128
```sql
46-
CREATE STAGE infer_parquet FILE_FORMAT = (TYPE = PARQUET);
47-
COPY INTO @infer_parquet FROM (SELECT * FROM numbers(10)) FILE_FORMAT = (TYPE = PARQUET);
129+
-- 仅采样前 5 条记录进行模式推断
130+
SELECT * FROM INFER_SCHEMA(
131+
location => '@test_csv',
132+
pattern => '*.csv',
133+
file_format => 'csv_format',
134+
max_records_pre_file => 5
135+
);
48136
```
49137

138+
### NDJSON 文件
139+
50140
```sql
51-
LIST @infer_parquet;
52-
+-------------------------------------------------------+------+------------------------------------+-------------------------------+---------+
53-
| name | size | md5 | last_modified | creator |
54-
+-------------------------------------------------------+------+------------------------------------+-------------------------------+---------+
55-
| data_e0fd9cba-f45c-4c43-aa07-d6d87d134378_0_0.parquet | 258 | "7DCC9FFE04EA1F6882AED2CF9640D3D4" | 2023-02-09 05:21:52.000 +0000 | NULL |
56-
+-------------------------------------------------------+------+------------------------------------+-------------------------------+---------+
141+
-- 创建暂存区并导出 NDJSON 数据
142+
CREATE STAGE test_ndjson;
143+
COPY INTO @test_ndjson FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'NDJSON');
144+
145+
-- 使用模式和 NDJSON 格式推断模式
146+
SELECT * FROM INFER_SCHEMA(
147+
location => '@test_ndjson',
148+
pattern => '*.ndjson',
149+
file_format => 'NDJSON'
150+
);
57151
```
58152

59-
### `infer_schema`
153+
结果:
154+
```
155+
+-------------+---------+----------+----------+----------+
156+
| column_name | type | nullable | filenames| order_id |
157+
+-------------+---------+----------+----------+----------+
158+
| number | BIGINT | true | data_... | 0 |
159+
+-------------+---------+----------+----------+----------+
160+
```
60161

162+
限制记录数以加快推断:
61163

62164
```sql
63-
SELECT * FROM INFER_SCHEMA(location => '@infer_parquet/data_e0fd9cba-f45c-4c43-aa07-d6d87d134378_0_0.parquet');
64-
+-------------+-----------------+----------+----------+
65-
| column_name | type | nullable | order_id |
66-
+-------------+-----------------+----------+----------+
67-
| number | BIGINT UNSIGNED | 0 | 0 |
68-
+-------------+-----------------+----------+----------+
165+
-- 仅采样前 5 条记录进行模式推断
166+
SELECT * FROM INFER_SCHEMA(
167+
location => '@test_ndjson',
168+
pattern => '*.ndjson',
169+
file_format => 'NDJSON',
170+
max_records_pre_file => 5
171+
);
69172
```
70173

71-
### 使用模式匹配的 `infer_schema`
174+
### 多文件模式合并
175+
176+
当文件模式不同时,`infer_schema` 会智能合并:
72177

73178
```sql
74-
SELECT * FROM infer_schema(location => '@infer_parquet/', pattern => '.*parquet');
75-
+-------------+-----------------+----------+----------+
76-
| column_name | type | nullable | order_id |
77-
+-------------+-----------------+----------+----------+
78-
| number | BIGINT UNSIGNED | 0 | 0 |
79-
+-------------+-----------------+----------+----------+
179+
-- 假设有多个不同模式的 CSV 文件:
180+
-- file1.csv: id(INT), name(VARCHAR)
181+
-- file2.csv: id(INT), name(VARCHAR), age(INT)
182+
-- file3.csv: id(FLOAT), name(VARCHAR), age(INT)
183+
184+
SELECT * FROM INFER_SCHEMA(
185+
location => '@my_stage/',
186+
pattern => '*.csv',
187+
file_format => 'csv_format'
188+
);
80189
```
81190

82-
### 从 Parquet 文件创建表
191+
结果显示合并后的模式:
192+
```
193+
+-------------+---------+----------+-----------+----------+
194+
| column_name | type | nullable | filenames | order_id |
195+
+-------------+---------+----------+-----------+----------+
196+
| id | VARCHAR | true | file1,... | 0 | -- INT+FLOAT→VARCHAR
197+
| name | VARCHAR | true | file1,... | 1 |
198+
| age | BIGINT | true | file1,... | 2 | -- file1 缺失→nullable
199+
+-------------+---------+----------+-----------+----------+
200+
```
83201

84-
`infer_schema` 只能显示 parquet 文件的模式,无法从中创建表。
202+
### 模式匹配与文件限制
85203

86-
要从 parquet 文件创建表
204+
使用模式匹配从多个文件推断模式
87205

88206
```sql
89-
CREATE TABLE mytable AS SELECT * FROM @infer_parquet/ (pattern=>'.*parquet') LIMIT 0;
90-
91-
DESC mytable;
92-
+--------+-----------------+------+---------+-------+
93-
| Field | Type | Null | Default | Extra |
94-
+--------+-----------------+------+---------+-------+
95-
| number | BIGINT UNSIGNED | NO | 0 | |
96-
+--------+-----------------+------+---------+-------+
207+
-- 从目录中所有 CSV 文件推断模式
208+
SELECT * FROM INFER_SCHEMA(
209+
location => '@my_stage/',
210+
pattern => '*.csv'
211+
);
212+
```
213+
214+
限制处理文件数以提升性能:
215+
216+
```sql
217+
-- 仅处理前 5 个匹配文件
218+
SELECT * FROM INFER_SCHEMA(
219+
location => '@my_stage/',
220+
pattern => '*.csv',
221+
max_file_count => 5
222+
);
223+
```
224+
225+
### 压缩文件
226+
227+
`infer_schema` 自动处理压缩文件:
228+
229+
```sql
230+
-- 适用于压缩 CSV 文件
231+
SELECT * FROM INFER_SCHEMA(location => '@my_stage/data.csv.zip');
232+
233+
-- 适用于压缩 NDJSON 文件
234+
SELECT * FROM INFER_SCHEMA(
235+
location => '@my_stage/data.ndjson.xz',
236+
file_format => 'NDJSON',
237+
max_records_pre_file => 50
238+
);
239+
```
240+
241+
### 从推断模式创建表
242+
243+
`infer_schema` 函数显示模式但不创建表。要从推断模式创建表:
244+
245+
```sql
246+
-- 从文件模式创建表结构
247+
CREATE TABLE my_table AS
248+
SELECT * FROM @my_stage/ (pattern=>'*.parquet')
249+
LIMIT 0;
250+
251+
-- 验证表结构
252+
DESC my_table;
97253
```

0 commit comments

Comments
 (0)