Skip to content

Commit 0c9391a

Browse files
authored
Update 01-infer-schema.md
1 parent ccb2355 commit 0c9391a

File tree

1 file changed

+200
-44
lines changed

1 file changed

+200
-44
lines changed

docs/en/sql-reference/20-sql-functions/17-table-functions/01-infer-schema.md

Lines changed: 200 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,26 @@ title: INFER_SCHEMA
44

55
Automatically detects the file metadata schema and retrieves the column definitions.
66

7+
`infer_schema` currently supports the following file formats:
8+
- **Parquet** - Native support for schema inference
9+
- **CSV** - With customizable delimiters and header detection
10+
- **NDJSON** - Newline-delimited JSON files
711

8-
:::caution
12+
**Compression Support**: All formats also support compressed files with extensions `.zip`, `.xz`, `.zst`.
913

10-
`infer_schema` currently only supports parquet file format.
14+
:::info File Size Limit
15+
Each individual file has a maximum size limit of **100MB** for schema inference.
16+
:::
17+
18+
:::info Schema Merging
19+
When processing multiple files, `infer_schema` automatically merges different schemas:
20+
21+
- **Compatible types** are promoted (e.g., INT8 + INT16 → INT16)
22+
- **Incompatible types** fall back to **VARCHAR** (e.g., INT + FLOAT → VARCHAR)
23+
- **Missing columns** in some files are marked as **nullable**
24+
- **New columns** from later files are added to the final schema
1125

26+
This ensures all files can be read using the unified schema.
1227
:::
1328

1429
## Syntax
@@ -17,81 +32,222 @@ Automatically detects the file metadata schema and retrieves the column definiti
1732
INFER_SCHEMA(
1833
LOCATION => '{ internalStage | externalStage }'
1934
[ PATTERN => '<regex_pattern>']
35+
[ FILE_FORMAT => '<format_name>' ]
36+
[ MAX_RECORDS_PRE_FILE => <number> ]
37+
[ MAX_FILE_COUNT => <number> ]
2038
)
2139
```
2240

23-
Where:
41+
## Parameters
2442

25-
### internalStage
43+
| Parameter | Description | Default | Example |
44+
|-----------|-------------|---------|---------|
45+
| `LOCATION` | Stage location: `@<stage_name>[/<path>]` | Required | `'@my_stage/data/'` |
46+
| `PATTERN` | File name pattern to match | All files | `'*.csv'`, `'*.parquet'` |
47+
| `FILE_FORMAT` | File format name for parsing | Stage's format | `'csv_format'`, `'NDJSON'` |
48+
| `MAX_RECORDS_PRE_FILE` | Max records to sample per file | All records | `100`, `1000` |
49+
| `MAX_FILE_COUNT` | Max number of files to process | All files | `5`, `10` |
50+
51+
## Examples
52+
53+
### Parquet Files
2654

2755
```sql
28-
internalStage ::= @<internal_stage_name>[/<path>]
56+
-- Create stage and export data
57+
CREATE STAGE test_parquet;
58+
COPY INTO @test_parquet FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'PARQUET');
59+
60+
-- Infer schema from parquet files using pattern
61+
SELECT * FROM INFER_SCHEMA(
62+
location => '@test_parquet',
63+
pattern => '*.parquet'
64+
);
65+
```
66+
67+
Result:
68+
```
69+
+-------------+-----------------+----------+----------+----------+
70+
| column_name | type | nullable | filenames| order_id |
71+
+-------------+-----------------+----------+----------+----------+
72+
| number | BIGINT UNSIGNED | false | data_... | 0 |
73+
+-------------+-----------------+----------+----------+----------+
2974
```
3075

31-
### externalStage
76+
### CSV Files
3277

3378
```sql
34-
externalStage ::= @<external_stage_name>[/<path>]
79+
-- Create stage and export CSV data
80+
CREATE STAGE test_csv;
81+
COPY INTO @test_csv FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'CSV');
82+
83+
-- Create a CSV file format
84+
CREATE FILE FORMAT csv_format TYPE = 'CSV';
85+
86+
-- Infer schema using pattern and file format
87+
SELECT * FROM INFER_SCHEMA(
88+
location => '@test_csv',
89+
pattern => '*.csv',
90+
file_format => 'csv_format'
91+
);
3592
```
3693

37-
### PATTERN = 'regex_pattern'
94+
Result:
95+
```
96+
+-------------+---------+----------+----------+----------+
97+
| column_name | type | nullable | filenames| order_id |
98+
+-------------+---------+----------+----------+----------+
99+
| column_1 | BIGINT | true | data_... | 0 |
100+
+-------------+---------+----------+----------+----------+
101+
```
38102

39-
A [PCRE2](https://www.pcre.org/current/doc/html/)-based regular expression pattern string, enclosed in single quotes, specifying the file names to match. Click [here](#loading-data-with-pattern-matching) to see an example. For PCRE2 syntax, see http://www.pcre.org/current/doc/html/pcre2syntax.html.
103+
For CSV files with headers:
40104

41-
## Examples
105+
```sql
106+
-- Create CSV file format with header support
107+
CREATE FILE FORMAT csv_headers_format
108+
TYPE = 'CSV'
109+
field_delimiter = ','
110+
skip_header = 1;
111+
112+
-- Export data with headers
113+
CREATE STAGE test_csv_headers;
114+
COPY INTO @test_csv_headers FROM (
115+
SELECT number as user_id, 'user_' || number::string as user_name
116+
FROM numbers(5)
117+
) FILE_FORMAT = (TYPE = 'CSV', output_header = true);
118+
119+
-- Infer schema with headers
120+
SELECT * FROM INFER_SCHEMA(
121+
location => '@test_csv_headers',
122+
file_format => 'csv_headers_format'
123+
);
124+
```
42125

43-
Generate a parquet file in a stage:
126+
Limit records for faster inference:
44127

45128
```sql
46-
CREATE STAGE infer_parquet FILE_FORMAT = (TYPE = PARQUET);
47-
COPY INTO @infer_parquet FROM (SELECT * FROM numbers(10)) FILE_FORMAT = (TYPE = PARQUET);
129+
-- Sample only first 5 records for schema inference
130+
SELECT * FROM INFER_SCHEMA(
131+
location => '@test_csv',
132+
pattern => '*.csv',
133+
file_format => 'csv_format',
134+
max_records_pre_file => 5
135+
);
48136
```
49137

138+
### NDJSON Files
139+
50140
```sql
51-
LIST @infer_parquet;
52-
+-------------------------------------------------------+------+------------------------------------+-------------------------------+---------+
53-
| name | size | md5 | last_modified | creator |
54-
+-------------------------------------------------------+------+------------------------------------+-------------------------------+---------+
55-
| data_e0fd9cba-f45c-4c43-aa07-d6d87d134378_0_0.parquet | 258 | "7DCC9FFE04EA1F6882AED2CF9640D3D4" | 2023-02-09 05:21:52.000 +0000 | NULL |
56-
+-------------------------------------------------------+------+------------------------------------+-------------------------------+---------+
141+
-- Create stage and export NDJSON data
142+
CREATE STAGE test_ndjson;
143+
COPY INTO @test_ndjson FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'NDJSON');
144+
145+
-- Infer schema using pattern and NDJSON format
146+
SELECT * FROM INFER_SCHEMA(
147+
location => '@test_ndjson',
148+
pattern => '*.ndjson',
149+
file_format => 'NDJSON'
150+
);
57151
```
58152

59-
### `infer_schema`
153+
Result:
154+
```
155+
+-------------+---------+----------+----------+----------+
156+
| column_name | type | nullable | filenames| order_id |
157+
+-------------+---------+----------+----------+----------+
158+
| number | BIGINT | true | data_... | 0 |
159+
+-------------+---------+----------+----------+----------+
160+
```
60161

162+
Limit records for faster inference:
61163

62164
```sql
63-
SELECT * FROM INFER_SCHEMA(location => '@infer_parquet/data_e0fd9cba-f45c-4c43-aa07-d6d87d134378_0_0.parquet');
64-
+-------------+-----------------+----------+----------+
65-
| column_name | type | nullable | order_id |
66-
+-------------+-----------------+----------+----------+
67-
| number | BIGINT UNSIGNED | 0 | 0 |
68-
+-------------+-----------------+----------+----------+
165+
-- Sample only first 5 records for schema inference
166+
SELECT * FROM INFER_SCHEMA(
167+
location => '@test_ndjson',
168+
pattern => '*.ndjson',
169+
file_format => 'NDJSON',
170+
max_records_pre_file => 5
171+
);
69172
```
70173

71-
### `infer_schema` with Pattern Matching
174+
### Schema Merging with Multiple Files
175+
176+
When files have different schemas, `infer_schema` merges them intelligently:
72177

73178
```sql
74-
SELECT * FROM infer_schema(location => '@infer_parquet/', pattern => '.*parquet');
75-
+-------------+-----------------+----------+----------+
76-
| column_name | type | nullable | order_id |
77-
+-------------+-----------------+----------+----------+
78-
| number | BIGINT UNSIGNED | 0 | 0 |
79-
+-------------+-----------------+----------+----------+
179+
-- Suppose you have multiple CSV files with different schemas:
180+
-- file1.csv: id(INT), name(VARCHAR)
181+
-- file2.csv: id(INT), name(VARCHAR), age(INT)
182+
-- file3.csv: id(FLOAT), name(VARCHAR), age(INT)
183+
184+
SELECT * FROM INFER_SCHEMA(
185+
location => '@my_stage/',
186+
pattern => '*.csv',
187+
file_format => 'csv_format'
188+
);
80189
```
81190

82-
### Create a Table From Parquet File
191+
Result shows merged schema:
192+
```
193+
+-------------+---------+----------+-----------+----------+
194+
| column_name | type | nullable | filenames | order_id |
195+
+-------------+---------+----------+-----------+----------+
196+
| id | VARCHAR | true | file1,... | 0 | -- INT+FLOAT→VARCHAR
197+
| name | VARCHAR | true | file1,... | 1 |
198+
| age | BIGINT | true | file1,... | 2 | -- Missing in file1→nullable
199+
+-------------+---------+----------+-----------+----------+
200+
```
83201

84-
The `infer_schema` can only display the schema of a parquet file and cannot create a table from it.
202+
### Pattern Matching and File Limits
85203

86-
To create a table from a parquet file:
204+
Use pattern matching to infer schema from multiple files:
87205

88206
```sql
89-
CREATE TABLE mytable AS SELECT * FROM @infer_parquet/ (pattern=>'.*parquet') LIMIT 0;
90-
91-
DESC mytable;
92-
+--------+-----------------+------+---------+-------+
93-
| Field | Type | Null | Default | Extra |
94-
+--------+-----------------+------+---------+-------+
95-
| number | BIGINT UNSIGNED | NO | 0 | |
96-
+--------+-----------------+------+---------+-------+
207+
-- Infer schema from all CSV files in the directory
208+
SELECT * FROM INFER_SCHEMA(
209+
location => '@my_stage/',
210+
pattern => '*.csv'
211+
);
212+
```
213+
214+
Limit the number of files processed to improve performance:
215+
216+
```sql
217+
-- Process only the first 5 matching files
218+
SELECT * FROM INFER_SCHEMA(
219+
location => '@my_stage/',
220+
pattern => '*.csv',
221+
max_file_count => 5
222+
);
223+
```
224+
225+
### Compressed Files
226+
227+
`infer_schema` automatically handles compressed files:
228+
229+
```sql
230+
-- Works with compressed CSV files
231+
SELECT * FROM INFER_SCHEMA(location => '@my_stage/data.csv.zip');
232+
233+
-- Works with compressed NDJSON files
234+
SELECT * FROM INFER_SCHEMA(
235+
location => '@my_stage/data.ndjson.xz',
236+
file_format => 'NDJSON',
237+
max_records_pre_file => 50
238+
);
239+
```
240+
241+
### Create Table from Inferred Schema
242+
243+
The `infer_schema` function displays the schema but doesn't create tables. To create a table from the inferred schema:
244+
245+
```sql
246+
-- Create table structure from file schema
247+
CREATE TABLE my_table AS
248+
SELECT * FROM @my_stage/ (pattern=>'*.parquet')
249+
LIMIT 0;
250+
251+
-- Verify the table structure
252+
DESC my_table;
97253
```

0 commit comments

Comments
 (0)