@@ -4,11 +4,26 @@ title: INFER_SCHEMA
4
4
5
5
Automatically detects the file metadata schema and retrieves the column definitions.
6
6
7
+ ` infer_schema ` currently supports the following file formats:
8
+ - ** Parquet** - Native support for schema inference
9
+ - ** CSV** - With customizable delimiters and header detection
10
+ - ** NDJSON** - Newline-delimited JSON files
7
11
8
- ::: caution
12
+ ** Compression Support ** : All formats also support compressed files with extensions ` .zip ` , ` .xz ` , ` .zst ` .
9
13
10
- ` infer_schema ` currently only supports parquet file format.
14
+ :::info File Size Limit
15
+ Each individual file has a maximum size limit of ** 100MB** for schema inference.
16
+ :::
17
+
18
+ :::info Schema Merging
19
+ When processing multiple files, ` infer_schema ` automatically merges different schemas:
20
+
21
+ - ** Compatible types** are promoted (e.g., INT8 + INT16 → INT16)
22
+ - ** Incompatible types** fall back to ** VARCHAR** (e.g., INT + FLOAT → VARCHAR)
23
+ - ** Missing columns** in some files are marked as ** nullable**
24
+ - ** New columns** from later files are added to the final schema
11
25
26
+ This ensures all files can be read using the unified schema.
12
27
:::
13
28
14
29
## Syntax
@@ -17,81 +32,222 @@ Automatically detects the file metadata schema and retrieves the column definiti
17
32
INFER_SCHEMA(
18
33
LOCATION => ' { internalStage | externalStage }'
19
34
[ PATTERN => ' <regex_pattern>' ]
35
+ [ FILE_FORMAT => ' <format_name>' ]
36
+ [ MAX_RECORDS_PRE_FILE => < number > ]
37
+ [ MAX_FILE_COUNT => < number > ]
20
38
)
21
39
```
22
40
23
- Where:
41
+ ## Parameters
24
42
25
- ### internalStage
43
+ | Parameter | Description | Default | Example |
44
+ | -----------| -------------| ---------| ---------|
45
+ | ` LOCATION ` | Stage location: ` @<stage_name>[/<path>] ` | Required | ` '@my_stage/data/' ` |
46
+ | ` PATTERN ` | File name pattern to match | All files | ` '*.csv' ` , ` '*.parquet' ` |
47
+ | ` FILE_FORMAT ` | File format name for parsing | Stage's format | ` 'csv_format' ` , ` 'NDJSON' ` |
48
+ | ` MAX_RECORDS_PRE_FILE ` | Max records to sample per file | All records | ` 100 ` , ` 1000 ` |
49
+ | ` MAX_FILE_COUNT ` | Max number of files to process | All files | ` 5 ` , ` 10 ` |
50
+
51
+ ## Examples
52
+
53
+ ### Parquet Files
26
54
27
55
``` sql
28
- internalStage ::= @< internal_stage_name> [/ < path > ]
56
+ -- Create stage and export data
57
+ CREATE STAGE test_parquet;
58
+ COPY INTO @test_parquet FROM (SELECT number FROM numbers(10 )) FILE_FORMAT = (TYPE = ' PARQUET' );
59
+
60
+ -- Infer schema from parquet files using pattern
61
+ SELECT * FROM INFER_SCHEMA(
62
+ location => ' @test_parquet' ,
63
+ pattern => ' *.parquet'
64
+ );
65
+ ```
66
+
67
+ Result:
68
+ ```
69
+ +-------------+-----------------+----------+----------+----------+
70
+ | column_name | type | nullable | filenames| order_id |
71
+ +-------------+-----------------+----------+----------+----------+
72
+ | number | BIGINT UNSIGNED | false | data_... | 0 |
73
+ +-------------+-----------------+----------+----------+----------+
29
74
```
30
75
31
- ### externalStage
76
+ ### CSV Files
32
77
33
78
``` sql
34
- externalStage ::= @< external_stage_name> [/ < path > ]
79
+ -- Create stage and export CSV data
80
+ CREATE STAGE test_csv;
81
+ COPY INTO @test_csv FROM (SELECT number FROM numbers(10 )) FILE_FORMAT = (TYPE = ' CSV' );
82
+
83
+ -- Create a CSV file format
84
+ CREATE FILE FORMAT csv_format TYPE = ' CSV' ;
85
+
86
+ -- Infer schema using pattern and file format
87
+ SELECT * FROM INFER_SCHEMA(
88
+ location => ' @test_csv' ,
89
+ pattern => ' *.csv' ,
90
+ file_format => ' csv_format'
91
+ );
35
92
```
36
93
37
- ### PATTERN = 'regex_pattern'
94
+ Result:
95
+ ```
96
+ +-------------+---------+----------+----------+----------+
97
+ | column_name | type | nullable | filenames| order_id |
98
+ +-------------+---------+----------+----------+----------+
99
+ | column_1 | BIGINT | true | data_... | 0 |
100
+ +-------------+---------+----------+----------+----------+
101
+ ```
38
102
39
- A [ PCRE2 ] ( https://www.pcre.org/current/doc/html/ ) -based regular expression pattern string, enclosed in single quotes, specifying the file names to match. Click [ here ] ( #loading-data- with-pattern-matching ) to see an example. For PCRE2 syntax, see http://www.pcre.org/current/doc/html/pcre2syntax.html .
103
+ For CSV files with headers:
40
104
41
- ## Examples
105
+ ``` sql
106
+ -- Create CSV file format with header support
107
+ CREATE FILE FORMAT csv_headers_format
108
+ TYPE = ' CSV'
109
+ field_delimiter = ' ,'
110
+ skip_header = 1 ;
111
+
112
+ -- Export data with headers
113
+ CREATE STAGE test_csv_headers;
114
+ COPY INTO @test_csv_headers FROM (
115
+ SELECT number as user_id, ' user_' || number ::string as user_name
116
+ FROM numbers(5 )
117
+ ) FILE_FORMAT = (TYPE = ' CSV' , output_header = true);
118
+
119
+ -- Infer schema with headers
120
+ SELECT * FROM INFER_SCHEMA(
121
+ location => ' @test_csv_headers' ,
122
+ file_format => ' csv_headers_format'
123
+ );
124
+ ```
42
125
43
- Generate a parquet file in a stage :
126
+ Limit records for faster inference :
44
127
45
128
``` sql
46
- CREATE STAGE infer_parquet FILE_FORMAT = (TYPE = PARQUET);
47
- COPY INTO @infer_parquet FROM (SELECT * FROM numbers(10 )) FILE_FORMAT = (TYPE = PARQUET);
129
+ -- Sample only first 5 records for schema inference
130
+ SELECT * FROM INFER_SCHEMA(
131
+ location => ' @test_csv' ,
132
+ pattern => ' *.csv' ,
133
+ file_format => ' csv_format' ,
134
+ max_records_pre_file => 5
135
+ );
48
136
```
49
137
138
+ ### NDJSON Files
139
+
50
140
``` sql
51
- LIST @infer_parquet;
52
- + -- -----------------------------------------------------+------+------------------------------------+-------------------------------+---------+
53
- | name | size | md5 | last_modified | creator |
54
- + -- -----------------------------------------------------+------+------------------------------------+-------------------------------+---------+
55
- | data_e0fd9cba- f45c- 4c43- aa07- d6d87d134378_0_0 .parquet | 258 | " 7DCC9FFE04EA1F6882AED2CF9640D3D4" | 2023 - 02 - 09 05 :21 :52 .000 + 0000 | NULL |
56
- + -- -----------------------------------------------------+------+------------------------------------+-------------------------------+---------+
141
+ -- Create stage and export NDJSON data
142
+ CREATE STAGE test_ndjson;
143
+ COPY INTO @test_ndjson FROM (SELECT number FROM numbers(10 )) FILE_FORMAT = (TYPE = ' NDJSON' );
144
+
145
+ -- Infer schema using pattern and NDJSON format
146
+ SELECT * FROM INFER_SCHEMA(
147
+ location => ' @test_ndjson' ,
148
+ pattern => ' *.ndjson' ,
149
+ file_format => ' NDJSON'
150
+ );
57
151
```
58
152
59
- ### ` infer_schema `
153
+ Result:
154
+ ```
155
+ +-------------+---------+----------+----------+----------+
156
+ | column_name | type | nullable | filenames| order_id |
157
+ +-------------+---------+----------+----------+----------+
158
+ | number | BIGINT | true | data_... | 0 |
159
+ +-------------+---------+----------+----------+----------+
160
+ ```
60
161
162
+ Limit records for faster inference:
61
163
62
164
``` sql
63
- SELECT * FROM INFER_SCHEMA(location => ' @infer_parquet/data_e0fd9cba-f45c-4c43-aa07-d6d87d134378_0_0.parquet' );
64
- + -- -----------+-----------------+----------+----------+
65
- | column_name | type | nullable | order_id |
66
- + -- -----------+-----------------+----------+----------+
67
- | number | BIGINT UNSIGNED | 0 | 0 |
68
- + -- -----------+-----------------+----------+----------+
165
+ -- Sample only first 5 records for schema inference
166
+ SELECT * FROM INFER_SCHEMA(
167
+ location => ' @test_ndjson' ,
168
+ pattern => ' *.ndjson' ,
169
+ file_format => ' NDJSON' ,
170
+ max_records_pre_file => 5
171
+ );
69
172
```
70
173
71
- ### ` infer_schema ` with Pattern Matching
174
+ ### Schema Merging with Multiple Files
175
+
176
+ When files have different schemas, ` infer_schema ` merges them intelligently:
72
177
73
178
``` sql
74
- SELECT * FROM infer_schema(location => ' @infer_parquet/' , pattern => ' .*parquet' );
75
- + -- -----------+-----------------+----------+----------+
76
- | column_name | type | nullable | order_id |
77
- + -- -----------+-----------------+----------+----------+
78
- | number | BIGINT UNSIGNED | 0 | 0 |
79
- + -- -----------+-----------------+----------+----------+
179
+ -- Suppose you have multiple CSV files with different schemas:
180
+ -- file1.csv: id(INT), name(VARCHAR)
181
+ -- file2.csv: id(INT), name(VARCHAR), age(INT)
182
+ -- file3.csv: id(FLOAT), name(VARCHAR), age(INT)
183
+
184
+ SELECT * FROM INFER_SCHEMA(
185
+ location => ' @my_stage/' ,
186
+ pattern => ' *.csv' ,
187
+ file_format => ' csv_format'
188
+ );
80
189
```
81
190
82
- ### Create a Table From Parquet File
191
+ Result shows merged schema:
192
+ ```
193
+ +-------------+---------+----------+-----------+----------+
194
+ | column_name | type | nullable | filenames | order_id |
195
+ +-------------+---------+----------+-----------+----------+
196
+ | id | VARCHAR | true | file1,... | 0 | -- INT+FLOAT→VARCHAR
197
+ | name | VARCHAR | true | file1,... | 1 |
198
+ | age | BIGINT | true | file1,... | 2 | -- Missing in file1→nullable
199
+ +-------------+---------+----------+-----------+----------+
200
+ ```
83
201
84
- The ` infer_schema ` can only display the schema of a parquet file and cannot create a table from it.
202
+ ### Pattern Matching and File Limits
85
203
86
- To create a table from a parquet file :
204
+ Use pattern matching to infer schema from multiple files :
87
205
88
206
``` sql
89
- CREATE TABLE mytable AS SELECT * FROM @infer_parquet/ (pattern=> ' .*parquet' ) LIMIT 0 ;
90
-
91
- DESC mytable;
92
- + -- ------+-----------------+------+---------+-------+
93
- | Field | Type | Null | Default | Extra |
94
- + -- ------+-----------------+------+---------+-------+
95
- | number | BIGINT UNSIGNED | NO | 0 | |
96
- + -- ------+-----------------+------+---------+-------+
207
+ -- Infer schema from all CSV files in the directory
208
+ SELECT * FROM INFER_SCHEMA(
209
+ location => ' @my_stage/' ,
210
+ pattern => ' *.csv'
211
+ );
212
+ ```
213
+
214
+ Limit the number of files processed to improve performance:
215
+
216
+ ``` sql
217
+ -- Process only the first 5 matching files
218
+ SELECT * FROM INFER_SCHEMA(
219
+ location => ' @my_stage/' ,
220
+ pattern => ' *.csv' ,
221
+ max_file_count => 5
222
+ );
223
+ ```
224
+
225
+ ### Compressed Files
226
+
227
+ ` infer_schema ` automatically handles compressed files:
228
+
229
+ ``` sql
230
+ -- Works with compressed CSV files
231
+ SELECT * FROM INFER_SCHEMA(location => ' @my_stage/data.csv.zip' );
232
+
233
+ -- Works with compressed NDJSON files
234
+ SELECT * FROM INFER_SCHEMA(
235
+ location => ' @my_stage/data.ndjson.xz' ,
236
+ file_format => ' NDJSON' ,
237
+ max_records_pre_file => 50
238
+ );
239
+ ```
240
+
241
+ ### Create Table from Inferred Schema
242
+
243
+ The ` infer_schema ` function displays the schema but doesn't create tables. To create a table from the inferred schema:
244
+
245
+ ``` sql
246
+ -- Create table structure from file schema
247
+ CREATE TABLE my_table AS
248
+ SELECT * FROM @my_stage/ (pattern=> ' *.parquet' )
249
+ LIMIT 0 ;
250
+
251
+ -- Verify the table structure
252
+ DESC my_table;
97
253
```
0 commit comments