Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/_snippets/_service_actions_menu.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import Image from '@theme/IdealImage';
import cloud_service_action_menu from '@site/static/images/_snippets/cloud-service-actions-menu.png';

Select your service, followed by `Data souces` -> `Predefined sample data`.
Select your service, followed by `Data sources` -> `Predefined sample data`.

<Image size="md" img={cloud_service_action_menu} alt="ClickHouse Cloud service Actions menu showing Data sources and Predefined sample data options" border />
2 changes: 1 addition & 1 deletion docs/dictionary/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ LIMIT 5
FORMAT PrettyCompactMonoBlock

┌───────Id─┬─Title─────────────────────────────────────────────────────────┬─Location──────────────┐
│ 52296928 │ Comparision between two Strings in ClickHouse │ Spain │
│ 52296928 │ Comparison between two Strings in ClickHouse │ Spain │
│ 52345137 │ How to use a file to migrate data from mysql to a clickhouse? │ 中国江苏省Nanjing Shi │
│ 61452077 │ How to change PARTITION in clickhouse │ Guangzhou, 广东省中国 │
│ 55608325 │ Clickhouse select last record without max() on all table │ Moscow, Russia │
Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/example-datasets/dbpedia.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ LIMIT 20
```

Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute
Also record the query latency with cold OS file cache and with `max_threads=1` to recognize the real compute
usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!)

## Build a vector similarity index {#build-vector-similarity-index}
Expand Down
4 changes: 2 additions & 2 deletions docs/getting-started/example-datasets/laion.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ data = data[['url', 'caption', 'NSFW', 'similarity', "image_embedding", "text_em
data['image_embedding'] = data['image_embedding'].apply(lambda x: x.tolist())
data['text_embedding'] = data['text_embedding'].apply(lambda x: x.tolist())

# this small hack is needed becase caption sometimes contains all kind of quotes
# this small hack is needed because caption sometimes contains all kind of quotes
data['caption'] = data['caption'].apply(lambda x: x.replace("'", " ").replace('"', " "))

# export data as CSV file
Expand Down Expand Up @@ -132,7 +132,7 @@ For now, we can run the embedding of a random LEGO set picture as `target`.
10 rows in set. Elapsed: 4.605 sec. Processed 100.38 million rows, 309.98 GB (21.80 million rows/s., 67.31 GB/s.)
```

## Run an approximate vector similarity search with a vector simialrity index {#run-an-approximate-vector-similarity-search-with-a-vector-similarity-index}
## Run an approximate vector similarity search with a vector similarity index {#run-an-approximate-vector-similarity-search-with-a-vector-similarity-index}

Let's now define two vector similarity indexes on the table.

Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/example-datasets/tpcds.md
Original file line number Diff line number Diff line change
Expand Up @@ -408,7 +408,7 @@ CREATE TABLE store (
s_zip LowCardinality(Nullable(String)),
s_country LowCardinality(Nullable(String)),
s_gmt_offset Nullable(Decimal(7,2)),
s_tax_precentage Nullable(Decimal(7,2)),
s_tax_percentage Nullable(Decimal(7,2)),
PRIMARY KEY (s_store_sk)
);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -426,7 +426,7 @@ object NativeSparkWrite extends App {
from pyspark.sql import SparkSession
from pyspark.sql import Row

# Feel free to use any other packages combination satesfying the compatability martix provided above.
# Feel free to use any other packages combination satesfying the compatibility matrix provided above.
packages = [
"com.clickhouse.spark:clickhouse-spark-runtime-3.4_2.12:0.8.0",
"com.clickhouse:clickhouse-client:0.7.0",
Expand Down Expand Up @@ -461,7 +461,7 @@ df.writeTo("clickhouse.default.example_table").append()
<TabItem value="SparkSQL" label="Spark SQL">

```sql
-- resultTalbe is the Spark intermediate df we want to insert into clickhouse.default.example_table
-- resultTable is the Spark intermediate df we want to insert into clickhouse.default.example_table
INSERT INTO TABLE clickhouse.default.example_table
SELECT * FROM resultTable;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ If ClickPipes tries to resume replication and the required binlog files have bee

By default, Aurora MySQL purges the binary log as soon as possible (i.e., _lazy purging_). We recommend increasing the binlog retention interval to at least **72 hours** to ensure availability of binary log files for replication under failure scenarios. To set an interval for binary log retention ([`binlog retention hours`](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/mysql-stored-proc-configuring.html#mysql_rds_set_configuration-usage-notes.binlog-retention-hours)), use the [`mysql.rds_set_configuration`](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/mysql-stored-proc-configuring.html#mysql_rds_set_configuration) procedure:

[//]: # "NOTE Most CDC providers recommend the maximum retention period for Aurora RDS (7 days/168 hours). Since this has an impact on disk usage, we conservatively recommend a mininum of 3 days/72 hours."
[//]: # "NOTE Most CDC providers recommend the maximum retention period for Aurora RDS (7 days/168 hours). Since this has an impact on disk usage, we conservatively recommend a minimum of 3 days/72 hours."

```text
mysql=> call mysql.rds_set_configuration('binlog retention hours', 72);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ If ClickPipes tries to resume replication and the required binlog files have bee

By default, Amazon RDS purges the binary log as soon as possible (i.e., _lazy purging_). We recommend increasing the binlog retention interval to at least **72 hours** to ensure availability of binary log files for replication under failure scenarios. To set an interval for binary log retention ([`binlog retention hours`](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/mysql-stored-proc-configuring.html#mysql_rds_set_configuration-usage-notes.binlog-retention-hours)), use the [`mysql.rds_set_configuration`](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/mysql-stored-proc-configuring.html#mysql_rds_set_configuration) procedure:

[//]: # "NOTE Most CDC providers recommend the maximum retention period for RDS (7 days/168 hours). Since this has an impact on disk usage, we conservatively recommend a mininum of 3 days/72 hours."
[//]: # "NOTE Most CDC providers recommend the maximum retention period for RDS (7 days/168 hours). Since this has an impact on disk usage, we conservatively recommend a minimum of 3 days/72 hours."

```text
mysql=> call mysql.rds_set_configuration('binlog retention hours', 72);
Expand Down
2 changes: 1 addition & 1 deletion docs/integrations/data-ingestion/s3/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1027,7 +1027,7 @@ ClickHouse Keeper is responsible for coordinating the replication of data across

See the [network ports](../../../guides/sre/network-ports.md) list when you configure the security settings in AWS so that your servers can communicate with each other, and you can communicate with them.

All three servers must listen for network connections so that they can communicate between the servers and with S3. By default, ClickHouse listens ony on the loopback address, so this must be changed. This is configured in `/etc/clickhouse-server/config.d/`. Here is a sample that configures ClickHouse and ClickHouse Keeper to listen on all IP v4 interfaces. see the documentation or the default configuration file `/etc/clickhouse/config.xml` for more information.
All three servers must listen for network connections so that they can communicate between the servers and with S3. By default, ClickHouse listens only on the loopback address, so this must be changed. This is configured in `/etc/clickhouse-server/config.d/`. Here is a sample that configures ClickHouse and ClickHouse Keeper to listen on all IP v4 interfaces. see the documentation or the default configuration file `/etc/clickhouse/config.xml` for more information.

```xml title="/etc/clickhouse-server/config.d/networking.xml"
<clickhouse>
Expand Down
4 changes: 2 additions & 2 deletions docs/integrations/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ We are actively compiling this list of ClickHouse integrations below, so it's no
|Google Cloud Storage|<Gcssvg alt="GCS Logo" style={{width: '3rem', 'height': '3rem'}}/>|Data ingestion|Import from, export to, and transform GCS data in flight with ClickHouse built-in `S3` functions.|[Documentation](/integrations/data-ingestion/s3/index.md)|
|Golang|<Golangsvg alt="Golang logo" style={{width: '3rem' }}/>|Language client|The Go client uses the native interface for a performant, low-overhead means of connecting to ClickHouse.|[Documentation](/integrations/language-clients/go/index.md)|
|HDFS|<Hdfssvg alt="HDFS logo" style={{width: '3rem'}}/>|Data ingestion|Provides integration with the [Apache Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop) ecosystem by allowing to manage data on [HDFS](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) via ClickHouse.|[Documentation](/engines/table-engines/integrations/hdfs)|
|Hive|<Hivesvg alt="Hive logo" style={{width: '3rem'}}/>|Data ingestionn|The Hive engine allows you to perform `SELECT` quries on HDFS Hive table.|[Documentation](/engines/table-engines/integrations/hive)|
|Hive|<Hivesvg alt="Hive logo" style={{width: '3rem'}}/>|Data ingestionn|The Hive engine allows you to perform `SELECT` queries on HDFS Hive table.|[Documentation](/engines/table-engines/integrations/hive)|
|Hudi|<Image img={hudi} size="logo" alt="Apache Hudi logo"/>|Data ingestion| provides a read-only integration with existing Apache [Hudi](https://hudi.apache.org/) tables in Amazon S3.|[Documentation](/engines/table-engines/integrations/hudi)|
|Iceberg|<Image img={iceberg} size="logo" alt="Apache Iceberg logo"/>|Data ingestion|Provides a read-only integration with existing Apache [Iceberg](https://iceberg.apache.org/) tables in Amazon S3.|[Documentation](/engines/table-engines/integrations/iceberg)|
|Java, JDBC|<Javasvg alt="Java logo" style={{width: '3rem'}}/>|Language client|The Java client and JDBC driver.|[Documentation](/integrations/language-clients/java/index.md)|
Expand Down Expand Up @@ -327,7 +327,7 @@ We are actively compiling this list of ClickHouse integrations below, so it's no
|SiSense|<Image img={sisense_logo} size="logo" alt="SiSense logo"/>|Data visualization|Embed analytics into any application or workflow|[Website](https://www.sisense.com/data-connectors/)|
|SigNoz|<Image img={signoz_logo} size="logo" alt="SigNoz logo"/>|Data visualization|Open Source Observability Platform|[Documentation](https://www.signoz.io/docs/architecture/)|
|Snappy Flow|<Image img={snappy_flow_logo} size="logo" alt="Snappy Flow logo"/>|Data management|Collects ClickHouse database metrics via plugin.|[Documentation](https://docs.snappyflow.io/docs/Integrations/clickhouse/instance)|
|Soda|<Image img={soda_logo} size="logo" alt="Soda logo"/>|Data quality|Soda integration makes it easy for organziations to detect, resolve, and prevent data quality issues by running data quality checks on data before it is loaded into the database.|[Website](https://www.soda.io/integrations/clickhouse)|
|Soda|<Image img={soda_logo} size="logo" alt="Soda logo"/>|Data quality|Soda integration makes it easy for organizations to detect, resolve, and prevent data quality issues by running data quality checks on data before it is loaded into the database.|[Website](https://www.soda.io/integrations/clickhouse)|
|Splunk|<Image img={splunk_logo} size="logo" alt="Splunk logo"/>|Data integration|Splunk modular input to import to Splunk the ClickHouse Cloud Audit logs.|[Website](https://splunkbase.splunk.com/app/7709),<br/>[Documentation](/integrations/tools/data-integration/splunk/index.md)|
|StreamingFast|<Image img={streamingfast_logo} size="logo" alt="StreamingFast logo"/>|Data ingestion| Blockchain-agnostic, parallelized and streaming-first data engine. |[Website](https://www.streamingfast.io/)|
|Streamkap|<Image img={streamkap_logo} size="logo" alt="Streamkap logo"/>|Data ingestion|Setup real-time CDC (Change Data Capture) streaming to ClickHouse with high throughput in minutes.|[Documentation](https://docs.streamkap.com/docs/clickhouse)|
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Authentication by an access token requires setting access token by calling `setA
.build();
```

Authentication by a SSL Client Certificate require setting username, enabling SSL Authentication, setting a client sertificate and a client key by calling `setUsername(String)`, `useSSLAuthentication(boolean)`, `setClientCertificate(String)` and `setClientKey(String)` accordingly:
Authentication by a SSL Client Certificate require setting username, enabling SSL Authentication, setting a client certificate and a client key by calling `setUsername(String)`, `useSSLAuthentication(boolean)`, `setClientCertificate(String)` and `setClientKey(String)` accordingly:
```java showLineNumbers
Client client = new Client.Builder()
.useSSLAuthentication(true)
Expand Down Expand Up @@ -150,7 +150,7 @@ Configuration is defined during client creation. See `com.clickhouse.client.api.
| `setServerTimeZone(String timeZone)` | `timeZone` - string value of java valid timezone ID (see `java.time.ZoneId`) | Sets server side timezone. UTC timezone will be used by default. <br/> <br/> Default: `UTC` <br/> Enum: `ClientConfigProperties.SERVER_TIMEZONE` <br/> Key: `server_time_zone` |
| `useAsyncRequests(boolean async)` | `async` - flag that indicates if the option should be enabled. | Sets if client should execute request in a separate thread. Disabled by default because application knows better how to organize multi-threaded tasks and running tasks in separate thread do not help with performance. <br/> <br/> Default: `false` <br/> Enum: `ClientConfigProperties.ASYNC_OPERATIONS` <br/> Key: `async` |
| `setSharedOperationExecutor(ExecutorService executorService)` | `executorService` - instance of executor service. | Sets executor service for operation tasks. <br/> <br/> Default: `none` <br/> Enum: `none` <br/> Key: `none`|
| `setClientNetworkBufferSize(int size)` | - `size` - size in bytes | Sets size of a buffer in application memory space that is used to copy data back-and-forth between socket and application. Greater reduces system calls to TCP stack, but affects how much memory is spent on every connection. This buffer is also subject for GC because connections are shortlive. Also keep in mind that allocating big continious block of memory might be a problem. <br/> <br/> Default: `300000` <br/> Enum: `ClientConfigProperties.CLIENT_NETWORK_BUFFER_SIZE` <br/> Key: `client_network_buffer_size`|
| `setClientNetworkBufferSize(int size)` | - `size` - size in bytes | Sets size of a buffer in application memory space that is used to copy data back-and-forth between socket and application. Greater reduces system calls to TCP stack, but affects how much memory is spent on every connection. This buffer is also subject for GC because connections are shortlive. Also keep in mind that allocating big continuous block of memory might be a problem. <br/> <br/> Default: `300000` <br/> Enum: `ClientConfigProperties.CLIENT_NETWORK_BUFFER_SIZE` <br/> Key: `client_network_buffer_size`|
| `retryOnFailures(ClientFaultCause ...causes)` | - `causes` - enum constant of `com.clickhouse.client.api.ClientFaultCause` | Sets recoverable/retriable fault types. <br/> <br/> Default: `NoHttpResponse,ConnectTimeout,ConnectionRequestTimeout` <br/> Enum: `ClientConfigProperties.CLIENT_RETRY_ON_FAILURE` <br/> Key: `client_retry_on_failures` |
| `setMaxRetries(int maxRetries)` | - `maxRetries` - number of retries | Sets maximum number of retries for failures defined by `retryOnFailures(ClientFaultCause ...causes)` <br/> <br/> Default: `3` <br/> Enum: `ClientConfigProperties.RETRY_ON_FAILURE` <br/> Key: `retry` |
| `allowBinaryReaderToReuseBuffers(boolean reuse)` | - `reuse` - flag that indicates if the option should be enabled | Most datasets contain numeric data encoded as small byte sequences. By default reader will allocate required buffer, read data into it and then transform into a target Number class. That may cause significant GC preasure because of many small objects are being allocated and released. If this option is enabled then reader will use preallocated buffers to do numbers transcoding. It is safe because each reader has own set of buffers and readers are used by one thread. |
Expand Down Expand Up @@ -349,7 +349,7 @@ try (InputStream dataStream = getDataStream()) {

### insert(String tableName, List&lt;?> data, InsertSettings settings) {#insertstring-tablename-listlt-data-insertsettings-settings}

Sends a write request to database. The list of objects is converted into an efficient format and then is sent to a server. The class of the list items should be registed up-front using `register(Class, TableSchema)` method.
Sends a write request to database. The list of objects is converted into an efficient format and then is sent to a server. The class of the list items should be registered up-front using `register(Class, TableSchema)` method.

**Signatures**
```java
Expand Down