From 672974dcf6ba73d43b344c0c479431132d54a68a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Emil=20Stenstr=C3=B6m?= Date: Fri, 16 Aug 2024 19:11:04 +0200 Subject: [PATCH] Spelling and small clarifications to README --- README.md | 86 +++++++++++++++++++++---------------------------------- 1 file changed, 33 insertions(+), 53 deletions(-) diff --git a/README.md b/README.md index 538410a1..6e0a2c92 100644 --- a/README.md +++ b/README.md @@ -1,95 +1,82 @@ # ๐Ÿšš Bulker -Bulker is a tool for streaming and batching large amount of semi-structured data into data warehouses. It uses Kafka internally +Bulker is a tool for streaming and batching large amounts of semi-structured data into data warehouses. It uses Kafka internally. -## How it works? +## How it works

-Send and JSON object to Bulker HTTP endpoint, and it will make sure it will be saved to data warehouse: +Send a JSON object to the Bulker HTTP endpoint, and it will ensure it is saved to the data warehouse: - * **JSON flattening**. Your object will be flattened - `{a: {b: 1}}` becomes `{a_b: 1}` - * **Schema managenent** for **semi-structured** data. For each field, bulker will make sure that a corresponding column exist in destination table. If not, Bulker -will create it. Type will be best-guessed by value, or it could be explicitely set via type hint as in `{"a": "test", "__sql_type_a": "varchar(4)"}` - * **Reliability**. Bulker will put the object to Kafka Queue immediately, so if datawarehouse is down, data won't be lost - * **Streaming** or **Batching**. Bulker will send data to datawarehouse either as soon it become available in Kafka (streaming) or after some time (batching). Most -data warehouses won't tolerate large number of inserts, that's why we implemented batching + * **JSON flattening**. The JSON object will be flattened - `{a: {b: 1}}` becomes `{a_b: 1}`. + * **Schema management** for **semi-structured** data. For each field, Bulker will ensure that a corresponding column exists in the destination table. If not, Bulker will create it. The type will be best-guessed by value, or it can be explicitly set via a type hint as in `{"a": "test", "__sql_type_a": "varchar(4)"}`. + * **Reliability**. Bulker will place the object in a Kafka queue immediately, so if the data warehouse is down, data won't be lost. + * **Streaming** or **Batching**. Bulker will send data to the data warehouse either as soon as it becomes available in Kafka (streaming) or after some time (batching). Most data warehouses won't tolerate a large number of inserts, which is why we implemented batching. +Bulker is at the ๐Ÿ’œ of [Jitsu](https://github.com/jitsucom/jitsu), an open-source data integration platform. -Bulker is a ๐Ÿ’œ of [Jitsu](https://github.com/jitsucom/jitsu), an open-source data integration platform. +See the full list of features below. -See full list of features below - - -Bulker is also available as a go library if you want to embed it into your application as opposed to use a HTTP-server +Bulker is also available as a Go library if you want to embed it into your application instead of using an HTTP server. ## Features -* ๐Ÿ›ข๏ธ **Batching** - Bulker sends data in batches in most efficient way for particular database. For example, for Postgres it uses -COPY command, for BigQuery it uses batch-files -* ๐Ÿšฟ **Streaming** - alternatively, Bulker can stream data to database. It is useful when number of records is low. Up to 10 records -per second for most databases -* ๐Ÿซ **Deduplication** - if configured, Bulker will deduplicate records by primary key -* ๐Ÿ“‹ **Schema management** - Bulker creates tables and columns on the fly. It also flattens nested JSON-objects. Example if you send `{"a": {"b": 1}}` to -bulker, it will make sure that there is a column `a_b` in the table (and will create it) -* ๐Ÿฆพ **Implicit typing** - Bulker infers types of columns from JSON-data. -* ๐Ÿ“Œ **Explicit typing** - Explicit types can be by type hints that are placed in JSON. Example: for event `{"a": "test", "__sql_type_a": "varchar(4)"}` -Bulker will make sure that there is a column `a`, and it's type is `varchar(4)`. -* ๐Ÿ“ˆ **Horizontal Scaling**. Bulker scales horizontally. Too much data? No problem, just add Bulker instances! -* ๐Ÿ“ฆ **Dockerized** - Bulker is dockerized and can be deployed to any cloud provider and k8s. -* โ˜๏ธ **Cloud Native** - each Bulker instance is stateless and is configured by only few environment variables. +* ๐Ÿ›ข๏ธ **Batching** - Bulker sends data in batches in the most efficient way for each particular database. For example, for Postgres, it uses the COPY command; for BigQuery, it uses batch files. +* ๐Ÿšฟ **Streaming** - Alternatively, Bulker can stream data to the database. This is useful when the number of records is low, up to 10 records per second for most databases. +* ๐Ÿซ **Deduplication** - If configured, Bulker will deduplicate records by primary key. +* ๐Ÿ“‹ **Schema management** - Bulker creates tables and columns on the fly. It also flattens nested JSON objects. For example, if you send `{"a": {"b": 1}}` to Bulker, it will ensure that there is a column `a_b` in the table (and will create it if needed). +* ๐Ÿฆพ **Implicit typing** - Bulker infers column types from JSON data. +* ๐Ÿ“Œ **Explicit typing** - Explicit types can be set by type hints placed in the JSON. For example, for the event `{"a": "test", "__sql_type_a": "varchar(4)"}`, Bulker will ensure that there is a column `a`, and its type is `varchar(4)`. +* ๐Ÿ“ˆ **Horizontal Scaling** - Bulker scales horizontally. Too much data? No problem, just add more Bulker instances! +* ๐Ÿ“ฆ **Dockerized** - Bulker is containerized and can be deployed to any cloud provider and Kubernetes. +* โ˜๏ธ **Cloud Native** - Each Bulker instance is stateless and is configured with only a few environment variables. ## Supported databases Bulker supports the following databases: - * โœ… PostgresSQL
- * โœ… Redshit
+ * โœ… PostgreSQL
+ * โœ… Redshift
* โœ… Snowflake
- * โœ… Clickhouse
+ * โœ… ClickHouse
* โœ… BigQuery
* โœ… MySQL
* โœ… S3
* โœ… GCS
-Please see [Compatibility Matrix](.docs/db-feature-matrix.md) to learn what Bulker features are supported by each database. - +Please see the [Compatibility Matrix](./.docs/db-feature-matrix.md) to learn which Bulker features are supported by each database. ## Documentation Links > **Note** -> We highly recommend to read [Core Concepts](#core-concepts) below before diving into details +> We highly recommend reading the [Core Concepts](#core-concepts) below before diving into the details. * [How to use Bulker as HTTP Service](./.docs/server-config.md) * [Server Configuration](./.docs/server-config.md) * [HTTP API](./.docs/http-api.md) -* How to use bulker as Go-lib *(coming soon)* +* How to use Bulker as a Go library *(coming soon)* ## Core Concepts ### Destinations -Bulker operates with destinations. Destination is database or -storage service (e.g. S3, GCS). Each destination has an ID and configuration -which is represented by JSON object. +Bulker operates with destinations. A destination is a database or storage service (e.g., S3, GCS). Each destination has an ID and configuration represented by a JSON object. -Bulker exposes HTTP API to load data into destinations, where those -destinations are referenced by their IDs. +Bulker exposes an HTTP API to load data into destinations, where those destinations are referenced by their IDs. -If destination is a database, you'll need to provide a destination table name. +If the destination is a database, you'll need to provide a destination table name. ### Event -The main unit of data in Bulker is an *event*. Event is a represented JSON-object +The main unit of data in Bulker is an *event*. An event is represented as a JSON object. ### Batching and Streaming (aka Destination Mode) -Bulker can send data to database in two ways: - * **Streaming**. Bulker sends evens to destinaion one by one. It is useful when number of events is low (less than 10 events per second for most DBs). - * **Batching**. Bulker accumulates events in batches and sends them periodically once batch is full or timeout is reached. Batching is more efficient for large amounts of events. Especially for cloud data-warehouses -(e.g. Postgres, Clickhouse, BigQuery). +Bulker can send data to databases in two ways: + * **Streaming**. Bulker sends events to the destination one by one. This is useful when the number of events is low (less than 10 events per second for most databases). + * **Batching**. Bulker accumulates events in batches and sends them periodically once the batch is full or a timeout is reached. Batching is more efficient for large amounts of events, especially for cloud data warehouses (e.g., PostgreSQL, ClickHouse, BigQuery).

@@ -97,13 +84,6 @@ Bulker can send data to database in two ways: ### Primary Keys and Deduplication -Optionally, Bulker can deduplicate events by primary key. It is useful when you same event can be sent to Bulker multiple times. -If available, Bulker uses primary keys, but for some data warehouses alternative strategies are used. +Optionally, Bulker can deduplicate events by primary key. This is useful when the same event can be sent to Bulker multiple times. If available, Bulker uses primary keys, but for some data warehouses, alternative strategies are used. >[Read more about deduplication ยป](./.docs/db-feature-matrix.md) - - - - - -