-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update README document formatting (#114)
- Loading branch information
Showing
2 changed files
with
82 additions
and
58 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -96,6 +96,9 @@ This part of the readme is also covered by autotests, so these code are always u | |
|
||
In any unclear situation, look into it first. | ||
|
||
<details> | ||
<summary>CLICK HERE to see the most complete description of ALL features!</summary> | ||
|
||
<!-- full-yml --> | ||
```yml | ||
# It's a complete example of the CSV schema file in YAML format. | ||
|
@@ -649,6 +652,7 @@ columns: | |
``` | ||
<!-- /full-yml --> | ||
|
||
</details> | ||
|
||
### Extra checks | ||
|
||
|
@@ -907,14 +911,13 @@ Optional format `text` with highlited keywords: | |
Of course, you'll want to know how fast it works. The thing is, it depends very-very-very much on the following factors: | ||
|
||
* **The file size** - Width and height of the CSV file. The larger the dataset, the longer it will take to go through | ||
it. | ||
The dependence is linear and strongly depends on the speed of your hardware (CPU, SSD). | ||
it. The dependence is linear and strongly depends on the speed of your hardware (CPU, SSD). | ||
* **Number of rules used** - Obviously, the more of them there are for one column, the more iterations you will have to | ||
make. | ||
Also remember that they do not depend on each other. | ||
make. Also remember that they do not depend on each other. I.e. execution of one rule will not optimize or slow down | ||
another rule in any way. In fact, it will be just summing up time and memory resources. | ||
* Some validation rules are very time or memory intensive. For the most part you won't notice this, but there are some | ||
that are dramatically slow. For example, `interquartile_mean` processes about 4k lines per second, while the rest of | ||
the rules are about 0.3-1 million lines per second. | ||
the rules are about 30+ millions lines per second. | ||
|
||
However, to get a rough picture, you can check out the table below. | ||
|
||
|
@@ -927,7 +930,7 @@ However, to get a rough picture, you can check out the table below. | |
* Software: Latest Ubuntu + Docker. | ||
Also [see detail about GA hardware](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-private-repositories). | ||
* The main metric is the number of lines per second. Please note that the table is thousands of lines per second | ||
(`100 K` = `100,000 lines per second`). | ||
(`100K` = `100,000 lines per second`). | ||
* An additional metric is the peak RAM consumption over the entire time of the test case. | ||
|
||
Since usage profiles can vary, I've prepared a few profiles to cover most cases. | ||
|
@@ -937,17 +940,20 @@ Since usage profiles can vary, I've prepared a few profiles to cover most cases. | |
* **[Minimum](tests/Benchmarks/bench_1_mini_combo.yml)** - Normal rules with average performance, but 2 of each. | ||
* **[Realistic](tests/Benchmarks/bench_2_realistic_combo.yml)** - A mix of rules that are most likely to be used in real | ||
life. | ||
* **[All aggregations at once](tests/Benchmarks/bench_3_all_agg.yml)** - All aggregation rules at once. This is the | ||
* **[All aggregations](tests/Benchmarks/bench_3_all_agg.yml)** - All aggregation rules at once. This is the | ||
worst-case scenario. | ||
|
||
Also, there is an additional division into | ||
|
||
* `Cell rules` - only rules applicable for each row/cell, 1000 lines per second. | ||
* `Agg rules` - only rules applicable for the whole column, 1000 lines per second. | ||
* `Cell + Agg` - a simultaneous combination of the previous two, 1000 lines per second. | ||
* `Peak Memory` - the maximum memory consumption during the test case, megabytes. **Important note:** This value is | ||
only for the aggregation case. Since if you don't have aggregations, the peak memory usage will always be | ||
no more than a couple megabytes. | ||
* `Cell rules` - only rules applicable for each row/cell. | ||
* `Agg rules` - only rules applicable for the whole column. | ||
* `Cell + Agg` - a simultaneous combination of the previous two. | ||
* `Peak Memory` - the maximum memory consumption during the test case. | ||
|
||
**Important note:** `Peak Memory` value is only for the aggregation case. Since if you don't have aggregations, | ||
the peak memory usage will always be no more than 2-4 megabytes. No memory leaks! | ||
It doesn't depend on the number of rules or the size of CSV file. | ||
|
||
|
||
<!-- benchmark-table --> | ||
<table> | ||
|
@@ -960,91 +966,91 @@ Also, there is an additional division into | |
<td align="left"><b>All aggregations</b></td> | ||
</tr> | ||
<tr> | ||
<td>Columns: 1<br>Size: 8.48 MB<br><br><br></td> | ||
<td>Columns: 1<br>Size: ~8 MB<br><br><br></td> | ||
<td>Cell rules<br>Agg rules<br>Cell + Agg<br>Peak Memory</td> | ||
<td align="right"> | ||
586K, 3.4 sec<br> | ||
802K, 2.5 sec<br> | ||
474K, 4.2 sec<br> | ||
586K,  3.4 sec<br> | ||
802K,  2.5 sec<br> | ||
474K,  4.2 sec<br> | ||
52 MB | ||
</td> | ||
<td align="right"> | ||
320K, 6.3 sec<br> | ||
755K, 2.6 sec<br> | ||
274K, 7.3 sec<br> | ||
320K,  6.3 sec<br> | ||
755K,  2.6 sec<br> | ||
274K,  7.3 sec<br> | ||
68 MB | ||
</td> | ||
<td align="right"> | ||
171K, 11.7 sec<br> | ||
532K, 3.8 sec<br> | ||
532K,  3.8 sec<br> | ||
142K, 14.1 sec<br> | ||
208 MB | ||
</td> | ||
<td align="right"> | ||
794K, 2.5 sec<br> | ||
794K,  2.5 sec<br> | ||
142K, 14.1 sec<br> | ||
121K, 16.5 sec<br> | ||
272 MB | ||
</td> | ||
</tr> | ||
<tr> | ||
<td>Columns: 5<br>Size: 64.04 MB<br><br><br></td> | ||
<td>Columns: 5<br>Size: 64 MB<br><br><br></td> | ||
<td>Cell rules<br>Agg rules<br>Cell + Agg<br>Peak Memory</td> | ||
<td align="right"> | ||
443K, 4.5 sec<br> | ||
559K, 3.6 sec<br> | ||
375K, 5.3 sec<br> | ||
443K,  4.5 sec<br> | ||
559K,  3.6 sec<br> | ||
375K,  5.3 sec<br> | ||
52 MB | ||
</td> | ||
<td align="right"> | ||
274K, 7.3 sec<br> | ||
526K, 3.8 sec<br> | ||
239K, 8.4 sec<br> | ||
274K,  7.3 sec<br> | ||
526K,  3.8 sec<br> | ||
239K,  8.4 sec<br> | ||
68 MB | ||
</td> | ||
<td align="right"> | ||
156K, 12.8 sec<br> | ||
406K, 4.9 sec<br> | ||
406K,  4.9 sec<br> | ||
131K, 15.3 sec<br> | ||
208 MB | ||
</td> | ||
<td align="right"> | ||
553K, 3.6 sec<br> | ||
553K,  3.6 sec<br> | ||
139K, 14.4 sec<br> | ||
111K, 18 sec<br> | ||
111K, 18.0 sec<br> | ||
272 MB | ||
</td> | ||
</tr> | ||
<tr> | ||
<td>Columns: 10<br>Size: 220.02 MB<br><br><br></td> | ||
<td>Columns: 10<br>Size: 220 MB<br><br><br></td> | ||
<td>Cell rules<br>Agg rules<br>Cell + Agg<br>Peak Memory</td> | ||
<td align="right"> | ||
276K, 7.2 sec<br> | ||
314K, 6.4 sec<br> | ||
247K, 8.1 sec<br> | ||
276K,  7.2 sec<br> | ||
314K,  6.4 sec<br> | ||
247K,  8.1 sec<br> | ||
52 MB | ||
</td> | ||
<td align="right"> | ||
197K, 10.2 sec<br> | ||
308K, 6.5 sec<br> | ||
308K,  6.5 sec<br> | ||
178K, 11.2 sec<br> | ||
68 MB | ||
</td> | ||
<td align="right"> | ||
129K, 15.5 sec<br> | ||
262K, 7.6 sec<br> | ||
111K, 18 sec<br> | ||
262K,  7.6 sec<br> | ||
111K, 18.0 sec<br> | ||
208 MB | ||
</td> | ||
<td align="right"> | ||
311K, 6.4 sec<br> | ||
311K,  6.4 sec<br> | ||
142K, 14.1 sec<br> | ||
97K, 20.6 sec<br> | ||
272 MB | ||
</td> | ||
</tr> | ||
<tr> | ||
<td>Columns: 20<br>Size: 1.18 GB<br><br><br></td> | ||
<td>Columns: 20<br>Size: 1.2 GB<br><br><br></td> | ||
<td>Cell rules<br>Agg rules<br>Cell + Agg<br>Peak Memory</td> | ||
<td align="right"> | ||
102K, 19.6 sec<br> | ||
|
@@ -1065,7 +1071,7 @@ Also, there is an additional division into | |
208 MB | ||
</td> | ||
<td align="right"> | ||
105K, 19 sec<br> | ||
105K, 19.0 sec<br> | ||
144K, 13.9 sec<br> | ||
61K, 32.8 sec<br> | ||
272 MB | ||
|
@@ -1074,28 +1080,39 @@ Also, there is an additional division into | |
</table> | ||
<!-- /benchmark-table --> | ||
|
||
Btw, if you run the same tests on a MacBook 14" M2 Max 2023, the results are ~2 times better. On MacBook 2019 Intel | ||
2.4Gz about the same as on GitHub Actions. So I think the table can be considered an average (but too far from the best) | ||
hardware at the regular engineer. | ||
|
||
### Brief conclusions | ||
|
||
* Cell rules are very CPU demanding, but use almost no RAM (always about 1-2 MB at peak). | ||
The more of them there are, the longer it will take to validate a column, as they are additional actions per(!) value. | ||
|
||
* Aggregation rules - work lightning fast (from 10 millions to billions of rows per second), but require a lot of RAM. | ||
On the other hand, if you add 20 different aggregation rules, the amount of memory consumed will not increase. | ||
On the other hand, if you add 100+ different aggregation rules, the amount of memory consumed will not increase too | ||
much. | ||
|
||
* Unfortunately, not all PHP array functions can work by reference (`&$var`). | ||
This is a very individual thing that depends on the algorithm. | ||
So if a dataset in a column is 20 MB sometimes it is copied and the peak value becomes 40 (this is just an example). | ||
That's why link optimization doesn't work most of the time. | ||
|
||
* In fact, if you are willing to wait 30-60 seconds for a 1 GB file, and you have 200-500 MB of RAM, | ||
I don't see the point in thinking about it at all. | ||
|
||
* No memory leaks have been detected. | ||
|
||
Btw, if you run the same tests on a MacBook 14" M2 Max 2023, the results are ~2 times better. On MacBook 2019 Intel | ||
2.4Gz about the same as on GitHub Actions. So I think the table can be considered an average (but too far from the best) | ||
hardware at the regular engineer. | ||
|
||
### Examples of CSV files | ||
|
||
Below you will find examples of CSV files that were used for the benchmarks. They were created | ||
with [PHP Faker](tests/Benchmarks/Commands/CreateCsv.php) (the first 2000 lines) and then | ||
copied [1000 times into themselves](tests/Benchmarks/create-csv.sh). | ||
copied [1000 times into themselves](tests/Benchmarks/create-csv.sh). So we can create a really huge random files in | ||
seconds. | ||
|
||
The basic principle is that the more columns there are, the longer the values in them. I.e. something like exponential | ||
growth. | ||
|
||
<details> | ||
<summary>Columns: 1, Size: 8.48 MB</summary> | ||
|
@@ -1110,7 +1127,7 @@ id | |
|
||
|
||
<details> | ||
<summary>Columns: 5, Size: 64.04 MB</summary> | ||
<summary>Columns: 5, Size: 64 MB</summary> | ||
|
||
```csv | ||
id,bool_int,bool_str,number,float | ||
|
@@ -1122,7 +1139,7 @@ id,bool_int,bool_str,number,float | |
|
||
|
||
<details> | ||
<summary>Columns: 10, Size: 220.02 MB</summary> | ||
<summary>Columns: 10, Size: 220 MB</summary> | ||
|
||
```csv | ||
id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4 | ||
|
@@ -1134,7 +1151,7 @@ id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4 | |
|
||
|
||
<details> | ||
<summary>Columns: 20, Size: 1.18 GB</summary> | ||
<summary>Columns: 20, Size: 12 GB</summary> | ||
|
||
```csv | ||
id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4,uuid,address,postcode,latitude,longitude,ip6,sentence_tiny,sentence_small,sentence_medium,sentence_huge | ||
|
@@ -1144,7 +1161,7 @@ id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4,uuid,address,po | |
|
||
</details> | ||
|
||
### Run the benchmark locally | ||
### Run benchmark locally | ||
|
||
Make sure you have PHP 8.1+ and Dooker installed. | ||
|
||
|
@@ -1274,7 +1291,7 @@ I'm not sure if I will implement all of them. But I will try to do my best. | |
## Contributing | ||
If you have any ideas or suggestions, feel free to open an issue or create a pull request. | ||
|
||
```sh | ||
```shell | ||
# Fork the repo and build project | ||
git clone [email protected]:jbzoo/csv-blueprint.git ./jbzoo-csv-blueprint | ||
cd ./jbzoo-csv-blueprint | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters