Skip to content

Commit

Permalink
Update README document formatting (#114)
Browse files Browse the repository at this point in the history
  • Loading branch information
SmetDenis authored Mar 30, 2024
1 parent d35883a commit 22b4951
Show file tree
Hide file tree
Showing 2 changed files with 82 additions and 58 deletions.
121 changes: 69 additions & 52 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,9 @@ This part of the readme is also covered by autotests, so these code are always u

In any unclear situation, look into it first.

<details>
<summary>CLICK HERE to see the most complete description of ALL features!</summary>

<!-- full-yml -->
```yml
# It's a complete example of the CSV schema file in YAML format.
Expand Down Expand Up @@ -649,6 +652,7 @@ columns:
```
<!-- /full-yml -->

</details>

### Extra checks

Expand Down Expand Up @@ -907,14 +911,13 @@ Optional format `text` with highlited keywords:
Of course, you'll want to know how fast it works. The thing is, it depends very-very-very much on the following factors:

* **The file size** - Width and height of the CSV file. The larger the dataset, the longer it will take to go through
it.
The dependence is linear and strongly depends on the speed of your hardware (CPU, SSD).
it. The dependence is linear and strongly depends on the speed of your hardware (CPU, SSD).
* **Number of rules used** - Obviously, the more of them there are for one column, the more iterations you will have to
make.
Also remember that they do not depend on each other.
make. Also remember that they do not depend on each other. I.e. execution of one rule will not optimize or slow down
another rule in any way. In fact, it will be just summing up time and memory resources.
* Some validation rules are very time or memory intensive. For the most part you won't notice this, but there are some
that are dramatically slow. For example, `interquartile_mean` processes about 4k lines per second, while the rest of
the rules are about 0.3-1 million lines per second.
the rules are about 30+ millions lines per second.

However, to get a rough picture, you can check out the table below.

Expand All @@ -927,7 +930,7 @@ However, to get a rough picture, you can check out the table below.
* Software: Latest Ubuntu + Docker.
Also [see detail about GA hardware](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-private-repositories).
* The main metric is the number of lines per second. Please note that the table is thousands of lines per second
(`100 K` = `100,000 lines per second`).
(`100K` = `100,000 lines per second`).
* An additional metric is the peak RAM consumption over the entire time of the test case.

Since usage profiles can vary, I've prepared a few profiles to cover most cases.
Expand All @@ -937,17 +940,20 @@ Since usage profiles can vary, I've prepared a few profiles to cover most cases.
* **[Minimum](tests/Benchmarks/bench_1_mini_combo.yml)** - Normal rules with average performance, but 2 of each.
* **[Realistic](tests/Benchmarks/bench_2_realistic_combo.yml)** - A mix of rules that are most likely to be used in real
life.
* **[All aggregations at once](tests/Benchmarks/bench_3_all_agg.yml)** - All aggregation rules at once. This is the
* **[All aggregations](tests/Benchmarks/bench_3_all_agg.yml)** - All aggregation rules at once. This is the
worst-case scenario.

Also, there is an additional division into

* `Cell rules` - only rules applicable for each row/cell, 1000 lines per second.
* `Agg rules` - only rules applicable for the whole column, 1000 lines per second.
* `Cell + Agg` - a simultaneous combination of the previous two, 1000 lines per second.
* `Peak Memory` - the maximum memory consumption during the test case, megabytes. **Important note:** This value is
only for the aggregation case. Since if you don't have aggregations, the peak memory usage will always be
no more than a couple megabytes.
* `Cell rules` - only rules applicable for each row/cell.
* `Agg rules` - only rules applicable for the whole column.
* `Cell + Agg` - a simultaneous combination of the previous two.
* `Peak Memory` - the maximum memory consumption during the test case.

**Important note:** `Peak Memory` value is only for the aggregation case. Since if you don't have aggregations,
the peak memory usage will always be no more than 2-4 megabytes. No memory leaks!
It doesn't depend on the number of rules or the size of CSV file.


<!-- benchmark-table -->
<table>
Expand All @@ -960,91 +966,91 @@ Also, there is an additional division into
<td align="left"><b>All&nbspaggregations</b></td>
</tr>
<tr>
<td>Columns:&nbsp1<br>Size:&nbsp8.48&nbspMB<br><br><br></td>
<td>Columns:&nbsp1<br>Size:&nbsp~8&nbspMB<br><br><br></td>
<td>Cell&nbsprules<br>Agg&nbsprules<br>Cell&nbsp+&nbspAgg<br>Peak&nbspMemory</td>
<td align="right">
586K,&nbsp3.4&nbspsec<br>
802K,&nbsp2.5&nbspsec<br>
474K,&nbsp4.2&nbspsec<br>
586K,&nbsp&nbsp3.4&nbspsec<br>
802K,&nbsp&nbsp2.5&nbspsec<br>
474K,&nbsp&nbsp4.2&nbspsec<br>
52 MB
</td>
<td align="right">
320K,&nbsp6.3&nbspsec<br>
755K,&nbsp2.6&nbspsec<br>
274K,&nbsp7.3&nbspsec<br>
320K,&nbsp&nbsp6.3&nbspsec<br>
755K,&nbsp&nbsp2.6&nbspsec<br>
274K,&nbsp&nbsp7.3&nbspsec<br>
68 MB
</td>
<td align="right">
171K,&nbsp11.7&nbspsec<br>
532K,&nbsp3.8&nbspsec<br>
532K,&nbsp&nbsp3.8&nbspsec<br>
142K,&nbsp14.1&nbspsec<br>
208 MB
</td>
<td align="right">
794K,&nbsp2.5&nbspsec<br>
794K,&nbsp&nbsp2.5&nbspsec<br>
142K,&nbsp14.1&nbspsec<br>
121K,&nbsp16.5&nbspsec<br>
272 MB
</td>
</tr>
<tr>
<td>Columns:&nbsp5<br>Size:&nbsp64.04&nbspMB<br><br><br></td>
<td>Columns:&nbsp5<br>Size:&nbsp64&nbspMB<br><br><br></td>
<td>Cell&nbsprules<br>Agg&nbsprules<br>Cell&nbsp+&nbspAgg<br>Peak&nbspMemory</td>
<td align="right">
443K,&nbsp4.5&nbspsec<br>
559K,&nbsp3.6&nbspsec<br>
375K,&nbsp5.3&nbspsec<br>
443K,&nbsp&nbsp4.5&nbspsec<br>
559K,&nbsp&nbsp3.6&nbspsec<br>
375K,&nbsp&nbsp5.3&nbspsec<br>
52 MB
</td>
<td align="right">
274K,&nbsp7.3&nbspsec<br>
526K,&nbsp3.8&nbspsec<br>
239K,&nbsp8.4&nbspsec<br>
274K,&nbsp&nbsp7.3&nbspsec<br>
526K,&nbsp&nbsp3.8&nbspsec<br>
239K,&nbsp&nbsp8.4&nbspsec<br>
68 MB
</td>
<td align="right">
156K,&nbsp12.8&nbspsec<br>
406K,&nbsp4.9&nbspsec<br>
406K,&nbsp&nbsp4.9&nbspsec<br>
131K,&nbsp15.3&nbspsec<br>
208 MB
</td>
<td align="right">
553K,&nbsp3.6&nbspsec<br>
553K,&nbsp&nbsp3.6&nbspsec<br>
139K,&nbsp14.4&nbspsec<br>
111K,&nbsp18&nbspsec<br>
111K,&nbsp18.0&nbspsec<br>
272 MB
</td>
</tr>
<tr>
<td>Columns:&nbsp10<br>Size:&nbsp220.02&nbspMB<br><br><br></td>
<td>Columns:&nbsp10<br>Size:&nbsp220&nbspMB<br><br><br></td>
<td>Cell&nbsprules<br>Agg&nbsprules<br>Cell&nbsp+&nbspAgg<br>Peak&nbspMemory</td>
<td align="right">
276K,&nbsp7.2&nbspsec<br>
314K,&nbsp6.4&nbspsec<br>
247K,&nbsp8.1&nbspsec<br>
276K,&nbsp&nbsp7.2&nbspsec<br>
314K,&nbsp&nbsp6.4&nbspsec<br>
247K,&nbsp&nbsp8.1&nbspsec<br>
52 MB
</td>
<td align="right">
197K,&nbsp10.2&nbspsec<br>
308K,&nbsp6.5&nbspsec<br>
308K,&nbsp&nbsp6.5&nbspsec<br>
178K,&nbsp11.2&nbspsec<br>
68 MB
</td>
<td align="right">
129K,&nbsp15.5&nbspsec<br>
262K,&nbsp7.6&nbspsec<br>
111K,&nbsp18&nbspsec<br>
262K,&nbsp&nbsp7.6&nbspsec<br>
111K,&nbsp18.0&nbspsec<br>
208 MB
</td>
<td align="right">
311K,&nbsp6.4&nbspsec<br>
311K,&nbsp&nbsp6.4&nbspsec<br>
142K,&nbsp14.1&nbspsec<br>
97K,&nbsp20.6&nbspsec<br>
272 MB
</td>
</tr>
<tr>
<td>Columns:&nbsp20<br>Size:&nbsp1.18&nbspGB<br><br><br></td>
<td>Columns:&nbsp20<br>Size:&nbsp1.2&nbspGB<br><br><br></td>
<td>Cell&nbsprules<br>Agg&nbsprules<br>Cell&nbsp+&nbspAgg<br>Peak&nbspMemory</td>
<td align="right">
102K,&nbsp19.6&nbspsec<br>
Expand All @@ -1065,7 +1071,7 @@ Also, there is an additional division into
208 MB
</td>
<td align="right">
105K,&nbsp19&nbspsec<br>
105K,&nbsp19.0&nbspsec<br>
144K,&nbsp13.9&nbspsec<br>
61K,&nbsp32.8&nbspsec<br>
272 MB
Expand All @@ -1074,28 +1080,39 @@ Also, there is an additional division into
</table>
<!-- /benchmark-table -->

Btw, if you run the same tests on a MacBook 14" M2 Max 2023, the results are ~2 times better. On MacBook 2019 Intel
2.4Gz about the same as on GitHub Actions. So I think the table can be considered an average (but too far from the best)
hardware at the regular engineer.

### Brief conclusions

* Cell rules are very CPU demanding, but use almost no RAM (always about 1-2 MB at peak).
The more of them there are, the longer it will take to validate a column, as they are additional actions per(!) value.

* Aggregation rules - work lightning fast (from 10 millions to billions of rows per second), but require a lot of RAM.
On the other hand, if you add 20 different aggregation rules, the amount of memory consumed will not increase.
On the other hand, if you add 100+ different aggregation rules, the amount of memory consumed will not increase too
much.

* Unfortunately, not all PHP array functions can work by reference (`&$var`).
This is a very individual thing that depends on the algorithm.
So if a dataset in a column is 20 MB sometimes it is copied and the peak value becomes 40 (this is just an example).
That's why link optimization doesn't work most of the time.

* In fact, if you are willing to wait 30-60 seconds for a 1 GB file, and you have 200-500 MB of RAM,
I don't see the point in thinking about it at all.

* No memory leaks have been detected.

Btw, if you run the same tests on a MacBook 14" M2 Max 2023, the results are ~2 times better. On MacBook 2019 Intel
2.4Gz about the same as on GitHub Actions. So I think the table can be considered an average (but too far from the best)
hardware at the regular engineer.

### Examples of CSV files

Below you will find examples of CSV files that were used for the benchmarks. They were created
with [PHP Faker](tests/Benchmarks/Commands/CreateCsv.php) (the first 2000 lines) and then
copied [1000 times into themselves](tests/Benchmarks/create-csv.sh).
copied [1000 times into themselves](tests/Benchmarks/create-csv.sh). So we can create a really huge random files in
seconds.

The basic principle is that the more columns there are, the longer the values in them. I.e. something like exponential
growth.

<details>
<summary>Columns: 1, Size: 8.48 MB</summary>
Expand All @@ -1110,7 +1127,7 @@ id


<details>
<summary>Columns: 5, Size: 64.04 MB</summary>
<summary>Columns: 5, Size: 64 MB</summary>

```csv
id,bool_int,bool_str,number,float
Expand All @@ -1122,7 +1139,7 @@ id,bool_int,bool_str,number,float


<details>
<summary>Columns: 10, Size: 220.02 MB</summary>
<summary>Columns: 10, Size: 220 MB</summary>

```csv
id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4
Expand All @@ -1134,7 +1151,7 @@ id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4


<details>
<summary>Columns: 20, Size: 1.18 GB</summary>
<summary>Columns: 20, Size: 12 GB</summary>

```csv
id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4,uuid,address,postcode,latitude,longitude,ip6,sentence_tiny,sentence_small,sentence_medium,sentence_huge
Expand All @@ -1144,7 +1161,7 @@ id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4,uuid,address,po

</details>

### Run the benchmark locally
### Run benchmark locally

Make sure you have PHP 8.1+ and Dooker installed.

Expand Down Expand Up @@ -1274,7 +1291,7 @@ I'm not sure if I will implement all of them. But I will try to do my best.
## Contributing
If you have any ideas or suggestions, feel free to open an issue or create a pull request.

```sh
```shell
# Fork the repo and build project
git clone [email protected]:jbzoo/csv-blueprint.git ./jbzoo-csv-blueprint
cd ./jbzoo-csv-blueprint
Expand Down
19 changes: 13 additions & 6 deletions tests/ReadmeTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,13 @@ public function testAdditionalValidationRules(): void
public function testBenchmarkTable(): void
{
$nbsp = static fn (string $text): string => \str_replace(' ', '&nbsp', $text);
$timeFormat = static fn (float $time): string => \str_pad(
\number_format($time, 1) . ' sec',
8,
' ',
\STR_PAD_LEFT,
);

$numberOfLines = 2_000_000;

$columns = [
Expand All @@ -149,25 +156,25 @@ public function testBenchmarkTable(): void
];

$table = [
'Columns: 1<br>Size: 8.48 MB' => [
'Columns: 1<br>Size: ~8 MB' => [
[586, 802, 474, 52],
[320, 755, 274, 68],
[171, 532, 142, 208],
[794, 142, 121, 272],
],
'Columns: 5<br>Size: 64.04 MB' => [
'Columns: 5<br>Size: 64 MB' => [
[443, 559, 375, 52],
[274, 526, 239, 68],
[156, 406, 131, 208],
[553, 139, 111, 272],
],
'Columns: 10<br>Size: 220.02 MB' => [
'Columns: 10<br>Size: 220 MB' => [
[276, 314, 247, 52],
[197, 308, 178, 68],
[129, 262, 111, 208],
[311, 142, 97, 272],
],
'Columns: 20<br>Size: 1.18 GB' => [
'Columns: 20<br>Size: 1.2 GB' => [
[102, 106, 95, 52],
[88, 103, 83, 68],
[70, 97, 65, 208],
Expand Down Expand Up @@ -199,8 +206,8 @@ public function testBenchmarkTable(): void
if ($key === 3) {
$testRes = $value . ' MB';
} else {
$execTime = \round($numberOfLines / ($value * 1000), 1);
$testRes = $nbsp("{$value}K, {$execTime} sec<br>");
$execTime = $timeFormat($numberOfLines / ($value * 1000));
$testRes = $nbsp("{$value}K, {$execTime}<br>");
}

$output[] = $testRes;
Expand Down

0 comments on commit 22b4951

Please sign in to comment.