Renaming fields takes too much time #627

NikosAlexandris · 2021-08-16T13:49:55Z

NikosAlexandris
Aug 16, 2021

With

[..] 204M Aug 16 16:28 madrid_composite_time_series_noqc_2019_ecostress_variables_best_unsparsified.csv

why would rename

➜  time mlr --csv rename ecostress_emissivity_4_best,emissivity_4,ecostress_emissivity_5_best,emissivity_5,ecostress_emissivity_wideband_best,emissivity_wideband,ecostress_lst_best,lst,ecostress_pwv_best,pwv madrid_composite_time_series_noqc_2019_ecostress_variables_best_unsparsified.csv > madrid_composite_time_series_noqc_2019_ecostress.csv

take

real    0m14.036s
user    0m11.014s
sys     0m1.381s

?

Why does renaming fields need to parse the whole file and not only the header?

masgo · 2022-01-28T13:15:02Z

masgo
Jan 28, 2022

I guess it's because miller supports so much more than simple CSV files. For example you could have something like

a,b,c
1,2,3

d,c,b,a
4,5,6,7

in the same file. Renaming b would still work and produce the expected output. If you do unsparsify, you would end with only a single table.

If you know for certain you have a simple csv file, you could do something like head -n1, rename it and then replace just the first line.

0 replies

johnkerl · 2022-01-28T16:21:21Z

johnkerl
Jan 28, 2022
Maintainer

Indeed -- Miller turns an input file into a stream of records which are -- individually -- ordered lists of key-value pairs, then processes those, then turns that into an output file. This is what makes it able to abstract most processing in a file-format-independent way -- CSV/TSV, JSON, DKVP/XTAB, someday (hopefully soon) a subset of YAML all can use the same logic for cut, sort, filter, etc etc etc independent of fileformat.

So even though the CSV file has keys only on line 1, in memory the keys are in every record:

a,b,c
1,2,3
4,5,6
...

becomes (in memory)

a=1,b=2,c=3
a=4,b=5,c=6
...

This is a very flexible and powerful design in general ... in the context of file format = CSV and transformation = rename, though, it does look absurd in terms of its performance ... :^/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Renaming fields takes too much time #627

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Renaming fields takes too much time #627

NikosAlexandris Aug 16, 2021

Replies: 2 comments

masgo Jan 28, 2022

johnkerl Jan 28, 2022 Maintainer

NikosAlexandris
Aug 16, 2021

masgo
Jan 28, 2022

johnkerl
Jan 28, 2022
Maintainer