Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mappify should set missing values (for rows shorter than header) to nil #65

Closed
ABeltramo opened this issue Jun 11, 2019 · 6 comments · May be fixed by #66
Closed

mappify should set missing values (for rows shorter than header) to nil #65

ABeltramo opened this issue Jun 11, 2019 · 6 comments · May be fixed by #66

Comments

@ABeltramo
Copy link

I found it weird and undocomunted but with the following malformed CSV (first two lines have less commas than the header)

sepal_length,sepal_width,petal_length,petal_width,label,id,test_train
5.8,4,1.2,0.2,15
4.8,3,1.4,
4.3,3,1.1,0.1,Iris-setosa,14,train

Using the mappify method will produce the following:

{:sepal_length "5.8", :sepal_width "4", :petal_length "1.2", :petal_width "0.2", :label "15"}
{:sepal_length "4.8", :sepal_width "3", :petal_length "1.4", :petal_width ""}
{:sepal_length "4.3", :sepal_width "3", :petal_length "1.1", :petal_width "0.1", :label "Iris-setosa", :id "14", :test_train "train"}

As you can see some rows are smaller than others, totally missing from the mappified results. I was expecting that all rows will have the same size, with nil values when something is missing from the CSV.

Bare in mind that using {:structs true} will produce the expected results:

{:sepal_length "5.8", :sepal_width "4", :petal_length "1.2", :petal_width "0.2", :label "15", :id nil, :test_train nil}
{:sepal_length "4.8", :sepal_width "3", :petal_length "1.4", :petal_width "", :label nil, :id nil, :test_train nil}
{:sepal_length "4.3", :sepal_width "3", :petal_length "1.1", :petal_width "0.1", :label "Iris-setosa", :id "14", :test_train "train"}

I have some other issues using structs but I will probably open another issue when I can get a reproducible environment.

I'll open up a pull request with a fix I have made in order to fix this.

@metasoarous
Copy link
Owner

Hi @ABeltramo. Thanks for bringing this up.

This is actually how I personally expect this to behave, at least by default. Ultimately, I think it's a consequence of how the underlying csv grammar library parses into vectors, and doesn't remeber what row dimensions it's seen. I'd consider a processing fn/option which produced the behavior you're after, but I don't think I'd make it default. Does that seem reasonable?

Thanks again.

@ABeltramo
Copy link
Author

Yeah sure, I was mistakely thinking that given an header all the rows will have the same number of elements no matter what is the content of the CSV.
I think it would be great to also add this to the documentation.

@metasoarous
Copy link
Owner

Hey @ABeltramo. Sorry for letting this hang for so long.

Is there a reason why just using {:structs true} doesn't suffice for you as an option for preserving the header for rows with missing data? You mentioned "I have some other issues using structs", but didn't leave any details. I've actually been considering making structs the default, which would solve this problem, and has the benefit of preserving column order (see: #68 ). But if there are unforeseen problems with this I'd like to be able to take into account.

Thanks!

@ABeltramo
Copy link
Author

Hi @metasoarous,
I tried to find back the issue I had with structify but unfortunately you'll have to trust my memory on this one.
I think there was an issue with apply struct or apply create-struct

(rf results (apply struct (apply create-struct @hdr) input))))))))))

given a large enough CSV file you'll get an exception in creating a struct bigger than X.
Unfortunately, I don't have a CSV file to reproduce this, and I'm not even sure this was the issue. I think you'll catch this by using spec generators instead of trivial examples in unit tests, if I can find the time I'll try something.

@metasoarous
Copy link
Owner

Thanks for this feedback @ABeltramo.

I was just able to create a struct with 10k fields, and didn't hit any issues.

I think you're right though that using some spec-based generative testing would be a good idea before taking this plunge though.

Thanks again

@metasoarous metasoarous changed the title If CSV is malformed rows will missing header mappify should set missing values (for rows shorter than header) to nil May 4, 2021
@metasoarous
Copy link
Owner

Thanks again for submitting this @ABeltramo.

I've decided that for now I'm going to close this issue. To my mind, nothing is broken, and this is actually how I'd expect the function to behave, at least by default. There are absolutely cases (triangular matrices, for example), where you wouldn't want to take up any more space than necessary in the dicts you parse.

I am not categorically apposed to an option that would preserve kv pairs (v=nil) for entries missing in this fashion, but would first need to hear a concrete rationale for not just using structs (which do have this behavior).

Please feel free to reopen if you disagree or have more pertinent information here.

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants