-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with double-quotes chars while trying to import a csv file #110
Comments
We are using a well tested csv library which seems to support this by default: The file seems to be semi-colon delimited rather than comma-delimited, I don't recall if that is detected/supported. Could you please upload the CSV file or email it to me so I can see what exactly is causing it? |
Ideally we could add a small unit test which triggers this behavior. |
Sure, here is the download link : https://adresse.data.gouv.fr/data/ban/adresses/latest/csv/adresses-france.csv.gz Thanks a lot! |
I agree the file seems valid according to a couple of open-source tools written in different languages: xsv count -d ";" adresses-france.csv
26049046 csvlint --delimiter=';' adresses-france.csv
Warning: not using defaults, may not validate CSV to RFC 4180
file is valid |
Although both tools fail with similar errors to OP when incorrectly specifying the delimiter: xsv count adresses-france.csv
CSV error: record 86898 (line: 86899, byte: 16594179): found record with 3 fields, but the previous record has 1 fields /tmp # csvlint adresses-france.csv
Record #86898 has error: wrong number of fields
Record #86899 has error: wrong number of fields
Record #86900 has error: wrong number of fields
Record #86901 has error: wrong number of fields
Record #86902 has error: wrong number of fields
Record #86903 has error: wrong number of fields
Record #86904 has error: wrong number of fields
Record #86905 has error: wrong number of fields
Record #86906 has error: wrong number of fields
Record #86907 has error: wrong number of fields
Record #86908 has error: wrong number of fields
Record #86909 has error: wrong number of fields
Record #86910 has error: wrong number of fields
Record #86911 has error: wrong number of fields
Record #86912 has error: wrong number of fields
Record #258188 has error: bare " in non-quoted-field
unable to parse any further |
Moreover it seems that the single quote character (apostrophe is frequent in french) brings the same error. Logs from pelias import csv :
Example values:
Do you think that allowing to redefine the field delimiter could allow to overcome this kind of issues ? Thanks! |
Ok, so moving forward, if this is an actual bug, it's going to lie within the CSV parser library rather than Pelias, so there's little we can do on our end except modify the config. I'm still suspecting the semi-colons, I don't find anywhere in the docs saying that is supported, so it's just an assumption at this stage that the library will auto-detect the delimiter, I also didn't find any mention of that in their docs. So, as a next step, can you please rewrite the semi-colon delimited file as comma delimited and try that? Please report back if the error persists. Something like this:
|
Yes possibly, if you're able to resolve the issue by manually rewriting the delimiters then we can consider how we might make that configurable. |
IMO this convention is a bit silly and confusing, it's common in Europe simply because the comma is used instead of the period to separate the Euro from the Cent. My bank does it to my statements here in Germany and I understand why, but it's confusing that the file extension is still For tab separated files there is a growing use of |
Hey folks I believe the issue here is that delimiters must be specified when configuring the CSV parsing library we are using, as shown in their docs. We don't set the delimiter when configuring, though I think it would be possible. The only thing that might be tricky is that we would have to add some way in our pelias.json config to both specify multiple CSV files (already possible, though usually a directory is given and all CSV files within the directory are imported), and to set a delimiter for each one. That's a lot of new configuration we don't currently have. I could see us theoretically adding support for this, though please discuss the potential format with us before doing any work on a PR. That said, I think the easiest practical way to proceed is to convert your file to comma delimited. Assuming any commas in your actual data are quoted appropriately, everything should work. |
Thanks for your support @missinglink @orangejulius , once the separator have been switched to commas the import occurred successfully. As the separator seems to be redefinable when instantiating csv-parser, it may look interesting to bring this attribute configurable. For example, a french address can usually contain commas to separate elements like the house number, the street name and postal code: Thanks again! |
Good to hear you got it working. Let's close this issue and open a new one to discuss some way of configuring the delimiter. |
The feature request is open : #111 |
Describe the bug
I am currently evaluating the pelias tool for geocoding.
And i am trying to import a CSV dataset containing french adresses.
But the import job interrupts while a double-quote character is found with the following INVALID_OPENING_QUOTE error :
However, it seems that the example above is compliant with the CSV format according to the rfc : https://www.ietf.org/rfc/rfc4180.txt
Steps to Reproduce
id;id_fantoir;numero;rep;nom_voie;code_postal;code_insee;nom_commune;code_insee_ancienne_commune;nom_ancienne_commune;x;y;lon;lat;type_position;alias;nom_ld;libelle_acheminement;nom_afnor;source_position;source_nom_voie;certification_commune;cad_parcelles 02727_b017_00001;02727_B017;1;;"Lieu dit ""Entre deux Villes""";02270;02727;Sons-et-Ronchères;;;749035.43;6962499.63;3.680096;49.759545;logement;;;SONS ET RONCHERES;LIEU DIT ENTRE DEUX VILLES ;commune;commune;1;02727000ZB0042
pelias elastic start
pelias elastic create
pelias import csv
Environment (please complete the following information):
Linux debian-like
References
A similar issue seems to have been fixed in pelias/transit with the help of the "relax" parameter from csv-parse : pelias/transit#46
And it seems the parameter looks already in place into the csv-importer/lib/streams/recordStream.js file :
49: relax: true,
But do you think the solution could be to rename the parameter from relax to relax_quotes in the csv-parser instance according to the following ?
https://stackoverflow.com/questions/70880341/csv-parse-is-throwing-invalid-opening-quote-a-quote-is-found-inside-a-field-at
https://stackoverflow.com/questions/73769717/csverror-invalid-opening-quote-a-quote-is-found-inside-a-field-at-line-9618?rq=3
Thanks a lot for your help!
The text was updated successfully, but these errors were encountered: