Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make parsing more robust #2

Closed

Conversation

ThrawnCA
Copy link

Queensland Government changes:

  • Convert empty strings to None when the database type is numeric or timestamp.
  • Don't try to convert to timestamp if there isn't a day component (DateUtil can be too greedy).
  • Increase the parsing sample size to more reliably detect column types (match the sample size used before Tabulator).
  • Use chardet to better sniff encodings.
  • If UTF-8 encoding fails, try Windows-1252, since they can look the same at first (both use the ASCII set).
  • If all type guesses failed, mark the absence of a value.

ThrawnCA and others added 4 commits November 24, 2023 11:43
- The database will choke on empty strings if it expects a number or date.
- Use a post-processor to detect dates and numbers, instead of a custom parser, and exclude 'timestamp' values that don't have day components
- Use chardet to improve encoding guesses, and retry as Windows-1252 if UTF-8 fails since both can look the same at first
- Increase parsing sample size
- Properly detect when all type guesses have failed and mark it as 'no value'
- Add more unit tests for data edge cases
@ThrawnCA
Copy link
Author

Obsolete.

@ThrawnCA ThrawnCA closed this Dec 17, 2024
@ThrawnCA ThrawnCA deleted the qgov-robustness-changes branch December 17, 2024 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant