-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
align definition of duplicate with its behavior #950
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for jovial-piroshki-65ff4d ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Related slack thread: https://sodadata.slack.com/archives/CAXRR8SS3/p1738086720079429 |
| `duplicate_count` | The number of rows that contain duplicate values.<br> Include one column in the argument to compare values relative to that one column. <br/>Include more than one column in the argument to search for duplicate pairs multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all | | ||
| `duplicate_percent` | The percentage of rows in a dataset that contain duplicate values.<br> Include one column in the argument to compare values relative to that one column. Include more than one column in the argument to search for duplicate pairs in multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all | | ||
| `duplicate_count` | The count of distinct values that have duplicates.<br> Include one column in the argument to compare values relative to that one column. <br/>Include more than one column in the argument to search for duplicate pairs multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all | | ||
| `duplicate_percent` | The percentage of rows in a dataset that have dupliates.<br> Include one column in the argument to compare values relative to that one column. Include more than one column in the argument to search for duplicate pairs in multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"dupliates" should be "duplicates"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still I'm not sure if this accurately captures the meaning of the duplicate percent metric...hmm...I think it's literally duplicate_count divided by row_count times 100. Do you think this matches the proposed description?
align definition of duplicate_chck duplicate_percent with its behavior
"The behavior of duplicate checks is a bit unintuitive and doesn't match our docs, and I am wondering if this is a bug. The value of a duplicate count check is equal to the count of distinct values that have duplicates. But in the docs it says a duplicate_count is "The number of rows that contain duplicate values". There is a similar issue for duplicate percent checks, for which the value is the count of distinct values that have duplicates divided by the total number of rows, but the docs say duplicate_percent is "The percentage of rows in a dataset that contain duplicate values"."