align definition of duplicate with its behavior #950

anopsy · 2025-02-06T08:16:29Z

align definition of duplicate_chck duplicate_percent with its behavior

"The behavior of duplicate checks is a bit unintuitive and doesn't match our docs, and I am wondering if this is a bug. The value of a duplicate count check is equal to the count of distinct values that have duplicates. But in the docs it says a duplicate_count is "The number of rows that contain duplicate values". There is a similar issue for duplicate percent checks, for which the value is the count of distinct values that have duplicates divided by the total number of rows, but the docs say duplicate_percent is "The percentage of rows in a dataset that contain duplicate values"."

netlify · 2025-02-06T08:16:47Z

✅ Deploy Preview for jovial-piroshki-65ff4d ready!

Name	Link
🔨 Latest commit	`b1292ca`
🔍 Latest deploy log	https://app.netlify.com/sites/jovial-piroshki-65ff4d/deploys/67a46fe0d020970008d6e8c4
😎 Deploy Preview	https://deploy-preview-950--jovial-piroshki-65ff4d.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

adkinsty · 2025-03-04T05:05:16Z

Related slack thread: https://sodadata.slack.com/archives/CAXRR8SS3/p1738086720079429

adkinsty · 2025-03-04T05:11:55Z

soda-cl/numeric-metrics.md

-| `duplicate_count`  | The number of rows that contain duplicate values.<br> Include one column in the argument to compare values relative to that one column. <br/>Include more than one column in the argument to search for duplicate pairs multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all  |
-| `duplicate_percent`  | The percentage of rows in a dataset that contain duplicate values.<br> Include one column in the argument to compare values relative to that one column. Include more than one column in the argument to search for duplicate pairs in multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all  |
+| `duplicate_count`  | The count of distinct values that have duplicates.<br> Include one column in the argument to compare values relative to that one column. <br/>Include more than one column in the argument to search for duplicate pairs multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all  |
+| `duplicate_percent`  | The percentage of rows in a dataset that have dupliates.<br> Include one column in the argument to compare values relative to that one column. Include more than one column in the argument to search for duplicate pairs in multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all  |


"dupliates" should be "duplicates"?

Still I'm not sure if this accurately captures the meaning of the duplicate percent metric...hmm...I think it's literally duplicate_count divided by row_count times 100. Do you think this matches the proposed description?

align definition of duplicate with its behavior

b1292ca

adkinsty self-requested a review March 4, 2025 05:06

adkinsty requested changes Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

align definition of duplicate with its behavior #950

align definition of duplicate with its behavior #950

anopsy commented Feb 6, 2025

netlify bot commented Feb 6, 2025 •

edited

Loading

adkinsty commented Mar 4, 2025

adkinsty Mar 4, 2025

adkinsty Mar 4, 2025

align definition of duplicate with its behavior #950

Are you sure you want to change the base?

align definition of duplicate with its behavior #950

Conversation

anopsy commented Feb 6, 2025

netlify bot commented Feb 6, 2025 • edited Loading

✅ Deploy Preview for jovial-piroshki-65ff4d ready!

adkinsty commented Mar 4, 2025

adkinsty Mar 4, 2025

Choose a reason for hiding this comment

adkinsty Mar 4, 2025

Choose a reason for hiding this comment

netlify bot commented Feb 6, 2025 •

edited

Loading