Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

align definition of duplicate with its behavior #950

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

anopsy
Copy link
Contributor

@anopsy anopsy commented Feb 6, 2025

align definition of duplicate_chck duplicate_percent with its behavior

"The behavior of duplicate checks is a bit unintuitive and doesn't match our docs, and I am wondering if this is a bug. The value of a duplicate count check is equal to the count of distinct values that have duplicates. But in the docs it says a duplicate_count is "The number of rows that contain duplicate values". There is a similar issue for duplicate percent checks, for which the value is the count of distinct values that have duplicates divided by the total number of rows, but the docs say duplicate_percent is "The percentage of rows in a dataset that contain duplicate values"."

Copy link

netlify bot commented Feb 6, 2025

Deploy Preview for jovial-piroshki-65ff4d ready!

Name Link
🔨 Latest commit b1292ca
🔍 Latest deploy log https://app.netlify.com/sites/jovial-piroshki-65ff4d/deploys/67a46fe0d020970008d6e8c4
😎 Deploy Preview https://deploy-preview-950--jovial-piroshki-65ff4d.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@adkinsty
Copy link
Contributor

adkinsty commented Mar 4, 2025

@adkinsty adkinsty self-requested a review March 4, 2025 05:06
| `duplicate_count` | The number of rows that contain duplicate values.<br> Include one column in the argument to compare values relative to that one column. <br/>Include more than one column in the argument to search for duplicate pairs multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all |
| `duplicate_percent` | The percentage of rows in a dataset that contain duplicate values.<br> Include one column in the argument to compare values relative to that one column. Include more than one column in the argument to search for duplicate pairs in multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all |
| `duplicate_count` | The count of distinct values that have duplicates.<br> Include one column in the argument to compare values relative to that one column. <br/>Include more than one column in the argument to search for duplicate pairs multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all |
| `duplicate_percent` | The percentage of rows in a dataset that have dupliates.<br> Include one column in the argument to compare values relative to that one column. Include more than one column in the argument to search for duplicate pairs in multiple columns. Be sure to add a space between the comma-separated values in the list of column names. <br />See also: [Duplicate check]({% link soda/quick-start-sodacl.md %}#duplicate-check)| number<br /> text<br /> time | all |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dupliates" should be "duplicates"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still I'm not sure if this accurately captures the meaning of the duplicate percent metric...hmm...I think it's literally duplicate_count divided by row_count times 100. Do you think this matches the proposed description?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants