Data cleaning is the process of fixing invalid and erroneous information in your dataset.
The result of this process yields a dataset that is valid, accurate, complete, consistent, and uniform.
- Validity: The degree to which the data conform to defined business rules or constraints.
- Accuracy: The degree to which the data is close to the true values.
- Completeness: The degree to which all required data is known.
- Consistency: The degree to which the data is consistent, within the same data set or across multiple data sets.
- Uniformity: The degree to which the data is specified using the same unit of measure.
WIP
🪄 Example: Data validation
A date of birth on a form may only be recognized if it’s formatted a certain way, for example, as dd-mm-yyyy, if you use data validation techniques.
The day field will allow numbers up to 31, the month field up to 12, and the year field up to 2021. If any numbers exceed those values, the form won’t be submitted.
WIP
🪄 Example: Inaccurate data
WIP
WIP
WIP
WIP
🪄 Example: Inconsistent data
WIP
WIP
WIP
🪄 Example: Nonuniform data
WIP
WIP