Skip to content

Latest commit

 

History

History
54 lines (33 loc) · 1.36 KB

data_cleaning.md

File metadata and controls

54 lines (33 loc) · 1.36 KB

Data cleaning aka data cleansing

Definition

Data cleaning is the process of fixing invalid and erroneous information in your dataset.

The result of this process yields a dataset that is valid, accurate, complete, consistent, and uniform.

Data quality metrics

  1. Validity: The degree to which the data conform to defined business rules or constraints.
  2. Accuracy: The degree to which the data is close to the true values.
  3. Completeness: The degree to which all required data is known.
  4. Consistency: The degree to which the data is consistent, within the same data set or across multiple data sets.
  5. Uniformity: The degree to which the data is specified using the same unit of measure.

Validity

WIP

🪄 Example: Data validation

A date of birth on a form may only be recognized if it’s formatted a certain way, for example, as dd-mm-yyyy, if you use data validation techniques.

The day field will allow numbers up to 31, the month field up to 12, and the year field up to 2021. If any numbers exceed those values, the form won’t be submitted.

Accuracy

WIP

🪄 Example: Inaccurate data

WIP

WIP

Completeness

WIP

Consistency

WIP

🪄 Example: Inconsistent data

WIP

WIP

Uniformity

WIP

🪄 Example: Nonuniform data

WIP

WIP