Skip to content

This project focuses on the systematic pollution of tabular data by introducing redundancy-related errors. Given a selected set of datasets, we apply a variety of operations to deliberately inject redundant and repetitive information, simulating common data quality issues found in real-world scenarios.

Notifications You must be signed in to change notification settings

TimRiedel/TabularRedundancyPolluter

 
 

Repository files navigation

TabularRedundancyPolluter

Project Description

This project focuses on the systematic pollution of tabular data by introducing redundancy-related errors. Given a selected set of datasets, we apply a variety of operations to deliberately inject redundant and repetitive information, simulating common data quality issues found in real-world scenarios.

Redundancy Errors

Exact Duplicates

Entire rows that are identical across all columns, resulting in redundant records. Each property is identical, which makes this error type the easiest to detect.

Fuzzy Duplicates

Rows that refer to the same real-world entity but differ slightly across one or multiple fields, for example due to inconsistent formatting ("2025-05-21" and "05/21/25"), abbreviations ("Mathematics 101" and "Math 101"), spelling mistakes ("Laura" and "Larua") or wrong units ("kg" and "lbs"). Although these records describe the same entity, they cannot be easily matched with their duplicates as exact duplicates, which makes the detection and removing process more complex.

Incorrect Values

In contrast to exact and fuzzy duplicates, incorrect values reflect redundancy at the column-level. They do not create duplicated records, but make it more difficult for a user to analyze and interpret the tabular dataset. We define two types of incorrect values:

  • semantically-redundant: Semantically redundant values within a single column that represent the same concept or entity but in inconsistent ways.
  • incorrect-unit: Values that do not coincide with the definition of the column, having the wrong unit, wrong data type, or wrong value range.

Misfielded Values

We define values as being misfielded, when they are placed in the wrong column, breaking the structural integrity of the table and leading to ambiguity. When polluting with misfielded values, no duplicate records are created, only an existing record is modified. During the pollution of the datasets, we introduce two types of misfielded values. The first type is where the contents of two columns are switched. In this case, both fields in the row are marked as misfielded. The second type is the replacement of contents from one column with the contents of another column, leading to a loss of information in the replaced column and duplication of the contents of the copied column. Here, only the content in the replaced column is marked as misfielded.

Datasets

  • IMDB dataset: Includes information on different scenes shot for a movie. One row corresponds to one scene for one person in the scene for this movie. The dataset has 200.000 rows and 24 columns and therefore is the dataset with the largest number of rows.
  • Australien Weather Dataset: This dataset contains one weather observation per day per region, including different measurements like temperature or wind speed. We have a total of 29.092 rows with 23 columns.
  • Medical Diabetes Dataset: Includes one row per hospital admission of patients. This dataset has 20.353 rows and 53 columns, which makes it the dataset with the most columns.

Implementation

We built a python application for polluting three datasets. The application consists of a main "Polluter" class, which provides all methods for loading a dataset, modifying it and exporting the polluted dataset together with the labels. For each of the three datasets, we define subclasses of the "Polluter" class, that specify the order of pollution operations, which columns to apply them to and also dataset specific transformations like unit conversions.

To avoid conflicting labels, especially since exact and fuzzy duplicates require labeling every cell in a row, we introduce only one error type per row. Through this constraint, we avoid adding misfielded or incorrect values to rows already containing exact or fuzzy duplicates. Polluting a part of the dataset with one error type then mainly involves extracting (cutting) a random slice of the data, polluting the slice, setting the labels and appending it to a polluted slices list. This way we ensure that polluted rows are never available for another pollution. When exporting, we merge the remaining unpolluted dataset with the polluted slices and shuffle the dataset.

A limitation of introducing only one modification per row is that, in datasets with many columns, it becomes mathematically infeasible to reach a target like 20% polluted cells. To address this, we allow multiple modifications of the same error type per row for misfielded and incorrect values. For instance, in the case of misfielded data, we switch several non-overlapping sets of columns within the same row. For incorrect values, we apply multiple independent unit transformations or other value corruptions across different columns in the same slice to increase the overall proportion of polluted cells.

Results

Dataset Total Pollution Exact Duplicates Fuzzy Duplicates Incorrect Values Misfielded Values
IMDB 18.65% 1.85% 12.96% 1.62% 2.22%
Australian Weather 18.32% 2.21% 7.37% 8.60% 0.15%
Medical Diabetes 18.70% 3.40% 8.50% 5.69% 1.11%

Usage

Run with Docker

  1. Install Docker and make sure the Docker daemon is running.
  2. Run docker-compose build to build the docker container from the provided image. This must only be done once or whenever
    • changing the Dockerfile
    • changing the docker-compose.yml
    • you add dependencies and change the pyproject.toml or poetry.lock
  3. Run docker-compose up to run the docker container. You need to execute this command whenever you make changes to the code base.

Commit Guideline

We use the Conventional Commits Specification v1.0.0 for writing commit messages. Refer to the website for instructions.

Commit Types

We use the recommended commit types from the specification, namely:

  • feat: A code change that introduces a new feature to the codebase (this correlates with MINOR in Semantic Versioning)
  • fix: A code change that patches a bug in your codebase (this correlates with PATCH in Semantic Versioning)
  • refactor: A code change that neither fixes a bug nor adds a feature
  • build: Changes that affect the build system or external dependencies (example scopes: pip, npm)
  • ci: Changes to CI configuration files and scripts (examples: GitHub Actions)
  • docs: Documentation only changes
  • perf: A code change that improves performance
  • test: Adding missing tests or correcting existing tests

How should I voice the commit message?

  • feat: commits: use the imperative, present tense – eg. feat: add button not feat: added button nor feat: adds button
  • fix: commits: describe the bug that is being fixed – eg. fix: button is broken not fix: repair broken button

What if I introduce a breaking change?

  • Option 1): include an exclamation mark (!) after the commit type to draw attention to a breaking change
feat!: send an email to the customer when a product is shipped
  • Option 2): include a breaking change footer
feat: allow provided config object to extend other configs

BREAKING CHANGE: `extends` key in config file is now used for extending other config files

What do I do if the commit conforms to more than one of the commit types?

Go back and make multiple commits whenever possible. Part of the benefit of Conventional Commits is its ability to drive us to make more organized commits and PRs.

About

This project focuses on the systematic pollution of tabular data by introducing redundancy-related errors. Given a selected set of datasets, we apply a variety of operations to deliberately inject redundant and repetitive information, simulating common data quality issues found in real-world scenarios.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.8%
  • Rich Text Format 1.2%
  • Dockerfile 1.0%