TabularRedundancyPolluter

Project Description

This project focuses on the systematic pollution of tabular data by introducing redundancy-related errors. Given a selected set of datasets, we apply a variety of operations to deliberately inject redundant and repetitive information, simulating common data quality issues found in real-world scenarios.

Redundancy Errors

Exact Duplicates

Entire rows that are identical across all columns, resulting in redundant records. Each property is identical, which makes this error type the easiest to detect.

Fuzzy Duplicates

Rows that refer to the same real-world entity but differ slightly across one or multiple fields, for example due to inconsistent formatting ("2025-05-21" and "05/21/25"), abbreviations ("Mathematics 101" and "Math 101"), spelling mistakes ("Laura" and "Larua") or wrong units ("kg" and "lbs"). Although these records describe the same entity, they cannot be easily matched with their duplicates as exact duplicates, which makes the detection and removing process more complex.

Incorrect Values

In contrast to exact and fuzzy duplicates, incorrect values reflect redundancy at the column-level. They do not create duplicated records, but make it more difficult for a user to analyze and interpret the tabular dataset. We define two types of incorrect values:

semantically-redundant: Semantically redundant values within a single column that represent the same concept or entity but in inconsistent ways.
incorrect-unit: Values that do not coincide with the definition of the column, having the wrong unit, wrong data type, or wrong value range.

Misfielded Values

We define values as being misfielded, when they are placed in the wrong column, breaking the structural integrity of the table and leading to ambiguity. When polluting with misfielded values, no duplicate records are created, only an existing record is modified. During the pollution of the datasets, we introduce two types of misfielded values. The first type is where the contents of two columns are switched. In this case, both fields in the row are marked as misfielded. The second type is the replacement of contents from one column with the contents of another column, leading to a loss of information in the replaced column and duplication of the contents of the copied column. Here, only the content in the replaced column is marked as misfielded.

Datasets

IMDB dataset: Includes information on different scenes shot for a movie. One row corresponds to one scene for one person in the scene for this movie. The dataset has 200.000 rows and 24 columns and therefore is the dataset with the largest number of rows.
Australien Weather Dataset: This dataset contains one weather observation per day per region, including different measurements like temperature or wind speed. We have a total of 29.092 rows with 23 columns.
Medical Diabetes Dataset: Includes one row per hospital admission of patients. This dataset has 20.353 rows and 53 columns, which makes it the dataset with the most columns.

Implementation

We built a python application for polluting three datasets. The application consists of a main "Polluter" class, which provides all methods for loading a dataset, modifying it and exporting the polluted dataset together with the labels. For each of the three datasets, we define subclasses of the "Polluter" class, that specify the order of pollution operations, which columns to apply them to and also dataset specific transformations like unit conversions.

To avoid conflicting labels, especially since exact and fuzzy duplicates require labeling every cell in a row, we introduce only one error type per row. Through this constraint, we avoid adding misfielded or incorrect values to rows already containing exact or fuzzy duplicates. Polluting a part of the dataset with one error type then mainly involves extracting (cutting) a random slice of the data, polluting the slice, setting the labels and appending it to a polluted slices list. This way we ensure that polluted rows are never available for another pollution. When exporting, we merge the remaining unpolluted dataset with the polluted slices and shuffle the dataset.

A limitation of introducing only one modification per row is that, in datasets with many columns, it becomes mathematically infeasible to reach a target like 20% polluted cells. To address this, we allow multiple modifications of the same error type per row for misfielded and incorrect values. For instance, in the case of misfielded data, we switch several non-overlapping sets of columns within the same row. For incorrect values, we apply multiple independent unit transformations or other value corruptions across different columns in the same slice to increase the overall proportion of polluted cells.

Results

Dataset	Total Pollution	Exact Duplicates	Fuzzy Duplicates	Incorrect Values	Misfielded Values
IMDB	18.65%	1.85%	12.96%	1.62%	2.22%
Australian Weather	18.32%	2.21%	7.37%	8.60%	0.15%
Medical Diabetes	18.70%	3.40%	8.50%	5.69%	1.11%

Usage

Run with Docker

Install Docker and make sure the Docker daemon is running.
Run docker-compose build to build the docker container from the provided image. This must only be done once or whenever
- changing the Dockerfile
- changing the docker-compose.yml
- you add dependencies and change the pyproject.toml or poetry.lock
Run docker-compose up to run the docker container. You need to execute this command whenever you make changes to the code base.

Commit Guideline

We use the Conventional Commits Specification v1.0.0 for writing commit messages. Refer to the website for instructions.

Commit Types

We use the recommended commit types from the specification, namely:

feat: A code change that introduces a new feature to the codebase (this correlates with MINOR in Semantic Versioning)
fix: A code change that patches a bug in your codebase (this correlates with PATCH in Semantic Versioning)
refactor: A code change that neither fixes a bug nor adds a feature
build: Changes that affect the build system or external dependencies (example scopes: pip, npm)
ci: Changes to CI configuration files and scripts (examples: GitHub Actions)
docs: Documentation only changes
perf: A code change that improves performance
test: Adding missing tests or correcting existing tests

How should I voice the commit message?

feat: commits: use the imperative, present tense – eg. feat: add button not feat: added button nor feat: adds button
fix: commits: describe the bug that is being fixed – eg. fix: button is broken not fix: repair broken button

What if I introduce a breaking change?

Option 1): include an exclamation mark (!) after the commit type to draw attention to a breaking change

feat!: send an email to the customer when a product is shipped

Option 2): include a breaking change footer

feat: allow provided config object to extend other configs

BREAKING CHANGE: `extends` key in config file is now used for extending other config files

What do I do if the commit conforms to more than one of the commit types?

Go back and make multiple commits whenever possible. Part of the benefit of Conventional Commits is its ability to drive us to make more organized commits and PRs.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
datasets		datasets
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TabularRedundancyPolluter

Project Description

Redundancy Errors

Exact Duplicates

Fuzzy Duplicates

Incorrect Values

Misfielded Values

Datasets

Implementation

Results

Usage

Run with Docker

Commit Guideline

Commit Types

How should I voice the commit message?

What if I introduce a breaking change?

What do I do if the commit conforms to more than one of the commit types?

About

Uh oh!

Releases

Packages

Languages

TimRiedel/TabularRedundancyPolluter

Folders and files

Latest commit

History

Repository files navigation

TabularRedundancyPolluter

Project Description

Redundancy Errors

Exact Duplicates

Fuzzy Duplicates

Incorrect Values

Misfielded Values

Datasets

Implementation

Results

Usage

Run with Docker

Commit Guideline

Commit Types

How should I voice the commit message?

What if I introduce a breaking change?

What do I do if the commit conforms to more than one of the commit types?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages