Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ coverage.xml

# Jupyter Notebook
.ipynb_checkpoints
test.ipynb

# pyenv
.python-version
Expand Down
51 changes: 51 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Added
- **Heuristic reference removal**: Automatically detect and remove bibliographic reference lines based on pattern scoring
- **Batch processing script**: Process multiple markdown files in parallel with `scripts/clean_mds_in_folder.py`
- **Footnote pattern removal**: Remove footnote references in text (e.g., `.1`, `.23`)
- **Enhanced linebreak crimping**: Improved algorithm for fixing line break errors from PDF conversion
- Connective-based crimping for lines ending with `-`, `–`, `—`, or `...`
- Justified text crimping for adjacent lines of similar length
- **CLI options**:
- `--keep-references`: Disable heuristic reference detection
- `--no-crimping`: Disable linebreak crimping
- Additional fine-grained control options for cleaning operations

### Changed
- **Default patterns**: Updated `default_cleaning_patterns.yaml` with:
- Improved section removal patterns (e.g., "Authors' Note", "Note on sources")
- Additional inline patterns (LaTeX footnotes, trailing ellipsis)
- Refined keyword and conflict of interest patterns
- **API**: `MarkdownCleaner` constructor now accepts optional `patterns` parameter (defaults to None for default patterns)
- **Linebreak crimping**: Now enabled by default (`crimp_linebreaks: True`)
- **README**: Comprehensive documentation updates with new features and examples

### Fixed
- Linebreak crimping logic now properly handles various PDF conversion artifacts
- Test suite updated to match new implementation

## [0.2.0] - 2025-03-XX

Initial PyPI release with core markdown cleaning functionality.

### Features
- Remove references, bibliographies, and citations
- Remove copyright notices and legal disclaimers
- Remove acknowledgements and funding information
- Pattern-based text cleaning with customizable YAML configuration
- Command-line interface
- Python API for programmatic use
- Text replacement and pattern removal
- Duplicate headline removal
- Short line removal
- Empty line contraction
- Encoding fix support (mojibake)
- Quotation normalization
140 changes: 124 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,15 @@ A simple Python tool for cleaning and formatting markdown documents. Default con
## Description

`markdowncleaner` helps you clean up markdown files by removing unwanted content such as:
- References, bibliographies, and citations
- References, bibliographies, and citations (including heuristic detection of bibliographic lines)
- Footnotes and endnote references in text
- Copyright notices and legal disclaimers
- Acknowledgements and funding information
- Author information and contact details
- Specific patterns like DOIs, URLs, and email addresses
- Short lines and excessive whitespace
- Duplicate headlines (for example, because paper title and author names were reprinted on every page of a PDF)
- Erroneous line breaks from PDF conversion

This tool is particularly useful for processing academic papers, books, or any markdown document that needs formatting cleanup.

Expand All @@ -24,7 +25,9 @@ pip install markdowncleaner

## Usage

### Basic Usage
### Python API

#### Basic Usage

```python
from markdowncleaner import MarkdownCleaner
Expand All @@ -42,7 +45,7 @@ cleaned_text = cleaner.clean_markdown_string(text)
print(cleaned_text)
```

### Customizing Cleaning Options
#### Customizing Cleaning Options

```python
from markdowncleaner import MarkdownCleaner, CleanerOptions
Expand All @@ -51,22 +54,25 @@ from markdowncleaner import MarkdownCleaner, CleanerOptions
options = CleanerOptions()
options.remove_short_lines = True
options.min_line_length = 50 # custom minimum line length
options.remove_duplicate_headlines = False
options.remove_duplicate_headlines = False
options.remove_footnotes_in_text = True
options.contract_empty_lines = True
options.fix_encoding_mojibake = True
options.normalize_quotation_symbols = True

# Initialize cleaner with custom options
cleaner = MarkdownCleaner(options=options)

# Use the cleaner as before
```

### Custom Cleaning Patterns
#### Custom Cleaning Patterns

You can also provide custom cleaning patterns:

```python
from markdowncleaner import MarkdownCleaner, CleaningPatterns
from markdowncleaner import MarkdownCleaner
from markdowncleaner.config.loader import CleaningPatterns
from pathlib import Path

# Load custom patterns from a YAML file
Expand All @@ -76,6 +82,102 @@ custom_patterns = CleaningPatterns.from_yaml(Path("my_patterns.yaml"))
cleaner = MarkdownCleaner(patterns=custom_patterns)
```

### Command Line Interface

Clean a single markdown file using the CLI:

```bash
# Basic usage - creates a new file with "_cleaned" suffix
markdowncleaner input.md

# Specify output file
markdowncleaner input.md -o output.md

# Specify output directory
markdowncleaner input.md --output-dir cleaned_files/

# Use custom configuration
markdowncleaner input.md --config my_patterns.yaml

# Enable encoding fixes and quotation normalization
markdowncleaner input.md --fix-encoding --normalize-quotation

# Customize line length threshold
markdowncleaner input.md --min-line-length 50

# Disable specific cleaning operations
markdowncleaner input.md --keep-short-lines --keep-sections --keep-footnotes

# Disable replacements and inline pattern removal
markdowncleaner input.md --no-replacements --keep-inline-patterns

# Disable formatting operations
markdowncleaner input.md --no-crimping --keep-empty-lines

# Keep references (disable heuristic reference detection)
markdowncleaner input.md --keep-references
```

**Available CLI Options:**

- `-o`, `--output`: Path to save the cleaned markdown file
- `--output-dir`: Directory to save the cleaned file
- `--config`: Path to custom YAML configuration file
- `--fix-encoding`: Fix encoding mojibake issues
- `--normalize-quotation`: Normalize quotation symbols to standard ASCII
- `--keep-short-lines`: Don't remove lines shorter than minimum length
- `--min-line-length`: Minimum line length to keep (default: 70)
- `--keep-bad-lines`: Don't remove lines matching bad line patterns
- `--keep-sections`: Don't remove sections like References, Acknowledgements
- `--keep-duplicate-headlines`: Don't remove duplicate headlines
- `--keep-footnotes`: Don't remove footnote references in text
- `--no-replacements`: Don't perform text replacements
- `--keep-inline-patterns`: Don't remove inline patterns like citations
- `--keep-empty-lines`: Don't contract consecutive empty lines
- `--no-crimping`: Don't crimp linebreaks (fix line break errors from PDF conversion)
- `--keep-references`: Don't heuristically detect and remove bibliographic reference lines

### Batch Processing Script

For processing multiple markdown files in a folder and its subfolders, use the included batch processing script:

```bash
# Basic usage - will prompt for confirmation
python scripts/clean_mds_in_folder.py documents/

# Skip confirmation prompt
python scripts/clean_mds_in_folder.py documents/ --yes

# Use 8 parallel workers (default is your CPU count)
python scripts/clean_mds_in_folder.py documents/ --workers 8

# Use custom cleaning patterns
python scripts/clean_mds_in_folder.py documents/ --config my_patterns.yaml

# Combine options
python scripts/clean_mds_in_folder.py documents/ --yes --workers 4
```

**Features:**
- Recursively finds all `.md` files in the specified folder and subfolders
- Processes files in parallel using multiple CPU cores for faster processing
- Shows real-time progress bar with `tqdm`
- Cleans files in-place (modifies original files)
- Asks for confirmation before processing (unless `--yes` is used)
- Continues processing even if some files fail
- Reports all successful and failed files at the end

**Script Options:**
- `folder`: Path to folder containing markdown files (required)
- `-y`, `--yes`: Skip confirmation prompt and proceed immediately
- `-w`, `--workers`: Number of parallel workers (default: CPU count)
- `--config`: Path to custom YAML configuration file

**Note:** Requires `tqdm` for the progress bar:
```bash
pip install tqdm
```

## Configuration

The default cleaning patterns are defined in `default_cleaning_patterns.yaml` and include:
Expand All @@ -88,16 +190,22 @@ The default cleaning patterns are defined in `default_cleaning_patterns.yaml` an

## Options

- `remove_short_lines`: Remove lines shorter than `min_line_length` (default: 70 characters)
- `remove_whole_lines`: Remove lines matching specific patterns
- `remove_sections`: Remove entire sections based on section headings
- `remove_duplicate_headlines`: Remove duplicate headlines based on threshold
- `remove_duplicate_headlines_threshold`: Threshold for duplicate headline removal
- `remove_footnotes_in_text`: Remove footnote references
- `replace_within_lines`: Replace specific patterns within lines
- `remove_within_lines`: Remove specific patterns within lines
- `contract_empty_lines`: Normalize whitespace
- `crimp_linebreaks`: Improve line break formatting
All available `CleanerOptions`:

- `fix_encoding_mojibake`: Fix encoding issues and mojibake using ftfy (default: False)
- `normalize_quotation_symbols`: Normalize various quotation marks to standard ASCII quotes (default: False)
- `remove_short_lines`: Remove lines shorter than `min_line_length` (default: True)
- `min_line_length`: Minimum line length to keep when `remove_short_lines` is enabled (default: 70)
- `remove_whole_lines`: Remove lines matching specific patterns (default: True)
- `remove_sections`: Remove entire sections based on section headings (default: True)
- `remove_duplicate_headlines`: Remove duplicate headlines based on threshold (default: True)
- `remove_duplicate_headlines_threshold`: Number of occurrences needed to consider a headline duplicate (default: 2)
- `remove_footnotes_in_text`: Remove footnote references like ".1" or ".23" (default: True)
- `replace_within_lines`: Replace specific patterns within lines (default: True)
- `remove_within_lines`: Remove specific patterns within lines (default: True)
- `contract_empty_lines`: Reduce multiple consecutive empty lines to one (default: True)
- `crimp_linebreaks`: Fix line break errors from PDF conversion (default: True)
- `remove_references_heuristically`: Heuristically detect and remove bibliographic reference lines by scoring lines based on bibliographic patterns (default: True)

## License

Expand Down
Loading