josk0 · josk0 · Oct 23, 2025 · Oct 21, 2025 · Oct 22, 2025 · Oct 22, 2025
diff --git a/.gitignore b/.gitignore
@@ -48,6 +48,7 @@ coverage.xml
 
 # Jupyter Notebook
 .ipynb_checkpoints
+test.ipynb
 
 # pyenv
 .python-version

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,51 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [Unreleased]
+
+### Added
+- **Heuristic reference removal**: Automatically detect and remove bibliographic reference lines based on pattern scoring
+- **Batch processing script**: Process multiple markdown files in parallel with `scripts/clean_mds_in_folder.py`
+- **Footnote pattern removal**: Remove footnote references in text (e.g., `.1`, `.23`)
+- **Enhanced linebreak crimping**: Improved algorithm for fixing line break errors from PDF conversion
+  - Connective-based crimping for lines ending with `-`, `–`, `—`, or `...`
+  - Justified text crimping for adjacent lines of similar length
+- **CLI options**:
+  - `--keep-references`: Disable heuristic reference detection
+  - `--no-crimping`: Disable linebreak crimping
+  - Additional fine-grained control options for cleaning operations
+
+### Changed
+- **Default patterns**: Updated `default_cleaning_patterns.yaml` with:
+  - Improved section removal patterns (e.g., "Authors' Note", "Note on sources")
+  - Additional inline patterns (LaTeX footnotes, trailing ellipsis)
+  - Refined keyword and conflict of interest patterns
+- **API**: `MarkdownCleaner` constructor now accepts optional `patterns` parameter (defaults to None for default patterns)
+- **Linebreak crimping**: Now enabled by default (`crimp_linebreaks: True`)
+- **README**: Comprehensive documentation updates with new features and examples
+
+### Fixed
+- Linebreak crimping logic now properly handles various PDF conversion artifacts
+- Test suite updated to match new implementation
+
+## [0.2.0] - 2025-03-XX
+
+Initial PyPI release with core markdown cleaning functionality.
+
+### Features
+- Remove references, bibliographies, and citations
+- Remove copyright notices and legal disclaimers
+- Remove acknowledgements and funding information
+- Pattern-based text cleaning with customizable YAML configuration
+- Command-line interface
+- Python API for programmatic use
+- Text replacement and pattern removal
+- Duplicate headline removal
+- Short line removal
+- Empty line contraction
+- Encoding fix support (mojibake)
+- Quotation normalization
diff --git a/README.md b/README.md
@@ -5,14 +5,15 @@ A simple Python tool for cleaning and formatting markdown documents. Default con
 ## Description
 
 `markdowncleaner` helps you clean up markdown files by removing unwanted content such as:
-- References, bibliographies, and citations
+- References, bibliographies, and citations (including heuristic detection of bibliographic lines)
 - Footnotes and endnote references in text
 - Copyright notices and legal disclaimers
 - Acknowledgements and funding information
 - Author information and contact details
 - Specific patterns like DOIs, URLs, and email addresses
 - Short lines and excessive whitespace
 - Duplicate headlines (for example, because paper title and author names were reprinted on every page of a PDF)
+- Erroneous line breaks from PDF conversion
 
 This tool is particularly useful for processing academic papers, books, or any markdown document that needs formatting cleanup.
 
@@ -24,7 +25,9 @@ pip install markdowncleaner
 
 ## Usage
 
-### Basic Usage
+### Python API
+
+#### Basic Usage
 
 ```python
 from markdowncleaner import MarkdownCleaner
@@ -42,7 +45,7 @@ cleaned_text = cleaner.clean_markdown_string(text)
 print(cleaned_text)
 ```
 
-### Customizing Cleaning Options
+#### Customizing Cleaning Options
 
 ```python
 from markdowncleaner import MarkdownCleaner, CleanerOptions
@@ -51,22 +54,25 @@ from markdowncleaner import MarkdownCleaner, CleanerOptions
 options = CleanerOptions()
 options.remove_short_lines = True
 options.min_line_length = 50  # custom minimum line length
-options.remove_duplicate_headlines = False 
+options.remove_duplicate_headlines = False
 options.remove_footnotes_in_text = True
 options.contract_empty_lines = True
+options.fix_encoding_mojibake = True
+options.normalize_quotation_symbols = True
 
 # Initialize cleaner with custom options
 cleaner = MarkdownCleaner(options=options)
 
 # Use the cleaner as before
 ```
 
-### Custom Cleaning Patterns
+#### Custom Cleaning Patterns
 
 You can also provide custom cleaning patterns:
 
 ```python
-from markdowncleaner import MarkdownCleaner, CleaningPatterns
+from markdowncleaner import MarkdownCleaner
+from markdowncleaner.config.loader import CleaningPatterns
 from pathlib import Path
 
 # Load custom patterns from a YAML file
@@ -76,6 +82,102 @@ custom_patterns = CleaningPatterns.from_yaml(Path("my_patterns.yaml"))
 cleaner = MarkdownCleaner(patterns=custom_patterns)
 ```
 
+### Command Line Interface
+
+Clean a single markdown file using the CLI:
+
+```bash
+# Basic usage - creates a new file with "_cleaned" suffix
+markdowncleaner input.md
+
+# Specify output file
+markdowncleaner input.md -o output.md
+
+# Specify output directory
+markdowncleaner input.md --output-dir cleaned_files/
+
+# Use custom configuration
+markdowncleaner input.md --config my_patterns.yaml
+
+# Enable encoding fixes and quotation normalization
+markdowncleaner input.md --fix-encoding --normalize-quotation
+
+# Customize line length threshold
+markdowncleaner input.md --min-line-length 50
+
+# Disable specific cleaning operations
+markdowncleaner input.md --keep-short-lines --keep-sections --keep-footnotes
+
+# Disable replacements and inline pattern removal
+markdowncleaner input.md --no-replacements --keep-inline-patterns
+
+# Disable formatting operations
+markdowncleaner input.md --no-crimping --keep-empty-lines
+
+# Keep references (disable heuristic reference detection)
+markdowncleaner input.md --keep-references
+```
+
+**Available CLI Options:**
+
+- `-o`, `--output`: Path to save the cleaned markdown file
+- `--output-dir`: Directory to save the cleaned file
+- `--config`: Path to custom YAML configuration file
+- `--fix-encoding`: Fix encoding mojibake issues
+- `--normalize-quotation`: Normalize quotation symbols to standard ASCII
+- `--keep-short-lines`: Don't remove lines shorter than minimum length
+- `--min-line-length`: Minimum line length to keep (default: 70)
+- `--keep-bad-lines`: Don't remove lines matching bad line patterns
+- `--keep-sections`: Don't remove sections like References, Acknowledgements
+- `--keep-duplicate-headlines`: Don't remove duplicate headlines
+- `--keep-footnotes`: Don't remove footnote references in text
+- `--no-replacements`: Don't perform text replacements
+- `--keep-inline-patterns`: Don't remove inline patterns like citations
+- `--keep-empty-lines`: Don't contract consecutive empty lines
+- `--no-crimping`: Don't crimp linebreaks (fix line break errors from PDF conversion)
+- `--keep-references`: Don't heuristically detect and remove bibliographic reference lines
+
+### Batch Processing Script
+
+For processing multiple markdown files in a folder and its subfolders, use the included batch processing script:
+
+```bash
+# Basic usage - will prompt for confirmation
+python scripts/clean_mds_in_folder.py documents/
+
+# Skip confirmation prompt
+python scripts/clean_mds_in_folder.py documents/ --yes
+
+# Use 8 parallel workers (default is your CPU count)
+python scripts/clean_mds_in_folder.py documents/ --workers 8
+
+# Use custom cleaning patterns
+python scripts/clean_mds_in_folder.py documents/ --config my_patterns.yaml
+
+# Combine options
+python scripts/clean_mds_in_folder.py documents/ --yes --workers 4
+```
+
+**Features:**
+- Recursively finds all `.md` files in the specified folder and subfolders
+- Processes files in parallel using multiple CPU cores for faster processing
+- Shows real-time progress bar with `tqdm`
+- Cleans files in-place (modifies original files)
+- Asks for confirmation before processing (unless `--yes` is used)
+- Continues processing even if some files fail
+- Reports all successful and failed files at the end
+
+**Script Options:**
+- `folder`: Path to folder containing markdown files (required)
+- `-y`, `--yes`: Skip confirmation prompt and proceed immediately
+- `-w`, `--workers`: Number of parallel workers (default: CPU count)
+- `--config`: Path to custom YAML configuration file
+
+**Note:** Requires `tqdm` for the progress bar:
+```bash
+pip install tqdm
+```
+
 ## Configuration
 
 The default cleaning patterns are defined in `default_cleaning_patterns.yaml` and include:
@@ -88,16 +190,22 @@ The default cleaning patterns are defined in `default_cleaning_patterns.yaml` an
 
 ## Options
 
-- `remove_short_lines`: Remove lines shorter than `min_line_length` (default: 70 characters)
-- `remove_whole_lines`: Remove lines matching specific patterns
-- `remove_sections`: Remove entire sections based on section headings
-- `remove_duplicate_headlines`: Remove duplicate headlines based on threshold
-- `remove_duplicate_headlines_threshold`: Threshold for duplicate headline removal
-- `remove_footnotes_in_text`: Remove footnote references
-- `replace_within_lines`: Replace specific patterns within lines
-- `remove_within_lines`: Remove specific patterns within lines
-- `contract_empty_lines`: Normalize whitespace
-- `crimp_linebreaks`: Improve line break formatting
+All available `CleanerOptions`:
+
+- `fix_encoding_mojibake`: Fix encoding issues and mojibake using ftfy (default: False)
+- `normalize_quotation_symbols`: Normalize various quotation marks to standard ASCII quotes (default: False)
+- `remove_short_lines`: Remove lines shorter than `min_line_length` (default: True)
+- `min_line_length`: Minimum line length to keep when `remove_short_lines` is enabled (default: 70)
+- `remove_whole_lines`: Remove lines matching specific patterns (default: True)
+- `remove_sections`: Remove entire sections based on section headings (default: True)
+- `remove_duplicate_headlines`: Remove duplicate headlines based on threshold (default: True)
+- `remove_duplicate_headlines_threshold`: Number of occurrences needed to consider a headline duplicate (default: 2)
+- `remove_footnotes_in_text`: Remove footnote references like ".1" or ".23" (default: True)
+- `replace_within_lines`: Replace specific patterns within lines (default: True)
+- `remove_within_lines`: Remove specific patterns within lines (default: True)
+- `contract_empty_lines`: Reduce multiple consecutive empty lines to one (default: True)
+- `crimp_linebreaks`: Fix line break errors from PDF conversion (default: True)
+- `remove_references_heuristically`: Heuristically detect and remove bibliographic reference lines by scoring lines based on bibliographic patterns (default: True)
 
 ## License