A versatile Python library for generalizing and streamlining the processing of diverse file types. It provides a unified File
class that uses the Strategy Pattern to select appropriate processors based on file extensions. The library supports over 20 unique file processors and is designed for easy extensibility.
- Features
- Installation
- Quick Start
- Supported File Types
- Optional Features
- Architecture
- Extending the Library
- Contributing
- License
- Contact
- Unified File Interface: Interact with different file types using a single
File
class. - Strategy Pattern Implementation: Dynamically selects the appropriate processor based on file extensions.
- Extensible Design: Easily add support for new file types by creating custom processors.
- Metadata Extraction: Extracts comprehensive metadata and text content from files.
- Lazy Loading for Optional Features: Supports optional OCR and transcription capabilities via
file-processing-ocr
andfile-processing-transcription
.
To install the file-processing
library from GitHub (since it's not packaged yet), use pip with the repository URL:
pip install git+https://github.com/hc-sc-ocdo-bdpd/file-processing.git
Note: Optional dependencies for OCR and transcription are available through file-processing-ocr
and file-processing-transcription
.
Here's how to get started with file-processing
:
from file_processing import File
# Initialize a File object
file = File('path/to/your/file.pdf')
# Access metadata
print(f"File Name: {file.file_name}")
print(f"File Size: {file.size} bytes")
print(f"Owner: {file.owner}")
# Access extracted text (if applicable)
print(f"Text Content: {file.metadata.get('text', 'No text extracted')}")
The library supports a wide range of file types:
- Text-Based Files:
.txt
,.csv
,.json
,.xml
,.html
,.py
,.ipynb
,.gitignore
- Documents:
.pdf
,.docx
,.rtf
,.xlsx
,.pptx
,.msg
- Images:
.png
,.jpg
,.jpeg
,.gif
,.tif
,.tiff
,.heic
,.heif
- Audio/Video Files:
.mp3
,.wav
,.mp4
,.flac
,.aiff
,.ogg
- Archives:
.zip
- Packaged Software:
.whl
,.exe
- Model Files:
.gguf
(used withfile-processing-models
)
The file-processing
library can be extended with OCR and transcription capabilities by installing additional packages:
- OCR: file-processing-ocr
- Transcription: file-processing-transcription
The library utilizes the Strategy Pattern to select the appropriate processor based on the file extension. Here's how it works:
- The
File
class acts as a context that delegates the processing to a specificFileProcessorStrategy
. - Each file type has a corresponding processor class that implements the
FileProcessorStrategy
interface. - If a file type is not explicitly supported, a
GenericFileProcessor
is used as a fallback.
To add support for a new file type:
-
Create a New Processor Class:
from file_processing.file_processor_strategy import FileProcessorStrategy class CustomFileProcessor(FileProcessorStrategy): def __init__(self, file_path: str, open_file: bool = True) -> None: super().__init__(file_path, open_file) self.metadata = {} def process(self) -> None: # Implement processing logic pass def save(self, output_path: str = None) -> None: # Implement save logic pass
-
Register the New Processor in
file.py
:Add your new processor to the
PROCESSORS
dictionary infile_processing/file.py
:File.PROCESSORS['.custom_extension'] = CustomFileProcessor
-
Update the
__init__.py
File:Add an import statement for your new processor in
file_processing/processors/__init__.py
:from .custom_processor import CustomFileProcessor
Following these steps ensures your new processor is correctly integrated with the file-processing
library.
We welcome contributions from the community. If you'd like to contribute:
- Fork the Repository: Create your own fork on GitHub.
- Create a Feature Branch: Work on your feature or bug fix in a separate branch.
- Write Tests: Ensure your changes are covered by tests.
- Submit a Pull Request: When you're ready, submit a PR for review.
This project is licensed under the MIT License.
For questions or support, please contact:
- Email: [email protected]
We are committed to fostering collaboration and innovation within Health Canada and beyond. Explore our repository, contribute, or get in touch to learn more about our work.