Skip to content

Community maintained fork of pdfminer - we fathom PDF - RTL Version

License

Notifications You must be signed in to change notification settings

a-on-seaside/pdfminerrtl

 
 

Repository files navigation

pdfminer.rtl

This is a fork of pdfminer.six that attempts to add RTL support with python-bidi. This version is experimental and probably buggy. Please don't rely on it for critical projects.

Check out the full original documentation on Read the Docs.

Features

  • (Added RTL support)
  • Written entirely in Python.
  • Parse, analyze, and convert PDF documents.
  • Extract content as text, images, html or hOCR.
  • PDF-1.7 specification support. (well, almost).
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Support for extracting images (JPG, JBIG2, Bitmaps).
  • Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
  • Support for RC4 and AES encryption.
  • Support for AcroForm interactive form extraction.
  • Table of contents extraction.
  • Tagged contents extraction.
  • Automatic layout analysis.

How to use

  • Install Python 3.8 or newer.

  • Install pdfminer.rtl.

    pip install pdfminer.rtl

  • (Optionally) install extra dependencies for extracting images.

    pip install 'pdfminer.rtl[image]'

  • Use the command-line interface to extract text from pdf.

    pdf2txt.py example.pdf

  • Or use it with Python.

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

Acknowledgement

This repository includes code from pyHanko ; the original license has been included here and to all the other contirbutors of the original project see here

About

Community maintained fork of pdfminer - we fathom PDF - RTL Version

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Other 0.2%