Skip to content

OCR Tibetan PDF to txt pipeline converts to jpg first, improving pipeline performance#2

Open
bri25yu wants to merge 4 commits intokeutzer:masterfrom
bri25yu:master
Open

OCR Tibetan PDF to txt pipeline converts to jpg first, improving pipeline performance#2
bri25yu wants to merge 4 commits intokeutzer:masterfrom
bri25yu:master

Conversation

@bri25yu
Copy link
Copy Markdown
Contributor

@bri25yu bri25yu commented Sep 8, 2022

Transcribing a single ~500 page PDF takes around 10 min (~1min to convert the PDF to a series of JPG images, ~3min to upload the JPG images to GCS, ~5min to perform OCR). Performing OCR with Google Cloud Vision costs ~$1.

This PR fixes the random ASCII mumbo jumbo at the end of each page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant