Doc-Intel-Transcription

Template for using Document Intelligence to extract text from non-OCR PDFs (aka AI-aided transcription).

Original code by Neil Aitken, UBC Digital Scholarship in the Arts

Summary

Sometimes we want to extract text from a PDF (or a group of PDFs) that contains scans of text that are currently in a non-machine readable form (eg. non-OCR PDF). Using Azure Document Intelligence (a Microsoft Azure cloud service), we can now extract this text, automating a transcription process that previously would have taken hundreds of hours to do manually.

Features

The provided files include:

doc_intel_analyze_py: a simplified version of the example code Microsoft offers for calling Document Intelligence; and,
di_transcription_template.py: a sample main file that includes both the transcription call and a framework with some example functions for post-extraction text-cleaning.

Instructions (Jupityr or Google Colab Notebook)

If you want to test this process out without installing anything new, you can use the Notebook file (for either Google Colab or Jupityr Notebook). Running large batches will be significantly slower on a web-based notebook though.

Set up Document Intelligence on Microsoft Azure
Download Demo_DocIntelligence_Transcription.ipynb
Set your Endpoint and Key in the file
Follow the steps in the Notebook

Instructions (Local Install)

If you expect to be processing a lot of files, it is recommended that you set up a local installation of the code and run it on a higher-end computer.

Set up Document Intelligence on Microsoft Azure
Download doc_intel_analyze.py and di_transcription_template.py
In doc_intel_analyze.py, update ENDPOINT and API_KEY to use the Endpoint and Key values located in your Document Intelligence resource on Microsoft Azure (refer to step 1)
Place any PDFs you wish to transcribe into the main folder (or another folder, but you will need to change 'SOURCE_DIR' below) where you've installed this project on your machine.
Create folders for the transcribed text files and the cleaned text files (data and cleaned respectively). If you use other directory names (or if you have the original PDFs stored elsewhere), update the SOURCE_DIR, DATA_DIR, and CLEANED_DIR entries in di_transcription_template.py

md data

md cleaned

Run the following from the command line (or compile and run di_transcription_template.py -- or whatever you have renamed this file)

python di_transcription_template.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Demo_DocIntelligence_Transcription.ipynb		Demo_DocIntelligence_Transcription.ipynb
LICENSE		LICENSE
README.md		README.md
SetupDocIntel.md		SetupDocIntel.md
di_transcription_template.py		di_transcription_template.py
doc_intel_analyze.py		doc_intel_analyze.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Doc-Intel-Transcription

Summary

Features

Instructions (Jupityr or Google Colab Notebook)

Instructions (Local Install)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Doc-Intel-Transcription

Summary

Features

Instructions (Jupityr or Google Colab Notebook)

Instructions (Local Install)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages