Skip to content

DiSA-Projects/Doc-Intel-Transcription

Repository files navigation

Doc-Intel-Transcription

Template for using Document Intelligence to extract text from non-OCR PDFs (aka AI-aided transcription).

Original code by Neil Aitken, UBC Digital Scholarship in the Arts

Summary

Sometimes we want to extract text from a PDF (or a group of PDFs) that contains scans of text that are currently in a non-machine readable form (eg. non-OCR PDF). Using Azure Document Intelligence (a Microsoft Azure cloud service), we can now extract this text, automating a transcription process that previously would have taken hundreds of hours to do manually.

Features

The provided files include:

  1. doc_intel_analyze_py: a simplified version of the example code Microsoft offers for calling Document Intelligence; and,
  2. di_transcription_template.py: a sample main file that includes both the transcription call and a framework with some example functions for post-extraction text-cleaning.

Instructions (Jupityr or Google Colab Notebook)

If you want to test this process out without installing anything new, you can use the Notebook file (for either Google Colab or Jupityr Notebook). Running large batches will be significantly slower on a web-based notebook though.

  1. Set up Document Intelligence on Microsoft Azure
  2. Download Demo_DocIntelligence_Transcription.ipynb
  3. Set your Endpoint and Key in the file
  4. Follow the steps in the Notebook

Instructions (Local Install)

If you expect to be processing a lot of files, it is recommended that you set up a local installation of the code and run it on a higher-end computer.

  1. Set up Document Intelligence on Microsoft Azure
  2. Download doc_intel_analyze.py and di_transcription_template.py
  3. In doc_intel_analyze.py, update ENDPOINT and API_KEY to use the Endpoint and Key values located in your Document Intelligence resource on Microsoft Azure (refer to step 1)
  4. Place any PDFs you wish to transcribe into the main folder (or another folder, but you will need to change 'SOURCE_DIR' below) where you've installed this project on your machine.
  5. Create folders for the transcribed text files and the cleaned text files (data and cleaned respectively). If you use other directory names (or if you have the original PDFs stored elsewhere), update the SOURCE_DIR, DATA_DIR, and CLEANED_DIR entries in di_transcription_template.py
md data
md cleaned
  1. Run the following from the command line (or compile and run di_transcription_template.py -- or whatever you have renamed this file)
python di_transcription_template.py

About

Template for using Document Intelligence to extract text from non-OCR PDFs (aka AI-aided transcription)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors