-
Notifications
You must be signed in to change notification settings - Fork 249
Document Layout Analysis
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis. – Wikipedia
In our case, we are using the pdf document itself instead of image representation. The following categories of tools are available:
- Word extractors
- Page segmenters
- Reading order detectors
- Other layout tools
- Export – Viewing/exporting the results of document layout analysis
Word extractors deal with the task of building words using letters in a page. 2 different methods are currently available:
Words | Default method | Nearest Neighbour |
---|---|---|
Horizontal | ✔️ | ✔️ |
Axis aligned rotated | ✖️ | ✔️ |
Rotated | ❌ | ✔️ |
Curved | ❌ | ✔️ |
Legend: ✔️: supported, ✖️: partial support, ❌: not supported
TO DO
The nearest neighbour word extractor is useful to extract words from pdf documents with complex layouts.
It will seek to connect each glyph bound box's EndBaseLine
point with the closest glyph bound box StartBaseLine
point.
In order to decide wether two glyphs are close enough from each other, the algorithm uses the maximum of both candidates width and point size as a reference distance.
- For glyphs with known text direction (axis aligned), the Manhattan distance is used and the threshold is set to 20% of the reference distance.
- For glyphs with unknown text direction, the Euclidean distance is used and the threshold is set to 40% of the reference distance.
If the measured distance between the two glyphs is below this threshold, they are deemed to be connected.
Once glyphs are connected, they are then grouped to form words via a depth first search algorithm. The extractor should work for left-to-right and right-to-left scripts as it seems that they both have there glyphs' StartBaseLine
on the left and EndBaseLine
on the right of the bounding box.
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
foreach (var word in words)
{
// Do something
}
}
}
using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;
using (var document = PdfDocument.Open(@"document.pdf"))
{
string[] punctuation = new[] { ".", "," };
string[] cannotStartWord = new[] { ")", "]", "!", "?", ";", ":" };
string[] cannotEndWord = new[] { "(", "[" };
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var nnOptions = new NearestNeighbourWordExtractor.NearestNeighbourWordExtractorOptions()
{
// Ignore the letters that are space or belong to 'punctuation' array
// These letters will be put in a single word
FilterPivot = letter => !string.IsNullOrWhiteSpace(letter.Value) &&
!punctuation.Contains(letter.Value),
Filter = (pivot, candidate) =>
{
if (string.IsNullOrWhiteSpace(candidate.Value) ||
cannotEndWord.Contains(candidate.Value))
{
// start new word if the candidate neighbour is
// a space or belongs to 'cannotEndWord' array
return false;
}
else if (cannotStartWord.Contains(pivot.Value))
{
// end word if pivot belongs to 'cannotStartWord' array
return false;
}
return true;
}
};
var nnWordExtracor = new NearestNeighbourWordExtractor(nnOptions);
var words = nnWordExtracor.GetWords(page.Letters);
foreach (var word in words)
{
// Do something
}
}
}
A more advanced example to read two columns. See #721
static void testAdv()
{
var sb = new StringBuilder();
string pdfFilePath = @"...\2302.14502.pdf";
using (var document = UglyToad.PdfPig.PdfDocument.Open(pdfFilePath))
{
foreach (var page in document.GetPages())
{
// 0. Preprocessing
var letters = page.Letters; // no preprocessing
// 1. Extract words
//var wordExtractor = NearestNeighbourWordExtractor.Instance;
var wordExtractorOptions = new NearestNeighbourWordExtractor.NearestNeighbourWordExtractorOptions()
{
Filter = (pivot, candidate) =>
{
// check if white space (default implementation of 'Filter')
if (string.IsNullOrWhiteSpace(candidate.Value))
{
// pivot and candidate letters cannot belong to the same word
// if candidate letter is null or white space.
// ('FilterPivot' already checks if the pivot is null or white space by default)
return false;
}
// check for height difference
var maxHeight = Math.Max(pivot.PointSize, candidate.PointSize);
var minHeight = Math.Min(pivot.PointSize, candidate.PointSize);
if (minHeight != 0 && maxHeight / minHeight > 2.0)
{
// pivot and candidate letters cannot belong to the same word
// if one letter is more than twice the size of the other.
return false;
}
// check for colour difference
var pivotRgb = pivot.Color.ToRGBValues();
var candidateRgb = candidate.Color.ToRGBValues();
if (!pivotRgb.Equals(candidateRgb))
{
// pivot and candidate letters cannot belong to the same word
// if they don't have the same colour.
return false;
}
return true;
}
};
var wordExtractor = new NearestNeighbourWordExtractor(wordExtractorOptions);
var words = wordExtractor.GetWords(letters);
// 2. Segment page
//var pageSegmenter = DocstrumBoundingBoxes.Instance;
var pageSegmenterOptions = new DocstrumBoundingBoxes.DocstrumBoundingBoxesOptions()
{
};
var pageSegmenter = new DocstrumBoundingBoxes(pageSegmenterOptions);
var textBlocks = pageSegmenter.GetBlocks(words);
// 3. Postprocessing
var readingOrder = UnsupervisedReadingOrderDetector.Instance;
var orderedTextBlocks = readingOrder.Get(textBlocks);
// 4. Extract text
foreach (var block in orderedTextBlocks)
{
sb.Append(block.Text.Normalize(NormalizationForm.FormKC)); // normalise text
sb.AppendLine();
}
sb.AppendLine();
}
}
Console.WriteLine(sb.ToString());
}
The algorithm was used on this map that has a complex layout, with glyphs/words having very diverse text directions. The algorithm is able to rebuild words independently of their direction.
Words are indicated by red rectangles:
Page segmenters deal with the task of finding blocks of text in a page. They return a list of TextBlock
s that can be thinked as paragraphs. Each TextBlock
will contain the list of lines (TextLine
) that belong to it. In turn, each TextLine
contains the list Word
s that belong to it. Each of these elements have their own bounding box and text.
3 different methods are currently available:
- Default method
- Recursive XY Cut – a top-down method
- Docstrum for bounding boxes – a bottom-up method
Text | Default method | Recursive XY Cut | Docstrum |
---|---|---|---|
Single Column | ✔️ | ✔️ | ✔️ |
Multi Columns | ✖️ | ✔️ | ✔️ |
L-shaped text | ❌ | ❌ | ✔️ |
Rotated lines/paragraphs | ❌ | ✖️ | ✔️ |
Legend: ✔️: supported, ✖️: partial support, ❌: not supported
This method returns one single block, containing all words in the page.
var blocks = DefaultPageSegmenter.Instance.GetBlocks(words);
The recursive X-Y cut is a top-down page segmentation technique that decomposes a document image recursively into a set of rectangular blocks. – Wikipedia
The algorithm works in a top-down manner because it starts at the page level, then scan it from left to right to find a large enough gap between words. Once the gap is found, a vertical cut is made at this level and the page is now separated in two blocks. Then the algorithm start scanning from bottom to top each block in order to find a large enough gap, and once it is found, a horizontal cut is made. This 2-step process is repeated until no cut is possible.
The height and width of the dominant font are used to decide vertical and horizontal gap sizes. A minimum block width can also be used to filter small blocks.
Once all blocks are discovered, lines of text are simply created by grouping words by their bottom coordinate and ordering them from left to right.
The default implementation will use the mode of the height and width of the letters in each block as relevant gap sizes. The minimum block width is 0 (no minimum).
- https://en.wikipedia.org/wiki/Recursive_X-Y_cut
- Recursive X-Y Cut using Bounding Boxes of Connected Components by Jaekyu Ha, Robert M.Haralick and Ihsin T. Phillips
using (var document = PdfDocument.Open(“document.pdf”))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords();
// Use default parameters
// - mode of letters' height and width used as gap size
// - no minimum block width
var blocks = RecursiveXYCut.Instance.GetBlocks(words);
foreach (var block in blocks)
{
// Do something
// E.g. Output the blocks
foreach (TextLine line in block.TextLines)
{
foreach (Word word in line.Words)
{
Console.Write(word.Text + " ");
}
}
}
}
}
The method can be tailored by providing a minimum block width, and horizontal and vertical gap sizes/functions:
- Minimum block width is set to 1/3 of page width:
var recursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
MinimumWidth = page.Width / 3.0
});
var blocks = recursiveXYCut.GetBlocks(words);
- Average of the page letters' height and width is used as gap size (the values won’t change), and minimum block width is set to 1/3 of page width:
var recursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
MinimumWidth = page.Width / 3.0,
DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width),
DominantFontHeightFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Height)
});
var blocks = recursiveXYCut.GetBlocks(words);
- A function that will be applied to each block letters’ height and width can also be provided. Here we use the average and minimum block width is set to 1/3 of page width:
var recursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
MinimumWidth = page.Width / 3.0,
DominantFontWidthFunc = letters => letters.Select(l => l.GlyphRectangle.Width).Average(),
DominantFontHeightFunc = letters => letters.Select(l => l.GlyphRectangle.Height).Average()
});
var blocks = recursiveXYCut.GetBlocks(words);
NB: Isolated bullet points can be handled by setting a minimum block width, e.g. RecursiveXYCut.Instance.GetBlocks(words, new RecursiveXYCut.RecursiveXYCutOptions() { MinimumWidth = page.Width / 3.0 })
NB: DominantFontHeightFunc: The examples above use the average letter glyph height for that page. But using the median glyph height would generally produce better results (the median ignores extremes that can impact the results). You may also want to consider using the median glyph height across all the pages in certain situations. With text where the gap between lines may be wider than the font height consider using the mean/median distance between the text lines.
Paraphrasing the abstract of the original paper, the document spectrum (or docstrum) is a method for structural page layout analysis based on bottom-up, nearest-neighbour clustering of page components. The method yields an accurate within-line, and between-line spacings and locates text lines and text blocks.
A description of the original algorithm’s steps can be found on the Wikipedia page.
This implementation loosely follows these steps, using only a subset them, as we don’t start from an image but from the document itself and don’t need to estimate the skew. It works in a bottom-up fashion, starting at word level:
- Estimate in line and between line spacing
- Within-line distance
- Between-line distance
- Build lines of text (from words)
- Build blocks of text (from text lines)
- https://en.wikipedia.org/wiki/Document_layout_analysis#Example_of_a_bottom_up_approach
- The Document Spectrum for Page Layout Analysis by Lawrence O’Gorman
- Document Structure and Layout Analysis by Anoop M. Namboodiri and Anil K. Jain
- Document Layout Analysis by Garrett Hoch
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords();
// Use default parameters
// - within line angle is set to [-30, +30] degree (horizontal angle)
// - between lines angle is set to [45, 135] degree (vertical angle)
// - between line multiplier is 1.3
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
foreach (var block in blocks)
{
// Do something
}
}
}
The method can be tailored by providing a within line angle, a between lines angle and a between line multiplier:
- Here, the within line angle is set to [-45, +45] degree (horizontal angle), the between lines angle is set to [35, 170] degree (vertical angle) and the between line multiplier is 1.5:
var docstrum = new DocstrumBoundingBoxes(new DocstrumBoundingBoxes.DocstrumBoundingBoxesOptions()
{
WithinLineBounds = new DocstrumBoundingBoxes.AngleBounds(-45, 45),
BetweenLineBounds = new DocstrumBoundingBoxes.AngleBounds(35, 170),
BetweenLineMultiplier = 1.5
});
docstrum.GetBlocks(words);
Reading order detectors deal with the task of finding the blocks' reading order in a page. Only TextBlock
are ordered at the moment. 3 different methods are currently available:
This reading order detector does nothing. It will return the blocks as they are provided and the ReadingOrder
of each TextBlock
remains -1.
n/a
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
var orderedBlocks = DefaultReadingOrderDetector.Instance.Get(blocks);
foreach (var block in orderedBlocks)
{
// Do something
}
}
}
n/a
This reading order detector uses the average of the Letter.TextSequence
contained in each block to determine the reading order.
n/a
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
var orderedBlocks = RenderingReadingOrderDetector.Instance.Get(blocks);
foreach (var block in orderedBlocks)
{
// Do something
}
}
}
TO DO
As per the original paper (see References): We follow the approach of Aiello et al., who defined a set of binary relations for intervals in X and Y direction that allow a certain amount of tolerance for the coordinate values. In total there are 13 relations in both X and Y direction, and for each pair of two-dimensional bounding boxes exactly one X relation and exactly one Y relation is true. This tolerance is implemented by a parameter T
; if two coordinates are closer than T
they are considered equal. This flexibility is necessary because due to the inherent noise in the PDF extraction text blocks in the same column might not be exactly aligned (here we choose T = 5
). Aiello et al. then defined the BeforeInReading
relation as a Boolean combination of binary relations for intervals in X and Y direction, which states for any pair of bounding boxes whether the first one occurs at some point (not necessarily immediately) before the other in a column-wise reading order.
In addition to [the above], we also define the BeforeInRendering
relation that tells whether a block is rendered at some time before another block in the PDF. We incorporate both relations into a single partial ordering of blocks by specifying a directed graph with an edge between every pair of blocks for which at least one of the two relations hold. – S. Klampfl et al.
- Section 5.1 of Unsupervised document structure analysis of digital scientific articles by S. Klampfl, M. Granitzer, K. Jack, R. Kern
- Document Understanding for a Broad Class of Documents by L. Todoran, M. Worring, M. Aiello and C. Monz.
- Spatial reasoning (Allen’s interval relations) Allens_Relations_Wikipedia
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
var orderedBlocks = UnsupervisedReadingOrderDetector.Instance.Get(blocks);
foreach (var block in orderedBlocks)
{
// Do something
}
}
}
- Set the tolerance parameter
T
to 10.
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
var unsupervisedReadingOrderDetector = new UnsupervisedReadingOrderDetector(10);
var orderedBlocks = unsupervisedReadingOrderDetector.Get(blocks);
foreach (var block in orderedBlocks)
{
// Do something
}
}
}
Viewing the exported xml file in PRImA Research Lab's LayoutEvalGUI:
NB: In this example, the decoration blocks (header and page number) are not ordered correctly. The pdf was generated using Word and it seems that the rendering order is not respected for decoration blocks. See the Decoration Text Block Classifier to know how to detect these blocks.
TO DO
TO DO
TO DO
TO DO
TO DO
TO DO: image
The algorithm finds a cover of the background whitespace of a document in terms of maximal empty rectangles. It is a top-down algorithm, which means it uses a whole page as its starting point, and works in a way analogous to quicksort or Branch and Bound algorithms. – Ø. R. Berg, 2011
- Section 3.2 of High precision text extraction from PDF documents by Øyvind Raddum Berg
- Section 2 of Two geometric algorithms for layout analysis by Thomas M. Breuel
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Use default parameters
// - maximum number of whitespace rectangles is 40 by default
// - whitespace fuzziness is 0.15
// - no limit to queue size
var whitespaces = WhitespaceCoverExtractor.GetWhitespaces(page.GetWords(),
page.GetImages());
}
}
The method can be tailored by providing maxRectangleCount, maxBoundQueueSize, whitespaceFuzziness, minWidth and minHeight (the fuzziness is defined as the percentage by which candidate whitespace rectangles are allowed to overlap the surrounding obstacles):
- Here, the maximum number of whitespace rectangles is set to 100 and the maximum queue size is set to 10,000:
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var whitespaces = WhitespaceCoverExtractor.GetWhitespaces(page.GetWords(),
page.GetImages(),
100,
10_000);
}
}
- Here, the maximum number of whitespace rectangles is set to 100, the maximum queue size is set to 10,000, the minimum whitespace rectangles' Width and Height is set to 0, and the fuzziness is set to 0 (instead of 0.15 by default):
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var whitespaces = WhitespaceCoverExtractor.GetWhitespaces(page.GetWords(),
page.GetImages(),
minWidth: 0,
minHeight: 0,
maxRectangleCount: 100,
whitespaceFuzziness: 0.0,
maxBoundQueueSize: 10_000);
}
}
The algorithm returns text blocks that are classified as decoration blocks. From the paper in reference: Many digital documents have archival information such as author names, publication titles, page numbers, and release dates printed repeatedly at the border of each page. Most prominently this content is placed inside headers or footers, but sometimes also at the left or right edge of the page. We refer to text blocks containing this type of information as decoration blocks. – S. Klampfl et al.
The document should contain more than 2 pages to work.
- Unsupervised document structure analysis of digital scientific articles by S. Klampfl, M. Granitzer, K. Jack, R. Kern.
using (var document = PdfDocument.Open(@"document.pdf"))
{
var docDecorations = DecorationTextBlockClassifier.Get(document.GetPages().ToList(),
DefaultWordExtractor.Instance,
DocstrumBoundingBoxes.Instance);
for (int i = 0; i < document.NumberOfPages; i++)
{
IReadOnlyList<TextBlock> decorations = docDecorations[i];
// Do something
}
}
If you want to use a different library to compute the minimum edit distance, used in the algorithm to compare strings, you can do as follow (using the Fastenshtein library):
using Fastenshtein;
using (var document = PdfDocument.Open(@"document.pdf"))
{
var docDecorations = DecorationTextBlockClassifier.Get(document.GetPages().ToList(),
DefaultWordExtractor.Instance,
DocstrumBoundingBoxes.Instance,
(string s1, string s2) => Levenshtein.Distance(s1, s2) / (double)Math.Max(s1.Length, s2.Length));
// The minimum edit distance should be normalised
for (int i = 0; i < document.NumberOfPages; i++)
{
IReadOnlyList<TextBlock> decorations = docDecorations[i];
// Do something
}
}
TO DO
Exporters deal with the task of transforming a pdf page into a text representation, while keeping as much information about the layout as possible. 3 different formats are currently available:
- PAGE (xml format)
- ALTO (xml format)
- hOcr (html format) The exporters allow tools to display a visual representation of the layout and the images show in this page are examples of that.
The PAGE (Page Analysis and Ground-Truth Elements) Format is an XML-based page image representation framework that records information on image characteristics (image borders, geometric distortions and corresponding corrections, binarisation etc.) in addition to layout structure and page content. – PRImA Research Lab
You can use one of the following tools from PRImA Research Lab's to open the exported file:
- LayoutEvalGUI
- PAGEViewer (This might be easier for beginners)
The PAGE xml exporter is the only exporter supporting reading order (as of Jan. 2020). See here for an example.
- https://www.primaresearch.org/publications/ICPR2010_Pletschacher_PAGE
- http://www.primaresearch.org/tools
- https://github.com/PRImA-Research-Lab/PAGE-XML
Using the DefaultWordExtractor
, RecursiveXYCut
and UnsupervisedReadingOrderDetector
:
PageXmlTextExporter pageXmlTextExporter = new PageXmlTextExporter(
DefaultWordExtractor.Instance,
RecursiveXYCut.Instance,
UnsupervisedReadingOrderDetector.Instance);
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Convert page to text
string xml = pageXmlTextExporter.Get(page);
// Save text to an xml file
File.WriteAllText("document.pagexml.xml", xml);
}
}
Opening the exported xml file in PRImA Research Lab's LayoutEvalGUI:
- Download from LayoutEvalGUI
- Execute above code to generate xml
- Inspect xml to work out width and length of the page image:
...
</Metadata>
<Page imageFilename="unknown" imageWidth="595" imageHeight="841">
...
(Note: if you set the imageFilename attribute above, PAGEViewer will automatically load the image and LayoutEvalGUI will automatically set it as the "Colour image" in step 6)
4. Create a black and white .tif file with those dimensions (to be used in the layout tool).
You can either create a black and white image of the pdf page or (easier) create a blank image and rely on the colour image (see step 5).
To create a blank image: On windows paint can be to create a new image. Set the canvas size and save as .tif file. Or below c# code will do the same job:
int width = 595;
int height = 841;
var bitmap = new Bitmap(width, height);
using (Graphics graph = Graphics.FromImage(bitmap))
{
var imageSize = new Rectangle(0, 0, width, height);
graph.FillRectangle(Brushes.White, imageSize);
}
bitmap.Save(path + ".tif", ImageFormat.Tiff);
- Create a png of the pdf page with the same size.
On windows you can use the snip it tool to create an image of the pdf page; and paint 3d (or your favourite drawing tool) to resize the canvas to the measurements from step 3, maintaining the aspect ratio. - You are now ready to pop it all into the layout tool:
a. Open the layout tool.
b. On the menubar click new.
c. On the "Evaluation files" popup: Then click "OK" and you are good to evaluate!
- Ground truth: The xml you generated
- Segmentation Result: The xml you generated
- Black and white image: The blank tiff you create/genearated
- Colour image: The png image you generate.
ALTO (Analyzed Layout and Text Object) is an open XML Schema developed by the EU-funded project called METAe. The standard was initially developed for the description of text OCR and layout information of pages for digitized material. The goal was to describe the layout and text in a form to be able to reconstruct the original appearance based on the digitized information - similar to the approach of a lossless image saving operation. – Wikipedia
You can use PRImA Research Lab's LayoutEvalGUI software to open the exported file.
- https://www.loc.gov/standards/alto/
- https://github.com/altoxml
- https://en.wikipedia.org/wiki/ALTO_(XML)
Using the NearestNeighbourWordExtractor
and DocstrumBoundingBoxes
:
AltoXmlTextExporter altoXmlTextExporter = new AltoXmlTextExporter(
NearestNeighbourWordExtractor.Instance,
DocstrumBoundingBoxes.Instance);
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Convert page to text
string xml = altoXmlTextExporter.Get(page);
// Save text to an xml file
File.WriteAllText("document.altoxml.xml", xml);
}
}
Opening the Alto xml file in should yield the same result as for the PAGE xml format.
hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML. – Wikipedia
- http://kba.cloud/hocr-spec/1.2/
- https://github.com/kba/hocr-spec
- https://en.wikipedia.org/wiki/HOCR
- https://github.com/kba/hocrjs
Using the DefaultWordExtractor
and DocstrumBoundingBoxes
:
HOcrTextExporter hocrTextExporter = new HOcrTextExporter(
DefaultWordExtractor.Instance,
DocstrumBoundingBoxes.Instance);
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Convert page to text
string html = hocrTextExporter.Get(page);
// Save text to an html file
File.WriteAllText("document.hocr.html", html);
}
}
Using the DefaultWordExtractor
and DocstrumBoundingBoxes
and adding the reference to the hocrjs script (more on hocrjs here):
HOcrTextExporter hocrTextExporter = new HOcrTextExporter(
DefaultWordExtractor.Instance,
DocstrumBoundingBoxes.Instance);
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Convert page to text, adding a reference to hocrjs script
string html = hocrTextExporter.Get(page, useHocrjs: true);
// Save text to an html file
File.WriteAllText("document.hocr.html", html);
}
}
Viewing the exported html file using hocrjs:
Converts the pdf page to SVG (Scalable Vector Graphics).
SvgTextExporter exporter = new SvgTextExporter();
var options = new ParsingOptions() { ClipPaths = true }; // true if clipped path are needed
using (var document = PdfDocument.Open(@"document.pdf", options ))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Convert page to text
var svg = exporter.Get(page);
// Save text to an html file
File.WriteAllText("document.html", svg);
}
}