| marp | false |
|---|---|
| headingDivider | 1 |
| paginate | true |
| footer | Lectures on Self Supervised Learning - Thomas Breuel - NVIDIA |
Optical Character Recognition:
- image → text
- closely related to object recognition, autonomous driving, speech recognition, ...
- much better statistical understanding of...
- class structure
- noise
- ground truth
- syntactic structure / language modeling
- surprisingly, a hot topic!
- wanted: information extraction from unstructed documents
- NLP technology didn't use to be up to it
- with new LLMs, that has changed dramatically
image
We need to train multiple models:
- preprocessing
- segmentation
- recognition
- syntactic model (language model)
- geometric model (reading order model)
This is common to many computer vision tasks using deep learning.
Available data:
- thousands of scanned pages with manual segmentation and text
- tens of thousands of scanned pages with approximate text
- millions of scanned pages without text
- large amounts of text without document images
- the ability to generate new texts and printed documents
Typical unsupervised / semi-supervised learning problem.
Train supervised models on manually labeled / transcribed data:
- segmentation model
- image
$\rightarrow$ text model
Yields good performance on data similar to training data.
Unlabeled training data primarily helps with generalization to new datasets.
Idea:
- Use our or other OCR systems to transcribe those pages and use the output as training data ("pseudolabel")
Questions:
- How can using a worse OCR system work for training a better OCR system?
- Is there anything we can do to improve this?
Let's focus just on recognizing words for the following examples (forget about the rest of the OCR system):
Idea:
- run the existing OCR system, giving word images and corresponding text
- construct a new training set by...
- rejecting any word that the OCR system has a low confidence in
- rejecting any word that is not found in the dictionary
- this way, we obtain a new training set that contains "mostly good" training samples
- iterate this multiple times
Network estimates
For OCR, we know:
- true
$P(c|x)$ is approximately 0 or 1 (no ambiguities) - any uncertainty in classifier output is due to mislabeled training data
- e.g. if 20% of training data mislabeled:
$\tilde{P}(c|x) = 0.8$ for true class$c$ - if we use pseudolabel
$\arg\max_c\tilde{P}(c|x)$ , many accidentally mislabled training labels will actually be correctly labeled - as a result, the posterior probability will be estimated higher on the next training round and the model improves
- latent variables in semi-supervised OCR
- labels for unlabeled portion of training set
- outlier status for unlabeled portion of training set
- EM algorithms recover latent variables by...
- "making a best guess" given the current model
- retraining the model as if that guess is correct
- identify the latent variables that are being recovered
- identify the prior assumptions / inductive biases in the model
- identify the EM algorithm implementation
- ideally, the LSTM/Conv model outputs the correct class only where the character occurs
- outputs ϵ (no character) everywhere else
- our transcript does not contain the character positions
- CTC performs alignment between classifier output and transcript
- CTC estimates the most likely positions of characters given the current model
(diagram)
Normal EM-Training:
- image + transcript, EM-training recovers alignment
Language Model-Based EM-Training:
- image + language model, EM-training recovers text + alignment
There always has to be some source of information:
- unigram perplexity: 1000
$\leftarrow$ what the classifier outputs - bigram perplexity: 200
- trigram perplexity: 100
$\leftarrow$ what the language model imposes - full transcript perplexity: 1
$\leftarrow$ fully supervised training
Additionally: several bits per character for character location.
We have seen three different forms of weak/unsupervised training for text → image training:
- using image + transcript, lacking just the alignment/segmentation
- using image + language model only
- using just pseudolabels from weak classifier
Assume we start with errorful segmenter from small labeled training data.
Two kinds of errors:
- segmenter returns word for non-words
- segmenter misses actual words
Approach:
- validate each returned word using OCR (no OCR = not a word)
- mark everything that is not identified as a word as "don't know" and exclude from training
Approaches to self-supervision usually incorporate prior knowledge and require some manual design:
- text recognizer
- reject non-words from "soft labels"
- use language models as part of EM training
- page segmentation
- validate segmentation via OCR
- introduce "don't know" mask during training
Prior knowledge about the task is required to choose meaningful self-supervision tasks.
- word recognition: pseudolabels, language-model based rejection
- segmentation: verification via OCR, introduction of "don't care" regions
- object reconition, natural image segmentation: later
Request Help from Oracle:
- run OCR as above and send low confidence outputs to manual transcribers
- output near decision boundaries
- output that can maximally help the classifier improve
Clustering:
- perform clustering on character or word images and manually transcribe each cluster
Question: what similarity measure do we use for clustering?
- cluster characters by shape
- use frequency and patterns to infer character identity
- e.g.: first word is likely "THE"
- letter frequencies
- word frequencies
- single letter word is likely "A"
- e.g.: first word is likely "THE"
- explicit form of many EM-based recognition algorithms
- clustering is based on a simple hierarchical Bayesian source model
- latent variables can be recovered by EM or Bayesian methods
- simple clustering assumes
$\Sigma=1$ - both
$c$ and$\mu$ are latent; need oracle to recover actual class labels
Rendering can involve transformation parameters, giving rise to view manifolds.
Dimensionality reduction in pixel space:
- think of real data as clouds surrounding...
- union of manifolds with boundary (only shown for blue class)
- "manifold learning" = union-of-manifold-s-with-boundary learning
Given or Learned:
- rendering
- degradation
Constructed or Learned (inverse problems):
- invariances
- preprocessing / image cleanup
CycleGAN can learn both directions simultaneously with no supervision.
This is the "forward path" in the channel view of recognition:
- digital typesetting = perfect "artificial" document generation
- take any text, generate pages of perfect text
- for OCR, we need "degraded images"
- printing, scanning, photographing, photocopying, ...
- physical processes for document image degradataion are well known
- recognition is easier if documents are not degraded
- can we restore the "clean" image via unsupervised learning?
AV/object recognition: image translation prior to recognition
- binarization
- optimal thresholding
- optimal linear filtering (deconvolution, etc.)
- clustering
- performance-based dynamic thresholding
- deep learning
- supervised restoration (LSTM, pix2pix)
- unsupervised restoration via CycleGAN
Otsu's method:
- manually constructed inverse degradation model based on prior knowledge/assumptions
- assume some kind of mixture model of image pixel generation
- maximize inter-class variance
- run thresholding with many parameters
- pick the parameters that yield the best OCR output
- no transcript, so use proxy
- use statistics/classifier, or #words in dictionary
- closely related to GAN methods
- assumption/prior: local thresholding = good way of inverting degradation model
(a) original, (b) Sauvola, (c) LSTM-based binarization
- generate clean images, degrade with degradation model
- takes advantage of prior knowledge of degradataion models
- take degraded images, use performance-based thresholding
- takes advantage of knowledge of statistical properties of output
CycleGAN replaces all those components with trainable networks:
- clean image
$\rightarrow$ degraded image - degraded image
$\rightarrow$ clean image - clean image detector
- degraded image detector
These correspond to the forward and backward arrows in our channel model.
CycleGAN can be trained end-to-end without any labeled data or (significant) prior assumptions.
Recent Developments
- transformer models permit full end-to-end training for image-to-text transcriptions
- do not need intermediate segmentation, text-line recognition
Image-to-HTML tags.
Image-to-HTML tags.
- apply the same principles
- train an initial model using supervised data
- compute output on unlabeled data ("soft labels")
- correct the output using language models
- retrain
- possibly use auxiliary tasks for pretraining (later)
- CLIP
- image features
- image quality
- image type
- script
- LLMs
- language modeling during recognition
- OCR correction
- uses
- full recognition pipeline (slow!)
- training data generation followed by "distillation"
In the OCR example, we have seen most of the major concepts of unsupervised and semi-supervised training:
- EM training (aka soft labeling)
- use of language modeling as data source
- clustering
- unsupervised preprocessing / image enhancement























