Skip to content

mika018/Character-Recognition-Classifier

Repository files navigation

Classification and Learning for Character Recognition

Abstract

This report describes the details of building a classifier for a character set containing three letters: ’S’, ’T’ and ’V’. Classification features are extracted from the magnitude spectrum of each individual letter, obtained after applying a Fourier Transform. The classifier is build using the k-Nearest Neighbour (kNN) algorithm and it is tested using an additional set of characters that include corner cases. In contrast to kNN, a Decision Tree classifier has been trained and tested on the same set of characters. The report also contains a discussion on the differences between the two classifiers, as well as advantages and disadvantages of each classification method.

Fourier Domain Analysis

Any periodic function can be represented as a sum of sines and cosines, each multiplied by a corresponding coefficient. Images can be represented as functions with two input variables, the output of the function being the gray-scale value of a certain pixel. Since pixels are discrete values, we shall apply Discrete Fourier Transform [2]. The Fourier Transform basically finds the coefficients of the sines and cosines. In order to apply this transform to an image, the following formula needs to be used:

1

Intuitively, this formula computes the average brightness value for a certain couple of frequency values, using all combinations of pairs of points (i.e. pixels).

Our data is comprised of images, which means that we are going to apply Fourier Transforms to move from the spatial domain to the 2D frequency domain. The Fourier Transform, especially the magnitude spectrum, will offer more information about the change of intensity along each dimension. Smooth areas in the image will result in low frequency, whereas areas with rapidly changing intensity will result in high frequency. The output is another image with the corresponding magnitude values at various frequency levels on both axis. For instance, when the frequency on both axis is 0, i.e. u = 0 and v = 0, we get the average image gray-level. The further away we move from the center of this image denoting the magnitude spectrum of the Fourier Space, the higher the frequency. Therefore, information about the contrast of the original image is retained closer to the center, while information about finer details is enclosed in the outer layers of the image. This allows us to control and manipulate the features that we want to extract from the image, as shall be seen below.

The images in the Fourier Domain also have conjugate symmetry. To be more precise, each point has another corresponding point with the same value across the symmetry point, i.e. the centre of the image. By looking at Figure 1, we can see that the top-left quarter of the image is symmetric with the bottom-right one; the other two quarters are symmetric as well. In a memory-consuming application, only half of each image could be used in order to save space.

s_avg t_avg v_avg

Figure 1: Average logarithm magnitude spectrum of S, T and V character

## Feature Selection Features describe the characteristics of the data. The combination of d features is represented as a d-dimensional column vector called a feature vector. The d-dimensional space defined by the feature vector is called the feature space. In order to classify characters, we had to create features that clearly separate the character classes. We have focused on the magnitude spectrum of a 2D Fourier transform. In order for magnitudes to be visible to the human eye, we took the logarithm of the magnitude spectrum. Without taking the log, all of the high values would be concentrated in the centre of the image, in the DC term. Figure 1 holds the average magnitude spectrum of each of the three characters.

From the figure above, you can notice that letter T has high magnitudes along the horizontal and vertical directions. This is due to a high rate of change of intensity in those two directions corresponding to the vertical and horizontal lines that form a letter T in the spatial domain. We have used this information when creating the features and decided that levels of high magnitudes are the ones that will clearly distinguish the classes. Following that analogy, we have created five filters: rotated rectangle, cross, plus, circle and sector, all of which can be seen in the Figure 2. As you can notice, all filters exclude the DC component, which is located in the center of the magnitude spectrum. The DC component represents the average brightness of an image in the spatial domain, which is not required in our model as the brightness of the image does not reveal anything about the displayed character. All filters are applied by extracting the pixels from the logarithm of the magnitude spectrum located in the filter region and then calculating the average value of those extracted pixels. We can therefore expect all T characters to return higher values than V and S characters when the plus filter is applied.

rotated-box-filter | cross-filter | plus-filter | ring-filter | sector-filter :-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------: Rotated Box Filter | Cross Filter | Plus Filter | Ring Filter | Sector Filter

Figure 2: Set of features selected from the original set

The choice of features is really important, as it influences the accuracy of the classification, the time needed for classification and the number of learning examples. Dimensionality reduction strives for compact representation of the properties of the data. The compact representation removes redundancy and can be obtained through feature selection and feature extraction. Feature selection involves selecting a subset of the existing features without a transformation. We have created a total of 5 features and used a heuristic strategy in selecting the 2 that make the clearest separation between the classes. We have first assumed that all features are independent. We have then selected a best single feature by using a significance test that involved observing the scatter plot matrix shown in Figure 3.

The scatter plot matrix represents every feature pair combination. On the diagonal, every feature is compared to itself and we used this information to determine a single best feature. We have chosen feature 1 (i.e. the cross filter), as it makes the clearest separation between one character (T in this case) and the other two. We chose the second feature by comparing all of the other features conditioned by the first one that was selected. It was hard to decide, as a lot of features were grouping the data in three distinguishable classes, but we have decided to select feature 2 (i.e the plus filter ) because, apart from one outlier, it separates the data in three well grouped classes.

feature-grid

Figure 3: Comparison between pairs of features

Feature Extraction

Feature Extraction refers to linear or non-linear transformations of the original variables to a lower dimensional feature space. It is useful when a feature space is too large, resulting in high computation times. A new set of features is built by mapping the features from the original set such that the transformed vector preserves most of the information given by the original set. An optimal mapping will be the one that results in no increase in the minimum probability error. Our feature space is two dimensional and does not suffer from the Curse of Dimensionality. We have experimented reducing the dimensionality but we were not successful in extracting a single feature that will preserve the clustering given by the two chosen features. This is due to the fact that all three characters have distinctive magnitude spectrums. Thus, a single feature can only distinguish one character from the other two.

Nearest-neighbour Classifier

The k-Nearest Neighbours algorithm (kNN) is a non-parametric lazy algorithm, meaning that it does not make any assumptions about the underlying model of the data[4]. For instance, it does not assume that the data is a Gaussian Mixture Model[1]. Also, the algorithm takes into account all of the training data during the testing phase. This can be costly in terms of how much memory the algorithm uses.

Whilst a very simple algorithm, kNN performs surprisingly well in practice. Below, in Figures 4, 5 and 6, you can see different decision boundaries when a different k is used. The kNN makes use of the majority rule. For each test data point, the algorithm works by taking the nearest k training data points, and looking at their labels. Each of these k training data points votes in favour of their own class. The class of the test data point is then set to the label with most votes. Usually, the distance measure is the euclidean distance and when we are performing binary classification, an odd k is chosen in order to avoid complications which arise in the case of two classes having an equal number of votes. As we have three classes, there is still a possibility of obtaining ties, but only in cases when the test point is close to where the boundaries meet.

1nnclassigication_training 3nnclassigication_training
Figure 4: Nearest Neighbour Classification boundaries Figure 5: 3-Nearest Neighbour Classification boundaries
5nnclassigication_training 5nnclassigication_test
Figure 6: 5-Nearest Neighbour Classification boundaries Figure 7: 5-Nearest Neighbour Classification with test data

The graph is drawn by taking all the points on the meshgrid and determining their corresponding class. Each point is coloured according to its class. This creates a Voronoi diagram with 3 cells, one for each class. The training data points are denoted by circles, while the test data points are represented by stars. The average magnitude values computed initially from the training data have been normalized using the feature scaling method, bringing all of the values into the [0, 1] interval. Some of the test data falls out of this normalised interval, however, as can be seen in the graph.

s18 s11 s12 s13 s14 s15 s16-marked s17
t18 t11 t12 t13 t14 t15-marked t16 t17-marked
v18 v11 v12 v13-marked v14 v15 v16 v17

Figure 8: Generated test data points

When designing our test data, we tried to create data points which would show the limits of our classifier. Some peculiar test data points which are analyzed below are marked in Figure 8. For instance, we have a included a small S, i.e. S16, which is rather sharp than curvy. Our classifier labels it correctly, but it places it in the middle of the graph, meaning it is not particularly confident about its decision. We can understand why by looking at the letter's magnitude spectrum included in Figure 9. This letter's magnitude spectrum is quite different from all of the average spectrums that we have computed. We can see the parts where V should be bright are also a bit brighter than the other regions. A similar example is letter V13. The most interesting ones are T15 and T17. The former is classified as a V, because it is tilted. Its magnitude spectrum overlays one of the cross filter's arms. The latter has a curvy top part, similar to how an S would look like, and therefore classifying it accordingly. If examined closely, the plus filter scores just bit higher than the cross filter. The magnitude spectrum in Figure 12 shows that, perhaps, the two arms of the plus filter cover the bright areas better than the arms of the cross filter.

While some outliers are present, most of our test data has been classified correctly.

s16_magnitude_spectrum v13_magnitude_spectrum t15_magnitude_spectrum t17_magnitude_spectrum
Figure 9: S16 magnitude spectrum Figure 10: V13 magnitude spectrum Figure 11: T15 magnitude spectrum Figure 12: T17 magnitude spectrum

Classifying A and B

An interesting exercise to do is to test the classifier which we have trained using the letters S, T and V on some other letters and see how well it performs, without tweaking or modifying anything. We have been given some test data for the letters A and B. The latter one is classified as an S. Its magnitude spectrum, shown in Figure 15, shows that the letter scores extremely high on the plus filter, its values being very close to 1. When taking away the horizontal, vertical, and oblique lines, we are left with a magnitude spectrum similar to the average magnitude spectrum of S. This makes sense because B contains lots of curves, just like the S.

a-and-b
Figure 13: Classification of A and B
a_magnitude_spectrum b_magnitude_spectrum
Figure 14: A magnitude spectrum Figure 15: B magnitude spectrum

Decision Trees

Decision trees [3] are common supervised classification data mining technique. The algorithm works by performing decisions at various levels of a tree, depending of which features have been chosen. It uses features which offer more information about the data first, i.e. features which classify the data better. The information gain can be determined by using an impurity measurement. The sklearn.tree.DecisionT reeClassif ier object has two impurity measurements, which are very similar: the Gini impurity and the Entropy. We decided to use the Gini impurity measurement, as it does not contain the computationally expensive logarithm and it is therefore slightly faster. The Gini impurity measures how often an element that was chosen randomly from the set of samples would be incorrectly labeled, provided that it was labeled randomly. The Gini impurity has the following formula, where c represents the number of classes:

2

The decision tree is formed out of multiple nodes. The algorithm splits the data points in two sets at each step, until all the data points in a certain node are part of the same class or until there are no features left to use in order to make splits. Figure 16 shows that the algorithm does a fairly good job at determining the decision surfaces.

decision-tree decision-tree-graphviz
Figure 16: Decision surface of a decision tree using paired features Figure 17: Diagram of the decision tree

Figure 17 exhibits a visual representation of the decision tree in a human understandable manner. The algorithm starts with all the samples in the root node. The first decision is based on the cross feature. The samples are split in such a way that the T training data points are to the left of the decision boundary (i.e. where the average brightness computed using the cross filter is smaller than 0.3416), while the other data points are to its right (i.e. where the values are greater than 0.3416). The left child, when the decision condition is T rue, contains only samples from the second class, so the algorithm has finished computing on this branch. For the right child, which contains all the sample points from class 1 and 3, the decision condition is based on the second feature (i.e. the plus feature in our case). At this point, the algorithm finishes computing because all of the leaves of the tree contain training data points from the same class.

Intuitively, decision trees emulate the way humans would try to set the decision boundaries. T’s score low on the cross feature, while the other two letters score relatively high. On the other hand, V ’s score low on the plus feature.

While decision trees do not tend to be as accurate as other approaches and tend to overfit the data if too many features are used, the technique has plenty of advantages. Being a white box model, it allows for a graphical representation of the tree, making it easy to look into why the algorithm took some decisions and gain more insight into the data. The Gini impurity allows for statistical tests to be applied in order to determine which features are more relevant in determining the best classification. A disadvantage is that the results it produces are not consistent - small variations in the training data can lead to a completely different tree. To mitigate the disadvantages, a small number of features is recommended to be chosen. In fact, the decision tree confirmed our assumption that the cross and the plus features are the best, by selecting those two features as the best ones when all five features are fed into the algorithm. The final decision tree was remarkably similar to the one which makes use of only two features.

Comparing the Decision Trees classifier with the k-Nearest Neighbours classifier, we can see that the results are quite similar, even if the boundaries generated by the decision tree are less precise. The mathematics behind the decision tree classifier is more complicated, but it offers another way of visualizing the data. Decision Trees are not affected by the outliers, whereas when using kNN, all points could potentially have an equal ”weighting” in building the decision boundary. In practice, the naive kNN is faster than naive decision trees. Assuming the decision tree is balanced, we have the following time complexities[5]:

  • naive kNN: O(1) [training time] + O(NK) [query time] = O(NK)
  • naive decision tree: O(N2 ∗ K ∗ log(N)) [training time] + O(log(N)) [query time] = O(N2 ∗ K)

Conclusion

This report explored how Fourier Space Analysis can be used in letter recognition. It first gave a brief explanation of what the Fourier Domain is and how to apply the Fourier Transform in order to move to the frequency domain. Then, we discussed how to select and extract features from the Fourier Space to get the best possible classification. Afterwards, we examined two supervised classification algorithms and tested them on test data we generated.

References

[1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[2] Sanjeev R. Kulkarni. Lecture Notes for ELE201 Introduction to Electrical Signals and Systems. 2002.

[3] skikit-learn Decision Trees. May 2017. url: http://scikit-learn.org/stable/modules/tree.html

[4] skikit-learn Nearest Neighbours. May 2017. url: http://scikit-learn.org/stable/modules/neighbors.html

[5] Why is KNN much faster than decision tree? May 2017. url: http://stackoverflow.com/questions/15428282/why-is-knn-much-faster-than-decision-tree

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published