Im2Latex

Automating redundant tasks help people to work on more interesting things. In the era of smart phone revolution and super computing, one of the redundant and laborious tasks is rewriting handwritten expressions in digital format.

Typically most people find it comfortable to derive and think mathematics while writing either on paper or on a white-board. To be able to communicate these mathematical ideas with other people, it is beneficial to typeset it using standard tools like LaTex. However, transferring the content into this format can be cumbersome and need considerable human effort and time.

Hence, we propose to build an automated system that converts the mathematical equations in an image into LaTex using the techniques of computer vision and machine learning for pattern recognition.

Block Diagram

Perspective Correction
The image of the handwritten mathematical expression can be taken by the user in multiple angles. To achieve better accuracy, we need to correct the angle to obtain the frontal-view.

Binarization
This corrected image needs to be binarized before passing it into the symbol classifier. This step is challenging because the method needs to be robust to uneven lighting conditions.

Segmentation
Next, we need to process the binarized image and draw bounding boxes around each symbol.

Symbol Classification
The segmented symbols are now passed into a symbol classifier that is trained using binary input images to output the correct symbols. We implemented SVM classifier using features like HOG and we also implemented convolutional neural networks for this task.

Structure Recognition
To be able to decode mathematical expressions, we need to understand the relation between adjacent symbols, for example, superscripts, subscripts, fraction and so on. Using these relations, we will be able to generate the latex code.

All the steps are discussed in detailed in the following sections.

Pre-processing

Canny edges: Edges present in the image are found using canny edge detection technique.

Hough transform: The lines corresponding to the clip-board boundary are determined from these edges using Hough transformation.

Boundary: The four corner points are found by determining the intersection of the boundary lines.

Homography: The four points so determined are used to correct the perspective distortion using Homography.

Binary image: The image is then binarized with a threshold on intensity.

Corners: Harris features are detected on this image to obtain the cluster center of the text.

Image cropping: Image is cropped around this cluster center to get the text box.

Connected components: The text is segmented and each character is extracted from this text box

Classification

We are using CNN classifier to classify the symbols in mathematical expression.

Approach

Any classification problem has two steps, training and testing. We had trained the classifer for 57 symbols.

Training

Following processes were done for training of the symbol dataset.

I. Data Collection

Chrome dataset consists of .inkml files. These files contains the information of all the stokes in the symbol with their time stamp.

First step towards our data collection was to convert these inkml files to images. We create a python script to extract the inkml data to a CSV file. This csv file consisted of all the strokes for a symbol and its labels. Using this CSV file, we generated images using matlab.

Dataset cleanup: Once we had the symbol images and their labels, we observed that there were lot of errors in the dataset. Some of the symbols were misclassified, and some of them were junk data. Hence we cleaned up the dataset to have minimal training errors. All the images of the symbols and their corresponding labels have been uploaded under "Processed Dataset" in this page. Also, we have uploaded the python script and matlab code that was used to generate the data.

II. SVM

We train a linear SVM on conventional HOG features.

III. Convolution Neural Network (CNN)

We trained a CNN [1] with 3 convolution layers and 2 linear layers. This included dropout regularization [2] and stochastic gradient descent method.

Dropouts: With CNN overfitting seems to be a problem which may reduce accuracy. We used ‘drop-out’ method to avoid overfitting. In this technique we set the output of each hidden node to 0.5 randomly and train the classifier. When the output of a node is ‘0’, it does not participate either in feed forward or the backward propagation path. This forces the other nodes to learn more complicated and generalised feature instead of relying on the presence of a previous node to give it a certain input. The propagation path takes a different path in every pass and this helps in making the nodes non-specialized for the classification. This has a downside that the training time increases roughly by twice but helps increase test accuracy by not overfitting the nodes. During the testing stage, no nodes are dropped, so to compensate for the increased node count as compared to training stage, we weigh the output of each node by half to get the final output. The training was done on GPUs and it took approximately four hours.

Stochastic Gradient Descent: This technique is used to update the weights of the nodes. This is helpful because, using a gradient descent would need entire dataset to be in the memory at a time while calculating the gradient. But here we pick a random data member and compute the gradient and update the weights accordingly. The weights are updated in the direction of the gradient of the loss function. The update process converges when the weights value do not change any more. It has been shown that stochastic gradient descent can converge faster in some cases when compared to the normal gradient descent and also it achieves the global minimum.

Testing

Testing is divided in to following:

I. Testing only the Classifier Performance:

Here the performance of SVM/CNN classifiers are measured by providing the images of the symbol from the dataset as input. This gives classification accuracy for the existing dataset.

II. Testing Im2Latex System

Second part of testing involves taking an image of mathematical equation written on a board or editor and then feed to Im2Latex System as input. In real world this would be the use case. The entire flow from capturing image to latex output is explained below.

1. Input:

The input to Im2Latex system is an image consisting of a mathematical equation. We have implemented two mechanisms for input image capturing, i.e. through webcam and another way is by taking snapshot of drawing written in an editor. For the first method, where webcam is used, the user captures a picture of handwritten equation. The capturing of image and further processing is integrated within MATLAB. For the second case, a screen snapshot of a drawing editor (source code given under canvas tab) is taken and is processed in MATLAB.

2. Image Rectification:

When the input image to the system is from a webcam, then most of the times there would be perspective distortion. If this image is directly fed to symbol extraction and classification, the symbols would be distorted and leads to misclassification. Hence we do perspective correction by using Homography technique. The details of the pre-processing to remove perspective distortion is discussed below.

• We consider that mathematical equations are written on a white board or a paper with rectangular boundary. So the technique to correct perspective distortion would be to find the rectangular corners of the board or paper. If the image has perspective distortion, then geometrically these would not be corners of a rectangle. Our next step would be to get an approximation of these corners which would actually fit corners of a rectangle. This is done by considering the highest dimensions(length/breadth) of the distorted rectangle. To determine the corners of the rectangle we use Hough Transform to find horizontal and vertical line segments in an image. The intersection of horizontal and vertical lines would be considered as rectangle corners.

• Once we know the mapping from the corners of distorted rectangle to the geometrically correct rectangle, we determine the Homography Matrix which can be solved with 4 corresponding pairs of x,y coordinates.

• The homography transform is applied to entire image to remove the perspective distortion.

3. Text Region Identification:

It is important to get the region where text is present in an image. This is needed for robust binarization which will be discussed further. To extract the region of text, we find harris corners and estimate the centroid of the corner positions. Text portion is extracted by cropping image around harris corner centroid with 2-3 standard deviation in x and y direction.

4. Binarization and Segmentation:

The image containing text is then binarized by choosing automated threshold. Segmentation is done with the help of connected components. Connected component analysis takes in input text image and gives out a label matrix where the pixel belong to particular letter/symbols would get the same labels. Next a bounding box is drawn across each symbols and each symbol is extracted and resized to 64x64 image.

5. Classification:

Once each symbol is extracted as 64x64 image, it needs to be classified to a particular latex symbol category. In case of SVM, Histogram of Orientation(HOG) features are extracted and then these features are sent to classifier. In case of a CNN classifier, symbol image is directly fed as input for classification. This is because of the ability of CNN to extract features by itself.

6.LaTex Mapping:

The output of the classification would be class labels to which the input symbol belongs. In this experiment we have total of 57 Latex Symbols trained. These labels are then used to generate a Latex code with the help of a look up table.

7. Final Ouput:

Once the Latex code is generated, it is sent to Latex compiler and the output is displayed as a Latex document which contains the mathematical equation translated to LaTex standard.

Experiments and results

Image was successfully converted to Latex code with levels of accuracy as follows:
Using SVM – 68.83 %
Using CNN - 92.19 %

Comments / Feedback

Currently we have trained our classifier by converting the image to binary form. Some information could be lost during this process. So, it would have been better if the classfier was trained using the original image.

Conclusion and future work

We have implemented an automated system which can take an image of handwritten mathematical equation and convert it to latex code.

Equation structure can be determined by using the relative location of bounding boxes in the pre-processed image. This project can be extended to process the online / classroom video background notes to digital notes.

References

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.
[2] Dropout: A Simple Way to Prevent Neural Networks from Overfitting by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
[3] http://ivc.univ-nantes.fr/CROHME/datasets.php
[4] http://www.isical.ac.in/~crohme/CROHME_tasks2.html