2015年12月22日星期二

[Machine Learning | CNN] Gradient-based learning applied to document recognition / 基于梯度的学习算法在文档识别方面的应用


Overview


To solve the on-line handwriting document recognition problem, two algorithms are been introduced by this paper - convolutional neural network for handwriting recognition problem and graph transformer network for field extraction, segmentation, recognition and language modeling.

为了解决实时手写文档识别问题,该论文介绍了两种算法 - 用于单个手写字符识别的卷积神经网络和用于识别场景提取, 分割, 识别和语言建模的图变换网络.


Convolutional Neural Network (CNN)

Base on the paper description, the basic fully-connected multi-layer network not always shown a good performance on image classification problems. The reason can be summarize as follow:

1) Image are large, it potentially requires a large number of hidden units on first hidden layer. Also bring more weights need to be trained. (Input Large - More Unit - More Weight - More Data/Times Required).

2) Can not handle translations, scale, rotation or local distortions of the input data.

3) Image (or Audio) data set often need to normalized at the word level (or by some group) to fit the input layer size, which also can cause the individual characters size, slant the position variations.



So to deal those problems, this paper given a new network structure called Convolutional Neural Network. Different with the basic fully-connected artificial neural network, CNN gives a more localized network structure. The three key idea is: local receptive fields, shared weights (weight replication) and spatial/temporal sub-sampling.

local receptive fields: "Each unit in a layer receives inputs form a set of units located in a small neighborhood in the previous layer". In another way to explain is: the feature is only dependent with their neighbors or feature are more locally rather than globally. "This idea is simultaneous with Hubel and Wiesel's discovery of locally-sensitive, orientation-selective neurons in the cat's visual system." So this mechanism will potentially requires there should have some hidden local feature in the input data which can be extracted.

shared weights (weight replication): "Units in a layer are organized in planes within which all the units share the same set of weights". One plane is a feature map, each feature map will apply to all previous layer units in order, "this operation is equivalent a convolution". By experience, this process will led the feature map automatically generate some meaningful "filter". For example, it can obtained some middle level feature like edge and corner from an image. And also it can reduce the number of weights which need to be training.


spatial/temporal sub-sampling: "Only its approximate position relative to other features is relevant" and sub-sampling is base on this assumption. "sub-sampling layers which performs a local averaging and a sub-sampling, reducing the resolution of the feature map, and reduce the sensitivity of the output to shifts and distortions". Base on this description, besides reduce the data size to speed up the training process, sub-sampling also can reduce the problem which caused by image transform, scale and rotation. 


The structure of LeNet-5 network



Input Layer :


Size is 32 x 32 pixel image, "significantly larger than the largest character in database", let the potential distinctive feature appear in the center of the image. 


Value are normalized to [-0.1, 1.175], "it makes the mean input roughly 0 and the variance roughly 1"


Convolution Layer :


For each convolution layer contain multiple feature maps with same size. In LeNet-5, each unit (pixel at next layer) has 5x5 25 trainable coefficients and a trainable bias. They share the same set of 25 weights and same bias at the started  (Each feature map has 25 + 1 weight need to be training)


Because of the convolution process, the boundary pixels can not be processed by feature map. So next layer will have (n - m + 1) x (n - m + 1) units (in LeNet-5, m = 5).


"Each convolution process followed by an additive bias and squashing function".

Sub-sampling Layer :


The receptive field in LeNet-5 is 2x2, no overlapping. "Each unit compute the average of its four inputs, multiplies it by a trainable coefficient, add a trainable bias add passes the result through a sigmoid function" (Two trainable weight, nonlinearity).


S2-C3 :


S4-C5-F6 :


Output Layer : 


Loss Function : 


Graph Transformer Networks (GTN)


Reference


[1] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition", Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998
[2] Paper: IEEE Page