显示标签为“Computer Vision”的博文。显示所有博文
显示标签为“Computer Vision”的博文。显示所有博文

2016年1月12日星期二

[Machine Learning | Computer Vision | DeepID] Deep Learning Face Representation from Predicting 10,000 Classes / 从预测10000个类别中深度学习人脸表达

Overview / 概览


The main task for this paper is human face verification, it's a sub-problem of verification. Different with classification, verification problem is given two instance and find out did this two instances are from the same class or not.

At the end of this paper, they claim that this algorithm got 97.45% verification accuracy on LFW face data set(human accuracy is 97.53%).

Different with other verification algorithms, the whole process been separated into two parts - high-level features generate and face verification. This paper focus on how to use deep learning algorithm to generate high-level features (ConvNet, DeepID) rather than the verification algorithm self.

本文的主要的研究领域是验证问题的子集 - 人脸验证. 与分类问题不同, 验证是的输入数据是两个不同的实例, 需要算法判断出这两个实例是否属于同一个集合/类.

论文在最后声称该分类算法在LFW数据集上的的准确率已经达到97.45%, 仅次于人类的验证准确率97.53%.

与其他验证算法不同, 在本篇文章中整个过程被作者分成了两个部分 - 高级特征的生成和人脸验证. 论文将着重讨论如何生成高级特征而不是验证算法本身.

DeepID (High-level Feature Generate / 高级特征的生成)


DeepID basically is a set of high-level feature vectors. Each high-level feature vector is generated by a deep model (ConvNet, taken from the last hidden layer). The structure is shown in this figure:

简单来说, DeepID 是一个高级特征向量的集合, 其中每一个高级特征向量的生成依赖于一个深度模型 (论文使用卷积神经网络作为深度模型, DeepID 即网络隐含层最后一层), 结构图下图所示:


To lead the deep models generate more effective features, the deep model is been trained as a multi-class classifier to identify each instance (which class/face) rather than a binary classifier to verify two instances (same class/face or not). The reason is they want make full use of super learning capacity ("adds a strong regularization to ConvNets", "shared hidden representations that can classify all the identities well") to get good generalization ability high-level features and avoid over-fitting to a small subset.

For each deep model, they been trained with different parameter (network structure) and with different parts of original data. In this paper, there are 60 ConvNets and each face image are generated to 60 different face patches (with 10 regions, three scales, and RGB or gray channels).

为了引导深度模型生成更有效的特征, 深度模型被训练为一个多类分类器去识别每一个实例(属于哪个类/人脸)而不是一个二类分类器去验证两个实例 (是否属于同一类). 这样做的原因是在于作者想充分的利用深度学习的潜力 ("向卷积神经网络添加一个强壮的正规化调节","共享可以对所有类型良好分类的隐含表示法") 去生成具有良好泛化能力的高级特征并避免深度模型过拟合于一个小的子集.

对于每个深度模型, 他们被以不同的参数 (网络结构) 和相同数据的不同部分进行训练. 本篇论文里使用了60个卷积神经网络, 每一个人脸数据也被分成了60个不同的子数据 (10个区域, 3个缩放, 和RGB图或灰度图)

Deep ConvNet



Base on the ConvNet's potential properties, each hidden layer is a new set of features. In this paper, they decrease the number of nodes layer by layer, force the ConvNet to summary the information and get the more global and  high-level features at the top layers. This paper use four convolutional layers (with max-polling) to extract high-level features.

基于卷积神经网络的隐含属性, 每一个隐含层其实是一组新的特征. 在论文中, 作者将每层所含神经元数目逐层递减, 迫使卷积神经网络去总结信息, 并在高层上获得更加全局和高级的特性. 论文使用了四层卷积层 (包含 max-polling) 去提取高级特征,

Input Layer / 输入层: "39 x 31 x k for rectangle patches, and 31 x 31 x k for square patches and k = 3 for color patches and k = 1 for gray patches"


DeepID Layer / DeepID 层: "The dimension of DeepID layer is fixed to 160" and "DeepID layer is fully connected to both the thrid and fourth convolutional layers (after maxpolling)"


Output Layer / 输出层: "The dimension of output layer varies according to the number of classes it predicts". They use n-way softmax to predict probability distribution over n classes.


$$y_{i} = \frac{exp\left ( y_{i}^{'} \right )}{\sum_{j=1}^{n}exp\left ( y_{i}^{'} \right )}$$

$$y_{i}{'} = \sum_{j=1}^{160}x_{i}\cdot w_{i,j} + b_{j}$$

"The ConvNet is leaned by minimizing: $\log{y_{t}}$ with the $t$-th target class"


Hidden Neurons / 隐含神经元: ReLU ( $y = max\left ( 0, x \right )$) function is been used for this ConvNet



Face Regions (Input Data)



Total 60 face patches with ten regions, three scales and the mid-point of the two mouth centers.

Top: Ten medium scales, left global region, right local region, centered around five facial landmarks (eye centers, nose tip and mouth corners)


Bottom: 3 scales (shown two patches)



Face Verification


Two different techniques (algorithms) are applied to the face verification task. The first on is Joint Bayesian, base on the paper's description, "Joint Bayesian has been highly successful for face verification". The second is Neural Network, as the comparison group "to see if other models can also learn from the extracted features and how much the features and a good face verification model contribute to the performance, respectively".

Joint Bayesian



Neural Network


Reference

2015年12月22日星期二

[Machine Learning | CNN] Gradient-based learning applied to document recognition / 基于梯度的学习算法在文档识别方面的应用


Overview


To solve the on-line handwriting document recognition problem, two algorithms are been introduced by this paper - convolutional neural network for handwriting recognition problem and graph transformer network for field extraction, segmentation, recognition and language modeling.

为了解决实时手写文档识别问题,该论文介绍了两种算法 - 用于单个手写字符识别的卷积神经网络和用于识别场景提取, 分割, 识别和语言建模的图变换网络.


Convolutional Neural Network (CNN)

Base on the paper description, the basic fully-connected multi-layer network not always shown a good performance on image classification problems. The reason can be summarize as follow:

1) Image are large, it potentially requires a large number of hidden units on first hidden layer. Also bring more weights need to be trained. (Input Large - More Unit - More Weight - More Data/Times Required).

2) Can not handle translations, scale, rotation or local distortions of the input data.

3) Image (or Audio) data set often need to normalized at the word level (or by some group) to fit the input layer size, which also can cause the individual characters size, slant the position variations.



So to deal those problems, this paper given a new network structure called Convolutional Neural Network. Different with the basic fully-connected artificial neural network, CNN gives a more localized network structure. The three key idea is: local receptive fields, shared weights (weight replication) and spatial/temporal sub-sampling.

local receptive fields: "Each unit in a layer receives inputs form a set of units located in a small neighborhood in the previous layer". In another way to explain is: the feature is only dependent with their neighbors or feature are more locally rather than globally. "This idea is simultaneous with Hubel and Wiesel's discovery of locally-sensitive, orientation-selective neurons in the cat's visual system." So this mechanism will potentially requires there should have some hidden local feature in the input data which can be extracted.

shared weights (weight replication): "Units in a layer are organized in planes within which all the units share the same set of weights". One plane is a feature map, each feature map will apply to all previous layer units in order, "this operation is equivalent a convolution". By experience, this process will led the feature map automatically generate some meaningful "filter". For example, it can obtained some middle level feature like edge and corner from an image. And also it can reduce the number of weights which need to be training.


spatial/temporal sub-sampling: "Only its approximate position relative to other features is relevant" and sub-sampling is base on this assumption. "sub-sampling layers which performs a local averaging and a sub-sampling, reducing the resolution of the feature map, and reduce the sensitivity of the output to shifts and distortions". Base on this description, besides reduce the data size to speed up the training process, sub-sampling also can reduce the problem which caused by image transform, scale and rotation. 


The structure of LeNet-5 network



Input Layer :


Size is 32 x 32 pixel image, "significantly larger than the largest character in database", let the potential distinctive feature appear in the center of the image. 


Value are normalized to [-0.1, 1.175], "it makes the mean input roughly 0 and the variance roughly 1"


Convolution Layer :


For each convolution layer contain multiple feature maps with same size. In LeNet-5, each unit (pixel at next layer) has 5x5 25 trainable coefficients and a trainable bias. They share the same set of 25 weights and same bias at the started  (Each feature map has 25 + 1 weight need to be training)


Because of the convolution process, the boundary pixels can not be processed by feature map. So next layer will have (n - m + 1) x (n - m + 1) units (in LeNet-5, m = 5).


"Each convolution process followed by an additive bias and squashing function".

Sub-sampling Layer :


The receptive field in LeNet-5 is 2x2, no overlapping. "Each unit compute the average of its four inputs, multiplies it by a trainable coefficient, add a trainable bias add passes the result through a sigmoid function" (Two trainable weight, nonlinearity).


S2-C3 :


S4-C5-F6 :


Output Layer : 


Loss Function : 


Graph Transformer Networks (GTN)


Reference


[1] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition", Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998
[2] Paper: IEEE Page