Abstract
Incremental learning is a sub research area of machine learning. Most research work are focus on instance incremental learning (Training data increase) which is the number of classes is fixed but the training data is increased by time. Class incremental learning means the number of classes may increase by time and the learning algorithm should classify all new and old classes without re-train the entire algorithm. But currently this problem is not well studied by previous paper.
Algorithms
In this summary, we introduce 12 papers, from 2001 to 2014. By analysis their approach, we can separate those class incremental learning papers into 3 classes - single classifier [6, 7, 8], two layer multi classifier [1, 2, 3, 4, 5] and tree structure multi classifier. [9, 10, 11, 12].
Single Classifier
Two Layer Multi Classifier
Learn++ [1] is a typical two layer class incremental learning algorithm, they use the first layer (they call it weak learner) to learning how to classify each instance, then like Ada-boost algorithm, they use vote strategy to make the final decision.
Different with Ada-boost algorithm, all the weak classifier are not trained at the same time and same class number. Each time a new set of data comes in, they will generate and training a new set of weak classifier it with current data set and current class number. So the lasted set of weak classifier can classify all classes for currently.
But with the time goes on, the number of weak classifier will increase, and they will need more new classifiers to cover the wrong decision (generated by previous classifier).
Learn++ .NC [2] is a new version of Learning++ algorithm to fix those problem. They came up a strategy to avoid the previous classifiers make a wrong decision when the current instance they haven't learned.
Paper [3, 4, 5] gives other versions of Learn++ algorithm to solve unbalance data, class changes and concept drift ($P\left ( C \right )$ changes, $P\left ( F \right )$ changes and $P\left ( C | F\right )$ changes, $C$ is denote class space distribution, $F$ is denote feature space distribution)
Different with Ada-boost algorithm, all the weak classifier are not trained at the same time and same class number. Each time a new set of data comes in, they will generate and training a new set of weak classifier it with current data set and current class number. So the lasted set of weak classifier can classify all classes for currently.
But with the time goes on, the number of weak classifier will increase, and they will need more new classifiers to cover the wrong decision (generated by previous classifier).
Learn++ .NC [2] is a new version of Learning++ algorithm to fix those problem. They came up a strategy to avoid the previous classifiers make a wrong decision when the current instance they haven't learned.
Paper [3, 4, 5] gives other versions of Learn++ algorithm to solve unbalance data, class changes and concept drift ($P\left ( C \right )$ changes, $P\left ( F \right )$ changes and $P\left ( C | F\right )$ changes, $C$ is denote class space distribution, $F$ is denote feature space distribution)
Tree Structured Multi Classifier
Different with two layer learning algorithm, [9, 10, 11, 12] gives a new set of algorithms with tree structure. The tree structure can potentially give a hierarchical analysis for each classes (like cat and dog are sub-class of animal), and also the tree structure can keep the new class update in a small area (sub-tree) to optimize the training speed.
Paper [9] using several binary support vector machine (SVM) to build a binary decision tree. Each binary SVM will learning one specific class (one vs all classifier). The instance been marked as "other" will send to next level. When a new class comes, a new binary SVM will be created and trained as a new root node.
Paper [9] using several binary support vector machine (SVM) to build a binary decision tree. Each binary SVM will learning one specific class (one vs all classifier). The instance been marked as "other" will send to next level. When a new class comes, a new binary SVM will be created and trained as a new root node.
["C" or other]
/ \
["A or B"] "C"
/ \
"A" "B"
Different with paper [9], paper [10, 11] gives a more general binary tree structure, They use random forest algorithm to training multiple decision trees. For each decision tree, unsupervised split algorithm (different tree with different parameter) is applied on branch node to separate each instance and the leaf node with supervised multi-classification algorithm will make the final decision.
NCMFs (nearest class mean forests) algorithm [10] use NCM algorithm to separate each instance an make classification. Paper [11] gives more detail analysis and apply the structure on SVM (SVMFs).
For new classes comes, they have four different strategies to extend the tree - ULS (only update leaf node), IGT (generate new layer with old split function), RTST (new split function and new layer) and RUST (update part split function and generate new layer).
Paper [12] also use split function at branch node and classify at leaf node, But instead of using binary split, they decide to calculate a similarity matrix and use it to separate each instance to one of multiple branches. Each node in this tree structure is a convolutional neural network (with softmax output layer). All the CNN hidden network (without output layer) have the same structure (layer size and node size at each layer).
When a new class comes, they build two new classifiers, the first one is just extend current leaf node, and second is extend a new layer and modify current leaf node (classification) to a branch node (split). After training those new classifiers, they will let them compete and choose the best strategy.
To accelerate the training progress, they copy the previous CNN weights (without output layer) to the new CNN network and create a new softmax output layer by the requirement (branch node by the number of branches, leaf node by the number of classes) with random weights. So the new CNN network will be trained on previous knowledge rather from zero.
When a new class comes, they build two new classifiers, the first one is just extend current leaf node, and second is extend a new layer and modify current leaf node (classification) to a branch node (split). After training those new classifiers, they will let them compete and choose the best strategy.
To accelerate the training progress, they copy the previous CNN weights (without output layer) to the new CNN network and create a new softmax output layer by the requirement (branch node by the number of branches, leaf node by the number of classes) with random weights. So the new CNN network will be trained on previous knowledge rather from zero.
Annotate
Modify a existing multi-class algorithm to solve the new class problem may not the optimal solution. The reason we want to handle new class because training a classifier from ground zero is time consuming, We want accelerate this process and let the algorithm learning new knowledge base on previous experiences. The final goal is to develop an algorithm can keep learning and the accuracy maintain on some threshold.
For general (no class incremental) learning algorithm, the algorithm capability is fixed when it been initialized. So modify the original algorithm to fit new class without change the algorithm structure\parameter will not work very well. The classification accuracy will decrease when the number of classes increase. When the number of class or the complexity of decision surface over the algorithm's capability, the classification accuracy will dramatic decrease. If we set the algorithm capability in a high level at beginning, the whole training speed will slow (and structural redundancy for some algorithm), lose the significant advantage compared with non-incremental learning algorithm.
对于一般的非类增量学习算法, 算法的潜在承载能力在一开始算法初始化的时候就已经固定了. 所以在不改变算法本身结构或参数的情况下, 将其调节为一个类增量算法并不是最有的解决方案. 算法分类的准确率会随着类的增加而降低, 当触及当前算法承载能力的时候, 分类准确率会大大降低(预测). 而在一开始就将算法承载能力设大的话, 算法整体训练速度就会降低, 对于非增量学习算法就失去了明显的优势.
For general (no class incremental) learning algorithm, the algorithm capability is fixed when it been initialized. So modify the original algorithm to fit new class without change the algorithm structure\parameter will not work very well. The classification accuracy will decrease when the number of classes increase. When the number of class or the complexity of decision surface over the algorithm's capability, the classification accuracy will dramatic decrease. If we set the algorithm capability in a high level at beginning, the whole training speed will slow (and structural redundancy for some algorithm), lose the significant advantage compared with non-incremental learning algorithm.
对于一般的非类增量学习算法, 算法的潜在承载能力在一开始算法初始化的时候就已经固定了. 所以在不改变算法本身结构或参数的情况下, 将其调节为一个类增量算法并不是最有的解决方案. 算法分类的准确率会随着类的增加而降低, 当触及当前算法承载能力的时候, 分类准确率会大大降低(预测). 而在一开始就将算法承载能力设大的话, 算法整体训练速度就会降低, 对于非增量学习算法就失去了明显的优势.
Reference
[1]R. Polikar, L. Upda, S. Upda and V. Honavar, "Learn++: an incremental learning algorithm for supervised neural networks", IEEE Trans. Syst., Man, Cybern. C, vol. 31, no. 4, pp. 497-508, 2001.
[2]M. Muhlbaier, A. Topalis and R. Polikar, "Learn++.NC: Combining Ensemble of Classifiers With Dynamically Weighted Consult-and-Vote for Efficient Incremental Learning of New Classes", IEEE Trans. Neural Netw., vol. 20, no. 1, pp. 152-168, 2009.
[3]G. Ditzler, M. Muhlbaier and R. Polikar, "Incremental Learning of New Classes in Unbalanced Datasets: Learn + + .UDNC", Multiple Classifier Systems, pp. 33-42, 2010.
[4]G. Ditzler, G. Rosen and R. Polikar, "Incremental learning of new classes from unbalanced data", The 2013 International Joint Conference on Neural Networks (IJCNN), 2013.
[5]R. Elwell and R. Polikar, "Incremental Learning of Concept Drift in Nonstationary Environments", IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1517-1531, 2011.
[6]B. Zhang, J. Su and X. Xu, "A Class-Incremental Learning Method for Multi-Class Support Vector Machines in Text Classification", 2006 International Conference on Machine Learning and Cybernetics, 2006.
[7]T. Mensink, J. Verbeek, F. Perronnin and G. Csurka, "Distance-Based Image Classification: Generalizing to New Classes at Near-Zero Cost", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2624-2637, 2013.
[8]I. Kuzborskij, F. Orabona and B. Caputo, "From N to N+1: Multiclass Transfer Incremental Learning", 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[9]B. Zhang, J. Su and X. Xu, "A Class-Incremental Learning Method for Multi-Class Support Vector Machines in Text Classification", 2006 International Conference on Machine Learning and Cybernetics, 2006.
[9]B. Zhang, J. Su and X. Xu, "A Class-Incremental Learning Method for Multi-Class Support Vector Machines in Text Classification", 2006 International Conference on Machine Learning and Cybernetics, 2006.
[10]M. Ristin, M. Guillaumin, J. Gall and L. Gool, "Incremental Learning of NCM Forests for Large-Scale Image Classification", 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
[11]M. Ristin, M. Guillaumin, J. Gall and L. Van Gool, "Incremental Learning of Random Forests for Large-Scale Image Classification", IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1-1, 2015.
[12]T. Xiao, J. Zhang, K. Yang, Y. Peng and Z. Zhang, "Error-Driven Incremental Learning in Deep Convolutional Neural Network for Large-Scale Image Classification",Proceedings of the ACM International Conference on Multimedia - MM '14, 2014.