深度学习|A brief Introduction to Continue Learning / Life long Learning 连续学习|深度学习|人工智能

What is Continual Learning? original link
A high level definition. “Continual Learning is the constant development of increasingly complex behaviors; the process of building more complicated skills on top of those already developed.” — Ring(1997).CHILD: A First Step Towards Continual Learning
Continual Learning is also referred to as lifelong Learning, sequential learing or incremental Learning. They have the same define.
“Studies the problem of Learning froman infinite stream of data, with the goal of gradually extending acquired knowledge and using it for future Learning.” — Z.Chen. Lifelong machine Learning

文章图片

In others words, Continual Learning tries to make machine like human to adaptive continuou Learning in a dynamic environment to learn tasks sequentially (from birth to death).
A low level definition. Continual Learning(CL) is an algorithm whose goal is to make machine Learning models train on non-stationary data (different from I.I.D. data.) from sequential tasks.

Take an example1, we define a sequence of tasksD = { D 1 , … , D T } D = \{D_1, \ldots, D_T\} D={D1?,…,DT?}, where the t-th taskD t = { ( x i t , y i t ) } i = 1 n t D_t= \{(\mathbf{x}_i^t,y_i^t)\}_{i=1}^{n_t} Dt?={(xit?,yit?)}i=1nt?? contains tuples of the input samplex i t ∈ X \mathbf{x}_i^t \in \mathcal{X} xit?∈X , and it’s labely i t ∈ Y y_i^t \in \mathcal{Y} yit?∈Y. The goal of the CL model is to train a single modelf θ : X → Y f_\theta : \mathcal{X} \rightarrow \mathcal{Y} fθ?:X→Y parameterized byθ \theta θ, and it can predicts the labely = f θ ( x ) ∈ Y y = f_\theta (\mathbf{x}) \in \mathcal{Y} y=fθ?(x)∈Y, wherex \mathbf{x} x is an unseen test sample from arbitrary tsaks. And data from the previous tasks may not be seen anymore when training future tasks.
Motivation & Application scenarios As we all know, Alpha-Go kills everyone in the Go world, however when it face to Chess, it is powerless. Similarly, YOLO(A model you only look once) can detect the dog easily, but it can only detect the specific object. Therefore, people look forward to a model that can resolve the aforementioned problems. This calls for systems that adapt Continually and keep on Learning over time.
And talk about the application scenarious, Continual Learning can be used in many areas. Take some simple examples, a robot need to acquire new skills in different environment to complish new tasks, a self-driving car need to adapt to different environments (from a country road to a highway to a city), and the conversational agents should adapt to different users, situations, tasks.

The Challenge of Continual Learing Nowadays, methods of realizing Continual Learning almost use Neural Networks(CNN, TransFormer and so on). And due to the limitations of the Neural Networks, the Continual Learning faces two major challenges, Catastrophic Forgetting and Balance between Learning and Forgeting(Stability vs Plasticity).

Catastrophic Forgetting. When the data is updated incrementally, the model will face catastrophic interference or forgetting, which leads to the model forgetting how to solve the old task after Learning the new task.
For example: A vision model, which can classify images into two categories. First, we train the vision model by Cat vs Dog Datasets, and then we get a perfect Acc(maybe 99.98%?) on current datasets. Second, we put the pre-trained model to another datset(e.g. Car vs Ship Datasets) to train, and can get a nice performance at the current datsets too. However, when we go back to the Cat vs Dog Datasets, we will find that the model forgets the previous data and can not divide them accurately.

文章图片
Stability vs Plasticity.
For people, the faster you learn, the faster you forget. The same is true for machines. How to balance the relationship between them is also a challenge.
1. Stability <=> ability to retain the learned skills on the old tasks
2. Plasticity <=> ability to adapt to a new task

深度学习|A brief Introduction to Continue Learning / Life long Learning

文章图片

Albeit a challenging problem, progress in Continual Learning has led to real-world applications starting to emerge.
Four Assumptions of Continual Learning Due to the general difficulty and variety of challenges in Continual Learning, many methods relax the general setting to an easier task incremental one.
Before understand the assumptions of the Continual Learing, we should know some pre-settings. The same to A low level definition
X - input vector
Y - class label
T - task.

The concept ‘task’ refers to an isolated training phase with a new batch of data, belonging to a new group of classes, a new domain, or a different output space.

( X t , Y t ) (\mathcal{X}^t,\mathcal{Y}^t) (Xt,Yt) - Dataset for task t.
{ Y t } \{\mathcal{Y}^t\} {Yt} - Class labels. e.g.:Dog Cat Bird …
P ( X t ) P(\mathcal{X}^t) P(Xt) - input distributions. For different task,P ( X t ) ≠ P ( X t + 1 ) P(\mathcal{X}^t) \neq P(\mathcal{X}^{t+1}) P(Xt)=P(Xt+1)
f t ( X t ; θ ) f_t(\mathcal{X^t}; \theta) ft?(Xt; θ) -The predicted label ofY t \mathcal{Y^t} Yt,model is parameterized byθ \theta θ
The four assumptions of Continual Learning :

Task incremental Learning.
Class incremental Learning.
Domain incremental Learning.
Data incremental Learning / Task-Agnostic Learning.

Task ID observed at training:

Task observed at test: Task incremental Learning
Task not observed at test : Class incremental Learning and Domain incremental Learning

Task ID not observed at training:

Data incremental Learning / Task-Agnostic Learning

Detail description of four setting:
Task incremental Learning(the easiest scenario)
Task incremental learning considers a sequence of tasks, receiving trainig data of just one task at a time to perform traing until convergence. During this setting, models are always informed about which task needs to be performed (both at train and test time). However, data is no longer available for old tasks, impeding evaluation of statistical risk for the new parameter values.
Express it with formulas:

Data( X t , Y t ) (\mathcal{X}^t,\mathcal{Y}^t) (Xt,Yt) is a training-data of task t, the current task isT \mathcal{T} T.

The goal is to control the statistical risk of all seen tasks given limited or no access to data from previous taskst < T t < \mathcal{T} t∑ t = 1 T E ( X t , Y t ) [ L ( f t ( X t ; θ ) , Y t ) ] , \sum\limits_{t=1}^{\mathcal{T}}\mathbb{E}_{(\mathcal{X}^t,\mathcal{Y}^t)}[\mathscr{L}(f_t(\mathcal{X^t}; \theta),\mathcal{Y^t})], t=1∑T?E(Xt,Yt)?[L(ft?(Xt; θ),Yt)],
For the current taskT \mathcal{T} T, the statistical risk can be approximated by the empirical risk:
1 N T ∑ t = 1 N T L ( f t ( x i T ; θ ) , y i T ) ] , \frac{1}{N_\mathcal{T}}\sum\limits_{t=1}^{N_\mathcal{T}}\mathscr{L}(f_t(x_i^{\mathcal{T}}; \theta),y_i^{\mathcal{T}})], NT?1?t=1∑NT??L(ft?(xiT?; θ),yiT?)],
whereN T N_{\mathcal{T}} NT? is the number data of taskT \mathcal{T} T.

文章图片

All in all, this setting assumptions are:P ( X t ) ≠ P ( X t + 1 ) P(\mathcal{X}^t) \neq P(\mathcal{X}^{t+1}) P(Xt)=P(Xt+1) and{ Y } t ≠ { Y t + 1 } {\{\mathcal{Y}\}^t\neq \{\mathcal{Y}^{t+1}\}} {Y}t={Yt+1}(different labels when in different task),P ( Y t ) ≠ P ( Y t + 1 ) P(\mathcal{Y}^t) \neq P(\mathcal{Y}^{t+1}) P(Yt)=P(Yt+1), but you know which task it is when in test.(each task has it’s specific task-label t).
Class incremental Learning
‘An algorithm learns continuously from a sequential data stream in which new classes occur. At any time, the learner is able to perform multi-class classification for all classes observed so far.2’

文章图片

Models must be able not only to solve each task seen so far, but also to infer which task they are presented with.(You don’t know which task you are facing) The new class labels may be added into the model in new task.

文章图片

The setting assumptions are:P ( X t ) ≠ P ( X t + 1 ) P(\mathcal{X}^t) \neq P(\mathcal{X}^{t+1}) P(Xt)=P(Xt+1) and{ Y } t ? { Y t + 1 } {\{\mathcal{Y}\}^t\subset \{\mathcal{Y}^{t+1}\}} {Y}t?{Yt+1}(Class incremental),P ( Y t ) ≠ P ( Y t + 1 ) P(\mathcal{Y}^t) \neq P(\mathcal{Y}^{t+1}) P(Yt)=P(Yt+1), and you don’t know which task it is when in test.
Domain incremental Learning
It defines a more general continual learning setting for any data stream without notion of task, class or domain.
Models only need to solve the task at hand; they are not required to infer which task it is. In other words, task concept is not specific now, but it also have the task.
The setting assumptions are:{ Y } t = { Y t + 1 } {\{\mathcal{Y}\}^t= \{\mathcal{Y}^{t+1}\}} {Y}t={Yt+1},P ( Y t ) = P ( Y t + 1 ) P(\mathcal{Y}^t) = P(\mathcal{Y}^{t+1}) P(Yt)=P(Yt+1).
Data incremental Learning / Task-Agnostic Learning (the hardest scenario)
Task identity is not available even at training time! Task-Agnostic Learning has no task concept at all, and it is the ideal condition of Continual Learning.

文章图片

For a clearer understanding Task incremental Learning,Class incremental Learning and Domain incremental Learning, you can see the following images3:

Split Mnist Task: Split the number into different task.

文章图片
Permuted Mnist Task: Permute each image in MNIST after vectorization. Actually use a group of random indexes to disrupt the position of each element in the vector(image). Different random indexes will generate different tasks after being disrupted.

文章图片

The difference between Continuous Learning and Multi-Task
Multi-Task Gradient Dynamics: Tug-of-War(拔河拉锯)

文章图片

However, the tasks are not available simultaneously in CL!
Need to use some form of memory, or to modify the gradients, to still take into account what solutions are good for previous tasks
Some key definitions！ Transfer and Interference
Note: We need to maximize Transfer and minimize Interference.

文章图片

Possible Scenarios in CL

文章图片

The method of Continual Learning Refer to Lange, M. D., et al.4, I try to draw a mind mapping for better understand the current mainstream methods of Continual Learning.
The define of each method4:
Replay Methods
As you see, replay is the key. To realize replay, this line of work should store samples in raw format or generate pseudo-samples with a generative model (e.g. GAN/diffusion model) because of privacy policy. Then, these previous task samples are replayed while learning a new task to alleviate forgetting. According to different ways of use, replay methods can be divided into the following three categories:
Rehearsal (Easy to implement, but poor performence )
It is the esaiest way to understand. Just combine a limited subset of stored samples(old tasks) into new task, and retrain the model.

Advantage:
1. Easy to implement
Disadvantage:
1. Be prone to overfitting the subset of stored samples.
2. Be bounded by joint training.

Pseudo Rehearsal
Feed random input to previous models, use the output as a pseudo-sample. (Generative models are also used nowadays but add training complexity.)5

文章图片

Novel GR method6: internal or hidden representations are replayed that are generated by the network’s own, context-modulated feedback connections.
Constrained Optimization
Minimize interference with old tasks by constraining updates on the new task. The goal is to optimize the loss on the current examples(s) without increasing the losses on the previously learned examples.
【深度学习|A brief Introduction to Continue Learning / Life long Learning】Assume the examples are observed one at a time. Formulate the goal as the following constrained optimization problem.
θ t = arg?min ? θ ? ( f ( x t ; θ ) , y t ) \theta^{t}=\argmin_\theta \ell\left(f\left(x_{t} ; \theta\right), y_{t}\right) θt=θargmin??(f(xt?; θ),yt?)s . t . ? ( f ( x i ; θ ) , y i ) ≤ ? ( f ( x i ; θ t ? 1 ) , y i ) ; ? i ∈ [ 0 … t ? 1 ] s.t. \ell\left(f\left(x_{i} ; \theta\right), y_{i}\right) \leq \ell\left(f\left(x_{i} ; \theta^{t-1}\right), y_{i}\right) ; \forall i \in[0 \ldots t-1] s.t.?(f(xi?; θ),yi?)≤?(f(xi?; θt?1),yi?); ?i∈[0…t?1]
f ( . ; θ ) f(. ; \theta) f(.; θ) is a model parameterized byθ \theta θ,? \ell ? is the loss function.t t t is the index of the current example andi i i indexes the previous examples.
The original constraints can be rephrased to the constraints in the gradient space:
? g , g i ? = ? ? ? ( f ( x t ; θ ) , y t ) ? θ , ? ? ( f ( x i ; θ ) , y i ) ? θ ? ≥ 0 \left\langle g, g_{i}\right\rangle=\left\langle\frac{\partial \ell\left(f\left(x_{t} ; \theta\right), y_{t}\right)}{\partial \theta}, \frac{\partial \ell\left(f\left(x_{i} ; \theta\right), y_{i}\right)}{\partial \theta}\right\rangle \geq 0 ?g,gi??=??θ??(f(xt?; θ),yt?)?,?θ??(f(xi?; θ),yi?)??≥0
Regularization-Based Methods
These method avoids storing raw inputs, prioritizing privacy, and alleviating memory requirements.
In these methods, an extra regularization term is introduced in the loss function, to consolidate previous knowledge when learning on new data. We can further divide these methods into datafocused and prior-focused methods.4
Data-Focused Methods
The basic building block in data-focused methods is knowledge distillation from a previous model (trained on a previous task) to the model being trained on the new data.
Prior-Focused Methods
To mitigate forgetting, prior-focused methods estimate a distribution over the model parameters, used as prior when learning from new data. Typically, importance of all neural network parameters is estimated, with parameters assumed independent to ensure feasibility. During training of later tasks, changes to important parameters are penalized.
Parameter Isolation Methods
This family dedicates different model parameters to each task, to prevent any possible forgetting. These mehods avoid forgetting by using different parameters for each task.
Best-suited for: task-incremental setting, unconstrained model capacity, performance is the priority.
Fixed Network Methods
Network parts used for previous tasks are masked out when learning new tasks (e.g., at neuronal level (HAT) or at parameter level (PackNet, PathNet)
Dynamic Architecture Methods
When model size is not constrained: grow new branches for new tasks, while freezing previous task parameters (RCL), or dedicate a model copy to each task (Expert Gate), etc.
Conlusion TODO! Summaries will be added when i am familiar enough with this field.
Appendix: Mind Map

Some representative Replay Methods(Keep updating):
Only brief introduction, read the origional paper for more information.
iCaRL (incremental classifier and representation learning) iCaRL belongs to Rehearsal and Class incremental Learning.
iCaRL, that allows learning in such a classincremental way: only the training data for a small number of classes(NOT ALL DATA! new data + some old data) has to be present at the same time and new classes can be added progressively.
The author introduces three main components that in combination allow iCaRL to fulfill all criteria put forth above.

classification by a nearest-mean-of-exemplars rule
prioritized exemplar selection based on herding
representation learning using knowledge distillation and prototype rehearsal.

Classification (nearest-mean-of-exemplars)
Algorithm 1 describes the mean-of-exemplars classifier that is used to classify images into the set of classes observed so far.

文章图片

whereP = ( P 1 , … , P t ) \mathcal{P} = (P_1,\ldots,P_t) P=(P1?,…,Pt?) is exemplar images that it selects dynamically out of the data stream.
Andt t t denotes the number of classes that have been observed so far( t t t increases with time).
φ : X → R d \varphi:\mathcal{X}\rightarrow \mathbb{R}^d φ:X→Rd, a trainable feature extractor, followed by a single classification layer with as many sigmoid output nodes as classes observed so far.
Class labely ∈ { 1 , … , t } y\in \{1,\ldots,t\} y∈{1,…,t}.
Training
For training, iCaRL processes batches of classes at a time using an incremental learning strategy. Every time data for new classes is available iCaRL calls an update routine (Algorithm 2)

文章图片
Other algorithm (For more detail, you can visit 10.1109/CVPR.2017.587)

文章图片

文章图片

GEM: Gradient Episodic Memory for Continual Learning7 Some important definition:
Note: Analogous to Transfer and Interference.

Backward transfer(BWT), which is the influence that learning a current taskt t t has on the performance on a previous taskk k k ( k < t k
Positive Backward transfer: There exists positive backward transfer when learning about some task t increases the performance on some preceding task k.
Negative Backward transfer: There also exists negative backward transfer when learning about some task t decreases the performance on some preceding task k. Large negative backward transfer is also known as catastrophic forgetting.
Forward transfer(FWT), which is the influence that learning a current task t has on the performance on a future task k ( k > t k>t k>t). (Rarely discussed because it is unpredictable)
- Positive Forward transfer: In particular, positive forward transfer is possible when the model is able to perform “zero-shot” learning, perhaps by exploiting the structure available in the task descriptors.

Evaluation:

文章图片

GEM:

文章图片

Experiments:

For More Blogs TODO : The future of Continue Learning.
TODO : Details of some papers。
Reference & Acknowledgements

Wang, Z., et al. (2022). Learning To Prompt for Continual Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). ??
Rebuffi, S., et al. (2017). iCaRL: Incremental Classifier and Representation Learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). ??
Gido van de Ven and Andreas S. Tolias.(2019) Three scenarios for continual learning. arXiv:1904.07734 ??
Lange, M. D., et al. (2022). “A Continual Learning Survey: Defying Forgetting in Classification Tasks.” Ieee Transactions on Pattern Analysis and Machine Intelligence 44(7): 3366-3385. ?? ?? ??
https://icml.cc/virtual/2021/tutorial/10833 Part of blog’s pictures come from this link. Thanks ??
van de Ven, G. M., et al. (2020). “Brain-inspired replay for continual learning with artificial neural networks.” Nature Communications 11(1): 4069. ??
Lopez-Paz, D. and M. t. A. Ranzato (2017). Gradient Episodic Memory for Continual Learning. Advances in Neural Information Processing Systems, Curran Associates, Inc. ??