About supervision in task-free continual learning

Hi everybody! I have some questions regarding task-free continual learning. I have just started digging into CL, so I apologize in advance if my questions do not make sense.

I was taking a look at [1] and I agree with the authors that a true continual learning agent should learn online and should not rely on the knowledge of task labels and boundaries. Besides that, IMHO one of the strengths of [1] is that they do experiments on weak and self-supervision. In fact, it seems that in this context the ability of learning without supervision is crucial for three reasons:

  1. humans are very slow at labeling, so most of the advantages of online learning become useless if the system needs to wait for the label
  2. I can’t think of a reason why the task label should not be known if the data has been manually annotated. Couldn’t the annotator just give the task label as well?
  3. correct me if I’m wrong, but knowing the class labels entails knowledge of the task label by definition since the task label represents a subset of the set of classes. Also, it is very easy to just use the class label to detect a task change (in python if current_label not in already_seen_labels: new_task())

note that point 3 is probably related to this sentence I found in section 3 of [2]:

However, it is only applied to supervised tasks, and can exploit the “label” trick, inferring the task based on the class label

Given these thoughts, the task-free setting seems unreasonable under the supervised paradigm. Nonetheless, many recent works (for example [3][4] but I am confident that there are others that I cannot find now) seem to go in that direction. At first, I thought that it might be that supervised learning is used as a toy problem for testing task-free continual learning, but then I found out that these methods explicitly rely on labels (e.g. for picking data points to store in the buffer) and therefore cannot be trivially adapted for unsupervised learning.
DISCLAIMER: I assume there is something wrong with my understanding since these authors are super experts in the field. What am I missing here?

Furthermore, I am confused about the unsupervised task-free continual learning setting presented in [2]. They only include CL experiments where a single class is presented to the model at a time, so this would be something like unsupervised task-free class-incremental continual learning. If this is true it means that the class label needs to be known at training time, otherwise, how do you split the data? Also, it can be that you suppose that the data for each class come from different “sources” and therefore you don’t need to label it, but then you can just use the “source” as a label. Hence it seems to me that this unsupervised CL is not really unsupervised. Again, I am probably wrong, I just want to understand :slight_smile:
Moreover, it is true that in [2] there are experiments on

continuous drift, similar to the sequential case, but with classes gradually introduced by slowly increasing the number of samples from the new class within a batch

but my understanding is that this still requires the knowledge of the class label (which btw in this case is the same as the task label).
Also, because of how the dynamic expansion is devised, I reckon that it is not possible for CURL to handle more than one new class at a time, am I right?

Can’t wait for your replies. Sorry for the wall of text.

[1] Task-Free Continual Learning
[2] Continual Unsupervised Representation Learning
[3] Online Continual Learning with Maximal Interfered Retrieval
[4] Gradient based sample selection for online continual learning

2 Likes

Hi @RobiNoob21, welcome to the forum! :smiley:

I will answer to the first questions and let the other authors of the papers you cited answer on their work (for example @optimass and @mavenlin who are here!)

Humans are very slow at labeling, so most of the advantages of online learning become useless if the system needs to wait for the label

That’s true, I believe we all agree that working with unlabeled data would be better for downstream applications but we also need to study continual learning in settings where it is easy to disentangle casual factors and experiment easily! :slight_smile: Moreover, there are many scenarios where labels can be recovered very fast in semi-automated ways (think at recommendation systems, imitation learning, or an app with a labeling system based on temporal coherence)

I can’t think of a reason why the task label should not be known if the data has been manually annotated. Couldn’t the annotator just give the task label as well?

Yes he could, but it’s an additional supervised signal we would need to provide during training and eventually every time we use the system as well (during inference). The less we label the better I think. A basic example is a robot recognizing objects in different rooms (tasks). I wouldn’t want to specify the room he is in every time I ask him to recognize an object.

correct me if I’m wrong, but knowing the class labels entails knowledge of the task label by definition since the task label represents a subset of the set of classes. Also, it is very easy to just use the class label to detect a task change (in python if current_label not in already_seen_labels: new_task() )

I don’t think this trick is generally applicable. A simple counterexample is a sequence of two tasks with the same labels but with different input distributions. Coming back to our example: two different objects with the same name in two different rooms. The distributional shift needs to be detected in the <X,Y> space not just in one of them.

1 Like

Thanks, Vincenzo, for your reply and for this very cool platform.

Indeed, I agree that in some scenarios it makes sense. The video is very cool! Though, the app is task-aware and supervised, isn’t it? I mean, you click and write the name of the new class, providing both the task boundary (the click) and the class label (the name of the class).

yes, that is true, it is still interesting at test time. This is kinda already possible with class-incremental methods, although training data should contain a single class at a time, which could be a problem for robots that usually see multiple objects at a time.

Right, for the domain-incremental scenario the label trick cannot be applied. But for task-incremental and class-incremental it is valid.

Hi @RobiNoob21
We all agree that totally unsupervised continual learning is one of the major goals of CL (maybe even of ML?) but IMO the field is too immature to seriously deals with unlabelled data in real scenarios. I mean, continual learning is far to be resolved even for toys supervised benchmark; before focusing on difficult tasks we should concentrate on simpler ones :slight_smile:
With this in mind, reducing information during training and testing removing the task label is a step forward in the direction of unsupervised CL / CL with little knowledge about data distribution and classes.

As far as I know task label is mainly used by multi-head models or growing models as [1], where the specific head or the path that data has to travel through the network is different according to the task.
Personally I’m not a fan of these kind of models, so I tend to avoid to use the task label.

Moreover, IMO, the concept of task label is not naturally present in many real world application. As an example, suppose you have a robot that have to learn to classify and handle different objects in your house. Let’s say that room = task. Suppose that when the robot learns to handle object in your bedroom, there is no mug in the room. During testing you put a mug in your bedroom. If the robot learned to classify objects based on the task label, probably it will fail to classify and handle the mug, simply because during training there was no mug in the bedroom, but probably the same robot would be able to correctly handle a mug if the mug was in the kitchen!
This behavior is usually unwanted, since the same object should always be correctly classified, regarding the task/room where it appears.

[1] Rusu, Andrei A., et al. “Progressive neural networks.” arXiv preprint arXiv:1606.04671 (2016).

2 Likes

The app is based on the AR1* algorithm described here: “Latent Replay for Real-Time Continual Learning”. AR1 works in a Single-Incremental-Task scenario (there’s no notion of task at all), so only class labels are provided through the app. AR1 is agnostic to the new batch content type (it can contain new classes, new examples of already encountered classes or both) :slight_smile:

1 Like

Thanks @ggraffieti for joining the discussion.

True, this is what I was implying when I said that supervised classification is a “toy example”. My only concern is that maybe we are “overfitting” on simple tasks in a way that will probably be useless when we switch to more practical scenarios. I mean, most of the literature proposes solutions that are specific for supervised classification and it will be hard to extend them for more challenging tasks (e.g. unsupervised).

Yeah, definitely :+1:. Also, I think it is unreasonable to expect that only one class will come at you at a time (class-incremental). Probably, as @vincenzo.lomonaco was saying, SIT and NIC (new instances and classes) settings are the most reasonable in terms of practical utility.

Thanks!! :slight_smile:

1 Like

Hi Robi,

(Sorry, haven’t had time to read the other responses.)

First of all, you should always question what the “super expert” do, especially in CL :wink:

So in [3], we are not using the task label but we are using the task boundaries. Nevertheless, our method applies to unsupervised learning :slight_smile: See Table 5. Let me know if you have any questions about that. Long story short: just swap p(y|x) for p(x) everywhere.

That being said, I believe in the task-free setting. Let me start with an example: A recommendation system trying to predict the next song you will listen to. Let’s say x is the current song you’re listening to, y is the next, and c is a hidden context variable, which represents your current music mood. The problem can thus be formulated as maximizing p(y|x,c). In this setting, c represents the task label. For this application, the model doesn’t have access to c but it’s, however, getting tons of supervision via y whenever you switch songs. (so, no task boundaries, no task labels).

You can think of a LOT of applications like this. Find more here

Let me know if that wasn’t helpful and you need further convincing :slight_smile:

2 Likes

Well, focusing more on supervised setting is “a problem” of ML in general not only of CL :slightly_smiling_face:
I don’t think we are overfitting the problem if we work with labelled data. In fact there are many applications where labels are easy to obtain or generate. Moreover the main problem of CL, catastrophic forgetting, is still an issue in the majority of supervised tasks! Trying to resolve it for labelled data could give us some insight in the direction of resolving it in the more general case of unsupervised learning. A good approach is the one described by @optimass. In general, semi-supervised or self-supervised learning are two fields that are gaining more and more attention recently. IMO an hybrid approach where sometimes you have labels and sometimes you don’t have them will be a very good benchmark towards unsupervised CL!

2 Likes

Hi Massimo,

Ahaha, yeah, I do but I also question myself, especially since I’m just starting out, really.

My bad, I had missed that section. Thanks for pointing that out. Anyway, it was a general thought about the lines of research that are being explored, not specific to a single publication. :+1:

Quite a nice example, makes a lot of sense in this perspective, thanks.

That paper is in my backlog!!!

1 Like

Happy it made it to your backlog :slight_smile:

2 Likes

Let’s hope so!

I tagged you in my comment in the reading group channel about this

2 Likes