Two different ways of rehearsing patterns

Hello CLAI people :slight_smile:

I was wondering what is your standard approach when implementing a standard rehearsal algorithm. Suppose a Class Incremental setting.
At the beginning / end of each task you select a certain number of patterns for each class and store them in memory.
During training, you interleave current patterns with previously stored ones.

But, during training…

  1. Do you concatenate the memory of previous patterns with the current dataset, then shuffle the dataset and train on the result?
  2. For each batch, do you concatenate the current batch with (a subset of) the memory of previous patterns?

Method 1 is computationally more efficient (less patterns to learn for each epoch), but the rehearsal of previous pattern is sparser, since there will be batches with only current patterns. Method 2 costs more but during each batch I am sure that there will be some training on previous patterns.

What do you prefer? What do you see more often? Do you even care?

Ciao,
Andrea

2 Likes

Hi @andrea! I assume that with batch you mean “mini-batch” :stuck_out_tongue:

You are right, these are two different implementations not often discussed in the litterature. I think people normally go for option 1 which is slightly easier to implement.

We explored option 2. in this paper (vd. Sec 5.1) since it was necessary for latent replay. In CORe50-NICv2 we didn’t notice a significant difference in accuracy using 1) or 2) when the epochs are the same.

1 Like

Hi @andrea, that’s a great question!

I’d suggest to use the second approach, as it enables you to independently control [1] how many samples to replay (replay’s “precision”) and [2] how strongly to weigh the loss of the replayed samples (replay’s “importance”). The first is important because you can get efficiency gains as fully replaying previous tasks is not needed (not forgetting is easier than learning, see Fig 4A of this paper); the second is important because you want to properly balance the importance of the new and old data (see bottom of p4 of this paper for one possible way in case of equally sized tasks).

Also, although the first approach you mention is indeed easier to implement, I don’t think it has an inherent benefit in terms of computational efficiency.

2 Likes

I didn’t remember this bit in your paper, thanks. Yes, it seems reasonable for the difference to not be significant when rehearsing few patterns (wrt the total memory) per mini-batch.

1 Like

I agree that the second method allows for a more fine-grained use of rehearsal. However, unless there are specific reasons to do that (see @vincenzo.lomonaco comment above), I still prefer method 1, since it has less hyperparameters to explore (e.g. #samples per mini-batch, your replay’s precision). Regarding the replay’s importance, I guess you can implement it also with approach 1, since you always know if a pattern is a replayed one or not.

I agree that approach 1 is not more efficient in general. It depends on how many samples per mini-batch approach 2 rehearses (they could be more or less than the amount of memory of method 1).

Thanks for the insightful papers, I will go through them. Shameless advert: if you have time you can add them to the CLAI wiki :slight_smile:

1 Like

If I understand well the results of @vincenzo.lomonaco method 1 works well whenever the old samples are a large percentage of the entire dataset. In this setting balancing is not necessary.

I may be wrong, but I would be surprised if method 1 works with small memories, where you only keep a small number of samples per class and the ratio between old and new data is extremely unbalanced (e.g. <10% of old samples). If you know some results in this direction please let me know.

Regarding the efficiency, if you have GPUs adding more samples to the batch is not much more expensive as long as you have enough memory. What’s more important for the efficiency is how many gradient updates you do. A balanced batch (method 2) usually helps to converge faster in a standard supervised setting. In a continual setting I don’t know if it matters.

1 Like

Yes, I agree that with approach 1 you can also control replay’s importance, similar as with approach 2. The difference is that with approach 2, you can independently from that control replay’s precision. Although this introduces an additional hyperparameter, it’s not a “problematic” hyperparameter (e.g., such as lambda in EWC) that needs to be tuned for optimal performance, but a “good” hyperparameter that allows you the possibility to balance performance and computational efficiency (and in many cases, I think it’s possible to win a lot of efficiency for very little performance loss).
Such a trade-off might not always be of interest, but I think it’s good to have the opportunity.

Thanks for the pointer to the CLAI wiki! I’ll look into adding those papers.

1 Like

This example of having a very small number of stored samples from previous tasks is great. With method 1, you could still make it work by adding many copies of each stored example to your dataset to end up with the correct good balance. But then you might end up with multiple copies of such a stored example in the same batch (and you would thus process the same sample multiple times). That could be prevented by using method 2.

I 100% agree with @GMvandeVen .

The massive challenge with version 1 is that it is highly dependent on amount of stored samples, diversity of your data distribution, the unknown complexity of our task and all sorts of other factors.

The way I see it, you almost always want explicit control over how you sample (and avoid the additional memory overhead of needing to store additional copies for balancing).

If you check out figure 8 plus corresponding discussion of our recent position paper: https://arxiv.org/abs/2009.01797
you will see that the observable differences in rehearsal can be quite massive, even for “easy” tasks, simply due to balancing the sampling. I.e. concatenation (1), balancing stored examples with new data in each batch (2), or going even further and balancing each mini-batch according to classes/concepts (possible 3) yields completely different behavior independently of your exemplar “quality”.

In my personal opinion, these questions are currently often undervalued and the biggest challenge is that you can essentially run into an apples to pears comparison when you contrast rehearsal techniques and simply flow with version 1.

2 Likes

Thanks, this was extremely useful. I look forward to hear more about this in the next meetup :slight_smile:

2 Likes