Replay pattern for data stream

Hello CL-Friends,

what is a favorable replay pattern for a standard replay algorithm for streaming data containing multiple tasks sequentially while the number of data points for a task is not defined?

I could imagine having a memory of size m for each previous task. New data points belonging to a new task from the data stream are collected until a batch of fixed size is reached. Then this new task data batch is concat with replay samples for previous tasks from the memory. This merged batch is used to train the NN.

But how do I address the problem of unknown length of the data stream belonging to the new task? Usually multiple epochs of an entire data set are used to train a network, but since the data stream could be infinitely long, the mini-batch-wise training process could theoretically last forever. Further all the data points are shown only once to the network. Is it useful to store some data of the current task and replay them together with data from previous tasks from the memory and the newest read data from the data stream?

Is there a better way to deal with this problem?

1 Like

Hi @MLStudent94,

Our REMIND model deals with replay in a streaming setting without task boundaries or task labels. After a base initialization stage, REMIND is given samples one at a time. It stores a compressed representation of the new point and then mixes it with a subset of reconstructed previous samples and updates the network with a single iteration of stochastic gradient descent. We showed that this method works well for streaming classification on the ImageNet and CORe50 datasets and extended it to streaming Visual Question Answering (VQA) on the TDIUC and CLEVR datasets.

Here are a few resources on our REMIND model:

More recently, we extended this work to streaming object detection:

I hope this helps! Please feel free to reach out if you have more questions!

Best,
Tyler Hayes

That’s a great question!

Based on your question I think there are 2 main problems here to address:

  1. Which samples to store from the buffer?
  2. Which samples to retrieve from the buffer?

There are quite some techniques to do so, but an important thing to keep in mind is which assumptions are being made, for example do we know the task-label during learning?
Now you are referring to a scenario where the number of data point is not defined for task, so in this case this is equivalent to requiring a task-agnostic storage and retrieval scheme.

What you could do for example is Reservoir sampling, which assumes an iid datastream. Or in the other case you could exploit the class-labels instead of the task indicator! We take this simple approach in our recent work, CoPE (https://arxiv.org/abs/2009.00919), which proves to be very efficient against highly imbalanced data streams! On top of that, we also review the literature on some retrieval/storage schemes which don’t need task labels. We explicitly advocate that the assumptions made are very important, both during learning or testing, so that’s why we define a framework in Section 2 to provide some perspective in this regard. (:

Then " Is it useful to store some data of the current task and replay them together with data from previous tasks from the memory and the newest read data from the data stream?"
This is actually a constraint of your setup, where we don’t know during learning how long a task is (or equivalently we don’t know which task we are looking at). In this case you don’t have a choice but to store samples based on some criterion that doesn’t require a task label as mentioned before. So this also means that you don’t know for the current data if we are looking at a new task or not. So they might be stored in the replay buffer as well and retrieved during learning.

Hope that was clear, let me know if you have any more questions!
Matthias

Hello Mathias,

thanks a lot for your reply.

In general I can assume that the current task label is known at any time. There are no classes. My tasks correspond to different parts produced in mass production from the machine. Since the machine can manufacture only one type of part ( task1 means train/predict regression problem during production of part1). That means during manufacturing process of a new part (new task) I want to train the NN and predict the output of my regression problem at the same time.

Since the data are collected during a mass production process, the collected data follow a periodic scheme with a cycle duration of the time needed to manufacture one single part. These periodic data contain some variability and noise of course.

That means the information of the data from the data stream doesn’t really change to much because of the “periodic” scheme of the data. The longer I train on this data stream, the more robust my trained network gets against noise etc., but the longer I train the more likely I tend to overfitting.

That’s an interesting problem setting! Regression makes the problem a whole lot more difficult and there is little research in this direction. However, knowing your task (or arguably you could call it a class as well) allow you to group your regression data in coherent groups.

Now as in your setting the task label is known at any time, you therefore also know how long or how many samples you have seen for each task. Based on this, you could for example for each task allocate an equal amount of buffer space or based on the frequency of occurence and fill it with random samples (or based on another criterion).

I don’t think that overfitting in your case is actually a problem, because your evaluation set will by very close to what you encounter during training. As the test-set is probably just another cycle after some cycles of learning?

Regarding

I want to train the NN and predict the output of my regression problem at the same time

For learning/training you need to predict anyway to train your NN, but in a supervised setting, you actually get the ground truth, so that your network can learn from its mistakes (error) w.r.t. the ground truth.

Thanks for your input!

Yes, the test sets just contain some supervises data points of a couple of “cycles”. When I speak about these test sets I mean the test sets generated from historic data sets.

Whenever a new task ocures, I just start generating new training data from the data stream. For the new task there is no test data available, since the new training data are just generated by the data stream. That means during my “offline simulation” I basically use the histioric test sets to validate my algorithm and check if I can prevent catastrophic forgetting for these historic tasks learned in a sequential order. Still this is no prove for the algorithm to work in future on new tasks. Further I don’t know how I can actually prove this, since no test data is available.

Maybe one approach could be to set a fixed # of training epochs based on the experience from previous tasks and stop training after this # of epochs. Then generate some test samples and use them for further testing.

Hello together,

I just wanted to give a update on my work.
So far I am using a replay based apprach with a reservoir sampling memory. Whenever a new batch of data is read it is mixed with samples from the memory. The combined batch is used for training the network with one adam step. Afterwards the new task data are written to memory. Generally this approach reduces catastrophic forrgetting so far.

However I stress the problem that the number of batches of new tasks shown to my network(=number of gradient updates) are not enough to converge to a acceptable error value. To leverage the number of gradient updates I prolong the learning of my network after the data stream ended by replaying samples from the memory. Although this leads to better results, I still stress the problem of potential overfitting on the memory data, specially when a smaller memory size is used. Further I hardly rely on the quality of the last state of the memory.

I already thought about reducing the size of batches to increase the number of gradient updates, but this will lead to a higher variance in each step.

I would like to hear your thoughts about my method of leveraging the gradient steps by replaying the memory multiple times.