Evaluation Modalities

Hello fellows,
I am defining the experimental settings and pipelines altogether with the plotting functions to compare different strategies on my multitask regression problem.
I distinguish from baseline and Continual Learning strategies hoping to obtain improvements in the Loss. My evaluation metrics are defined in terms of Mean Absolute Error (MAE), and Forgetting.
This is the table of my experiments so far, composed by the plugins already available to plug and play.


To depict the MAE of some strategies (Replay or Naive for example) I plot the different scores like the ones below (I depict the mean of three different evaluation phase for each one of the task test streams), for others (like the Joint) I use the mean of the MAE of the different tasks (a single value) and it does make sense to me.


For some strategy I obtain some curves and others for others. To represent them all together in a unique plot this way seems good to me, for the MAE at least.

What are the best ways to represent each of the different strategies and baselines in your opinion?
I would like to imagine an additional column to the prior table about the modalities of evaluation (so the relative representation) of the different strategies. To be even richer I would love to build an additional column to indicate if the strategy is suitable for regression.

An example:
I have included also the GDumb (Greedy sampling and Dumb learner) in the experimentation, however it does not really seems a coherent strategy (however viable) as baseline for my regression problem, it seems a better CL baseline for classification. Am I wrong? In the paper the authors do not write a line about regression.

Thank you very much in advance,
So I can start to imagine the search for the best hyper parameters :smiling_face_with_tear:
Questions are in bold :slight_smile: :crossed_fingers:


Hi @vale_bonsi,
great work! As far as I know, many of these strategies were developed and tested only for classification, so evaluating the performance on regression is definitely interesting.
I have some doubts about some strategies, that may be easily portable from classification to regression. As an example, Deep SLDA is based on the updating of a mean vector for each class and a common covariance matrix (all of them calculated from the features extracted from a pretrained model). I don’t know how you can use this strategy for regression, since seems that it was developed having classification in mind.

Moreover, from the plot, I noticed strange behavior, in particular, the joint training baseline has an error greater than the naive strategy (which has the lowest error of them all!). This is strange since I’d expect the exact opposite behavior, even with regression. Maybe I misinterpreted the plot or the test is performed in different manners between the baseline strategies and the CL strategies. In any case, there may be something wrong if the naive is the best strategy, even better than joint training or cumulative.

Returning to your question, IMO the best way to add information about the strategies is to add a column that describes the most important characteristics of each strategy, e.g. the mechanism used to contrast forgetting (replay, regularization, architectural) or the type of optimization used (DeepSLDA is not based on gradient, CWR block the feature extractor after the first task, etc.).

Just for curiosity, what dataset and what regression problem do you use (or you want to use) to perform this evaluation?

Indeed that is my worry! :smiley:
Thank you for your quick reply. :heart:

Using Avalanche as framework for CL, this is the code (more or less) that I use for the Naive, Replay, Gdumb and GEM and Cumulative Strategies/Baselines; since the training is done per task, i can evaluate the performancies of the model on the test set after each one of the learning phase of the tasks and gather so the scores.

for experience in benchmark.train_stream:

Meanwhile is the code for the joint


Here the plot on a similar Dataset…

I am performing a preliminary overview of the different strategies and baselines. I am training the models for only 10 epochs not diving particularly deep in the techniques. :slight_smile:

My dataset is an experimental dataset about traffic reconstruction…

I know the theory of the different strategies up to a certain degree, however I do not have any kind of bias about the results :grin: