1 Million neurons and curve fitting

Suppose you have a layer of 1 million ReLU neurons in a conventional artificial neural network. Each neuron has 1 million weights to connect back to the neurons in the prior layer. The neural network is trying to fit some complicated curve in higher dimensions. Each neuron is trying to match a place where the curve changes, and change the response of the neural network there too.
I think they are sometimes called break points.
The problem is you are using 1 million weights just to get 1 change decision ( break point decision.) That seems excessive and inefficient for curve fitting.
What is the optimal number of weight parameters to use per change decision, 1000, 100? Or is it actually way over to the other extreme and you should only use 1 or 2 parameters per change decision?
That is possible using fast distributive random projections or fast distributive transforms and parametric activation functions.
The biological brain because of the high connectivity (to 1000 to 10000 other neurons) of biological neurons would only need 2 layers of 1 million neurons to do a large fast random projection. A algorithm on a digital computer would need 20 mathematical layers to reach the same level of connectivity for full random distribution.
If there was evolutionary pressure to reduce the number of parameters that needed to be adjusted to get a behavior or response then it is possible that nature could find such an arrangement.

To get even more complicated you could mix parametric activation functions and standard non-parametric activation functions with fast random projections or fast transforms to get a fractional number of parameters per change decision, for example 0.5 parameters per change decision. I haven’t tried that yet, but it is certainly possible.

1 Like

There is a nice video about the ‘breakpoints’ of a conventional ReLU neural network that is very interesting:

However if you think about a conventional neural network of width 1 million, then each neuron needs 1 million weight parameters, and for that you only get 1 breakpoint !

You could even get down to a fractional number of parameters per curve decision (breakpoint) using fast transform neural networks. Instead of making all the activation functions in the neural network parametric you could make for example 25% parametric and the other 75% standard non-parametric activation functions. You could end up with say 0.5 parameters per breakpoint.
I haven’t tried that in code yet, I don’t know if it would be beneficial.