Judea Pearl in a recent tweet expresses the intractable nature of generative processes:
@shell_ki Contrary to expectations, the definition of "causal modeling" is fairly easy to articulate. To me, "causal model" is a set of assumptions about the data generating process, which cannot be expressed as properties of the joint distribution of observed variables. #Bookofwhy
A generating process can conjure up sufficient complexity that cannot be predicted using the bulk statistics that is observed. This is best demonstrated in this graphic:
The algorithm to conjure up this deceptive distribution is very simple. Take an existing dataset, perturb it slightly, and continue to maintain specific statistical properties. This is done by randomly selecting a point, add a small perturbation and then validating if the statistics are within targeted bounds. Now repeat these perturbation enough times and you can target different results. This illustrates the Achilles heel of using bulk statistics to characterize causal behavior.
This also reveals that small imperceptible perturbations can lead to emergent behavior. This emergent behavior cannot be used to explain its causes. This is what Daniel Dennett would describe as an inversion of reasoning.
Another good demonstration of this is in using Deep Learning to invert a diffusion process (code):
The authors train a system to invert a random walk process to reconstruct the original distribution from just random Gaussian noise. In this example, you can run time backward and it will go against entropy and reconstruct its original form. Imperceptible perturbations can evolve in the opposite direction of entropy.
Deep Learning is fundamentally successful based on exploiting imperceptible (alternatively, infinitesimal) perturbations on a massive scale. The entire idea of stochastic gradient descent is to perturb the entire network (note: top down) so that it evolves towards satisfying its fitness function. Top-down intentional behavior drives perturbations across millions of parameters in the network. The fitness function acts as an environmental constraint that these millions of parameters conform to.
The primary challenge for Deep Learning is to discover the rules for these perturbations. The machine learning community has historically based its methods on optimization that is used for curve fitting. The introduction of generative models such as GANs has changed this narrative. The conventional method had a designer describe a fitness function that through optimization a model will learn to match the training data. However, with GANs, the fitness function is also a trained neural network (i.e. Discriminator). Furthermore, the objective is not to perform classification of data, but rather to generate data. In a sense, these systems learn a causal model (note: all computation are driven by causality) that can generate a target dataset. There can be many different causal models that can generate the same data. Therefore, the method of inquiry should find methods to constrain the generation in a manner that it approximates the unknown true causal model.
Recent research on “Perturbative Neural Networks” that demonstrates a network that is comparable in performance as convolutional networks. The research describes a perturbation layer that has as its activation as a weighted linear combination of noise perturbed inputs:
So, you can select any random perturbation, all that is needed is that a perturbation is performed. There is no hand-engineered exotic layer that is required, just a random layer! Perturbation is all that is ever needed. In fact, randomness is so vitally important to Deep Learning in that networks parameters must be initialized with randomness to ensure training convergence. The primary reason why randomness is critical is that it relates to cognitive diversity. These massively parallel networks require sufficient diversity to ensure robust cognition. One can even claim perturbation only works with diversity and that “Diversity is all you need”.
To illustrate the intrinsic diversity that is captured in a neural network, one can in “reprogram” a previously trained neural network to perform a task that it was not previously designed to perform. This is without perturbing any of the original parameters (i.e. weights). This is illustrated in the paper “Adversarial Reprogramming of Neural Networks” where Jascha Sohl-Dickstein et. al shows how to repurpose a CIFAR-10 trained network to perform MNIST classification only by manipulating the input:
In short, the intrinsic diversity of computation in a neural network is rich enough to be repurposed without customization (i.e. retraining). From the perspective of nano-intentionality, individual cells are highly competent and it is likely that neurons are recruited in a manner that restricts their original competence. Paradoxically, learning is achieved by constraining capability rather than adding capacity. This thought conforms to the principle of least action. That is, it is easier to remove something than to add something. This is in fact how the immune response system is generated, it goes through hyper-mutation so all the kinds of pattern matching T-cells are created in the beginning.
One can call this the Lottery Ticket Hypothesis and the basic idea applied to neural networks is that you begin with a wide enough neural network with sufficient diversity. Then you train the network to a new task. After that, you prune the network removing parts that contribute negligibly to the final behavior. Deep Learning works because they are initialized with sufficient diversity. Very few talk about this intrinsic diversity, that’s because this mind-numbing unintelligent characteristic is nothing to brag home about!
A causal model discovery process is based on discovering the required constraints that can guide the imperceptible perturbations of the generative process. In biology, these ‘required constraints’ are encoded in DNA and an analogous perturbative process generates a complex multicellular organism. So discovering the DNA ( the generative language ) of a complex process is the objective of any process that seeks comprehension and control of a complex process. There exists a causal model and there exists code that not only describes this causal model but also generates the causal model. This is the surprising complexity of DNA indirect encoding.
Can we design DL architectures that are capable of extracting the causal relationships in complex systems? The research titled Causal Effect Inference with Deep Latent-Variable Models explores this idea in greater detail:
We build on recent advances in latent variable modeling to simultaneously estimate the unknown latent space summarizing the confounders and the causal effect. Our method is based on Variational Autoencoders (VAE) which follow the causal structure of inference with proxies.
The recent paper on “Stylistic Generative Models” introduces a very promising method of how to control perturbative generative layers by constraining generation through the use of style layers.
Evolution deals with the high-level rules as to why organisms that survive are those that are most adaptive to their environments. Evolution does not prescribe the actual mechanism but only the general probability principle. There exist other principles that reveal the constraints that drive a perturbative process. The principle of least action is one such principle, and it tells you that an organism we select the perturbation that requires the least cost. The principle of least action motivates the principle of minimum descriptive length. A system will most likely select the behavior that is described by the shortest program code. This, I suspect is not universal, this is because the principle of adjacent possibles can restrict the code to be not what is shortest, but what is available. An organism will select the perturbation that is already available and what is most useful. Complexity arises due to the synergy of combining perturbations that are most useful.
This highlights the flaws of many universal top-down methods like Bayesian approaches, Maximum entropy, least likelihood, minimal free energy etc. As Ilya Prigogine has alluded to, that order is created only in far from equilibrium environments. The perturbations that created the building blocks of life and mind are likely not to be present in a tepid environment. These are forged in environments very different from what is present.
The curriculum to develop AGI will, therefore, require training in environments that are different from the target environment. This is because important skills that may be required can only be found elsewhere. Innovation always happens elsewhere. The Eukaryotic cell, the building block of all multicellular life, required the symbiosis with the mitochondria. That is another cell that was invented elsewhere under different environmental conditions. Evolution does not happen only due to a selection process, it happens through cooperation with a diverse collection of capabilities that have been invented elsewhere.
The brain’s neuron as a consequence of evolving from a Eukaryotic cell has its own nano-intentionality. Therefore, the imperceptible perturbations that it performs can involve a lot more than just responding to a gradient signal. Its perturbation is more likely based on its own local group and the containment of that local group in a much larger cluster. There are bottom-up and top-down forces that influence the behavior of the whole. Each local group behavior could be more complex than its constituents and for every cluster, there is a bubble up of behavioral complexity. This leads to a bewildering soup of cognitive complexity, all due to imperceptible perturbations of every neuron.
This emergent complexity can lead any researcher into a realization of utter despair. However, the great discovery of Deep Learning is that we now have the tools that reveal to us the massive capabilities that simple imperceptible perturbation can lead to. What we are dealing with here is a revolutionary new way of doing science. It is critically important for many researchers to realize that many of their own mathematical inspired tools are now obsolete.
- Initialize a network with all zero weights, attempt to train the network to perform MNIST.
- Initialize the “Inverting Diffusion” system’s initial state not with random samples, but from a fixed set (i.e. in a line). Does it converge to the original swirl?