Evolving Supra-Normal Art using Pre-trained Neural Networks as Aesthetic Selectors
Updated: Feb 5, 2021
Let us generalize the work of Tom White and others who have used stochastic search to produce art works (images) that optimize the class probability of a pre-trained classifier. To generalize, take any pre-trained neural network e.g from here and evolve an input to them which maximizes some aspect of the regressor or classifier. In this way one finds the supra-normal stimulus such as the Venus of Willendorf. The principle can be extended also to reinforcement learning neural networks, e.g. finding the go board position that is most confusing, or the Atari screen that makes an agent want to go most to the right or to shoot. This approach sees art as that which maximizes some affordance for the viewer in the Gibsonian sense. One might also take existing works of art and see how classifiers respond to them, and use this to produce an interpreted view of the original, as Paglen has to some extent done.

Above is an image I evolved to maximize the probability that ResNet50 thought it was a goldfish. You can see a very dark image has evolved which has a goldfishy quality about it.
This approach falls into the domain of the cognitive neuroscience of art which claims that we ought to find a close coupling between artists' productive strategies and operations of perceptual systems (p27 Art, Aesthetics and the Brain), e.g. Richard Latto's work. Latto asks why we prefer one abstract shape compared to another. Art works because we are visually limited. "We like looking at stimuli that we are good at seeing in some way" he says, maybe thats why we like Mondrian because we are simply better at seeing vertical and horizontal lines. Maybe artists are sensitive to their perceptual processes more than others and able to identify and exploit this understanding? Faces have their own galleries, e.g. the National Portrait Gallery, because we enjoy looking at them so much. This approach I think is more fruitful than trying to find an ideal formal-compositional aesthetic metric as people tried before, e.g. Berlyne, which is more closely related to current approaches in intrinsic motivation, which attempt to find equations for interestingness or beauty. There are many artistic games which can be played.
In summary, I would suggest that an interesting new art game is one which invents an interesting new goal. And an interesting new goal is one which identifies and then manipulates with mastery some interesting neural/cognitive property about our brains. There is an unlimited number of such manipulations possible. As with science, a lot is learned by maximizing or minimizing some quantity, e.g. height --> space rockets, temperature --> fridges. But art is not trying to manipulate and predict the world, it's trying to manipulate & predict the mind. The artistic method is similar to the scientific method applied to the mind. In science there are trivial games to play as there are in art. Once you can manipulate sums, there isnt any point manipulating sums + 3 for example. Trying to formalize what science is has been a rather difficult task for the philosophy of science. In short, science seems to be a set of heuristics for effective Bayesian model construction and selection. These heuristics include things like ensuring that a theory is falsifiable, that a theory is parsimonious (as simple as possible), etc... The artistic process also contains heuristics for inventing new artistic theories which can be thought of as generative and evaluative constraints that manipulate and predict new data (the data in this case not being things in the world, but mental states). Lets consider some artistic games:
1. Manipulate what you think something is. (photorealism) Tom White. ImageNet simulates that part of the temporal lobe.
2. Manipulate what you feel. We don't have agents that feel much yet. Although we do have agents that can behave e.g. in first person shooter games, or Atari. These are RL agents. I’ve been thinking of e.g. generating images that maximize the probability of an RL agent running away, or shooting, in some way. This view is about art having affects on affordances, and manipulating (maximizing and minimizing) the affordances of things, e.g. maximizing the sit-on-able-comfortableness of something. OK, so what i’ve talked about is not quite feelings but affordances. In summary. a formula for making art. Take a pre-trained neural network ... Make things to maximize or minimize its outputs.
1. Choose an RL agent, e.g. alphaGo, or an agent trained to cook cakes, or drive cars. 2. Produce (through a process of search/optimization) a visual image that the agent can look at which makes it do or think something, e.g. makes it want to resign, makes it confused about whether it will win, makes it want to add flour, makes it want to hit the breaks. This will be an art object. 3. The full set of art objects for this agent will be all those images which make think or do all the things that can think or do. 4. If it can only think and do 3 things, then there are only 3 pieces of art that agent can appreciate. 5. But hopefully if its an agent that can e.g. drive a Tesla, then the art that a self-driving car would appreciate would be quite diverse, consisting of images that it would assign different values and affordances to, i.e. would make it behave in different ways.
This is related to DeepDream which used gradient ascent to maximize the L2 norm of a layer by modifying the image not the weights. Some tricks were used such as moving the image a little bit each time (jitter), normalizing the gradients in the layers going down to the image, and magnifying the image in 'octaves' so as to not get artifacts at the same spatial scale (?) but the effect was to select for images that manipulated the conclusions of the perceiver and these images were of some interest artistically. Although, in many ways they lacked any real understanding of the image. It highlighted the fact that convolutional neural networks largely work on local texture relations and so can be fooled by modifications to the image that are of this local texture type, rather than higher order image properties. I like the use of jitter and octaves, this relates to what Tom White also did in his work on maximizing imageNet classifications. It forces the image modifications to be more human interpretable and meaningful. Perhaps the images have the property they do because of the greediness of backpropagation, which makes only the changes to the image that in the limit are expected to maximize the decrease of error. Backpropagation makes no guesses for how to change the image in more structured (macro) ways.