It was a pleasure to find the work of Aaron Hertzman via the Neurips 2020 workshop on creativity. His talk addresses some issues (failures) I had around May 2020 when I was trying to evolve drawings that would maximize the entropy of an imageNet classifier so that it would be maximally confused about what it was looking at. Here I describe my own experiments, and then talk about what is suggested by Hertzman's approach.
I was interested in how to measure "aesthetic interestingness" in abstract art. What is a sufficient criterion for a good piece of abstract art? We have such a rich visual experience in life. We know what things look like, their shapes, their arrangements relative to each other as they normally appear etc.. The hopelessness of capturing an evaluation function in a simple computational aesthetic equation became somewhat evident. I wondered, if you showed good paintings vs. bad paintings to an ImageNet classifier, would there be there any distributional property of its output probabilities that would tell you which was which? One could even show hierarchical image patches of the painting to the imageNet classifier and think of a function of the outputs that might be able to tell you what was a good painting and what was not. If you have an abstract image e.g. a Jackson Pollock and showed it to imageNet, what distribution of object probabilities will it reveal, and what about a Rothko, or a regular square grid? If you could evolve a drawing, a drawing being a set of lines, say 2000 lines, and the fitness of the drawing was some function of the output probabilities of imageNet observing the drawing, then what drawing would make imageNet maximally confused? And what would this drawing look like? This is what I asked myself. In short, I was thinking of ways in which a pre-trained visual object discriminator could be used in some way to tell what was and what was not an 'interesting' abstract image.
I showed ResNet50 an image of the moon and found that the highest class probabilities went to the following items : [[('n04372370', 'switch', 0.1295303), ('n02799071', 'baseball', 0.09941145), ('n03400231', 'frying_pan', 0.080728576), ('n04332243', 'strainer', 0.07138036), ('n03874599', 'padlock', 0.05697499)]] I can understand frying_pan and padlock. I then evolved drawings to maximize the entropy of the top 5 highest probability classes. The drawings were produced by a wheeled robot that had a little camera looking down at the drawing around its pen. This image went into a neural network and controlled its motors.
Some examples of such evolved drawings are shown above and the top probabilities are as follows: [[('n04275548', 'spider_web', 0.98526186), ('n01773797', 'garden_spider', 0.0108058965), ('n01773549', 'barn_spider', 0.003682051), ('n01773157', 'black_and_gold_garden_spider', 0.0002397322), ('n03000134', 'chainlink_fence', 6.1695837e-06)]]. Notice that spider's web is by far the highest probability. This is not surprising since the drawings were made as bright white lines on a black surface, and it seemed very hard for any drawing to look like anything but a spiders web. The robot certainly did manage to make drawings that made it look more spider-webby, but it failed to increase the entropy of the top 5 class probabilities, i.e. to equalize the chances of it thinking it was each of the top 5 classes.
Wang et al on the other hand used artbreeder to generate images and asked people to provide textual descriptions of the image, kept only the nouns and calculated the entropy of the description. They found that high entropy images were the most evocative. Hertzman argued that the appeal of GANs was due to the indeterminacy of the images produced. Interestingly they allowed humans different viewing durations for the images, and found differences in the entropy of descriptions between short and long exposures, and used this 2D space to distinguish different kinds of image, see below...
It would seem that the most interesting abstract images have high entropy in both viewing durations. It would be wonderful if we could automate this process rather than depending on human observers textual reports.
So to return to the question of why I was not able to get imageNet trained classifiers to understand my line drawings, I refer to another piece of work by Hertzman where he tries to understand how line drawings work. He used a pre-trained depth estimation network and showed it line drawings and found it did a good job on estimating depth from the line drawings, just as well as from photographs. What does this tell us about line drawings? He showed that if you take a rendered simple image with lighting coming from a lamp on the viewers head, and make a line in the darkest parts of that image, then you get human like line drawings arising of that 3D object. This suggests that line drawing is an effective kind of communication of a 3D object iff the viewer assumes something about how the line drawing was made from the 3D scene and can map back to the 3D rendered scene.
ResNet50 was not trained on line drawings, and there is lots of evidence that such networks do not see images in the same way that we do. It is known that convolutional neural networks mainly make discriminations based on the texture of the image, and not on global relations between parts of the image. We would perhaps first need to train a neural network to convert line drawings into rendered 3D scenes (as seen in photographs, perhaps without textures) and then to convert these photographs into class probabilities. Such a neural network exists for faces, see convolutional sketch inversion and for more general images described in Image-to-Image Translation with Conditional Adversarial Nets, see the interactive demo here.
So a proposed return to this work would be to evolve line drawings that could be put through a variety of pix2pix type systems to produce images which would then maximize the class probability of many different classes. Of-course, this system is still very limited in the kinds of ambiguity it can produce. It is only ambiguity between the class label of a single object in the image. Real art can be ambiguous at many levels, from what a smile means, to whether it intends to criticize through sarcasm or whether it is an earnest statement.