Language Games and Drawing Games
How is language different from visual art? I wrote a paper recently about this topic which explores the conflict between Goodman's "Languages of Art" and Arnheim's "Art and Visual Perception" approaches to visual representation. On the one hand language (apart from onomatopoeia) is entirely arbitrary (symbolic), and on the other hand visual photorealistic representational art has a resemblance to the thing represented. Although the notion of resemblance has been hard to pin down, until now, when we can talk about the idea that e.g. a visual classifier trained on images of that thing would also classify a drawing of that thing as that thing. Even in that case, this is a notion of resemblance based on the pragmatics of the photo. We currently do not train visual classifiers based on active vision in the real world, where they can move around an object, get closer and further from it, etc... and then require the same classifier to maximize a 2D image's probability of being that thing, when that active classifier is able to move around the 2D image in the same way in order to make the discrimination. This experiment is the natural one implied by Cubism and might be soon made and sold by Hockney once he gets into some machine learning perhaps.

In the middle between the extremes of arbitrariness and verisimilitude there are abstractions such as maps, graphs, charts of various kinds, etc.. which are not entirely arbitrary and yet are not photorealistic, e.g. a graph of the stock market going up does not look like a real stock market going up. How do such visual abstractions arise? They are not the kinds of thing which could ever be produced by a GAN (Generative Adversarial Network). GANs always have at their endpoint verisimilitude.
In this paper I discuss ways in which such abstractions could arise. I think the answer lies in the fact that we do not yet have neural networks that can identify Gestalt Principles that are very general, such as inside/outside, collinearity, numerosity, and other relational visual principles. We do not have networks that try to make sense of any image in terms of affordances provided by that image, e.g. can I put things on it, can I climb it, can things in the image be put inside other things in the image etc... Once such networks exist then we can begin to generate diagrams that possess the same affordances as real events in the world. Once a Gibsonian perspective is taken to vision, then we can extend the existing work on maximizing classifier probability, and generate images which have increasing abstraction. So in this sense, I fall on the side of Arnheim rather than Goodman in arguing that very little about visual art is entirely arbitrary in the linguistic sense, but instead arises from a pre-trained visual system evolved to understand visual affordances.
The paper is available on archive here.