Neural Visual Grammars + Dual Encoder Evaluation = Jungle in the Tiger
Updated: Nov 10, 2021
The image below was produced by a neural L-system evolved to produce images that satisfy the text description "Jungle in the Tiger" according to a Duel Encoder trained on the ALIGN dataset. See our paper "Generative Art Using Neural Visual Grammars and Dual Encoders".

Submitted for the CVPR 2021 Computational Measurements of Machine of Creativity workshop, it shows the output of 16 consecutive evolutionary runs with no further human curation. Click on it for a higher resolution version. Each run evolved the parameters of two sets of LSTMs (Long Short Term Memory networks) and an input sentence for the top level LSTMs. Once the image was produced it was evaluated using a Dual Encoder similar to OpenAI's CLIP (trained on the ALIGN dataset). The full algorithm is described in a paper I wrote with colleagues at DeepMind available on arxiv here, along with some of the artistic process involved in its production.
I've been using the system to produce some generative art. The results on this page show the outputs produced by exactly the same algorithm used to produce the above image with no modifications other than the text input string. I wanted to fully explore the style and capability of this one precise algorithm (lets call it Arnheim 1.0 after Rudolph Arnheim who was a pioneer in the demystification of the artistic process). It has some remarkable abilities resulting from its deep (well 2 level) hierarchical structure which mirror the hierarchical structure of visual scenes and objects in the world in general. Here I will demonstrate some of its abilities through looking at some of the images it produced.
"Jungle in the Tiger" was the text prompt given to the Duel Encoder. If instead you give the prompt "tiger in the jungle" then the following kind of image is produced instead.

See how parsimoniously the system has invented a systematic change in the marks along the back producing the orange stripes which get thicker nearer the neck of the tiger, and how it has rotated them. Some idea of the evolutionary process can be obtained by looking at a video of the production of the above image.

The evolutionary process quickly discovered some marks which the Duel Encoder scored highly as 'tiger in the jungle' and then refined them. The neural visual grammar permitted correlated variations to be produced in the image. This would not have been possible had the individual marks been encoded independently. The visual neural grammar is an indirect encoding of the image, and like all attempts at indirect encodings its success or failure depends on how well it captures the fitness landscape of the problem, i.e. how many of the underlying priors about the core properties of good solutions it embodies. We will see that the priors of Arnheim 1.0 capture several properties about the world which make it a good image generator. Notice that evolution takes place at 224 x 224 image size, but the neural grammar is able to generate the final image at any desired resolution.
Let us recap the experiments we did with looking at encoders for scenes, all using the Dual Encoder as the critic/evaluator. Early experiments encoded directly simple geometric svg primitives. Here is an example of directly evolved "face" and "cat" pictures.

We soon moved to encoding closed Bezier curved shapes of some order of control points, along with the positions and colours at which flood fills were to be injected into the image during the production of Bezier closed forms. This was very good at rapidly allowing an increase in Duel Encoder determined fitness, see an example of evolving "scream" below.

There is a lot of perceptual information in the curves so evolved. Note that there seems to be some pathological overfitting resulting in the image becoming very abstract after it has become veridical, perhaps mirroring the evolution of abstraction in other domains. Here is the system trying to produce the "mona lisa".

And here is one my daughter Agi (who helps me in my search for AGI [Artificial General Intelligence]) asked me to produce "Cute fluffy dog with a parrot sitting next to it".

See how the goofy dog has been drawn in various ways, with big eyes, and how the parrot has been drawn also in various ways, often resembling a duck or a chicken. In once case I think the dog is drawn from above, but I could be imagining it. It scored well on "the last supper" but I have no idea what it has drawn, maybe a fried egg?

"A person walking" is rather evocative with exaggerations of the feet in a rather Lowryasque way.

I liked this encoding, but I felt it looked rather too cartoony, and not as general as brush strokes, so I decided to encode italic brush strokes using an LSTM neural network. I was immediately pleased with the initial random images produced by this system which had a strongly painterly quality. See how the colour of the stroke changes smoothly as new strokes are made.

I was very excited by one of my first pictures of 'tiger in the jungle' which whilst it did not score highly, I felt it had an order which was remarkable. This uses 4 LSTMs, one for each kind of stroke. You may be able to see that there are distinct styles of stroke in this one ''painting''. The strong diagonal composition is entirely random I think.

This is an image which really captivated me. I felt it was an order of magnitude more meaningful for me than any of the previous images. I felt it could stand alone as an image in itself, without any idea of how it had been made, although maybe with a higher resolution. I felt it was much more serious than the Bezier curve pictures above, I felt that interesting parsimonious brush strokes had been used to evoke the tiger in the jungle, even though it looked more like goldfish in reeds. I refined the stroke making code a bit, and was very happy with some of the completely random initial images that were being produced. I wanted the initial random images to be diverse, so that selection would have many interesting initial conditions to start making a depiction from. A particular favorite is below.

I love the contrasts (the colours are very saturated, making me think I should adjust the initial parameter values in my LSTM) and I like the mix of thick and thin marks, and the sequences of marks that are produced with rotations as the series of marks progresses. I liked the subtle colour changes as a series of lines were produced that constituted a unified stroke. Random images produced by different numbers of LSTMs (2,5,10,30) are shown on each row. The more pleasing images are definitely the simpler ones with fewer LSTMs. The marks are rather random in structure and have no very clear systematic relation to each other.

At this point the system is like a visual language generator. I evolve a set of input vectors which I call the input sequence, each word in the input sequence is a vector of length 10 say. One of these words encodes the position where a mark should be, which LSTM the word will be input to, and how many iterations (s) this LSTM will do. This word is then fed into that LSTM for that number of iterations and the LSTM outputs a set of marks which I call a stroke. The stroke consists of s lines. Each line has a colour specified by the LSTM output as well as an opacity, and a displacement angle and thickness from the position specified in the top level word. Note this is a single layer system currently. I had not thought of the advance to the two layer (deep) system yet, but was still very excited by the kinds of image I was able to evolve with this painterly system.

Here are some examples. I like the parsimonious use of marks in these representations, e.g. the simple rotations that have been used to suggest the bright pink glasses, and the way that circles have been produced. Remember my previous obsession with the drawing of circles earlier in lockdown one? At this point I tried a lot of input texts out, but found myself slowly getting bored with the marks and what it was trying to do. A whole load of videos of the process are available here. This is when I first explored 'tiger in the jungle' producing this image right at the start, allowing the evolution of background colour. I liked the dynamic feeling of purple meteors.

At this point I developed the deep (2-level) LSTM grammar. Here I evolve an input sentence, put that through an LSTM which produces another sentence which goes through the low level LSTM to produce the strokes. The details are in the paper, but the system immediately showed some amazing abilities in producing systematic strokes compared to the rather messy pictures above. See the way it draws apples. You can think of my system as going some way to s

What it more, it was able to repeat a fairly complex set of strokes that constituted an object in various locations in the image with variation. This made it ideal for evolving flocks of birds for example.

And it could do this in various ways in different runs.

Its this algorithm which evolved the lovely tiger in the jungle. Low res 224 x 224 image on left (which is the one we actually select for) and high res image on right.

I gave it the ability to make Beziers as well as line primitives.

I became slightly obsessed with the topic of plane crashes

clouds

and pineapple crashes.


I was disturbed by the fact my cashew nuts in my Sri-Lankan cashew nut curry did not rotate or scale, something I will get back to in the next round of algorithm development.

My unhealthy obsession with crashing fruit extended to Avocado and Banana.

And then I really got into Sri-Lankan masks.

and more Sri-Lankan masks...

And I grew increasingly fond of abstract forms.


I'm currently focusing on Coral. I've been exploring what these look like printed out.

As the screen is often very forgiving and the brightness and contrast can often be lost on paper, I felt this was important to do. I'm very excited because I think I've only just scratched the surface of what a deep visual grammar evolver can do in terms of structured design. I'm blown away with some of its colour choices and I've started to explore the evolution of colour maps themselves, with some interesting results to follow with Arnheim 2.0 which comes next, which I hope will feature the rotatable cashew nut.