top of page

The Doodle: A Reinforcement Learning Approach to Drawing

The doodle is a fascinating thing. It is produced unconsciously. Maybe it is the visual equivalent of stream of consciousness thinking, or mind-wondering as it is now called. If you're thinking too much about the doodle then, various conscious biases arise, such as trying to make it look good, and its not a doodle any more.

Everyones doodles are different and it would be worth making an archive of them. Can a neural network be produced which can generate doodles of this general form? More generally, can a program be written which produces doodles like this on random initialization, i.e. so that it has sufficient priors built into it which it need not have to learn from scratch?

How can a learning machine be put on top of these modules to control the doodling at a higher level? What is the underlying algorithm that could result in open-ended doodling?

Paul Klee in his largely incomprehensible pedagogical sketchbook talks about doodle like processes. Pretty much the only thing anyone can remember from this book is the first sentence. He's interested in the forms that arise from graphical processes.

Doodles seem to exhibit a kind of flexible visual grammar, in fact, image grammars are now a neglected topic in machine learning, although Josh Tenenbaum's team continues to push hard at this within the rubric of program synthesis. When we make a doodle we invent the grammar of that doodle, and we follow its consequences, and we perhaps change the grammar. For example we might compose independent marks into groups, and operate on that higher level object, i.e. identify a vocabulary of forms based on finding primitives that occur with higher frequency together than by chance.

A parse tree relates the primitives in some way to construct the whole image. Sampling from the tree can generate many different possible images. The geometric relations of alignment, parallelism and symmetry, especially as created by occlusions, are the driving forces behind the grouping of lower level parts into larger parts. Primitive parts are related by relations. A visual grammar is much more interesting than a Lindenmayer system, or even much more interesting that Wolfram's recent new theory of physics which is just a set of re-write rules on graphs. In that case it produces quite complex graphs basically. Whilst we might interpret these graphs as pen strokes and produce drawings from them, note this would be a completely open-loop generative grammar, not at all reactive to the thing which has actually been drawn, which makes it unsatisfactory. A true visual grammar is grounded because it operates directly on the page in a closed loop manner. It is grounded by the use of neural networks to identify the components on the page itself, and then to modify the components on the page directly, rather than just generating the full structure in some abstract space and then putting it on the page in a ballistic way. It is engaged in a real interaction with the page, just as a real tree does not grow according to a L-system, but is present in a real 3D space in a world of physics which operates simultaneously on the whole system. It is not just a formal system, but is a physical system. A shape matching based cellular automata like re-write system can be implemented using generalized Hough Transforms for shape matching.

Evolution can be used to evolve programs for doing drawings if one knows what one wants in the first place. This is not quite what a doodle is about, but it can be done. For example here I evolve a program using the Auto-ML framework to fill in a rectangle. The evolved program appears at the bottom of the video.

One of the problems is that there is a lack of diversity, it is hard to get from one algorithm to another using mutations. At this point I attempted to develop an architecture to implement the overall process of art making. Let's call this algorithm Binduva-Draw. Roughly speaking, starting with a blank page:

While (not finished):

- MCTS search to depth D making M marks using generator G.

- Evaluating each proposed drawing state using an aesthetic value function (C)

- Choose the actual mark to make.

- Update the page state by making that mark.

This allows us to ask questions like "How many marks ahead did Picasso think?" So this is a drawing agent who uses MCTS planning (as AlphaGo does) to decide which mark to make next. It also uses a value function (C) to evaluate which marks it thinks are currently best. But for any particular artistic game it needs to be given this G (rules for mark making) and C (value function or equation) to evaluate the quality of the drawing. So this is an RL approach to drawing. It differs from the evolutionary approach in that the drawing is evaluated throughout its execution, not just at the end. I think this framework is more suited to modeling doodling where we may have different drives operating throughout the process, we're trying to achieve different goals in a doodle throughout the process of making a single doodle I think. The very simplest application of this algorithm is for example in planning to draw lines on a page that do not overlap. We might try a variety of random lines and choose those which don't overlap. [Note, this process describes an agent which is not quite an artist. The artist is an agent who invents G and C functions on the basis of neuroaesthetic criteria which are based on how their brain (and how they think other brains) is affected by the artwork. We will need to leave that part aside for the moment, but we've got to make a start somewhere.]

The video shows the process of drawing using the above algorithm...

It is possible to apply selection to the mark making modules on the basis of the evaluation during the course of drawing a single image. Currently all the mark making methods are blind. They do not see the image they are drawing on. It's as if they are made randomly by the artist, while someone else tells them how effective they have been. The artist can however, modify the random variables in the mark making procedure based on evaluative feedback after each mark. Now we add a simple reactive drawing procedure, which looks at the current state of the page, and proposes where to draw the new object, e.g away from existing objects. FindSpace applies a Gaussian blur and returns a position which is in the darkest region of the gaussian blur. Preventing overlap isn’t completely possible with the small number of rollouts (single step) done in each mark. Increasing this should improve the non-overlap. Other kinds of evaluation are possible, e.g. here we reward lines drawn in a central invisible circle, and punish those drawn outside it.

The more rollouts (popSize) there are the more likely that the chosen mark will be in the right place. The mark making modules can be much more complex, which is what we experiment with next. We add a local view based drawing process like before. We add a neural network control function (evolvable weights) We add a neural network method for choosing where to initialize the line. We added the LSTM-GMM model. The next few pictures show what happens with LSTM-GMMs evolved online during the drawing to produce the marks to fill in a circle.

So far we've been using toy aesthetic functions (drawing circles and reducing overlap). What about using a more interesting one. Let us say we wish to produce a drawing that has a scale invariant fractal dimension as it has been claimed distinguishes Jackson Pollock drip paintings from fakes. Let us try to minimize the residual of the fit of the fractal dimension log log plot. The image was erased every 100 marks. The mark type used is shown. The images produced do not resemble Pollock pictures, instead they show strong biases for non-homogeneous distributions of lines. It is not clear that the residual has been minimized either.

Perhaps it is better to attempt to achieve a specific fractal dimension? Its not at all clear that hill-climbing over a fractal D estimate can work, because with very few marks you don’t get a reliable D estimate initially. So what was the actual algorithm used by Pollock to produce the image? And is there any existing generative process that can reproduce Pollocks in all aspects, and does it have an evaluation function which is based on fractal dimension, or is this something that arises as a side effect of another evaluation function? Further work is required in this problem.

Consider another set of computational aesthetic evaluation (reward) functions

  1. Fourier power spectrum === Box counting + standard deviation [also]

  2. PHOG self-similarity [here] [Shared weight convNet at multiple spatial scales?]

  3. HOG Complexity, HOG Anisotropy

  4. Pixel and pre-trained convNet activity fraction

  5. Classification properties of net trained on ImageNet convolved over drawing.

  6. GIF compression (Lossy compression factor, intermediate value)

  7. Symmetry

  8. Compositional “Balance” and non-overlap [rule of thirds]

  9. The Principle of Juxtaposition.

A variety of computational aesthetic functions have been proposed in the literature.

For example we used the our algorithm above to select for images which maximize this function

rew = -np.power(1.7 - ev.get_power_spectrum(page[:,:,0], FLAGS.size)[0], 2)


I am not very satisfied with this.

Extending the work of Richard Latto on Mondrian who argues that vertical and horizontal lines are prefered to oblique lines because of visual fit to the human visual system, what about using pre-trained convNets. We could take the first Conv layers form ResNet, ignoring the classification layer, and observe the properties of activation of the convolutional layers at multiple levels in the network. Maybe a stimulus that simply activates intermediate layers of the convNet more is good?

A variety of combinations of the above aesthetic measures were used, along with a variety of generative mark making procedures, generating the following images.

I was not desperately happy with any of these outputs. My conclusion is that in all the above approaches, algorithms that can detect visual concepts are currently lacking and this contributes to the random-looking-ness of the drawings. E.g. there is no appreciation by the drawing agent of the following concepts...

  • Detecting the inside and outside of a shape if that exists.

  • Detecting the boundary of a shape

  • Detecting the center of a shape

  • Detecting a path or a region made by a cluster of forms

It does not understand the image, it is not sensitive to the implications and meanings and affordances of the image to a sufficient degree to prevent the images appearing random. In summary an algorithm implementing the principle below was developed...

With a variety of generators (G) and a variety of evaluators (C), with myself playing the roll of meta-critic. I felt in retrospect that I had failed to effectively devise interesting non-trivial evaluators and generators, and the result was a bit of a mess. I think this is because the evaluative sensibility of the system was extremely low. The kind of work which has worked is for example Tom White's experiments using imageNet to evaluate a generated image. ImageNet has much more evaluative sensibility than the computational aesthetic equations used above. This work put me completely off such hand-designed computational aesthetic equations. I think that for a critic to be effective it must be a learned neural function from data, as in GANs. The problem is we don't have neural networks that have good understanding of what Clive Bell perhaps calls "significant form", of the non-representational interesting relations between things. This appears to be the next bottleneck we face. Can we invent a neural network that understands the meaning of images in a much richer way than the object classification networks we have now? What networks do we have now? What off the shelf image understanding can we exploit for the C part of this work?

428 views0 comments

Recent Posts

See All
bottom of page