A few years ago I wrote an ASCII diagram parser in Python. I never got round to blogging about it at the time, but now I want to finally get some of my thoughts about it down in writing.

Ascidia

I called the project Ascidia (uh-sid-ee-uh) which is a combination of the words "ASCII" and "diagram". Ascidia is a command-line utility which parses ASCII art technical diagrams in a particular format, and outputs a prettier vector or raster image version. It was heavily inspired by Ditaa by Stathis Sideris which has been around for longer. To give you an idea of the kind of thing I mean, it turns something like this:

               O     
              -|-  -.
              / \   | 
              User  | Request
                    V
 Foobar         +--------+       .------.
  Layer         |  Acme  |       '------'
- - - - - - +   | Widget |<----->|      |
   .----.   ;   +--------+       |      |
  | do-  |  ;       |            '------'
  |  dad |--^--<|---+            Database
   '----'   ;
            ;

into something like this:

Ascidia-converted output image

Ascidia-converted output image

ASCII Diagrams

Why would anyone want to use ASCII text to make diagrams? Such diagrams are useful where only text is available and there is no facility for using images. Code comments are a good example of this. A picture speaks a thousand words, as they say, and good code comments can be essential for easily-maintainable code:


class Widget(object):
    """
    The widget takes user input and relays messages to the doodad, which encodes
    them and sends them to the remote machine to be unencoded at the other end:
    
     O      +----+    +---+        +---+    +----+      O
    -|- <-->|    |<-->|   |<- - - >|   |<-->|    |<--> -|-
    / \     +----+    +---+        +---+    +----+     / \
            widget   doodad       doodad    widget
    """
    
    def __init__(self):
        ...    
        

If such ASCII diagrams are in a parsable format, it allows them to be extracted and processed into images, perhaps as part of an automated process for generating documentation. In the Java world, for example, javadoc comments can be parsed and turned into HTML documentation like this using the javadoc tool. Similarly Python has a pydoc tool for generating documentation from docstrings.

Markdown

Why create another ASCII diagram tool if tools like Ditaa and ASCIIToSVG already exist? Well, for fun mostly. But beyond that, first you have to understand that I'm a big fan of John Gruber's text formatting language Markdown (and formats like it: atx, reStructuredText, etc). In particular, I like the philosophy Markdown follows.

The idea of Markdown is that it allows you to describe a text document with rich formatting; emphasis, paragraphs, bulleted lists, etc, as you can with HTML. Unlike HTML, however, Markdown does away with all of the messy angle brackets and other syntactic clutter that make the source difficult to read. Instead, Markdown's syntax uses such minimal and intuitive constructs that anyone reading the plain-text source wouldn't even know it was there. It essentially formalises many of the conventions that people adopt in plain-text documents anyway. The result is a format that is perfectly readable in its plain-text form, yet structured enough to be parsed and displayed with all of the beautiful formatting of HTML.

As an example, consider this fragment of HTML:

<h2>Rich Text</h2>
<p>
    Text <em>formatting</em>
    <ul>
        <li>Wow</li>
        <li>So text</li>
        <li>Very format</li>
    </ul>
</p>

It isn't designed to be read by anyone other than the author. It's a set of instructions to the computer about how the document is structured. You need an HTML renderer (such as a web browser) in order to read it. Compare this with the equivalent Markdown:

Rich Text
---------

Text *formatting*

* Wow
* So text
* Very format

It's easy to read, and this makes it easy to author too. The reason I personally like Markdown so much is because it allows me to write a document in any plain-text editor, where I can focus on the content and not be distracted by the presentation of the document while I do so. Many people use LaTeX for the same reason (It's pronounced lay-tek apparently), although despite the intention of its design I still found that my LaTeX documents were littered with presentational formatting instructions.

What You See is What You Get

I wanted Ascidia to be for diagrams what Markdown is for text; a plain-text Ascidia diagram should be as readable as its parsed and rendered form. I wanted to be able to write a perfectly readable document, with diagrams, entirely in plain text and have the option of rendering the whole thing in a prettier format later. Ditaa and ASCIIToSVG are both great, but I feel that neither quite fits this requirement as they don't take the Markdown philosophy all the way.

For example, Ditaa allows you to draw rectangular boxes in ASCII, and without further specification these will become rectangles in the output. However, more exotic boxes are specified by adding a tag to a rectangular box, indicating that its shape should be transformed in the output:

+-----+   +-----+   +-----+
|{d}  |   |{s}  |   |{io} |
|     |   |     |   |     |
|     |   |     |   |     |
+-----+   +-----+   +-----+
Document, storage and I/O boxes using Ditaa

Document, storage and I/O boxes using Ditaa

Similarly, lines can be specified with dashes in ASCII, and will be rendered as plain lines in the output. But they can be modified by adding a single special character to the line:

----+   ----+  
    |       :
    |       |
    v       v 
                
Solid and dashed lines using Ditaa

Solid and dashed lines using Ditaa

Colours can be specified using colour tags:

/----\ /----\
|c33F| |cC02|
|    | |    |
\----/ \----/

/----\ /----\
|c1FF| |c1AB|
|    | |    |
\----/ \----/
Coloured boxes using Ditaa

Coloured boxes using Ditaa

For me, Ditaa's (and similarly ASCIIToSVG's) use of metadata tags to control presentational attributes doesn't seem to entirely mesh with Markdown's philosophy. While convenient for the diagram author, they have no meaning in the plain-text source. To anyone viewing the diagram in its ASCII form, the metadata tags will be obscure, cryptic. They could potentially be confusing if the viewer tries to identify their meaning in the context of the diagram's content.

If a diagram is going to contain an I/O symbol, I feel it should be recognisable as such in the ASCII source, otherwise it no longer serves the dual purpose of being readable plain text and a parsable specification. So the approach I took with Ascidia was to use ASCII patterns that looks like the symbols they represent. What you see in the ASCII version is what you get in the rendered version:

+-----+   .-----.       +-----+
|     |   '-----'      /     /
|     |   |     |     /     /
|     |   |     |    /     /
'._.-.|   '-----'   +-----+
Document, storage and I/O symbols using Ascidia

Document, storage and I/O symbols using Ascidia

----+   - - +
    |       ;
    |       ;
    v       v
    
Solid and dashed lines using Ascidia

Solid and dashed lines using Ascidia

I made a conscious decision to make sure that Ascidia's rendered symbols are as close as possible to their ASCII versions. So far, this has meant avoiding any "extra" decoration in the rendered output such as drop-shadows, gradient shading, etc. This makes Ascidia more flexible with regards to the use of its symbols. For example, if the user draws a diagram like the following:

+------+
|      |
|      |---+
|      |   |
+------+   |
    |      |
    +------+
    

Ascidia recognises the top left part as a rectangular box, and renders it as such. The bottom right part of the diagram is recognised only as a series of connected lines. If Ascidia added a drop shadow to its rectangular boxes, the diagram above would have a drop-shadow on the top left part and not the bottom right. This might have been the user's intention - a box with a line connecting it to itself - but on the other hand, the user may have been trying to represent a pair of boxes overlapping each other. By not adding a shadow, Ascidia doesn't make too many assumptions about the user's intended meaning of the diagram.

With and without drop shadow

With and without drop shadow

What About Colours?

So how do you specify colours with Ascidia? Well, the short answer is that you don't. My thinking is this: if you require colours to convey the meaning of your diagram, then the diagram is already inadequate in its ASCII form. Basically, you're doing it wrong! ;) Ascidia does let you specify the global background and foreground colours when rendering, however.

How it Works

Ascidia has a state machine for each type of symbol it recognises. When a document is being parsed, it makes a pass over the characters in the document, in order from start to finish, for each of these state machines. For each character in the document, it creates a new instance of the state machine as well as feeding the current character into the existing instances. Thus every sub-sequence of characters in the document is checked for a match against each symbol.

Matching symbols using state machines

Matching symbols using state machines

As soon as a state machine is fed a sequence of characters that doesn't match its symbol, it is discarded. As soon as a state machine is fed a matching sequence, the state machine records the size and position of the matched symbol and is stored in a list. Where more than one state machine matches, the one with the longest matching sequence is used. All other currently-running state machines are discarded.

Metadata about the symbol is also recorded against each of its constituent characters. This metadata includes things like whether the character has already been "claimed" by a symbol, whether it is the edge of a box, a line ending, and so on. Further symbol state machines may then base their match on this information. Arrow heads, for example, will only match where existing metadata denoting a line ending is found.

Pattern Definitions

I wanted to make it as easy as possible to add new symbols to Ascidia, so I needed to have a nice way to specify the patterns of symbols it should look for. My initial thought was to use regular expressions to specify valid sequences of characters. However, I quickly found that they weren't flexible enough. Take, for example, a simple rectangle pattern:

+-----+
|     |
|     |
+-----+

We could use a regular expression such the one below to match it:


                                   \+-{5}\+.*\n\(| {5}\|.*\n){2}\+-{5}\+
                                    ^ ^   ^ ^    ^ ^   ^ ^       ^ ^   ^
                                    | |   | |    | |   | |       | |   |
        /                  corner --' |   | |    | |   | |       | |   |
   top |          horizontal line ----'   | |    | |   | |       | |   |
  line |                   corner --------' |    | |   | |       | |   |
        \ remainder to line break ----------'    | |   | |       | |   |
middle  /           vertical line ---------------' |   | |       | |   |
  line |             inside space -----------------'   | |       | |   |
  (x2) |            vertical line ---------------------' |       | |   |
        \ remainder to line break -----------------------'       | |   |
        /                  corner -------------------------------' |   |
bottom |          horizontal line ---------------------------------'   |
  line  \                  corner -------------------------------------'
                  

There are a number of issues that this doesn't address, though. Firstly, the regex will only match a rectangle that is exactly 7 characters wide and 4 characters tall. We need to be able to match any size of rectangle, and this means accepting any width of top, middle and bottom line so long as they are all the same. Standard regular expressions don't have any way of specifying this.

Another issue is that of the position in the document. The expression above will only match a rectangle that is up against the left side of the page. That is, each of its lines starts at the first character. If we want to match a rectangle at an arbitrary position, we need to know where along the line the top corner is, and then demand that the same amount of space precedes each subsequent line.

A further issue is the other character data that might surround the rectangle; to the right of each of its lines and inside the box itself. We can use .* to allow any characters, but once the pattern is matched we have no way of determining which of them are part of the rectangle and which aren't. (OK, some regex libraries might be able to provide the starting position of each capturing group, but it gets tricky).

Going beyond regular expressions I began to think about how I would need to store information in variables during the match. Each pattern definition would need to be something almost code-like, and not just data. Lisp's S-Expressions, I hear, are great at the whole code-as-data / data-as-code thing, so I started thinking about some kind of S-Expression-based DSL. Something like:

(pattern (
    (cap-len a (n-of 0+ " "))
    (n-of 1 +)
    (cap-len b (n-of 1+ -))
    (n-of 1 +)
    (n-of 0+ " ")
    (n-of 1 "\n")
    (n-of a " ")
    ...

However, as I began to think about more complex patterns I soon realised that such a DSL would need not just variables, but also branches, loops and subroutines. This was well into "code" territory. I concluded that I might as well just use Python itself to specify pattern definitions. It's powerful, concise, and I could use generator functions as a convenient way to hold state information between each character processed.

The pattern definitions I've ended up with aren't exactly the elegant creatures I'd initially hoped for, and adding new patterns is a more involved process than I'd like it to be. The rectangle box definition, for example, is 134 lines of fairly baffling Python code. It could be worse, I suppose, but I definitely consider this to be Ascidia's dirty little secret.

Conclusion

I don't really think Ascidia is "done" yet, as there are lots more features I'd like to add, but it's in a functioning state and I don't have any plans to work on it in the immediate future. It was a fun project and, despite its shortcomings I've found it to be one of the more useful tools I've created for myself.

Ascidia is open source and can be found on Github, here:

http://github.com/Frimkron/Ascidia