Parsing span elements

So far we have only done block level parsing for our markdown source. Let's add functionality to parse inline (span) elements. How hard can it be? Let's say this gave me more trouble than I expected, mostly because I went in with some wrong assumptions.

To be sure we agree on what we want to accomplish, we want to turn input with elements like _emphasis_, *strong* text, ~strikethrough~ and `code` into emphasis, strong, ~~strikethrough~~ and code, where the html output would look like emphasis, strong and <code>code</code>. Finally, a really useful addition, hyperlinks. In markdown they look like [home](https://ron.srht.site), they should render as home, the corresponding html being <a href="https://ron.srht.site">home</a>. We are deviating from Markdown a bit, as in markdown single _underscores_ and *asterisks* would translate to emphasized text, and double __underscores__ and **asterisks** would turn into strong text.

The problem

Having read the excellent book Crafting Interpreters by Robert Nystrom¹, I figured I would try to base the additions to my parser on some of the concepts discussed there. If you follow along with the book, you will implement a Recursive Descent parser², which is a relatively simple but quite performant type of parser. The catch is that there are some requirements on the grammar you are trying to parse.

At some point I realised Markdown does not meet these requirements, for instance, Markdown cannot be expressed as a context free grammar³. When parsing something with a recursive descent parser, you are basically building a tree of expressions. This cannot be done in an unambiguous way for for Markdown. So... back to the drawing board!

Solving it the dirty way

Instead of building a tree of elements that we can write as html in a neat way, we will instead code some rules to do inline token substitution. Let's see how that looks!

As we are already doing block level parsing, we just have to add a function to parse strings where they may contain span elements. Let's use paragraph as an example, but it also applies to headers, for instance:

fn parse_entry(raw_post: String) -> Page {
    let mut elements: Vec<Box<dyn Element>> = vec![];

    // ..

            elements.push(Box::new(Paragraph {
                // parse the inner text going into the paragraph
                inner: parse_string(inner),  
            }));    // ^^^^^^^^^^^^ new

    // ..
}

fn parse_string(block: String) -> Vec<Box<dyn Element>> {
    todo!()
}

To make applying our token substitution rules easier, we first turn our string into tokens, a process called tokenization. This means splitting up our text into parts that are just text or that are modifiers. Let's make an enum for that distinction and add a tokenize function:

enum Token {
    Text(String),
    Modifier(char),
}

fn tokenize(block: impl Into<String>) -> Vec<Token> {
    todo!()
}

To visualize what this function should do, it should turn a string "this is a *String* with _modifiers_" into a Vec like:

[
    Text(
        "this is a ",
    ),
    Modifier(
        '*',
    ),
    Text(
        "String",
    ),
    Modifier(
        '*',
    ),
    Text(
        " with ",
    ),
    Modifier(
        '_',
    ),
    Text(
        "modifiers",
    ),
    Modifier(
        '_',
    ),
]

The function itself:

fn tokenize(block: impl Into<String>) -> Vec<Token> {
    let block = block.into();
    let mut tokens = vec![];
    let mut iter = block.chars();

    loop {
        // we scan every char in the string
        let next = iter.next();

        if next.is_none() {
            break;
        }

        // if the char is a modifier, add a modifier token
        match next.unwrap() {
            '`' => tokens.push(Token::Modifier('`')),
            '*' => tokens.push(Token::Modifier('*')),
            '_' => tokens.push(Token::Modifier('_')),
            '~' => tokens.push(Token::Modifier('~')),
            '[' => tokens.push(Token::Modifier('[')),
            ']' => tokens.push(Token::Modifier(']')),
            '(' => tokens.push(Token::Modifier('(')),
            ')' => tokens.push(Token::Modifier(')')),

            // escape char is a special case, we need to handle the next char
            // instead of the current one, and handle it as text
            '\\' => {
                let next = iter.next();

                if next.is_some() {
                    // if the last token in tokens it a text token,
                    // we append the character there
                    if let Some(Token::Text(t)) = tokens.last_mut() {
                        t.push(next.unwrap());
                    } else {
                        // else we start a new text token
                        tokens.push(Token::Text(next.unwrap().to_string()));
                    }
                }
            }
            c => {
                // for every other character, do the same as for the escape char, so..
                // add it to the last token if it is a text token, else start a new one.
                if let Some(Token::Text(t)) = tokens.last_mut() {
                    t.push(c);
                } else {
                    tokens.push(Token::Text(c.to_string()));
                }
            }
        }
    }

    tokens
}

Now we make it slightly more complex as we want to substitute our tokens for elements according to certain rules. Since we will do successive rules we really want to sort of... swap the tokens for elements in place... so we create another enum, called Position, and put our tokens into a Vec<Position>. Then we can feed our positions into our "rules functions" that will do the actual swaps.

enum Position {
    Token(Token),
    // We remember Box<dyn Element> from past articles.. right?
    Element(Box<dyn Element>), 
}

fn parse_string(block: String) -> Vec<Box<dyn Element>> {
    let mut positions = tokenize(block)
        .into_iter()
        .map(|token| Position::Token(token))
        .collect::<Vec<_>>();

    code(&mut positions);
    strong(&mut positions);
    emphasis(&mut positions);
    strike(&mut positions);
    links(&mut positions);

    todo!()
}

The only real precedence order here is that code(..) has to be called first, as we turn everything inside backticks into plaintext. If we would call strong(..) first we would end up with bold text inside our inline code span, which should *not* happen. :^)

Let's not look at every rule, let's stick to code(..) and strong(..). Most of the rules work in a similar way. They call indexes_for_modifier(..) to get a Vec<usize> with all occurences of the modifier for this rule. Then chunks_exact(..) is used to get pairs. This means when there is only one modifier left, it is ignored. This pair of indexes is then used to apply the rule. For code, the ` modifier is turned into <code> or </code> and all tokens in between are turned into text elements. For strong, we only really turn the modifier token into  elements, and leave the stuff in between untouched.

fn code(positions: &mut Vec<Position>) {
    let idxs = indexes_for_modifier(positions, '`');

    for set in idxs.chunks_exact(2) {
        // swap the modifier token for InlineCode elements
        positions[set[0]] = Position::Element(Box::new(InlineCode { open: true }));
        positions[set[1]] = Position::Element(Box::new(InlineCode { open: false }));

        // everything inside the code span we turn to text so we don't process them as 
        // modifiers later on
        for index in set[0] + 1..set[1] {
            positions[index] = Position::Element(Box::new(Text {
                inner: positions[index].to_string(),
            }))
        }
    }
}

fn strong(positions: &mut Vec<Position>) {
    let idxs = indexes_for_modifier(positions, '*');

    for set in idxs.chunks_exact(2) {
        // here we only swap the Modifier for an element
        positions[set[0]] = Position::Element(Box::new(Strong { open: true }));
        positions[set[1]] = Position::Element(Box::new(Strong { open: false }));
    }
}

fn indexes_for_modifier(positions: &Vec<Position>, modifier: char) -> Vec<usize> {
    // let's skip this one
}

After we are done with all the rules, we turn any Token that may not have been swapped such as when there is only one* modifier in a line into text elements. We can then return our new Vec of elements:

fn parse_string(block: String) -> Vec<Box<dyn Element>> {
    // snip

    positions
        .drain(0..)
        .map(|pos| -> Box<dyn Element> {
            match pos {
                Position::Token(t) => Box::new(Text {
                    inner: t.to_string(),
                }),
                Position::Element(e) => e,
            }
        })
        .collect::<Vec<_>>()
}

Rendering our new elements

Ofcourse we also have to add our new elements, but they are quite straightforward:

pub struct InlineCode {
    pub open: bool,
}

impl Element for InlineCode {
    fn render(&self) -> String {
        if self.open {
            "<code>".to_string()
        } else {
            "</code>".to_string()
        }
    }
}

pub struct Strong {
    pub open: bool
}

impl Element for Strong {
    fn render(&self) -> String {
        if self.open {
            String::from("<strong>")
        } else {
            String::from("</strong>")
        }
    }
}

Finally we have to update our Paragraph implementation to have an inner: Vec<Box<dyn Element>> instead of just a string. This is probably not the most efficient way to render and concatenate the inner elements, but it will do for now.

pub struct Paragraph {
    pub inner: Vec<Box<dyn Element>>,
}

impl Element for Paragraph {
    fn render(&self) -> String {
        let inner = self
            .inner
            .iter()
            .map(|element| element.render())
            .collect::<Vec<_>>();

        format!(
            r#"        <p>
            {}
        </p>
"#,
            inner.join("")
        )
    }
}

Code

That's it for this change! You can check out the code here

Crafting Interpreters [back]
Recursive descent parser on Wikipedia [back]
Why isn't there a formal grammar for Markdown? [back]

~ronblog

Parsing span elements

The problem

Solving it the dirty way

Rendering our new elements

Code