Build your own blog

This is my blog about building my own blog, starting with a homebrew static site generator. As the purpose of this blog is to learn stuff by building stuff, I will be building my own instead of using an existing tool!

The generator will take input in the form of markdown (or a subset thereof), and output a html page. To do this we will write a small CLI that reads markdown files from a directory, parses the markdown into an internal representation, then writes that representation to a HTML file. As I would like to have something left to improve in the future, we will write all posts into one page and the generator will implement just a subset of Markdown, being headings, paragraphs and code blocks.

The process of generating the page will be straightforward, at least for now, and look as follows:


  ┌──────────┐          ┌──────────────────┐          ┌───────────┐
  │ Markdown │ parse()  │ Internal         │ write()  │   HTML    │
  │          ├─────────►│ Representation   ├─────────►│           │
  └──────────┘          └──────────────────┘          └───────────┘
  

From this, we can determine approximately the project structure we will need, with a directory for the generator project, the input (markdown files) and the output (html).

blog/
├─ generator/
│  ├─ src/
│  │  ├─ main.rs
│  │  ├─ model.rs       <- internal representation module
│  │  ├─ parser.rs      <- parsing module
│  │  ├─ writer.rs      <- writing module
├─ posts/
│  ├─ byob.md           <- the markdown source for this post
├─ www/
│  ├─ index.html        <- the HTML we will generate
│  ├─ style.css

Let's get started!

Internal Representation

As we are building version 0.1 of the generator, I feel like cutting some corners just to get a working thing going. This means we will write all articles into one big page, which means we can model our page as a list of posts:

pub struct Blog {
    pub posts: Vec<Post>
}

A post consists of a list of markdown elements. As I have (conveniently) chosen to support only a subset of markdown (headings, paragraphs, code blocks), the result after parsing will be a flat list of elements, with no nested elements supported (hyperlinks, inline code, bold or italic text, etc).

pub trait Element {}

pub struct Post {
    pub elements: Vec<Box<dyn Element>>,
}

We then create structs for all the Markdown syntax elements we want to support, showing Heading here. Inner is the actual heading text, level is the heading level. I chose "inner" because it makes sense from a HTML point of view where the inner part (between tags) is whatever goes inside a html element, eg: <h1> inner is whatever is here </h1>, <p> or here! :-) </p>.

pub struct Heading {
    pub level: usize,
    pub inner: String,
}

impl Element for Heading {}

Parsing Markdown

Now that we have our data structure down, it is time to parse our first markdown file. I've created a parser module and added a pub function, which should be enough for now. So our parse function should look approximately like this:

pub fn parse_blog() -> Blog {
    // snip
}

In this function we will read files from our posts dir, and then parse the files to our Post type one by one. Readings files means we have to handle IO errors, so we have to update our function signature a bit.

pub fn parse_blog() -> Result<Blog, std::io::Error> {
    let path = Path::new("../posts");

    let dir = fs::read_dir(path)?;

    let blog = Blog {
        posts: dir
            .filter(|dir_entry| dir_entry.is_ok())
            .map(|dir_entry| {
                let path = dir_entry.unwrap().path();
                let raw_post = fs::read_to_string(&path).unwrap();
                parse_entry(raw_post)
            })
            .collect(),
    };

    Ok(blog)
}

fn parse_entry(raw_post: String) -> Post {
    // here we will do the actual parsing of the files
}

This code definitely leaves something to be desired improved upon, as we are unwrapping even though we should be able to bubble the errors up. Let's agree to put that one the backlog.

On to parsing the files: The elements we chose allow us to parse our source in a line-by-line fashion. I usually automatically pick a for-in loop, which would look something like this:

fn parse_entry(raw_post: String) -> Post {
    let mut elements: Vec<Box<dyn Element>> = vec![];

    for line in source.lines() {
        if line.starts_with("#") {
            // heading
        }

        else if line.starts_with("```") {
            // code block
        }

        else if line.is_empty() {
            // skip empty lines
            continue;
        }

        else {
            // paragraph
        }
    }

    Post { elements }
}

But now we have a problem. In Markdown, the end of an elements is often indicated by having an empty line, this means our element can only be determined to be finished in the next iteration of the loop. But! On the next iteration of the loop, how do we know a code block started on the previous line? We now have to keep track what kind of block we are in (for instance, code block or paragraph) by introducing state:

let mut code_block = false;

for line in source.lines() {
  // snip

  else if line.starts_with("```") {
    code_block = true;

    // do stuff
  }

  // snip
}

Furthermore, when we reach an empty line now, we have to check if we were currently parsing some kind of block:

let mut code_block = false;
let mut paragraph = false;
let mut inner = "".to_string()

for line in source.lines() {
    // snip

    if line.is_empty() {
        if code_block {
            code_block = false
            
            // finalise the code element
            let code = Code { inner }

            // don't forget to reset inner
            inner = "".to_string()
        } else if paragraph {
            // do the same thing for paragraph
        } else {
            continue;
        }
    }

    // snip
}

This is definitely not the way to go, as it spreads the logic for parsing a certain type of element across multiple locations in the code. Instead we will just take an iterator and move across it inside a loop {}. We can then keep calling line_iter.next() until we see the closing line, and handle everything we need to handle for this element within this single if-block:

    let mut line_iter = raw_post.lines();

    loop {
        let next = line_iter.next();

        if next.is_none() {
            break;
        }

        // code block
        if line.starts_with("```") {
            loop {
                // read until we see the closing "```"
            }
        }
    }

The code to read a code block then looks like this:

if line.starts_with("```") {
    let mut inner = String::new();

    loop {
        let next = line_iter.next();

        if let Some(next_line) = next {
            if next_line.starts_with("```") {
                break;
            } else {
                inner += &format!("{}\n", next_line);
            }
        } else {
            break;
        }
    }

    elements.push(Box::new(Code { inner }));
}

We handle paragraphs in a similar way, headings are slightly different as they are a one line element, but we do have to count the number of # to determine the level of the heading:

if line.starts_with("#") {
    elements.push(Box::new(Heading {
        level: line.matches("#").count(),   // who can spot the bug here?
        inner: line.trim_start_matches("#").to_string(),
    }));
}

Now, this method of parsing probably will not hold once we implement some of the nested elements Markdown specifies (like hyperlinks, text formatting, etc). But that will probably be a blog post on its own.

Writing HTML

We have parsed our markdown files, and now we want to magic our data into HTML. I've tried to do this in the most straightforward way I could think of, and starting again I would definitely change some (lots of) things. But, let's stick to what I did.

We have a Blog which has Posts and those have Elements. We start by giving Blog a function render:

pub struct Blog {
    pub posts: Vec<Post>,
}

impl Blog {
    pub fn render(&self) -> String {
        // snip
    }
}

Inside render we define a head and tail and we write those to a mutable string, with our posts in between:

impl Blog {
    pub fn render(&self) -> String {
        let mut doc = String::new();

        let head = r#"raw html string goes here :-)"#;
        let tail = r#"some more raw html goes here"#;

        doc.write_str(&head).unwrap();

        for post in self.posts.iter() {
            doc.write_str(&post.render()).unwrap();
        }

        doc.write_str(tail).unwrap();

        doc
    }
}

Blog calls render() on Post, so let's implement that as well:

pub struct Post {
    pub elements: Vec<Box<dyn Element>>,
}

impl Post {
    fn render(&self) -> String {
        let mut post = String::new();

        for elem in self.elements.iter() {
            post.write_str(&elem.render()).unwrap();
        }

        post
    }
}

Post calls render on all elements which are dyn Element: a trait object, so we have to add render() to the trait:

pub trait Element {
    fn render(&self) -> String;
}

Then we add the function to the impl Element block we have for all our elements, using heading as an example:

pub struct Heading {
    pub level: usize,
    pub inner: String,
}

impl Element for Heading {
    fn render(&self) -> String {
        format!(
            r#"        <h{0}>
            {1}
        </h{0}>
"#,
            self.level, self.inner
        )
    }
}

Done! Or are we? Loading the page in a browser, our code block with:

pub trait Element {}

pub struct Post {
    pub elements: Vec<Box<dyn Element>>,
}

Turns into:

pub trait Element {}

pub struct Post {
    pub elements: Vec>,
}

It turns out we forgot to HTML-encode our inner texts. I've added a quick module called escape, which contains one function, escape_html:

pub fn escape_html(html: &str) -> String {
    html.chars().map(|c| match c {
        '&' => "&amp;".to_string(),
        '<' => "&lt;".to_string(),
        '>' => "&gt;".to_string(),
        '"' => "&quot;".to_string(),
        '\'' => "&apos;".to_string(),
        _ => c.to_string(),

    }).collect::<String>()
}

For syntax highlighting I've added Prism, a client-side Javascript syntax highlighting to the page, and to mimic hot-reload, I've also added Live.js. Now we really should be done!

Run

For development and authoring, I really want to run all this locally. The easiest way I can think of to launch a dev-server, is to run one with python.

python3 -m http.server 8000 -d www -b localhost

Then to run the generator:

cd generator
cargo run

Code

Check out the source code here