--- title: The Switch to Hakyll published: May 14, 2013 excerpt: Migrating from Jekyll to Hakyll tags: Hakyll, Haskell, Pandoc --- * toc This site was originally built with [Jekyll](http://jekyllrb.com/). Technically I began with the pre-packaged distribution known as [Octopress](http://octopress.org/) which offered a Rakefile for common tasks as well as an out-of-the-box directory structure. I didn't use many of these features, however, so I had been wanting to shed traces of Octopress, partly motivated by the pursuit of increased speed in site generation. I found the opportunity to do this when Jekyll 1.0 was released recently. To cut away the unnecessary components of Octopress, I decided to go through every file and keep only what I absolutely needed. This is evident in commits after [`712168ec`](https://github.com/blaenk/blaenk.github.com.jekyll/commit/712168ec33004b693cc8cfb553a6a861da6a8708). I was well on my way to making the site's source a lot leaner when I remembered that I had been wanting to try [Hakyll](http://jaspervdj.be/hakyll/), a static site generator written in Haskell that I had heard about on Hacker News. Given that I was more or less starting the site from scratch, I figured it was the perfect opportunity to try it. Ultimately, this site is now compiled with Hakyll. It took me about a week to implement every feature I wanted in Hakyll and Pandoc. The net effect is that the difference in speed and flexibility is highly appreciable. ## File Structure Oftentimes when new to a system, learning its directory structure can help one to get oriented. Unlike some other static site generators, Hakyll does not enforce any particular directory structure or convention. The one I have adopted for my [repository](https://github.com/blaenk/blaenk.github.io) looks like this: Entry Purpose ------- ---------- provider/ compilable content src/ Hakyll, Pandoc customizations Setup.hs build type blaenk.cabal dependency management readme.markdown repository information I build the site binary with `cabal build` which results in a new top-level directory `dist/`{.path}, which stores the object files generated by GHC. The `site` binary, stored at the top level, is the actual binary which is used for generating and manipulating the site. This binary has a variety of options, the ones I commonly use are: Option Purpose -------- --------- build Generate the entire site preview Generate changes on-the-fly and serve them on a preview server deploy Deploy the site using a custom deploy procedure **Build** creates a top-level directory `generated/`{.path} with two sub-directories: a directory `cache/`{.path} for cached content and a directory `site/`{.path} where the compiled site is stored. **Deploy** puts the compiled site into top-level directory `deploy/`{.path} which is git-controlled and force pushes the content to the master branch, effectively deploying (on GitHub). ## Hakyll As I mentioned earlier, Hakyll is a static site generator written in Haskell. Hakyll sites are fleshed out using a Haskell [Embedded Domain Specific Language](http://www.haskell.org/haskellwiki/EDSL) (EDSL). This EDSL is used to declare rules for different patterns which should be searched for within the provider directory and what should be done with them. For example, in the following Hakyll program: ~~~ {.haskell} main :: IO () main = hakyll $ do match "images/*" $ do route idRoute compile copyFileCompiler ~~~ `match "images/*"` is a [`Rule`](http://hackage.haskell.org/packages/archive/hakyll/latest/doc/html/Hakyll-Core-Rules.html) that states that the provider directory should match all files matching the glob `images/*`, [`Route`](http://hackage.haskell.org/packages/archive/hakyll/latest/doc/html/Hakyll-Core-Routes.html) them using the `idRoute`, and compile them using the [`Compiler`](http://hackage.haskell.org/packages/archive/hakyll/latest/doc/html/Hakyll-Core-Compiler.html) `copyFileCompiler`. Routing a file in the context of a static site generator like Hakyll refers to the mapping between the file as it sits in the provider directory and its name/path in the compiled directory; in this case, `idRoute` keeps the same name/path in the compiled directory. Compiling a file in this context refers to the operations that should be performed on the contents of the file, for example processing through Pandoc for Markdown to HTML generation, or in this case, simply copying the file from the provider directory to the compiled directory. `Compiler` is a [Monad](https://en.wikipedia.org/wiki/Monad_%28functional_programming%29), which allows for seamless chaining of operations that should be performed on any given file. For example, here is my `Rule` for regular posts: ~~~ {.haskell} match "posts/*" $ do route $ nicePostRoute compile $ getResourceBody >>= withItemBody (abbreviationFilter) >>= pandocCompiler >>= loadAndApplyTemplate "templates/post.html" (tagsCtx tags <> postCtx) >>= loadAndApplyTemplate "templates/layout.html" postCtx ~~~ This states that the compilation process for any given post is as follows: 1. the post body (i.e. excluding post metadata) is read 2. the result is passed to an abbreviation substitution filter 3. the result is passed to my custom Pandoc compiler 4. the result is embedded into a post template with a so called "post context" 5. the result is embedded into the page layout A post is routed using the `nicePostRoute` function which is largely borrowed from [Yann Esposito](http://yannesposito.com/Scratch/en/blog/Hakyll-setup/). It simply routes a `posts/this-post.markdown`{.path} to `posts/this-post/index.html`{.path} so that the post can be viewed at `posts/this-post/.`{.path} An interesting thing to note is that when templates are applied, they are supplied a [`Context`](http://hackage.haskell.org/packages/archive/hakyll/latest/doc/html/Hakyll-Web-Template-Context.html). A [`Context`](http://hackage.haskell.org/packages/archive/hakyll/latest/doc/html/Hakyll-Web-Template-Context.html) is simply a [Monoid](http://en.wikipedia.org/wiki/Monoid) that encapsulates a key (i.e. `String` identifier for the field) and an [`Item`](http://hackage.haskell.org/packages/archive/hakyll/latest/doc/html/Hakyll-Core-Item.html). During application of the template, if a field of the form `$key$` is encountered, the supplied `Context` is searched for an appropriate handler (i.e. one with the same key). If one is found, the item is passed to that `Context`'s handler and the result is substituted into the template. In the above `Rule` for posts, I pass a pre-crafted post `Context`, `postCtx`, and [`mappend`](http://www.haskell.org/ghc/docs/latest/html/libraries/base/Data-Monoid.html#v:mappend) to it a special tags context, `tagsCtx` which encapsulates tags information for that post. ### SCSS The first customization I made was to allow support for [SCSS](http://sass-lang.com/). This is usually possible with a simple line: ~~~ {.haskell} getResourceString >>= withItemBody (unixFilter "sass" ["-s", "--scss"]) ~~~ This works fine in POSIX environments, of which Linux is my primary environment for development. However, it's very useful to me to have Windows support as well. The problem is that on Windows, ruby gem binaries---such as `scss`---are implemented using batch file stubs. The underlying function used for creating the process in `unixFilter` is [System.Process](http://hackage.haskell.org/packages/archive/process/latest/doc/html/System-Process.html)' [`createProcess`](http://hackage.haskell.org/packages/archive/process/latest/doc/html/System-Process.html#v:createProcess), specifically with the `proc` type. On Windows, this uses the [`CreateProcess`](http://msdn.microsoft.com/en-us/library/windows/desktop/ms682425.aspx) function. Using this function, batch files are not run unless they are run explicitly with `cmd.exe /c batchfile`. The problem is that there is no simple way to find the file path of the batch file stub for `scss`. The solution to this is to use the `shell` type with `createProcess` instead of `proc`. This has the effect of a `system` call, where the parameter is interpreted by the shell, in Windows' case, `cmd.exe`. As a result, the program can simply be called as `scss`, leaving the shell to automatically run the appropriate batch file stub. To accomplish this, I had to implement what was essentially a mirror copy of [Hakyll.Core.UnixFilter](http://hackage.haskell.org/packages/archive/hakyll/latest/doc/html/Hakyll-Core-UnixFilter.html) with `proc` switched out with `shell`. I'll be suggesting a pull request upstream soon which gives the user the option and removes the duplicate code. Now I can implement an SCSS compiler like the following, though I additionally pass it a few extra parameters in my actual implementation: ~~~ {.haskell} getResourceString >>= withItemBody (shellFilter "sass -s --scss") ~~~ ### Abbreviations {#abbreviations} One feature I missed from [kramdown](http://kramdown.rubyforge.org/) that wasn't available in my new markdown processor, [Pandoc](http://johnmacfarlane.net/pandoc/), was abbreviation substitution. It consists of writing abbreviation definitions which are then used to turn every occurrence of the abbreviation into a proper [`abbr`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/abbr) HTML tag with an accompanying tooltip consisting of the definition. I had hardly used regular expressions in Haskell before, so the method of using it was pretty confusing to me at first. There's a base regex package called [regex-base](http://hackage.haskell.org/package/regex-base) which exposes a common interface API, and then there are a variety of backend implementations. Hakyll happens to use [regex-tdfa](http://hackage.haskell.org/package/regex-tdfa), a fast and popular backend, so I decided to use that one instead of introducing additional dependencies. One way of using regular expressions in Haskell is through type inference, as is described in the [Text.Regex.Base.Context](http://hackage.haskell.org/packages/archive/regex-base/latest/doc/html/Text-Regex-Base-Context.html) documentation: > This module name is Context because they [sic] operators are context dependent: use them in a context that expects an Int and you get a count of matches, use them in a Bool context and get True if there is a match, etc. Keeping this in mind, I explicitly annotated the `[[String]]` type since I wanted every match and sub-match. I created a function `abbreviationReplace` that takes a `String`, removes the abbreviation definitions, and then creates `abbr` tags out of every occurrence of the abbreviation using the parsed definitions. The `abbreviationReplace` function begins like this: ~~~ {.haskell} abbreviationReplace :: String -> String abbreviationReplace body = let pat = "^\\*\\[(.+)\\]: (.+)$" :: String found = body =~ pat :: [[String]] ~~~ ### Git Tag In a [previous post](/posts/commit-tag-for-jekyll/) I talked about a liquid tag I created for Jekyll which inserts the SHA of the commit on which the site was last generated. I have come to like this small feature of my site. It's not some tacky "Powered by blah" footer. It's pretty unobtrusive. It seems unimportant to people who wouldn't understand what it's about, and those who would understand it might immediately recognize its purpose. **Update**: I have stopped including the git commit in the footer of every page. The problem with doing this was that, in order to have every page reflect the new commit, I had to regenerate every page before deploy. This obviously doesn't scale well once more and more pages are added to the site. Instead I have adopted a per-post commit and history link which I believe is a lot more meaningful and meshes perfectly well with generation of pages, i.e. if a post is modified, there'll be a commit made for it and since it was modified it will have to be regenerated anyways. Now I simply include social links in the footer. One thing I forgot to update the previous post about was that I ended up switching from using the Rugged git-bindings for Ruby to just using straight up commands and reading their output. The reason for doing this was that, while everything worked perfectly fine on Linux, Rugged had problems building on Windows. It turned out that taking this approach ended up being simpler and had the added benefit of decreasing my dependencies. The equivalent of a liquid tag in Jekyll would be a field, expressed as a `Context`. For this reason I created the `gitTag` function that takes a desired key, such as `git`, which would be used as `$git$` in templates, and returns a `Context` which returns the `String` of formatted HTML. One problem was that to do this I had to use `IO`, so I needed some way to escape the `Compiler` Monad. It turned out that Hakyll already had a function for something like this called `unsafeCompiler`, which it uses for `UnixFilter` for example. Here's what `gitTag` looks like: ~~~ {.haskell} gitTag :: String -> Context String gitTag key = field key $ \_ -> do unsafeCompiler $ do sha <- readProcess "git" ["log", "-1", "HEAD", "--pretty=format:%H"] [] message <- readProcess "git" ["log", "-1", "HEAD", "--pretty=format:%s"] [] return ("" ++ (take 8 sha) ++ "") ~~~ ## Pandoc Hakyll configuration is fairly straightforward. What took longer was the process of re-implementing some features that I had in [kramdown](http://kramdown.rubyforge.org/) when I used Jekyll that weren't available in my new document processor, [Pandoc](http://johnmacfarlane.net/pandoc/). Pandoc is a very interesting project that basically works by parsing input documents into a common intermediate form represented as an abstract syntax tree (AST). This AST can then be used to generate an output document in a variety of formats. In this spirit, I feel it's a lot like the [LLVM](http://llvm.org/) project. It seems to me that it has been gaining popularity especially from an end-user perspective (i.e. using the `pandoc` binary), commonly used to do things such as write manual pages in markdown or generate ebooks. The very nature of how Pandoc transforms input documents into an AST lends itself to straight-forward AST transformations. I have created two such transformations so far: one for Pygments syntax-highlighting and another for fancy table of contents generation. One of the things I needed to implement, however, was the abbreviation substitution described above. I would have implemented it as a Pandoc customization, but Pandoc has no representation for abbreviations in its abstract syntax tree. This was why I implemented it as a Hakyll compiler instead, using simple regular expressions. There is actually work towards implementing abbreviation substitution according to the [readme](http://johnmacfarlane.net/pandoc/README.html) under the section "Extension: abbrevations" [sic] but it says: > Note that the pandoc document model does not support abbreviations, so if this extension is enabled, abbreviation keys are simply skipped (as opposed to being parsed as paragraphs). ### Pygments {#pygments} **Update**: This has been through two redesigns since this was written. The first involved an fs-backed caching system, but this was still too slow, since the bottleneck seemed to be caused by continuously spawning a new pygmentize process. Most recently I've created a pygments server that the site opens alongside it at launch, and this Pandoc AST transformer communicates with it through its stdout/stdin handles. It works perfectly and the site compiles a lot quicker. It also fully supports UTF-8: ``` ¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯ · ₹ ``` One of the first things I wanted to implement right away was syntax highlighting with [Pygments](http://pygments.org/). There are a variety of options for syntax highlighting. In fact, Pandoc comes with support for [kate](http://johnmacfarlane.net/highlighting-kate/): a Haskell package for syntax highlighting written by the author of Pandoc. However, I don't find it to be on par with Pygments. In the past, I simply posted code to [gist](https://gist.github.com/) and then embedded it into posts. This caused unnecessary overhead and more importantly, would break my site when github made changes to the service. Eventually I realized that github just uses Pygments underneath, so I implemented a Pandoc AST transformer that finds every [`CodeBlock`](http://hackage.haskell.org/packages/archive/pandoc-types/latest/doc/html/Text-Pandoc-Definition.html#t:Block), extracts the code within it, passes it to Pygments, and replaces that `CodeBlock` with a `RawBlock` containing the raw HTML output by Pygments. I also implemented a way to specify an optional caption which is shown under the code block. I use [blaze-html](http://jaspervdj.be/blaze/) for the parts where I need to hand-craft HTML. Ultimately, this all means that I can write code blocks like this in markdown: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~ {lang="haskell"} testFunction :: String -> Integer ~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Or, with a caption: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~ {lang="ruby" text="some caption"} args.map! {|arg| arg.upcase} ~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ One thing I had to do was invoke [`unsafePerformIO`](http://www.haskell.org/ghc/docs/latest/html/libraries/base/System-IO-Unsafe.html#v:unsafePerformIO) in the function I created which runs the code through `pygmentize`, an end-user binary for the Pygments library. I'm not sure if there's a better way to do this, but my justification for using it is that Pygments should return the same output for any given input. If it doesn't, then there are probably larger problems. ~~~ {.haskell} pygmentize :: String -> String -> String pygmentize lang contents = unsafePerformIO $ do ~~~ I don't feel particularly worried about it, given my justification. It's a similar justification used by Real World Haskell when [creating bindings](http://book.realworldhaskell.org/read/interfacing-with-c-the-ffi.html#id655783) for [PCRE](http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions) with the foreign function interface: > It lets us say to the compiler, "I know what I'm doing - this code really is pure". For regular expression compilation, we know this to be the case: given the same pattern, we should get the same regular expression matcher every time. However, proving that to the compiler is beyond the Haskell type system, so we're forced to assert that this code is pure. This is what the AST transformer I wrote looks like: ~~~ {.haskell text="Pandoc AST transformer for Pygments syntax highlighting"} pygments :: Block -> Block pygments (CodeBlock (_, _, namevals) contents) = let lang = fromMaybe "text" $ lookup "lang" namevals text = fromMaybe "" $ lookup "text" namevals colored = renderHtml $ H.div ! A.class_ "code-container" $ do preEscapedToHtml $ pygmentize lang contents caption = if text /= "" then renderHtml $ H.figcaption $ H.span $ H.toHtml text else "" composed = renderHtml $ H.figure ! A.class_ "code" $ do preEscapedToHtml $ colored ++ caption in RawBlock "html" composed pygments x = x ~~~ ### Table of Contents {#table-of-contents} The more sophisticated and complex of the AST transformers I wrote for Pandoc is table of contents generation. This is something that kramdown had out of the box, though not as fancy. Paired with automatic id generation for headers, this meant that simply placing `{:toc}` in my page would replace that with automatically generated table of contents based on the headers used in the page. #### Alternatives Pandoc actually does have support for table of contents generation using the `--toc` flag. In fact, [Julien Tanguy](http://julien.jhome.fr/posts/2013-05-14-adding-toc-to-posts.html) recently devised a way to generate a separate version of every post which only included the table of contents, then re-introduced the table of contents as a `Context` field `$toc$`. I actually tried this approach, along with a metadata field that decided if the table of contents should be included in a given post or page. However, I ended up deciding against using it. One advantage would be that it took less code on my end, and possibly I would avoid re-inventing the wheel. One reason I didn't keep it was because there was a tiny increase in compilation time which I fear might accumulate in the future as the number of posts grow. The reason for this is that the table of contents is generated for every post/page, instead of only the ones that should display it. Another reason was that it would require me to implement the fancy section numbering in JavaScript, which I don't think would be too difficult since in this case the table of contents already exists and I simply need to insert my numbering. The main reason I decided against it, along with the previous two reasons, is that there would be a noticeable delay between the time when the table of contents are shown plainly and when they are transformed into my custom table of contents. #### Implementation Implementing this involved many steps. In general terms, I had to make a pass through the document to collect all of the headers, then I had to make another pass to find a special sentinel marker I would manually place in the document to replace it with the generated table of contents. This effectively makes table of contents generation a two-pass transformer. Gathering all of the headers and their accompanying information, i.e. HTML `id`, text, level, proved to be a pretty straight-forward task using [`queryWith`](http://hackage.haskell.org/packages/archive/pandoc-types/latest/doc/html/Text-Pandoc-Generic.html#v:queryWith) from the [pandoc-types](http://hackage.haskell.org/package/pandoc-types) package: ~~~ {.haskell} queryWith :: (Data a, Monoid b, Data c) => (a -> b) -> c -> b -- Runs a query on matching a elements in a c. -- The results of the queries are combined using mappend. ~~~ Once I collect all of the `Header` items' information, I normalize them by finding the smallest header _level_ (i.e. big header) and normalizing all headers based on that level. That is, if smallest header level is 3 (i.e. `h3`), every header gets its level subtracted by 2 so that all headers are level 1 and above. Note that I'm not actually modifying the headers in the document, just the information about them that I've collected. Next, a [`Data.Tree`](http://hackage.haskell.org/packages/archive/containers/latest/doc/html/Data-Tree.html) is constructed out of the headers which automatically encodes the nesting of the headers. This is done by exploiting [`groupBy`](http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-List.html#v:groupBy) by passing it `<` as an equivalence predicate: ~~~ {.haskell} tocTree :: [TocItem] -> Forest TocItem tocTree = map (\(x:xs) -> Node x (tocTree xs)) . groupBy (comp) where comp (TocItem a _ _) (TocItem b _ _) = a < b ~~~ This `Tree` is finally passed to a recursive function that folds every level of the `Tree`---known as a `Forest`---into a numbered, unordered list. While that may sound like an oxymoron, the point is that I wanted to have nested numbering in my table of contents. For this reason, I create an unordered list with a `span` containing the section number concatenated to the parent's section number. This function generates the HTML. The final problem was finding a way to insert the table of contents on-demand, in a location of my choosing. In kramdown, this is achieved by writing `{:toc}`, which gets substituted with the table of contents. Pandoc has no such thing, however. For this reason, I chose a list with a single item, "toc," as the place holder for the table of contents. This means that I write the following wherever I want the table of contents to show up: ~~~ * toc ~~~ You can take a look at the beginning of this post to see what the generated table of contents looks like, especially the nested numbering I was referring to. ## Deploying I host my site using GitHub Pages. Such sites are deployed by pushing the site to the master branch of the repository. I wrote a quick shell script that accomplishes this in a pretty straightforward manner. It creates a git ignored directory, `deploy/,`{.path} which itself is under git control, associated with the same repository, but its master branch instead. When I deploy the site with `./site deploy`, the contents of `deploy/`{.path} are removed---except for the `.git/`{.path} directory---and then all of the new generated files are copied into it. A commit is then generated for the deployment, tagged with the SHA identifier of the commit from which the site was generated, to make it easy for me to track things down sometimes. An eight character, truncated SHA is used as follows: ~~~ {.bash} COMMIT=$(git log -1 HEAD --pretty=format:%H) SHA=${COMMIT:0:8} git commit -m "generated from $SHA" -q ~~~ Finally, the commit is force pushed to the repository, replacing everything already there, effectively deploying the site. ## Conclusion Preliminary migration to Hakyll was pretty quick. This included porting all of my posts, pages, and other assets to the Hakyll and Pandoc Markdown formats. The rest of the week was spent implementing the various features, some outlined above, and refining the code base. At first I was a little rusty with my Haskell and found myself at odds with the seemingly capricious compiler, trying to find one way or another to appease it. I quickly remembered that patience prevailed when concerning Haskell, and eventually came to really enjoy reasoning out the problems and solving them with Haskell. The site binary which is in charge of generation, previewing, etc. is compiled. Once you have configured Hakyll to your liking, you have a very fast binary, especially compared to other site generators which are known not to scale well with the amount of posts. The `Compiler` Monad in Hakyll takes care of dependency tracking, allowing re-generation of only those items which are affected by those which were changed, instead of the whole site. But perhaps my favorite aspect of Hakyll is that it's more like a library for static site generation which you use as you see fit, and as a result, your site is entirely customizable. *[AST]: Abstract Syntax Tree *[EDSL]: Embedded Domain Specific Language *[GHC]: Glasgow Haskell Compiler