Soupault 4.8.0 release

Estimated reading time: 5 minutes.

Date: 2024-01-19

Soupault 4.8.0 is available for download from my own server and from GitHub releases. It is a small release that fixes a bug with parsing HTML of page bodies that could cause weird behavior when tags like style were found in pages, makes the index entry data available to all hooks past the indexing stage, and adds a small helper function — HTML.inner_text(). However, there are big plans for the next release (whenever it is made) — read on for details.

New features and improvements

site_index variable is now available to the post-build hook.
index_entry variable (the complete site index entry for the current page) is now available to post-index, save and post-save hooks and to Lua index processors.
New options for ignoring certain paths in the sire dir: settings.ignore_path_regexes and settings.ignore_directories.

New plugin API functions

HTML.inner_text() — returns the text nodes from inside a node, stripped of all HTML tags.

Bug fixes

In generator mode, page files are parsed as HTML fragments so <style> tags and similar no longer call issues with duplicate <body> tag inserted in the page (#58, report by Delan Azabani).

Future plans

I’m still glad that most design decisions I made early in the development process stood the test of time. However, some decisions clearly did not: one of them is the idea to always process all pages sequentially.

Here’s the thing: most static site generators load all pages into memory before they do any further processing. They load all data, then process it, then save it all to disk.

Soupault, as of now, doesn’t: it loads a list of page files only, then reads every page file individually, processes it, saves the generated HTML to disk, and moves on to the next page. To provide site index data to index pages, it splits the page file list into content and index pages and processes content pages first, then gets to the index pages (typically, pages names index.*).

There are a few reasons why I did it that way:

It allows soupault to process websites of potentially unlimited size that do not fit in RAM — uniquely among all SSGs.
It makes the mental model of what soupault does very simple.
It naturally groups page processing logs by page file, which makes reading debug logs simpler.

However, there’s a huge compromise: content pages don’t have access to site index data by default. Complete site index In SSGs that use “front matter”, it’s trivial to provide every page with its own metadata. Soupault, however, extracts metadata from HTML itself, and thus widgets can create new metadata: for this reason there’s an option to only do index extraction after certain widgets have run (extract_after_widgets).

That meant that the only way to provide content pages with access to their own metadata was to do some of the processing twice, so in soupault 4.0.0 I introduced a two-pass workflow that allows the user to do that, at the cost of increased build time and already less obvious debug logs.

Initially I assumed that accessing page’s own metadata or the entire site index from content pages was an uncommon use case. However, that assumption was obviously false, and there are lots use cases for that: autogenerated site-wide navigation menus (like the chapter index pane in ocamlbook.org), tag clouds, and more. Moreover, lots of questions on the mailing list and the IRC channel are about that behavior, and it’s clear that making index data available to all pages by default would make soupault much more intuitive for people.

So, in the next big release, I plan to make the following changes:

The two-pass workflow will be removed.
The default mode will be to load app pages into memory first, and index.index_first = true will be the new implicit default.
It will be possible to enable the classic sequential workflow by setting that option to false in generator mode or by switching to the post-processor mode, so that larger-than-RAM websites can still be processed.
In all modes, pages may have access to their own metadata (but access to the site-wide index will require index.index_first not set to false).

I still need to work out implementation details so it will take some time, and I have a few more plans for the next big release that also need quite a lot of work, but I’m committed to making soupault better for everyone, so make sure to check the mailing list and the Atom feed once in a while.