Sam Trenholme's webpage
Support this website

HTMLDOC and Unicode

 

September 19 2025

I go over how I process blog entries.

Processing the blog

I use a hacked version of HTMLDOC to process my blog entries. The entries are written in a special version of Markdown which is then processed as follows:

Markdown → UTF-8 to Xascii converter → HTMLDOC → Lua script → HTML webpage

The steps are as follows:

  • I write the blog in Markdown

  • Since HTMLDOC doesn’t like Unicode (in theory, it has the ability to accept UTF-8 input. In practice, this is so buggy it’s simply better to convert it in to a non-Unicode format first), I convert the Unicode in to a special 7-bit format called Xascii.

  • I use a slightly hacked version of HTMLDOC 1.9.16 (the last version which easily compiles in Cygwin) to convert the Markdown in to HTML

  • Since that HTML isn’t quite in a form suitable for my blog, I run it through a Lua script which uses a bunch of regular expressions to massage the HTML (real footnotes,1 pictures, split long words so there aren’t issues on a phone, etc.) using Lua’s 8-bit regular expression engine (with some extensions to support the unusual Xascii format I use). Here keeping things 8-bit is good because then the regular expressions can process non-ASCII. Said Lua engine outputs UTF-8 HTML suitable for putting on my blog.

  • I have a final script which takes those (mostly) HTML blog entries and puts them on both of my blogs (inter-blog links are different between the two blogs)

The Xascii format

Instead of using UTF-8, which HTMLDOC doesn’t really like, we use a format called Xascii which I actually made up about seven years ago, with no intention of using it in the real world. It was at the time a “fantasy” encoding for an alternate universe where low end computers would use this encoding to encode both English and Spanish, as well as some limited non-ASCII punctuation.

Since both HTMLDOC and Lua prefer to get fix-width 8-bit input (HTMLDOC is buggy with UTF-8 and Lua’s 8-bit regular expression engine treats UTF-8 as multi character entities, so they can’t be part of regular expression classes and what not), this fantasy format finally became a reality since I had a real world use for it. The format is as follows:

  0123456789abcdef
0 .ÁÉÍÑÓÚÜ¡..—..«»
1 •áéíñóúü¿‘’.→“”©
2 .!"#$%&'()*+,-./
3 0123456789:;<=>?
4 @ABCDEFGHIJKLMNO
5 PQRSTUVWXYZ[\]^_
6 `abcdefghijklmno
7 pqrstuvwxyz{|}~♥

Here, “.” (except for 0x2e, the literal period) is a control character. Note that “—” uses the slot for the very rarely used “vertical tab”, so I have hacked HTMLDOC to not consider “vertical tab” a whitespace character, and on the Lua side I have code which massages regular expressions so that the custom class %b matches all whitespace characters except “vertical tab”.

The fonts I use to render the webpage have essentially only the characters that are present in Xascii; this allows the font files to be really small and load very quickly, even on older networks (down here in México, on my phone 4g is nearly universal and 5g is almost never available).

This is perfect for my blog, since I only write in English and Spanish. I have all of the non-ASCII control characters I need for my blogs, and can use long term simple stable tools to build the blog.

Getting the code

The code is open-source and available here:

https://github.com/samboy/blog/

Footnote

1: Like this footnote