Categories
Tech Posts

Using Google Docs as a Web Publishing Platform

When a system becomes complex enough, documentation is unavoidable. I used to believe that our product was simple enough for anyone to understand, but with continued development comes complexity.

To start with, I created a MS Word document that would serve as the manual, but this suffered problems of compatibility and even saving in PDF wasn’t ideal: asking a person to page to page 63 of a PDF when their copy was out of date is not ideal and lead to one or two near disasters.

The second iteration was a WordPress site – not unlike this one – on which I could publish the updated help information. However, at that point, including screenshots and the like was a tedious affair: save, upload, insert… This only meant that the work was slow to update and was often out of date.

Iteration three was progression of the Word Document in that it was simple enough to convert that document to a Google Document and use that platform and a read-only shared copy as the documentation. This had a few advantages. Editing it is remarkably simple and it could be downloaded and, even, printed in a sensible format if required. Importantly, I could send people a URL that would jump them to a specific heading in the document, so it was easy to refer people to the specific help that they needed.

The Google Doc also allows for intra-document linking, referring to other headers in the document and being able to link to them so that the user can jump to those links. The links are web-standard #-anchors, but in a Google Doc, they are a random series of characters, so you wouldn’t know where you’re jumping to if you saw the URL on its own.

That was one thing when the document was 100 pages long, but we’re now approaching 300 pages. The document loads reasonably quickly on my 100Mbps line but setting one’s browser to “3G” network emulation is painful to say the least.

G Suite has a “Publish to Web” feature which trims out all the JS required for editing and allows a much quicker download and rendering of the document. Its formatting is a bit off, but it works. The static web document is automatically updated whenever changes occur in the original. Frustratingly, it includes things like the page numbers in the table of contents. If one hits print from this web view, the page numbers don’t lead anywhere close. The images still take forever to load on a slow connection because we’re loading from a single document.

One irritation with both the Google options is that the URLs I would typically email to people are very long and very ugly and don’t provide any context to what the recipient is about to click on.

This led me to write my own script that fetches this static document and splits it up into sections, spits out a host of smaller HTML files.

Step 1: Define a bunch of constants. This will make life easier later. Importantly is the link to the documentation, the templates and the output folder. Also, I’ve included a sitemap which can be generated, and this needs a URL prefix.

 define ("DOCUMENTATION_URL", "https://docs.google.com/document/d/e/XXXXXXXXXX/pub");
define ("TEMPLATE_FOLDER", __DIR__ . DIRECTORY_SEPARATOR . '..' . DIRECTORY_SEPARATOR . 'template' . DIRECTORY_SEPARATOR);
define ("OUTPUT_FOLDER", __DIR__ . DIRECTORY_SEPARATOR . '..' . DIRECTORY_SEPARATOR . 'html' . DIRECTORY_SEPARATOR);
define ("SITEMAP_URL", "https://help.adam.co.za/");

Now that we have the constants, we can fetch the HTML. We do a bit of error checking to make sure that we aren’t about to stuff up the whole site.

 $html = @file_get_contents (DOCUMENTATION_URL);

if ($html === false)
{
echo "Could not retrieve URL " . DOCUMENTATION_URL . ". Aborting.";
exit ();
}

Broadly speaking, the HTML is formatted as follows:

<html>
<head>
<title>Document Title</title>
<style>/* Some arb styling */</style>
</head>
<body>
<div id="header">Document Title</div>
<div id="contents">
<style>/* All Google's CSS styling needed */</style>
<!-- Everything here your document needs... →
</div>
<div id="footer"><!-- Some stuff... -->
<script>/* some JS used to render and sanitize links */</script>
</body>
</html>

I’ll refer to this later, but for now let’s just read this into a DOMDocument object which allows for much easier manipulation of the DOM.

 $dom = new DOMDocument();
if (@$dom->loadHTML ('<?xml encoding="utf-8" ?>' . $html) === false)
{
echo "Could not parse HTML. Aborting.";
exit ();
}

Three important things are happening. The first is that we are prepending an encoding tag which forces DOMDocument to interpret the characters as UTF-8. Without this, any extended characters, including curly-quotes, will look horrific. We are @suppressing the warning messages here. This document seems to cause problems for DOMDocument, but browsers are happy with this. The issues come in while reading the JavaScript that Google appends to the document, so the errors, in my experience, are not important. Lastly, we check to make sure that the HTML was read successfully. If it wasn’t, we abort!

 $xPath = new DOMXPath ($dom);

$style = $xPath->query ('/html/body/div[@id="contents"]/style')->item (0)->nodeValue;
$styleHash = md5 ($style);
file_put_contents (OUTPUT_FOLDER . 'default.css', $style);

Using a DOMXPath object, we navigate the DOM in order to extract the “<style> tag within the div#contents tag. These are stored in the $style variable. I generated a hash so that if the style contents change, we can signal this to browsers who might otherwise cache the CSS files.

The CSS styles are all class based and are named c0, c1, c2 and so on. A small change in the document seems to change this numbering arbitrarily and so it is important that if a document is fetched that depends on a new style sheet that it is prompted to get it!

Finally, I write the styles to a CSS file in the output folder which I’ll make use of later.

 $menuchange = [];
$outline = [];
$h1id = null;
$h2id = null;
$h3id = null;
$nodes = $xPath->query ('/html/body/div[@id="contents"]/*');

One uglyness of the Google document is that its anchors are not intuitively named. I am going to store the existing anchors in the $menuchange array, generate new anchors and use this array as a lookup to replace the ugly with the readable!

The $outline variable is going to be used to store our document in a usable format. I’ll talk about this in some detail just now.

The three heading ID variables are used to keep track of which headings we are under. This is important for document hierarchy and bears some additional discussion at this point.

An HTML document is ultimately flat in its structure but we interpret that differently. An HTML document might be structured as follows:

  • <h1>Heading</h1>
  • <p>Paragraph 1</p>
  • <p>Paragraph 2</p>
  • <h2>Subheading</h2>
  • <p>Paragraph 3</p>
  • <h1>Second Heading</h1>
  • <h2>Subheading 2</h2>
  • <p>Paragraph 4</p>
  • <h3>Sub Sub Heading</h3>
  • <p>Paragraph 5</p>

The structure does not convey the hierarchy. We interpret the hierarchy as follows:

  • <h1>Heading</h1>
    • <p>Paragraph 1</p>
    • <p>Paragraph 2</p>
    • <h2>Subheading</h2>
      • <p>Paragraph 3</p>
  • <h1>Second Heading</h1>
    • <h2>Subheading 2</h2>
    • <p>Paragraph 4</p>
      • <h3>Sub Sub Heading</h3>
      • <p>Paragraph 5</p>

Each non-heading element is a child of the last heading element that came before it. Headings are added to the first lower numbered heading that we have, working up.

I want to treat <h1>s differently and for each to be in its own file to reduce the amount of data that gets downloaded.

In the Google HTML, each heading had its own anchor assigned to it. In my remapping, <h1>s would get their own file, and <h2> and <h3> tags would get the most recent <h1> file with a newly defined anchor for them.

All this leads me to the central worker of this process; where we iterate over the nodes.

 foreach ($nodes as $node)
{
if ($node->nodeName == 'h1' && $node->attributes->getNamedItem ('id') instanceof DOMNode)
{
$h1id = $node->attributes->getNamedItem ('id')->textContent;
$filename = count ($outline) == 0 ? "index.html" : sanitiseTextForLink ($node->textContent) . ".html";
  $outline [$h1id] ['heading'] = $node->textContent;
$outline [$h1id] ['newlink'] = $filename;
  $outline [$h1id] ['content'] = [];

  $menuchange [$h1id] = $outline [$h1id] ['newlink'];
  $h2id = null;
  $h3id = null;
  }
  elseif ($node->nodeName == 'h2' && $node->attributes->getNamedItem ('id') instanceof DOMNode)
  {
  $h2id = $node->attributes->getNamedItem ('id')->textContent; $outline [$h1id] [$h2id] ['heading'] = $node->textContent;
$outline [$h1id] [$h2id] ['id'] = sanitiseTextForLink ($node->textContent);
$outline [$h1id] [$h2id] ['newlink'] = $outline [$h1id] ['newlink'] . "#" . $outline [$h1id] [$h2id] ['id'];
$outline [$h1id] [$h2id] ['content'] = [];

$menuchange [$h2id] = $outline [$h1id] [$h2id] ['newlink'];
$h3id = null;
}
elseif ($node->nodeName == 'h3' && $node->attributes->getNamedItem ('id') instanceof DOMNode)
{
$h3id = $node->attributes->getNamedItem ('id')->textContent;
$outline [$h1id] [$h2id] [$h3id] ['heading'] = $node->textContent;
$outline [$h1id] [$h2id] [$h3id] ['id'] = sanitiseTextForLink ($node->textContent);
$outline [$h1id] [$h2id] [$h3id] ['newlink'] = $outline [$h1id] ['newlink'] . "#" . $outline [$h1id] [$h2id] [$h3id] ['id'];
$outline [$h1id] [$h2id] [$h3id]['content'] = [];

$menuchange [$h3id] = $outline [$h1id] [$h2id] [$h3id] ['newlink'];
}
elseif ($h1id === null)
{
// do nothing.
}
elseif ($h2id === null)
{
$outline [$h1id] ['content'] [] = $node->ownerDocument->saveHTML ($node);
}
elseif ($h3id === null)
{
$outline [$h1id] [$h2id] ['content'] [] = $node->ownerDocument->saveHTML ($node);
}
else
{
$outline [$h1id] [$h2id] [$h3id] ['content'] [] = $node->ownerDocument->saveHTML ($node);
}
}

Let me try and explain what is going on in this block. It’s a lot!

The first ‘if’ block is interested in <h1> tags with an ID. We need to remember this ID so that we can replace it if there are any other references to it in the document. We read the ID into the variable $h1id.

Because we’re dealing with an <h1> tag, we know that this is going to be in a new file. If there are no previous <h1> tags, then this one must be called “index.html” to provide a default page for our site. If it’s not the first <h1> it’s come across, then we create a filename based on a sanitised version of its name. I’ve included the sanitising function later.

We remember the heading text, the new filename and initialise an array for any content that might appear beneath it.

The lookup for the old to the new anchor link is added to the $menuchange array.

Finally, because we have just seen an <h1> tag, we know that we are not “under” an <h2> or <h3> tag and so we set those values to null.

The next two blocks are similar and deal with <h2> and <h3> tags. There are some differences.

In both these blocks, our $newlink is set to the <h1> filename plus the (new) anchor for the heading. This is so that if we click on a link it will automatically reference anchors in a different file.

The second difference is that we record, additionally, the new sanitised ID. This probably could be extracted from the “newlink” property of the array. I’ve just saved it while we had it.

The last difference is that in the <h2> block, we set the $h3id variable to not reference anything, but we don’t need to do this in the <h3> block since we aren’t worried about any heading levels lower than this.

You will, of course, have noticed the additional levels in the arrays for the in the latter two blocks. This is what generates our hierarchy that we wanted.

Those first three if-blocks take care of the heading tags. Every other tag can now be substituted under it.

The fourth if-block checks to see if $h1id is null. As the comment suggests, this block does nothing; effectively discarding any content in the document that appears before the first <h1> tag. Such as the cover page and table of contents.

We we check the fifth if-block, we are at a point where we have an <h1> but not yet an <h2>. This must therefore belong to <h1> content. As such we add it at that level in our $outline.

Similarly, we repeat the process if $h3id is null, adding content to the current <h2> tag.

Finally, we add whatever content is left to the <h3> tag.

This other leftover content could include other heading tags (<h4>, <h5>, etc.) but these are not important to our menu structure and so we ignore them. It does mean that any links to heading 4s and 5s will not be managed, but I’m ok with that for this particular project.

Now that we have our document hierarchy, we can begin generating our HTML documents.

 $menu = getMenuStructure ($outline, "");

$sitemap = [];
$template = file_get_contents (TEMPLATE_FOLDER . 'template.html');

foreach ($outline as $h1 => $content)
{
$filename = $outline [$h1] ['newlink'];
$sitemap [] = SITEMAP_URL . $filename;

$html = str_replace ("#title", $outline [$h1] ['heading'], $template);
$html = str_replace ("#cssfile", "default.css?{$styleHash}", $html);
$html = str_replace ("#menu", $menu, $html);

$content = "
" . $outline [$h1] ['heading'] . "
";
foreach ($outline [$h1] ['content'] as $htmlSnip)
{
$content .= relink ($htmlSnip);
}
foreach ($outline [$h1] as $h2 => $subheadings)
{
if (substr ($h2, 0, 2) == 'h.')
{
$content .= "" . $outline [$h1] [$h2] ['heading'] . "
";
foreach ($outline [$h1] [$h2] ['content'] as $htmlSnip)
{
$content .= relink ($htmlSnip);
}
foreach ($outline [$h1] [$h2] as $h3 => $subheadings)
{
if (substr ($h3, 0, 2) == 'h.')
{
$content .= "" . $outline [$h1] [$h2] [$h3] ['heading'] . "
";
foreach ($outline [$h1] [$h2] [$h3] ['content'] as $htmlSnip)
{
$content .= relink ($htmlSnip);
}
}
}
}
}
$html = str_replace ("#body", $content, $html);

$file = fopen (OUTPUT_FOLDER . $filename, "w");
fwrite ($file, $html);
fclose ($file);
}

We being here by creating a menu structure. This function is described later, but essentially generates a list of links to include in the file.

We reset our sitemap and being by reading in our template file. Then we loop through out outline.

We get the filename from our outline property and add the file, with the site URL, to the future site map.

We do a bunch of substitutions into our template of the title (which, because each <h1> has its own file, is the content of the <h1>), and link this file to the Google CSS file, including the hash so that browsers get a hint when its time to refresh their cached version. We also substitute the menu structure into the template file.

I was considering customising the menu for each file, but I’ve relied on JavaScript instead to format the menu depending on which file it is in. It would, of course, be more efficient to trim it down since each file has the entire menu structure. This might give way to expandable menus in the future. For now, they are just hidden.

The $content variable has the <h1> tag added, followed by any <h1> content and finally, followed by any <h2> tags. There is possible a bit of recursion that I could have employed here since there are some fairly similar repeated processes for the three heading levels.

The relink function, also described later, does some regex magic to substitute old anchors with new anchors and update any links to those anchors. In addition, it reformats the external URLs (so that they don’t go via Google’s “you’ve clicked on a link… are you sure” service). This also finds YouTube videos and replaces the paragraph they’re in with an embedded video. That’s quite smart, I thought!

We end this block of code by substituting our body content into our template and then writing the file into the output folder. Next!

 copy (TEMPLATE_FOLDER . 'custom.css', OUTPUT_FOLDER . 'custom.css');
copy (TEMPLATE_FOLDER . 'display.js', OUTPUT_FOLDER . 'display.js');
copy (TEMPLATE_FOLDER . 'logo.png', OUTPUT_FOLDER . 'logo.png');

file_put_contents (OUTPUT_FOLDER . "sitemap.txt", implode ("\n", $sitemap));

The last important step of this process is to move the assets into the main folder, including the JavaScript file that is ultimately responsible for the menu formatting.

function relink ($html)
{
global $menuchange;
$matches = [];
preg_match_all ('/<a [^>]+href="#(h\.[a-zA-Z0-9]+)"/s', $html, $matches);
foreach ($matches [1] as $match)
{
if (isset ($match) && isset ($menuchange [trim ($match)]))
{
$html = str_replace ("#{$match}", $menuchange [trim ($match)], $html);
}
}

$matches = [];
preg_match_all ('/<a [^>]*href="(https?:\/\/www.google.com\/url\?q=([^"]*)&sa=D&ust=[0-9]+)"[^>]*>/s', $html, $matches);
foreach ($matches [0] as $key => $match)
{
if (isset ($matches [1] [$key]) && isset ($matches [2] [$key]))
{
$html = str_replace ($matches [1] [$key], $matches [2] [$key], $html);
}
}

$matches = [];
preg_match_all ('/<p[^>]*>.+?<a [^>]*href="https?:\/\/(www\.)?youtu(be\.com|\.be)\/(watch\?v=)?([a-zA-Z0-9-_]+)"[^>]*>.+?<\/p>/', $html, $matches);
foreach ($matches [0] as $key => $match)
{
if (isset ($matches [4] [$key]) && !empty ($matches [4] [$key]))
{
$html = "<iframe width=\"640\" height=\"360\" src=\"https://www.youtube.com/embed/{$matches [4] [$key]}\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>";
}
}

return $html;
}

This block of code does my link substitution. I’m a bit worried about the middle part – the substitution of Google’s links – since that could change without notice by Google. The YouTube embedding should be a little more predictable.

The anchor linking is done in the first part, using a global (yuck) variable to access our substitution lookup.

function getMenuStructure ($outline, $current, $level = 1)
{
if ($level > 3)
{
return '';
}

$menu = "";
foreach ($outline as $heading1 => $details)
{
if (isset ($outline [$heading1] ['heading']))
{
$menu .= '<li><a href="' . $outline [$heading1] ['newlink'] . '">' . $outline [$heading1] ['heading'] . "</a></li>\n";
}
if (is_array ($outline [$heading1]))
{
$menu .= getMenuStructure ($details, $current, $level + 1);
}
}

if ($menu == '')
{
return '';
}

$final = "<ul class='menu{$level}'>" . trim ($menu) . "</ul>\n";

return $final;
}

Here, I am looking through the outline to get any ‘heading’ elements of the current array. There is a recursive call to get headings further down the hierarchy. A nested, unordered list (<ul>) takes care of the indentations for me, which are styled using CSS.

Finally, the sanitiseTextForLink function:

function sanitiseTextForLink ($text)
{
return strtolower (str_replace ('--', '-', str_replace (' ', '-', preg_replace ('/[^a-zA-Z0-9 ]/', '', $text))));
}

I’m not sharing the CSS or JS with you for the final product since this post has gone on for long enough!

I am also aware that this is not pleasant code to look at, but it seems to work really well. Any changes that I make are updated in the help site within 10 minutes with no additional work from me.