Archival Process Part 4: Cleaning Up

I know I said at the beginning of this series that it should only be three parts; I was going to include some additional content in Part 3 and wrap it up but then I kept running into problems which spilled over into a Part 4. In continuing with the trend of increasing difficulty throughout the archival process, this is the most difficult yet. Keep in mind that if you are only archiving sites for personal use and not re-hosting them you shouldn’t need to follow this tutorial.

Repairing Links between Proxied Websites

If you worked through Part 3 and managed to re-host a website or two, you probably noticed that many, if not all of the external links were broken. This is because up to this point we have only provided proxy information on a domain-by-domain basis in each virtual host file. Now we want each site to know about every other site so that they can point to each other. In order to do this we must add a few things to the global Apache configuration file.

eric@localhost:/etc/apache2$ sudo vi apache2.conf

The apache2.conf file controls everything that has to do with the Apache installation on the server so it is important to be very careful when modifying it. In my installation it can be found here: /etc/apache2/apache2.conf , which is the directory above sites-available/ containing the virtual hosts. At the very bottom of the file add the following lines:

SSLProxyEngine On
ProxyHTMLEnable On
ProxyHTMLExtended On

You should recognize these same directives from the individual virtual host files created previously. I’m pretty sure that because these are now in the apache2.conf file they are not needed in the virtual hosts, but I have yet to try with this change. Now that the easy part is done the more tedious step begins: adding ProxyHTMLURLMaps for every single website. I am going to include two below for examples, but make sure to add one for each site you want to proxy.

ProxyHTMLURLMap https?://(www\.)?pirate101\.com/ http://www.home.p101/ Ri
ProxyHTMLURLMap https?://piratescope\.blogspot\.com/ http://www.scope.p101/ Ri

Again, the ProxyHTMLURLMap directive is not new. This time instead of using a single “/” the whole address is used because the global configuration file does not discern between subdomains the way virtual hosts do. The deal with the ?s and \s in that they are creating a regular expression, a powerful tool to find patterns in a given text. In the first statement it is finding all of these strings:

http://www.home.p101/ , http://www.home.p101/ , http://www.home.p101/ , http://www.home.p101/

and replacing them with http://www.home.p101/. This happens because ? tells the parser to include strings both with and without the expression immediately before it (s and www.). Backslashes escape the “.”, which normally represents any character in the world of regular expressions. I chose not to check for the leading www. string in the second example because it is unlikely that anyone will link to www.piratescope.blogspot.com (few people use www at the third level, much less the fourth).

Finally make sure that you have enabled the directives listed above if you haven’t already, and then reload Apache. It does not matter which directory these commands are executed from.

sudo a2enmod proxy
sudo a2enmod proxy_html
sudo systemctl reload apache2

Now proxied websites should have no problem linking to both other proxied domains and those that are mirrored from a download.

Repairing Links between Mirrored Websites

This step requires a similar process to the one detailed above so I’m not going to explain it in as close of detail. Again open the apache2.conf file and add the following lines:

AddOutputFilterByType Substitute text/html

Just like above this initial statement readies the server to modify html, but the Substitute keyword means that it only applies to files stored on the server. The individual site regular expressions look similar but have slightly different formatting:

Substitute s|https?://(www\.)?pirate101\.com|http://www.home.p101|i
Substitute s|https?://piratescope\.blogspot\.com|http://www.piratescope.p101|i

The main differences are the string delimiters changing to pipes “|” and the absence of the trailing R. These changes are present here because the Substitute directive takes regular expressions by default whereas it had to be specifically asked for under ProxyHTMLURLMap. Normally slashes are used as string delimiters but I didn’t want to have to escape each one with a backslash (https?:\/\/), so I opted to use pipes as delimiters instead.

Finally, don’t forget to enable Substitute and reload Apache.

sudo a2enmod substitute
sudo systemctl reload apache2

In case you’re curious, here is what the bottom of my apache2.conf file ended up looking like:

Fixing Absolute Links on WordPress

This step only pertains to those of you using WordPress for a network information center on a Handshake TLD (an incredibly small subset of people that as of right now probably only includes me). I noticed when I first setup this website that none of my links to other pages on the same domain worked. This is because WordPress decides for some reason to use absolute paths instead of relative ones. For example, a file might be accessed via /index.html instead of just /index.html.

No one would ever know the difference when using a legacy ICANN domain because both would resolve the same way. For websites using a Handshake TLD via a gateway, it would still look for it at /index.html instead of (http://45.62.212.55.hns.to)/index.html. This would cause an issue because not everyone can resolve .p101 domains without a gateway. The easy way to fix this is to install a WordPress plugin to fix the problem for you. I use Relative URL, but any plugin for this purpose should work. I haven’t had any issues with links going to the same domain since.

Fixing Absolute Links on Proxied/Mirrored Sites

The same issue for WordPress links that was fixed above can exist on sites that are proxied or mirrored. If someone using a Handshake gateway finds an absolute link to another domain it will point to the original destination without the gateway. Absolute paths would not be an issue if everyone natively resolved Handshake domains without a gateway but alas that is still a long way off. This is currently not an easy fix so I will dedicate an entire post to this single problem.

Leave a Comment

Your email address will not be published. Required fields are marked *