Adjusting WinHTTrack Mirror Directory Structure – P101 Network Information Center

This post is sort of a Part 2 to the last one.

After the last post I was able to successfully access the Pirate101 Loot website at loot.p101, but there was a little bit of an issue. In order to access the content of the website, I had to navigate to loot.p101/pirate101loot.webs.com. This technically works, but it is ugly and hard to remember. If I didn’t fix it, it would hang on my conscience.

The WinHTTrack program that I use to download live websites is great. It does exactly what I need and all I have to do is paste the URL and click a few buttons. The only problem is that it saves the files in a odd directory structure. Each domain that a page gets saved from has its own folder one level under the root. This makes sense because you wouldn’t files from different sites possibly conflicting in the same folder. If you are only browsing the downloaded website on your computer this is fine, but when the website gets rehosted it causes problems.

Screen cap taken at the root level of downloaded website, with each domain having its own directory in the next level of the file tree.

All I needed to do was copy every single file in this folder one level higher. This would put the homepage index.html file at the root directory (the first “/”) so the user sees the homepage at loot.p101/index.html and not loot.p101/pirate101loot.webs.com/index.html. This is the easy part. The hard part is changing the links and scripts contained in the code of each webpage so that they pointed to the new, correct place.

This is the structure of a script call from the index.html file in the pirate101loot.webs.com folder seen above. The two periods at the beginning say to go up one folder, into the root. The root folder is the one in the screenshot above, containing folders for each of the websites that a page is saved from. After going up to the root, the script is then looked for in ajax.googleapis.com and its sub-folders until it gets to the jquery.min.js file.

This is how the updated script call should look. Instead of going up to the root directory, this time the index.html file is already in the root directory. Sounds easy enough: remove the two dots at the beginning of every file location. This is true, but it would become very tedious to do this for every call in every file by hand. So instead, I decided to write a little something in Python.

The script I ended up with was only 46 lines long and accomplished exactly what I needed.

On the left is an image of the script in its entirety. There was space between different code blocks to make it easier to read but I took them out in order to fit the whole thing into view without needing a scroll bar. The middle image is what the program looks like when it is first run. You are prompted to enter the root directory of the downloaded website files (in this case `Loot_copy3`). From there the terminal prints out all of the folders that contain website files, waiting for you to enter the one you want to modify. After you pick this folder, the program starts fixing all of the webpages and prints out a nice little file tree as it goes. I am not sure what is causing the error at the end, but I had to manually fix the script calls for that file.

Now that the files are prepared for the new location of script calls you can move them. First move all of the files that WinHTTrack generated into their own directory so that there are no conflicts between any pages that may have the same name (usually index.html). Then just go into a file explorer and cut+paste all of the files up one directory.

I realize that most of you probably do not care too much about the stuff going on in the background, but I think that it is important to document the problems that I encounter and how I was able to solve them. For anyone curious, or if you want to do the same for one of your projects, the python script can be downloaded here.

Leave a Comment Cancel Reply