Healing the Web
23 07 2006
One of the reasons that I stayed on my old blogging engine (.Text) so long was that I had a huge amount of incoming links which I didn’t really want to break, not so much that I wanted to keep the traffic, but because if someone is following a link its usually because they are interested at what is at the end of it and there is nothing more frustraiting than getting a 404, or potentially even worse a DNS resolution problem.
For that reason when I started migrating my content over here to WordPress I committed to making sure that as many of the links as possible to the old blog would not return a 404, but would instead redirect to the new location of the same content on WordPress. This presented a number of challenges:
- Extracting a faithful representation of my content from .Text.
- Importing that into WordPress whilst loosing as little detail as possible.
- Introducing redirects for old post URLs.
- Contacting site owners who have linked to my blog and asking them to fix the links (with specific details).
Getting the content out of .Text was fairly straight forward. I went to the BlogML homepage at CodePlex and used the SDK (version 1.0.0) to rip it out. There were a few minor issues there because since I first submitted the code to the project in the first place, there had been a number of modifications to support later versions of .Text than I was running – that broke me. It took me about half an hour debugging the issue to remove the stuff that I didn’t need.
Once that was done I needed to transform the BlogML file into the native WXR format that WordPress supports for the bulk import of content. This was fairly straight forward and it was “fun” playing with XSLT, and in the end I had a representation that was faithful to the sample of WXR content that I had generated from WordPress – or so I thought.
WordPress Import Sagas
When I first tried importing my two megabyte content file it bombed out completely with a timeout error. After further investigation I discovered that this is probably due to a configuration on the host which stops scripts running for longer than a specified period of time.
I thought this was a bit odd since the code I used to generate the XML in the first place only took a few seconds to run and I wasn’t even being particularly careful about memory utilisation. I suspect that the LAMP platform that they are running on is probably not that fast at processing XML content based on how long it was running before it timed out (either that or the MySQL is painfully slow).
Anyway – enough LAMP-bashing. I wrote a program that takes the WXR file and breaks it down to small chunks. In the end I had to break them down into chunks of only ten items each. Each file then had to be uploaded seperately – thats right, I individually uploaded 100+ files.
Of course, before I got to that point I did quite a bit of testing to make sure the data was being imported correctly and I’m afraid that there is inconsistent date handling in there somewhere because while the dates are pretty much correct for the posts and comments, the times on the comments in particular are all 12:00am – this was odd because the input was based on the export that WordPress gave me so I would have assumed it would accept the same format inwards.
I think the issue here is that the WXR format extends RSS using XML namespaces, and the elements within that namespace add redundant information which could have been pushed into the pubDate element defined as part of RSS. The result is that with WXR, there are three dates that the WordPress processing code has to deal with – BlogML on the other hand is much more precise and there is no redundant information.
Next Steps
Its been an interesting week with WordPress. I’ve been mostly focused on getting my content across, but over the next little while I’ll start mapping the old .Text post URLs to the new WordPress generated ones.
I’ll probably do this by exporting the content from WordPress and doing a match on the post titles (although I know there are a few formatting issues that will lead to mismatches – but it will get it down to enough to handle match them). I can then put redirects into the ASP.NET configuration file at http://notgartner.com which will redirect people that land there get redirected to the new location.
Thats not the end of it though because eventually I don’t want those old URLs to be used, so using search engines (MSN Search and Google) I’ll find all the pages that link to the new URLs and see if I can get them changed.
With a bit of luck I’ll be able to turn the redirects off by the end of the year, but I will monitor the statistics on the old hosting provider.
Categories : Blogging



Recent Comments