Healing the Web

BandaidOne of the reasons that I stayed on my old blogging engine (.Text) so long was that I had a huge amount of incoming links which I didn’t really want to break, not so much that I wanted to keep the traffic, but because if someone is following a link its usually because they are interested at what is at the end of it and there is nothing more frustraiting than getting a 404, or potentially even worse a DNS resolution problem.

For that reason when I started migrating my content over here to WordPress I committed to making sure that as many of the links as possible to the old blog would not return a 404, but would instead redirect to the new location of the same content on WordPress. This presented a number of challenges:

  1. Extracting a faithful representation of my content from .Text.
  2. Importing that into WordPress whilst loosing as little detail as possible.
  3. Introducing redirects for old post URLs.
  4. Contacting site owners who have linked to my blog and asking them to fix the links (with specific details).

Getting the content out of .Text was fairly straight forward. I went to the BlogML homepage at CodePlex and used the SDK (version 1.0.0) to rip it out. There were a few minor issues there because since I first submitted the code to the project in the first place, there had been a number of modifications to support later versions of .Text than I was running – that broke me. It took me about half an hour debugging the issue to remove the stuff that I didn’t need.

Once that was done I needed to transform the BlogML file into the native WXR format that WordPress supports for the bulk import of content. This was fairly straight forward and it was “fun” playing with XSLT, and in the end I had a representation that was faithful to the sample of WXR content that I had generated from WordPress – or so I thought.

WordPress Import Sagas

When I first tried importing my two megabyte content file it bombed out completely with a timeout error. After further investigation I discovered that this is probably due to a configuration on the host which stops scripts running for longer than a specified period of time.

I thought this was a bit odd since the code I used to generate the XML in the first place only took a few seconds to run and I wasn’t even being particularly careful about memory utilisation. I suspect that the LAMP platform that they are running on is probably not that fast at processing XML content based on how long it was running before it timed out (either that or the MySQL is painfully slow).

Anyway – enough LAMP-bashing. I wrote a program that takes the WXR file and breaks it down to small chunks. In the end I had to break them down into chunks of only ten items each. Each file then had to be uploaded seperately – thats right, I individually uploaded 100+ files.

Of course, before I got to that point I did quite a bit of testing to make sure the data was being imported correctly and I’m afraid that there is inconsistent date handling in there somewhere because while the dates are pretty much correct for the posts and comments, the times on the comments in particular are all 12:00am – this was odd because the input was based on the export that WordPress gave me so I would have assumed it would accept the same format inwards.

I think the issue here is that the WXR format extends RSS using XML namespaces, and the elements within that namespace add redundant information which could have been pushed into the pubDate element defined as part of RSS. The result is that with WXR, there are three dates that the WordPress processing code has to deal with – BlogML on the other hand is much more precise and there is no redundant information.

Next Steps

Its been an interesting week with WordPress. I’ve been mostly focused on getting my content across, but over the next little while I’ll start mapping the old .Text post URLs to the new WordPress generated ones.

I’ll probably do this by exporting the content from WordPress and doing a match on the post titles (although I know there are a few formatting issues that will lead to mismatches – but it will get it down to enough to handle match them). I can then put redirects into the ASP.NET configuration file at http://notgartner.com which will redirect people that land there get redirected to the new location.

Thats not the end of it though because eventually I don’t want those old URLs to be used, so using search engines (MSN Search and Google) I’ll find all the pages that link to the new URLs and see if I can get them changed.

With a bit of luck I’ll be able to turn the redirects off by the end of the year, but I will monitor the statistics on the old hosting provider.

 

10 thoughts on “Healing the Web

  1. Darren Neimke

    Nice post Mitch. My first thoughts were that you were crazy for going away from hosting on your own domain but the more I think about it the more I’m convinced that you are doing the right thing.

  2. brendan murley

    I use wordpress for my training site and love it. I dont have the complexitites that you have of course. But I have it installed on my own domain. Next time we catch up well have to have a chat about it.

  3. Mitch Denny

    Albert, I’m not sure where the latest version of .Text is, it actually got rolled into Community Server which is certainly something worth looking at. But there are other options out there too such as:

    SingleUserBlog
    http://markitup.com/Posts/PostsByCategory.aspx?categoryId=bc1e0ae5-7908-452d-a936-f296ba379f23

    Subtext
    http://subtextproject.com

    Das Blog
    http://www.dasblog.net

    I would council you that while self hosting seems fun, you need to consider the long term impact on your content when you decide you want to move to a new blogging engine which breaks the URLs that others have linked to you with.

    The blogosphere is built on links so don’t break them lightly (which is why I am going to so much effort).

  4. Mitch Denny

    Darren – thanks! Its a big change for me, but I’m actually doing it to give myself some freedom. Actually – what I think the world needs is a low cost .NET blog hosting provider which hosts all different kinds of hosting engines. For example – if you want a hosted Community Server installation you could get:

    http://markitup.communityserver.net

    Or:

    http://markitup.singleuserblog.net

    The sites would be hosted for a very small fee which means that you could have heaps of them as a small drain on your credit card and not have to worry about the content disappearing. A really forward thinking hosting provider would figure out how to keep a blog available even after the principal owner of the site stopped using it (to make sure the web of links didn’t break).

  5. Mitch Denny

    Hi Brendan,

    I like WordPress as well, although its not because of the technology platform (that has given me some grief over the last week). Its all about the way they have implemented the content management features – its really simple but quite powerful too – you can discover the features at your own pace.

  6. Pingback: notgartner » Blog Archive » Blog spam be gone!

  7. danithew

    I wrote a program that takes the WXR file and breaks it down to small chunks. In the end I had to break them down into chunks of only ten items each. Each file then had to be uploaded seperately – thats right, I individually uploaded 100+ files.

    I am interested in this program you wrote – as I am currently trying to upload a large WXR file and the system limits are not allowing me. I can see you wrote this post a year ago. Do you still have the program you wrote and would you be willing to share it?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s