Healing the Web

23 07 2006

BandaidOne of the reasons that I stayed on my old blogging engine (.Text) so long was that I had a huge amount of incoming links which I didn’t really want to break, not so much that I wanted to keep the traffic, but because if someone is following a link its usually because they are interested at what is at the end of it and there is nothing more frustraiting than getting a 404, or potentially even worse a DNS resolution problem.

For that reason when I started migrating my content over here to WordPress I committed to making sure that as many of the links as possible to the old blog would not return a 404, but would instead redirect to the new location of the same content on WordPress. This presented a number of challenges:

  1. Extracting a faithful representation of my content from .Text.
  2. Importing that into WordPress whilst loosing as little detail as possible.
  3. Introducing redirects for old post URLs.
  4. Contacting site owners who have linked to my blog and asking them to fix the links (with specific details).

Getting the content out of .Text was fairly straight forward. I went to the BlogML homepage at CodePlex and used the SDK (version 1.0.0) to rip it out. There were a few minor issues there because since I first submitted the code to the project in the first place, there had been a number of modifications to support later versions of .Text than I was running – that broke me. It took me about half an hour debugging the issue to remove the stuff that I didn’t need.

Once that was done I needed to transform the BlogML file into the native WXR format that WordPress supports for the bulk import of content. This was fairly straight forward and it was “fun” playing with XSLT, and in the end I had a representation that was faithful to the sample of WXR content that I had generated from WordPress – or so I thought.

WordPress Import Sagas

When I first tried importing my two megabyte content file it bombed out completely with a timeout error. After further investigation I discovered that this is probably due to a configuration on the host which stops scripts running for longer than a specified period of time.

I thought this was a bit odd since the code I used to generate the XML in the first place only took a few seconds to run and I wasn’t even being particularly careful about memory utilisation. I suspect that the LAMP platform that they are running on is probably not that fast at processing XML content based on how long it was running before it timed out (either that or the MySQL is painfully slow).

Anyway – enough LAMP-bashing. I wrote a program that takes the WXR file and breaks it down to small chunks. In the end I had to break them down into chunks of only ten items each. Each file then had to be uploaded seperately – thats right, I individually uploaded 100+ files.

Of course, before I got to that point I did quite a bit of testing to make sure the data was being imported correctly and I’m afraid that there is inconsistent date handling in there somewhere because while the dates are pretty much correct for the posts and comments, the times on the comments in particular are all 12:00am – this was odd because the input was based on the export that WordPress gave me so I would have assumed it would accept the same format inwards.

I think the issue here is that the WXR format extends RSS using XML namespaces, and the elements within that namespace add redundant information which could have been pushed into the pubDate element defined as part of RSS. The result is that with WXR, there are three dates that the WordPress processing code has to deal with – BlogML on the other hand is much more precise and there is no redundant information.

Next Steps

Its been an interesting week with WordPress. I’ve been mostly focused on getting my content across, but over the next little while I’ll start mapping the old .Text post URLs to the new WordPress generated ones.

I’ll probably do this by exporting the content from WordPress and doing a match on the post titles (although I know there are a few formatting issues that will lead to mismatches – but it will get it down to enough to handle match them). I can then put redirects into the ASP.NET configuration file at http://notgartner.com which will redirect people that land there get redirected to the new location.

Thats not the end of it though because eventually I don’t want those old URLs to be used, so using search engines (MSN Search and Google) I’ll find all the pages that link to the new URLs and see if I can get them changed.

With a bit of luck I’ll be able to turn the redirects off by the end of the year, but I will monitor the statistics on the old hosting provider.

 




Why did I choose WordPress as my future blogging engine?

19 07 2006

Good question, I’m glad you asked! Over the past couple of years I’ve quietly evaluated several blogging engines from the large co-hosted ones like WordPressBloggerMSN Spaces, MySpace, and the SixApart offerings. I’ve also looked at the various self-hosted options like Community Server, dasBlog, Subtext, and SingleUserBlog.

To self-host, or not to self-host?

When I started blogging I believed that having my own domain name was the most important thing and that over time I would change which blogging engine I used, however I have learned that its not that simple – especially when you want to give the inbound links a safe place to land.

With all of the URL rewriting that the ASP.NET engines use being able to intercept inbound requests and redirect to the new content becomes increasingly difficult without breaking the system. Since I didn’t really look forward to that, and there was no way in hell that I was going to self-host a non-ASP.NET solution I decided that I should look at the co-hosted options.

Which co-hosted solution is right for me?

Of all the co-hosted solutions that I looked at I liked WordPress the most because of the backend publishing tools from the look and feel to the actual functionality required. The ability to import from a number of existing archive formats is also useful and I have plans to migrate my existing content across using BlogML and the WordPress WXR (WordPress eXtended RSS) format – so far I haven’t been terribly successful because the import tool times out when trying to import as much content as I am.

WordPress also has quite a bit of star power with a blogger like Robert Scoble using it, so my hope is that if something is done that affects the service then all us little people have one very squeaky wheel.

Life beyond WordPress?

I’ll be the first to admit that WordPress won’t be my final home, but I don’t think that I will be going back to a self hosted solution. I’ll probably change my blogging engine every year or two as different features appeal to me. But one thing that I’ll never have to do is worry about my links breaking (unless WordPress gets taken down or starts charging too much for me to bother keeping it going).

Moving forward I will continue to own the notgartner.com domain and use that to refer people to my current blogging engine (with the help of some ASP.NET code), and I’ll probably keep redirecting old incoming links to their new locations here until the web heals itself. I’m looking forward to the flexibililty that hosting more things “in the cloud” will give me, heck I might even start some other blogs to talk specifically about certain projects that I am working on.




This feed has moved!

18 07 2006

This is my first post from my new blog digs over here at WordPress. The first post at a new URL seems like such an auspicious occasion that it is a little unfair that most people have to use it up telling people where to point their feed aggregators (hint: my new feed is here).

I haven’t made this decision lightly, I’ve been wanting to change the way that I have been presenting my blog to the world for over two years but because of various reasons I’ve held back – until now. Over the next couple of posts I’ll explain why my overall plan for successfully migrating across to WordPress and some background on why I am doing it.