Healing the Web
July 23, 2006
One of the reasons that I stayed on my old blogging engine (.Text) so long was that I had a huge amount of incoming links which I didn’t really want to break, not so much that I wanted to keep the traffic, but because if someone is following a link its usually because they are interested at what is at the end of it and there is nothing more frustraiting than getting a 404, or potentially even worse a DNS resolution problem.
For that reason when I started migrating my content over here to WordPress I committed to making sure that as many of the links as possible to the old blog would not return a 404, but would instead redirect to the new location of the same content on WordPress. This presented a number of challenges:
- Extracting a faithful representation of my content from .Text.
- Importing that into WordPress whilst loosing as little detail as possible.
- Introducing redirects for old post URLs.
- Contacting site owners who have linked to my blog and asking them to fix the links (with specific details).
Getting the content out of .Text was fairly straight forward. I went to the BlogML homepage at CodePlex and used the SDK (version 1.0.0) to rip it out. There were a few minor issues there because since I first submitted the code to the project in the first place, there had been a number of modifications to support later versions of .Text than I was running – that broke me. It took me about half an hour debugging the issue to remove the stuff that I didn’t need.
Once that was done I needed to transform the BlogML file into the native WXR format that WordPress supports for the bulk import of content. This was fairly straight forward and it was “fun” playing with XSLT, and in the end I had a representation that was faithful to the sample of WXR content that I had generated from WordPress – or so I thought.
WordPress Import Sagas
When I first tried importing my two megabyte content file it bombed out completely with a timeout error. After further investigation I discovered that this is probably due to a configuration on the host which stops scripts running for longer than a specified period of time.
I thought this was a bit odd since the code I used to generate the XML in the first place only took a few seconds to run and I wasn’t even being particularly careful about memory utilisation. I suspect that the LAMP platform that they are running on is probably not that fast at processing XML content based on how long it was running before it timed out (either that or the MySQL is painfully slow).
Anyway – enough LAMP-bashing. I wrote a program that takes the WXR file and breaks it down to small chunks. In the end I had to break them down into chunks of only ten items each. Each file then had to be uploaded seperately – thats right, I individually uploaded 100+ files.
Of course, before I got to that point I did quite a bit of testing to make sure the data was being imported correctly and I’m afraid that there is inconsistent date handling in there somewhere because while the dates are pretty much correct for the posts and comments, the times on the comments in particular are all 12:00am – this was odd because the input was based on the export that WordPress gave me so I would have assumed it would accept the same format inwards.
I think the issue here is that the WXR format extends RSS using XML namespaces, and the elements within that namespace add redundant information which could have been pushed into the pubDate element defined as part of RSS. The result is that with WXR, there are three dates that the WordPress processing code has to deal with – BlogML on the other hand is much more precise and there is no redundant information.
Next Steps
Its been an interesting week with WordPress. I’ve been mostly focused on getting my content across, but over the next little while I’ll start mapping the old .Text post URLs to the new WordPress generated ones.
I’ll probably do this by exporting the content from WordPress and doing a match on the post titles (although I know there are a few formatting issues that will lead to mismatches – but it will get it down to enough to handle match them). I can then put redirects into the ASP.NET configuration file at http://notgartner.com which will redirect people that land there get redirected to the new location.
Thats not the end of it though because eventually I don’t want those old URLs to be used, so using search engines (MSN Search and Google) I’ll find all the pages that link to the new URLs and see if I can get them changed.
With a bit of luck I’ll be able to turn the redirects off by the end of the year, but I will monitor the statistics on the old hosting provider.
July 23, 2006 at 5:02 pm
I have been using .Text for a very long time. I would like to get the code so I can modify stuff. What’s the latest version?
Cheers
Al
July 23, 2006 at 9:05 pm
Nice post Mitch. My first thoughts were that you were crazy for going away from hosting on your own domain but the more I think about it the more I’m convinced that you are doing the right thing.
July 23, 2006 at 10:36 pm
I use wordpress for my training site and love it. I dont have the complexitites that you have of course. But I have it installed on my own domain. Next time we catch up well have to have a chat about it.
July 23, 2006 at 11:05 pm
Albert, I’m not sure where the latest version of .Text is, it actually got rolled into Community Server which is certainly something worth looking at. But there are other options out there too such as:
SingleUserBlog
http://markitup.com/Posts/PostsByCategory.aspx?categoryId=bc1e0ae5-7908-452d-a936-f296ba379f23
Subtext
http://subtextproject.com
Das Blog
http://www.dasblog.net
I would council you that while self hosting seems fun, you need to consider the long term impact on your content when you decide you want to move to a new blogging engine which breaks the URLs that others have linked to you with.
The blogosphere is built on links so don’t break them lightly (which is why I am going to so much effort).
July 23, 2006 at 11:11 pm
Darren – thanks! Its a big change for me, but I’m actually doing it to give myself some freedom. Actually – what I think the world needs is a low cost .NET blog hosting provider which hosts all different kinds of hosting engines. For example – if you want a hosted Community Server installation you could get:
http://markitup.communityserver.net
Or:
http://markitup.singleuserblog.net
The sites would be hosted for a very small fee which means that you could have heaps of them as a small drain on your credit card and not have to worry about the content disappearing. A really forward thinking hosting provider would figure out how to keep a blog available even after the principal owner of the site stopped using it (to make sure the web of links didn’t break).
July 23, 2006 at 11:13 pm
Hi Brendan,
I like WordPress as well, although its not because of the technology platform (that has given me some grief over the last week). Its all about the way they have implemented the content management features – its really simple but quite powerful too – you can discover the features at your own pace.
August 9, 2006 at 8:52 am
[...] Today I sat down and wrote the piece of code required to match my new WordPress URLs with my old .Text URLs. I then used the matched URLs to generate an ASP.NET 2.0 Web.config file with a <urlMappings /> element. Now – when someone hits one of my old permalinks they will be redirected to the new URL up here in WordPress. This was step three in my plan to heal the web after I moved my blog. [...]
November 15, 2006 at 2:30 pm
How can use BLOGML 2.0 and write a XSL transformation for it.
November 16, 2006 at 4:41 am
Hi GISKB,
Check out the BlogML homepage (http://www.blogml.com). Cheers.
August 6, 2007 at 5:52 pm
I wrote a program that takes the WXR file and breaks it down to small chunks. In the end I had to break them down into chunks of only ten items each. Each file then had to be uploaded seperately – thats right, I individually uploaded 100+ files.
I am interested in this program you wrote – as I am currently trying to upload a large WXR file and the system limits are not allowing me. I can see you wrote this post a year ago. Do you still have the program you wrote and would you be willing to share it?