I help to administer a blog with about 2000 entries. Previously the site was managed in MovableType but earlier this year we moved it over to WordPress. At the time we migrated about 30% of the posts to WordPress (the top trafficked 30%). It was a slow process and as time has gone by we’ve migrated more and more posts but it’s a time consuming process. We need to create the new WordPress post, put a 301 redirect in place and then manually delete the old MovableType post. And by manual deletion I mean deleting the actual HTML file that MoveableType created for that post. You see, MovableType produces static files for each and every post, archive page, and index page. This is a blessing and a curse, it means that the load on the web server isn’t large on heavily trafficked sites, but it also means maintaining a legacy site is a PAIN IN THE A**. It’s actually such a pain that I have given up with MT itself and now I am working with the static HTML files directly.
This week I needed an easy way of doing a search and replace in all the legacy HTML files, all 2000 or so of them. It needed to be recursive and ideally it needed to happen without me needing to FTP all the files to a local computer and then FTP them back. I have command line access to the server the blog is on so I checked to see if PERL offered a way to do what I wanted working on the files in situ. As it turns out it’s pretty simple. I wanted to find and replace all internal links that used the old domain name (did I mention this site just changed domain names) with the new domain name. The PERL command to do this with all the files in a single folder is like this:
perl -p -i -e 's/oldstring/newstring/g;' *.html
The -p means the script we’re running will be put through the C pre-processor before PERL compilation. The -i option means that PERL will edit the files in place. The -e option allows the running of PERL commands from the command line (it doesn’t look for a script file). The actual PERL operator s is the substitution operator while the he oldstring and newstring are regular expressions so special characters need to be escaped appropriately. And finally the operator /g means the command will do a global match. So to replace my URLs I needed something like this:
perl -p -i -e 's/www\.old\-domain\.com/www\.new\-domain\.com/g' *.html
The issue with that is that I needed the script to process sub-directories recursively to search and replace in all the HTML files. That could be done in a few different ways, the immediately obvious were chaining with FIND or GREP. I chose FIND and ended up with a command that looks like this:
perl -p -i -e 's/www\.old\-domain\.com/www\.new\-domain\.com/g' ' `find ./ -name "*.html"`
Ran that from the ubuntu commmand line and thousand or more files were processed in under a second. Very cool.