I'm like a superhero with no powers or motivation, and she's my Lois Lane.
Apple Gaming Geeky General Hacking Opinion
Recovering your MovableType database from your HTML archives…

So, Perl is a wonderful scripting language, that can do magical things with text files. Like help me rebuild the database including all of the comments from everyone’s blogs. That’s close to a thousand entries and 400 comments.

Here is the magic that made it possible

#!/usr/bin/perl
use Date::Manip;
# mtfix.pl - parse HTML files to import into Movable Type
# usage: mtfix.pl *html > import.mt
# Note: you _will_ need to adjust the regex
while (<>) { # for each file on the command line
# read in entire file to $content, line feeds and all
# using slurp mode
{ local $/; $content = <>;}
# locate the fields we need using regex
# some matches may include newlines
($author) =($content =~ m|<div class="posted">\s+(.+?)\s+/|s);
($title) = ($content =~ m|<span class="title">(.+)</span>|);
($text) = ($content =~ m|<p>(.+)<a name="more">|s);
($more) = ($content =~ m|<a name="more">(.+)<span class="posted">|s);
($date) = ($content =~ m|<div class="date">(.+)</div>|);
($time) = ($content =~ m|Posted\sat\s(.+)<br />|);

# Read in all the comments as one big block of text
($comments) = ($content =~ m|</a>Comments</div>\n(.+)<div class="comments-head">|s);
# Break up each comment into it's own element in an array
@comm = split (/<\/div>\n/, $comments);

# convert the date to MM/DD/YYYY hh:mm:ss
$datetime = "$date $time";
$parsed = ParseDate($datetime);
$datetime = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

# Strip out the paragraph tags, MT will add them later anyways.
$text =~ s|\<p\>||g;
$text =~ s|\</p\>||g;
$more =~ s|\<p\>||g;
$more =~ s|\</p\>||g;

# printout the fields in the proper format
print "AUTHOR: $author\n";
print "TITLE: $title\n";
print "DATE: $datetime\n";
print "-----\n";
print "BODY:\n$text\n";
print "-----\n";
print "EXTENDED BODY:\n$more\n";

foreach (@comm) { # For every comment in our aray, printout the necessary formating.
if (length $_ > 7) { # this is here to ignore the last comment record.
($CText) = ($_ =~ m|<div class="comments-body">\n(.+)\n\<span|s);
($CDate) = ($_ =~ m|</a>\son\s(.+)</span>|);
($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|);
($CAuthor) = ($Ctemp =~ m|\>(.+)\<|);
($CURL) = ($Ctemp =~ m|href=\"(.+)\"|);
$CText =~ s|\<p\>||g;
$CText =~ s|\</p\>||g;
$parsed = ParseDate($CDate);
$CDate = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

print "-----\n";
print "COMMENT:\n";
print "AUTHOR: $CAuthor\n";
print "URL: $CURL\n";
print "DATE: $CDate\n";
print "$CText\n\n";
}
}
print "--------\n";
}

Thanks to Papa Scott for getting me started.


In order to get this to work, you’ll need to install perl.

You can get perl FREE from www.perl.com, there are Windows/Mac/Unix/Linux distributions.

Once you’ve installed Perl, copy that script into your favorite text editor, save it as mtfix.pl (or whatever). You’ll need to use your individual archives for the import. Take a look at one of the HTML files to figure out what you need to change in the perl script.

The way the perl script works, is that it starts reading the HTML from the start of the file and is looking for a match.

ie:

($title) = ($content =~ m|<span class=”title”>(.+)</span>|);

Here we are looking for the Title of the entry. The ‘(.+)’ represents the text we are going to capture, my titles are sitting in a span class. All you need to do to get it to work for yours is to figure out what surrounds each item you are looking for.

A little more complicated is the way I got the author information out of the comments. For me, the author name was a link.

<span class=”comments-post”>Posted by: <a target=”_blank” href=”http://jason.sdf1.net”>Jason</a> at December 18, 2003 09:43 AM</span>

So this was the code:

($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|);
($CAuthor) = ($Ctemp =~ m|\>(.+)\<|);
($CURL) = ($Ctemp =~ m|href=\”(.+)\”|);

It grabs that whole line into the temp variable and then looks in the temp variable for the Name and URL. Notice how I had to use \s to represent spaces for the ‘Posted by:’ search. You may have to use a \n for new lines 😉

Okay, so you think you’ve made the changes you need, and want to go ahead and try it out. Drop to a command prompt, with your saved script and the html files in the same folder and type:

perl mtfix.pl 0000001.html

Or whatever file name you choose. I recommend trying it out with just one of your files. It’s going to print the results to the screen, you can read through and make sure that it’s finding the right text. If not, go back and figure out why. If it’s perfect, then use this:

perl mtfix.pl *.html > import.mt

Hope that works out for you. If you need more help, do a google search on Perl and Regular Expressions. Learn something 😀

3 comments
3 Comments so far
Alison December 18th, 2003 9:43 am

Hey Jay,

Is this why I can’t sign into the blog to edit my pages anymore? The username and password are invalid

🙂
Alison

Russ December 31st, 2003 6:04 am

Nice script, and I can *almost* get it to work for me. My perl-fu is weak….

What I can’t extract is the time out of this html:

[span class=”posted”>Posted by Russ E. at December 1, 2003 11:28 PM
| [a href=”http://www.yadda-yadda-yadda.com/MT/mt-tb.cgi?__mode=view&entry_id=001″ [/span]

Any suggestions?

Cheers,
Russ

Doug Daulton (apakuni) June 19th, 2004 11:03 pm

Jason,

Thanks for a great script. It saved my backside. I took the liberty of tweaking it a bit to grab more date fromthe RDF portion of the page. I’ve posted the tweaked file over at the MT Forums.

http://www.movabletype.org/support/index.php?s=03fc4fd66d6462e9d2b3942287661c0d&act=ST&f=13&t=32035&st=0#entry187641