Slurping content from Livejournal with ruby

Published: 2017-04-15
Tagged: ruby

Let’s talk about LiveJournal.

When I was a teenager, I was a giant nerd.¹ A good deal of my social life was lived on LiveJournal. Since about 2007, when Six Apart sold the LiveJournal service to the Russian SUP Media, every visit to the site has been like visiting an elderly relative riddled with some form of incurable disease: you feel a wave of pity because you remember what this place was like.

I kept my LiveJournal around, locked down in private mode, because I figured it documented a good part of my life - the part of my life where I had the time and wherewithal to actually document stuff, as opposed to now when I’m running around trying to get ten things done all the time. Still, I was an embarrassing little brat as a teenager, and about 90% of those entries, well, the world is better for their general absence.

So when LiveJournal changes its terms of service yet again, I decided I should do something. I don’t want the best documentation of my teenage years to vanish as soon as I hit “delete” on my account (or as soon as LiveJournal closes its doors or decides to permaban me for disobeying terms of service that I cannot actually read, on account of not speaking Russian); nor do I really want to port them over to better, nicer, equivalent services like DreamWidth.²

Can we get computers to help us in this endeavour? You bet we can.

Building our base importer

We’re going to be using ruby to access LiveJournal and pull down all our entries. Specifically, we’re going to be using the livejournal gem. We can pull that into our base program (and do a quick login) easily enough:

require "livejournal"

user = LiveJournal::User.new("USERNAME", "PASSWORD")
login = LiveJournal::Request::Login.new(user)
login.run

puts "Login response:"
login.dumpresponse

puts "User's full name: #{user.fullname}"

This is just copied straight from the gem site, and gives you the sort of results you would expect. If you’re getting errors, it may be because LiveJournal wants you to sign their new Terms of Service - a quick trip to the website and check of your messages will help with that.

OK, let’s see if we can pull the first entry of our journal. In order to do this, we must make a GetEvents request, similar to how we made our login request earlier. You can make three different types of GetEvents request:

ItemID, which fetches a post with a given ID (posts start at ID 1 increment for each post; an invalid ItemID returns nil).
Recent, which returns the n most recent posts.
Sync, which returns all posts since a given sync point.

Obviously, we want to use ItemID here. We also need to specify strict: false. Strict mode (which defaults to true) basically means that our livejournal gem will throw an error if it sees entry metadata it doesn’t recognise. Since LiveJournal have introduced new post metadata since this library was last updated, we can’t run on Strict mode.

first_post = LiveJournal::Request::GetEvents.new(
  user,
  itemid: 1,
  strict: false
)

puts first_post.run.inspect

This will give your first post. You could either save it as-is, or if you want to be more fancy you could create your own Markdown file from the post data. Regardless: you’ve got all that sweet post data on your own hard drive now.

Grabbing posts

Grabbing the rest of your entries should be a cakewalk: the only question is which ID to stop at. You can fetch that as well:

most_recent_posts = Livejournal::Request::GetEvents.new(
  user,
  recent: 1,
  strict:false
).run

most_recent_post = most_recent_posts.values.first
most_recent_id = most_recent_post.itemid

You could run a quick loop at this point, and grab each entry as its own separate item, but it turns out that LiveJournal has a posting limit.³ Looking through the documentation, we see we can fetch the x most recent entries. Perhaps that will help us!

entries = LiveJournal::Request::GetEvents.new(
  user,
  recent:most_recent_id,
  strict:false
).run

Except that, no, this will only fetch a maximum of 50 entries. As far as I can tell, you’ll need to fetch each entry individually, slowly, over the course of a few days to make sure you don’t overload their servers.

Grabbing comments

Let’s say you want to go one better and grab all the comments on your entries as well. Must be pretty easy, right? We just adapt our method for posts, above, to deal with comments. Right?

Well, no. Not at all. We can’t retrieve comments using the normal API - instead, we have to muck about with REST endpoints and session stores. Again, we’ll just try to grab one comment as a proof-of-concept:

require "livejournal"
require "livejournal/sync"
require "open-uri"
require "nokogiri"

user = LiveJournal::User.new("USERNAME", "PASSWORD")
session = LiveJournal::Request::SessionGenerate.new(user).run

path = 
  "http://www.livejournal.com/export_comments.bml?get=comment_body&startid=1"

first_comment = open(
  path,
  "Cookie" => "ljsession=#{session}"
){ |f| Nokogiri::XML(f.read) }

Here we do three things:

First, we generate a session with LiveJournal. We do this through the Request::SessionGenerate class, which is part of the livejournal/sync part of the livejournal gem.⁴
Following this, we fetch the data from LiveJournal using the open-uri library to automate the boring bits of web requests.
Finally (and this is kind of munged into reading the data) we convert said data from an XML string into a structured object using nokogiri.

If you inspect your comment, you’ll find that you haven’t just pulled down one comment - you’ve pulled down a whole bunch of comments. To be precise, this method will cap out at 1000 comments - so you just need to re-run the query with startid set to 1001, and so on.

Your comment file will look something like this:

<livejournal>
  <comments>
    <comment id="1" jitemid="1" posterid="598088">
      <body>Comment body goes in here</body>
      <date>2002-09-16T22:11:55Z</date>
    </comment>
    <comment id="2" jitemid="1" posterid="597677" parentid="1">
      <body>Comment body goes in here.</body>
      <date>2002-09-17T03:17:10Z</date>
    </comment>
    ...
  </comments>
</livejournal>

That’s actually much more convenient than our post-slurping code. You’ll see that along with the body and date data, we also get some comment metadata:

id: The id of the comment itself.
jitemid: The id of the post the comment is associated with.
posterid: The id of the poster who made the comment.
parentid: If this comment is a reply to another comment, the id of the comment this comment is a reply to.

In order to link up poster IDs with names, we need to fetch our comment metadata as well:

require "livejournal"
require "livejournal/sync"
require "open-uri"
require "nokogiri"

user = LiveJournal::User.new("USERNAME", "PASSWORD")
session = LiveJournal::Request::SessionGenerate.new(user).run

path =
  "http://www.livejournal.com/export_comments.bml?get=comment_meta&startid=1"

comment_meta = open(
  path,
  "Cookie" => "ljsession=#{session}"
){ |f| Nokogiri::XML(f.read) }

Note the change in path - now we’re grabbing content metadata. You’ll get back a bunch of XML for your comments (that excludes the body of each comment), but also, appended to that, an XML structure that looks like the following:

<usermaps>
  <usermap id='6' user='test2' />
  <usermap id='3' user='test' />
  <usermap id='2' user='xb95' />
</usermaps>

This links each ID to a username.

Conclusion

It doesn’t feel like much of a conclusion, does it? But we have the basics of a LiveJournal data-slurping machine:

We can grab our LiveJournal entries, one-by-one, as long as we don’t offend the LiveJournal gods by making more than 1000 requests per hour. Over the course of a day or so, we can probably grab every entry.
We can grab our comments in bundles of one thousand at a time. These have XML all over the place, but they contain all the data we need (if we want) to re-assemble the tree of threaded comments below each entry.
We can grab additional comment metadata so we know who made these comments.

We haven’t built anything to extract information from these data structures. You can just save the comment XML files to disk, since pretty much anything will read XML. Grabbing data from your LiveJournal::Entry objects is a little more difficult, but if you’re the sort of person who’s seriously considering building a ruby script to download your LiveJournal, I’m sure you can work your way around the internals of the gem’s other classes.

At this stage, there’s a big issue with my building anything nice and well-formed to show off to the public: I plan on running this script once and once only. I suspect that building a proper-looking set of well-formed threaded comments will be enough work in itself, let alone considering how to link up user names and IDs and all that jazz. But even if I just leave these files on my hard drive, unformatted, I’ve got over the biggest barrier. I’ve grabbed everything from LiveJournal that I wanted to. I’ve defeated one more silo, and I’m now free to fold, spindle, and mutilate these files in my own time.

Plus ça change, am I right? ↩
Again, if you’re not me and therefore emotionally attached to that stuff, it’s actively horrid to read. ↩
Approximately 1000 requests per hour, as far as I can tell. ↩
And before you ask, no, I don’t think that would provide us with any shortcuts. I have liberally cribbed code from this file to try to automate this whole business, however. ↩

1	require "livejournal"
2
3	user = LiveJournal::User.new("USERNAME", "PASSWORD")
4	login = LiveJournal::Request::Login.new(user)
5	login.run
6
7	puts "Login response:"
8	login.dumpresponse
9
10	puts "User's full name: #{user.fullname}"

1	first_post = LiveJournal::Request::GetEvents.new(
2	user,
3	itemid: 1,
4	strict: false
5	)
6
7	puts first_post.run.inspect

1	most_recent_posts = Livejournal::Request::GetEvents.new(
2	user,
3	recent: 1,
4	strict:false
5	).run
6
7	most_recent_post = most_recent_posts.values.first
8	most_recent_id = most_recent_post.itemid

1	entries = LiveJournal::Request::GetEvents.new(
2	user,
3	recent:most_recent_id,
4	strict:false
5	).run

1	require "livejournal"
2	require "livejournal/sync"
3	require "open-uri"
4	require "nokogiri"
5
6	user = LiveJournal::User.new("USERNAME", "PASSWORD")
7	session = LiveJournal::Request::SessionGenerate.new(user).run
8
9	path =
10	"http://www.livejournal.com/export_comments.bml?get=comment_body&startid=1"
11
12	first_comment = open(
13	path,
14	"Cookie" => "ljsession=#{session}"
15	){ \|f\| Nokogiri::XML(f.read) }

1	<livejournal>
2	<comments>
3	<comment id="1" jitemid="1" posterid="598088">
4	<body>Comment body goes in here</body>
5	<date>2002-09-16T22:11:55Z</date>
6	</comment>
7	<comment id="2" jitemid="1" posterid="597677" parentid="1">
8	<body>Comment body goes in here.</body>
9	<date>2002-09-17T03:17:10Z</date>
10	</comment>
11	...
12	</comments>
13	</livejournal>

1	<usermaps>
2	<usermap id='6' user='test2' />
3	<usermap id='3' user='test' />
4	<usermap id='2' user='xb95' />
5	</usermaps>