Automating the production of .epub files

You ever get that feeling where you think, “I wonder if I can do x?” And then you spend a good day working out how to do it, and document it all in a blog post?

No? Just me? OK.

I have a little bit of writing that I keep on this site. I vacillate between being proud of it and being ashamed of it. Perhaps over time it’ll slowly get bigger. In the meantime, wouldn’t it be nice if you could download the writing as an epub? It would, wouldn’t it?

What’s in an .epub?

So according to wikipedia, an epub consists of:

And these files are all zipped up into one package. Sounds pretty logical. Let’s take one apart!

First, we grab an epub:

You’ll do. Source

Let’s rename it to a .zip and see if we can open it:

Oops

So it seems like OS X’s default archive utility doesn’t like this zipped epub. However, when I unzip it with The Unarchiver, everything runs smoothly. No idea what that’s about.

OK, what do we see when we unzip everything?

A promising start

It looks like we have files in the following structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
epub.zip
+ META-INF
| + container.xml
+ mimetype
+ OEBPS
  + @export@sunsite@users@gutenbackend@cache@epub@1257@1257-cover.png
  + @public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-0.htm.html
  + @public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html
  + @public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-2.htm.html
  + ...<69 more documents like this skipped>
  + 0.css
  + 1.css
  + content.opf
  + pgepub.css
  + toc.ncx
  + wrap0000.html

OK. Let’s go through each of these files and work out what’s in each of them. Combined with the wikipedia article above, that should give us an idea of what goes into the epub.

container.xml

This is a really basic file that really just tells the epub reader where to find the juicy stuff. Check it out:

1
2
3
4
5
6
<?xml version='1.0' encoding='utf-8'?>
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
  <rootfiles>
    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
  </rootfiles>
</container>

That’s hardly anything! In fact, this is a stock file and varies very little from file to file.

mimetype

If you thought container.xml was bare, you’re in for a surprise here. This is literally just our mimetype:

1
application/epub+zip

OEBPS

This is where the action happens. Those files starting with @export or @public definitely look like they’ve been automatically exported from some kind of automatic publishing tool. The cover is just the PNG cover of the book, while the html files are the individual chapters of the book. For example, here’s the start of the preface, taken from h2.htm.html:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta name="generator" content="HTML Tidy for HTML5 for Linux version 5.6.0"/>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
<meta http-equiv="Content-Style-Type" content="text/css"/>
<title>The Project Gutenberg eBook of The Three Musketeers, by Alexandre Dumas, Père</title>




<link href="0.css" rel="stylesheet" type="text/css"/>
<link href="1.css" rel="stylesheet" type="text/css"/>
<link href="pgepub.css" rel="stylesheet" type="text/css"/>
<meta name="generator" content="Ebookmaker 0.11.9 by Project Gutenberg"/>
</head>
<body class="x-ebookmaker"><div class="chapter" id="pgepubid00002">
<h2><a id="pref01"/>AUTHOR’S PREFACE</h2>
<p class="pfirst"><span class="dropcap c6">I</span><span class="dropspan">n</span> which it is proved that, notwithstanding their names’ ending in <i>os</i> and <i>is</i>, the heroes of the story which we are about to have the honor to relate to our readers have nothing mythological about them.</p>
<p class="p2">A short time ago, while making researches in the Royal Library for my History of Louis XIV...

What about those files at the end? We have three CSS files, plus an opf, an ncx, and one weird html file. The weird html file is nothing, and the css files are just styling info for the book. The opf and ncx are more interesting though.

content.opf

This is an Open Packaging Format file, and it’s here that we define all the interesting stuff about the book. Once again it’s an xml subset:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
<?xml version='1.0' encoding='UTF-8'?>

<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="id">
  <metadata>
    <dc:rights>Public domain in the USA.</dc:rights>
    <dc:identifier opf:scheme="URI" id="id">http://www.gutenberg.org/1257</dc:identifier>
    <dc:creator opf:file-as="Dumas, Alexandre">Alexandre Dumas</dc:creator>
    <dc:title>The Three Musketeers</dc:title>
    <dc:language xsi:type="dcterms:RFC4646">en</dc:language>
    <dc:subject>Historical fiction</dc:subject>
    <dc:subject>France -- History -- Louis XIII, 1610-1643 -- Fiction</dc:subject>
    <dc:subject>Adventure and adventurers -- Fiction</dc:subject>
    <dc:subject>Swordsmen -- Fiction</dc:subject>
    <dc:date opf:event="publication">1998-03-01</dc:date>
    <dc:date opf:event="conversion">2021-09-07T17:00:12.890029+00:00</dc:date>
    <dc:source>https://www.gutenberg.org/files/1257/1257-h/1257-h.htm</dc:source>
    <meta name="cover" content="item1"/>
  </metadata>
  <manifest>
    <!--Image: 1200 x 1800 size=41509 -->
    <item href="@export@sunsite@users@gutenbackend@cache@epub@1257@1257-cover.png" id="item1" media-type="image/png"/>
    <item href="pgepub.css" id="item2" media-type="text/css"/>
    <item href="0.css" id="item3" media-type="text/css"/>
    <item href="1.css" id="item4" media-type="text/css"/>
    <!--Chunk: size=2934 Split on div.chapter-->
    <item href="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-0.htm.html" id="item5" media-type="application/xhtml+xml"/>
    <!--Chunk: size=11517 Split on div.chapter-->
    <item href="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html" id="item6" media-type="application/xhtml+xml"/>
    <!--...plus a bunch more...-->
    <item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>
    <item href="wrap0000.html" id="coverpage-wrapper" media-type="application/xhtml+xml"/>
  </manifest>
  <spine toc="ncx">
    <itemref idref="coverpage-wrapper" linear="yes"/>
    <itemref idref="item5" linear="yes"/>
    <itemref idref="item6" linear="yes"/>
    <!--...plus a bunch more...-->
  </spine>
  <guide>
    <reference type="toc" title="CONTENTS" href="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html#pgepubid00001"/>
    <reference type="cover" title="Cover" href="wrap0000.html"/>
  </guide>
</package>

So it looks like we have:

toc.ncx

Wikipedia informs me that the Table of Contents file (which is a navigational control file for XML, or ncx file) is “traditionally named toc.ncx”, and it looks like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
<?xml version='1.0' encoding='UTF-8'?>

<!DOCTYPE ncx PUBLIC '-//NISO//DTD ncx 2005-1//EN' 'http://www.daisy.org/z3986/2005/ncx-2005-1.dtd'>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en">
  <head>
    <meta name="dtb:uid" content="http://www.gutenberg.org/1257"/>
    <meta name="dtb:depth" content="1"/>
    <meta name="dtb:generator" content="Ebookmaker 0.11.9 by Project Gutenberg"/>
    <meta name="dtb:totalPageCount" content="0"/>
    <meta name="dtb:maxPageNumber" content="0"/>
  </head>
  <docTitle>
    <text>The Three Musketeers</text>
  </docTitle>
  <navMap>
    <navPoint id="np-1" playOrder="1">
      <navLabel>
        <text>The Three Musketeers</text>
      </navLabel>
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-0.htm.html#pgepubid00000"/>
    </navPoint>
    <navPoint id="np-2" playOrder="2">
      <navLabel>
        <text>CONTENTS</text>
      </navLabel>
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html#pgepubid00001"/>
    </navPoint>
    <navPoint id="np-3" playOrder="3">
      <navLabel>
        <text>AUTHOR’S PREFACE</text>
      </navLabel>
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-2.htm.html#pgepubid00002"/>
    </navPoint>
    <navPoint id="np-4" playOrder="4">
      <navLabel>
        <text>The Three Musketeers</text>
      </navLabel>
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-3.htm.html#pgepubid00003"/>
    </navPoint>
    <navPoint id="np-5" playOrder="5">
      <navLabel>
        <text>1 THE THREE PRESENTS OF D’ARTAGNAN THE ELDER</text>
      </navLabel>
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-4.htm.html#pgepubid00004"/>
    </navPoint>
    <!-- ... -->
    <navPoint id="np-72" playOrder="72">
      <navLabel>
        <text>EPILOGUE</text>
      </navLabel>
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-71.htm.html#pgepubid00071"/>
    </navPoint>
  </navMap>
</ncx>

Building our own epub

OK! So, we’ve gone through what makes an epub. We should therefore be able to make our own.

I’m going to integrate this with my current static site build, which uses nanoc - at this stage all I have on the site are short stories, so we can get away with a single file/chapter per epub. This should make it pretty easy to build these.

Our process will need to look like the following:

  1. Identify all the pieces of writing we have on the site.
  2. Make an epub for each a. Parse the story (which could be either markdown or haml) into html. b. Create a basic TOC c. Create basic content.opf file d. Package + zip it all up

Let’s see how easy this will be to do!

Step 1: Add epub items

Nanoc builds pages and files in the output site by linking each input file to an output file through a set of rules. So this post I’m writing right now is a markdown file which will get converted to an html file (with the appropriate wrapper) when I render it. I still want each piece of writing to turn up on the site, so I need to create a new, duplicate item to produce the relevant epub.

Thankfully, we can do this with @items.create. I’m going to do this in the preprocess step:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
preprocess do 
  # ...a bunch of other stuff that my site needs...

  #---------------------------------------
  # Create epubs for each piece of writing
  @items
    .find_all("/writing/**/*.{md,haml}")
    .each do |i|

      # Don't create epubs of index files - just of others
      next if i.identifier =~ /index.(haml|md)$/

      new_identifier = i.identifier.to_s.gsub(/(haml|md)$/, "epub.\\1")
      @items.create(i.raw_content, {}, new_identifier)       
    end

So this will take every piece of writing (at least every markdown and haml file in the writing folder) and create an epub equivalent. I’m also going to build a renderer, which looks like this:

1
2
3
compile '/writing/**/*.epub.{haml,md}' do
  write ext: "epub"
end

Right now all this will do is take the base text and output to a file that has the same name as the original, but with the extension “.epub”. So for example, this will output /writing/a-story.md.epub to the file /writing/a-story.epub.

Actually, all of the files on my site tend to be output as index.html files within folders - so this story would actually be located at /writing/a-story/index.html. So I’m going to change the output code so that the epub sits alongside the file itself:

1
2
3
4
5
6
7
8
9
10
11
12
13
compile '/writing/**/*.epub.{haml,md}' do

  # Write to a path which means the file goes alongside the index.html of the
  # story itself

  # Remove the haml or md extension
  output_filename = File.basename(item.identifier.without_ext)

  # Remove all the extensions - haml/md and epub
  output_path = item.identifier.without_exts

  write "#{output_path}/#{output_filename}"
end

OK! We’ve now got an epub step in our processing, and it’s outputting to the right place. It’s a bit more complex to build the epubs themselves.

Step 2: Build an epub

The usual way we convert something from one format to another in nanoc is through filters. It’s really easy to build a filter. Here’s an example that appears on Nanoc’s site:

1
2
3
4
5
6
7
class CensorFilter < Nanoc::Filter
  identifier :censor

  def run(content, params = {})
    content.gsub('Nanoc sucks', 'Nanoc rocks')
  end
end

So this would then allow us to use the filter in our Rules:

1
2
3
4
compile '/some/glob' do
  filter :censor
  # ...
end

This is a text-to-text filter - we take in text, and we return text. But you can also create a text-to-binary filter, which looks more like the following:

1
2
3
4
5
6
7
8
9
10
class EpubFilter < Nanoc::Filter
  identifier :epub

  type :text => :binary

  def run(content, params = {})
    # This function will receive text content as its first argument, and should
    # produce the result to the value `output_filename`.
  end
end

So what do we need to do here? We need to create to:

Step 2d: Zipping it all up

If you’ve been following along at home, right now you’ll be going “Now hold on Jan, this is the last step! We should start at 2a!” And you’re right. But it turns out that when you search for ruby zip gems, you quickly find yourself checking out rubyzip, and rubyzip lets us assemble our zipped folder in situ. For example, here’s some code that produces our metadata file in a zipped folder:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
require "zip"

class EpubFilter < Nanoc::Filter
  identifier :epub

  type :text => :binary

  # This takes them item as the param
  def run(content, params = {})
    Zip::File.open(output_filename, create: true) do |zipfile|
      zipfile.get_output_stream("mimetype"){ |f| f.write "application/epub+zip" }
    end
  end
end

So that means we can fold steps 2a through 2c into 2d.

It’s nice and easy to add container.xml to this file as well:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class EpubFilter < Nanoc::Filter
  identifier :epub

  type :text => :binary

  # This takes them item as the param
  def run(content, params = {})
    Zip::File.open(output_filename, create: true) do |zipfile|
      # mimetype
      zipfile.get_output_stream("mimetype"){ |f| f.write "application/epub_zip" }

      # META-INF/container.xml
      zipfile.get_output_stream("META-INF/container.xml") do |f|
        f.write <<-end
<?xml version='1.0' encoding='utf-8'?>
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
  <rootfiles>
    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
  </rootfiles>
</container>
end

    end
  end
end

We still need to add the table of contents, and the file itself - let’s split them out into their own functions. Here’s the final look:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
require "zip"

class EpubFilter < Nanoc::Filter
  identifier :epub

  type :text => :binary

  def run(content, params = {})

    Zip::File.open(output_filename, create: true) do |zipfile|
      # mimetype
      zipfile.get_output_stream("mimetype"){ |f| f.write generate_mimetype }

      # META-INF/container.xml
      zipfile.get_output_stream("META-INF/container.xml"){ |f| f.write generate_container }

      # OEBPS/toc.ncx
      zipfile.get_output_stream("OEBPS/toc.ncx"){ |f| f.write generate_toc }

      # OEBPS/content.html
      zipfile.get_output_stream("OEBPS/content.html"){ |f| f.write generate_content_html(content) }

      # OEBPS/content.opf
      zipfile.get_output_stream("OEBPS/content.opf"){ |f| f.write generate_content_opf }
    end
  end

  # Generator functions --------------------------------------------------------
  def generate_mimetype
    "application/epub_zip"
  end

  def generate_container
    return <<-end
<?xml version='1.0' encoding='utf-8'?>
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
  <rootfiles>
    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
  </rootfiles>
</container>
    end
  end

  def generate_toc
    return <<-end
<?xml version='1.0' encoding='UTF-8'?>

<!DOCTYPE ncx PUBLIC '-//NISO//DTD ncx 2005-1//EN' 'http://www.daisy.org/z3986/2005/ncx-2005-1.dtd'>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en">
  <head>  
  </head>
  <docTitle>
    <text>#{item[:title]}</text>
  </docTitle>
  <navMap>
    <navPoint id="np-1" playOrder="1">
      <navLabel>
        <text>#{item[:title]}</text>
      </navLabel>
      <content src="content.html"/>
    </navPoint>
  </navMap>
</ncx>
    end
  end

  def generate_content_html(content)

    rendered_content = case item.identifier.ext
    when "md"
      markdown = Redcarpet::Markdown.new(Redcarpet::Render::HTML)
      markdown.render(content)
    when "haml"
      Haml::Engine.new(content).render
    else
      raise "Don't know how to convert #{item.identifier.ext}"
    end

    return <<-end
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
  <meta http-equiv="Content-Style-Type" content="text/css"/>
  <title>#{item[:title]}</title>
</head>
<body>#{rendered_content}</body>
</html>
    end
  end

  def generate_content_opf
    return <<-end
<?xml version='1.0' encoding='UTF-8'?>

<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="id">
  <metadata>
    <dc:creator opf:file-as="Ruzicka, Jan-Yves">Jan-Yves Ruzicka</dc:creator>
    <dc:title>#{item[:title]}</dc:title>
    <dc:language xsi:type="dcterms:RFC4646">en</dc:language>
    <dc:date opf:event="publication">#{item[:date]}</dc:date>
  </metadata>
  <manifest>
    <item href="content.html" id="content" media-type="application/xhtml+xml"/>
    <item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>
  </manifest>
  <spine toc="ncx">
    <itemref idref="content" linear="yes"/>
  </spine>
</package>
    end
  end
end

And now we’re creating epubs as we go! Super-easy! OK, final step: adding a download link into our site.

Linking to the epub

If you visit a story page right now, it looks like the following:

We’re taking a minimalist approach here

How about we put a little button just under the title, which allows the viewer to download the epub version of the story?

Nanoc uses layouts to wrap content in boilerplate html. Right now, our writing uses a very basic “default” layout, that provides the header, sidebar, footer, all the rest. Let’s quickly make a custom layout for our writing:

1
2
3
4
5
6
7
8
# layouts/writing.haml

=render "/default.*" do

  %p
    =link_to "Download epub", "#", class: "button"

  =yield

This layout will render the default layout, as normal, but rather than just spitting out the content of the story, it’ll put a little “download” button at the top. The button won’t do anything yet, but it’ll look pretty (thanks to some default rendering).

Inside our rules, we need to set things up to ensure the stories use this layout1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
compile '/writing/**/*.{haml,md}' do

  # Filter
  case item.identifier.ext
  when "haml"
    filter :haml
  when "md"
    filter :redcarpet, renderer: MarkdownOptions::renderer, options: MarkdownOptions::options
  else
   raise RuntimeError, "Don't know how to render #{item.identifier}"
  end

  layout '/writing.*'
end

Now, let’s hook things up! Actually, you know what? I keep on referring to the epub’s location - why don’t we define it in the preprocessing step and make it an attribute of both the story, and its epub equivalent:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
preprocess do
  # All the other stuff...
   #---------------------------------------
  # Create epubs for each piece of writing
  @items
    .find_all("/writing/**/*.{md,haml}")
    .each do |i|

      # Don't create epubs of index files - just of others
      next if i.identifier =~ /index.(haml|md)$/

      # Identifier for the epub element
      new_identifier = i.identifier.to_s.gsub(/(haml|md)$/, "epub.\\1")

      # Final URL for the epub
      i[:epub_location] = 
        i.identifier.without_exts +
        "/" +
        File.basename(i.identifier.without_ext) +
        ".epub"

      @items.create(i.raw_content, i.attributes, new_identifier)       
    end
end

Now we’re defining the final URL where the epub should end up, and setting it to the :epub_location attribute of both items. And that means we can simplify the epub rendering step in our Rules:

1
2
3
4
compile '/writing/**/*.epub.{haml,md}' do
  filter :epub
  write item[:epub_location]
end

And finally, we can link up that button in our layout:

1
2
3
4
5
6
7
8
# layouts/writing.haml

=render "/default.*" do

  %p
    =link_to "Download epub", item[:epub_location], class: "button"

  =yield

And now, look what we have!

Success all around.


  1. I’m using some custom RedCarpet options, which is why I have that interesting set of parameters going to the redcarpet renderer. 

Comments

Leave a comment

Sorry, you need to enable javascript for commenting. That’s one of the drawbacks of having a static site blog I guess.