Automating the production of .epub files

Published
2021-11-10
Tagged

You ever get that feeling where you think, “I wonder if I can do x?” And then you spend a good day working out how to do it, and document it all in a blog post?

No? Just me? OK.

I have a little bit of writing that I keep on this site. I vacillate between being proud of it and being ashamed of it. Perhaps over time it’ll slowly get bigger. In the meantime, wouldn’t it be nice if you could download the writing as an epub? It would, wouldn’t it?

What’s in an .epub?

So according to wikipedia, an epub consists of:

  • XHTML or DTBook files to represent the text and structure of the document,
  • a subset of CSS to provide layout and formatting,
  • XML to create the document manifest, table of contents, and metadata.

And these files are all zipped up into one package. Sounds pretty logical. Let’s take one apart!

First, we grab an epub:

You’ll do. Source

Let’s rename it to a .zip and see if we can open it:

Oops

So it seems like OS X’s default archive utility doesn’t like this zipped epub. However, when I unzip it with The Unarchiver, everything runs smoothly. No idea what that’s about.

OK, what do we see when we unzip everything?

A promising start

It looks like we have files in the following structure:

1
epub.zip
2
+ META-INF
3
| + container.xml
4
+ mimetype
5
+ OEBPS
6
  + @export@sunsite@users@gutenbackend@cache@epub@1257@1257-cover.png
7
  + @public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-0.htm.html
8
  + @public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html
9
  + @public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-2.htm.html
10
  + ...<69 more documents like this skipped>
11
  + 0.css
12
  + 1.css
13
  + content.opf
14
  + pgepub.css
15
  + toc.ncx
16
  + wrap0000.html

OK. Let’s go through each of these files and work out what’s in each of them. Combined with the wikipedia article above, that should give us an idea of what goes into the epub.

container.xml

This is a really basic file that really just tells the epub reader where to find the juicy stuff. Check it out:

1
<?xml version='1.0' encoding='utf-8'?>
2
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
3
  <rootfiles>
4
    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
5
  </rootfiles>
6
</container>

That’s hardly anything! In fact, this is a stock file and varies very little from file to file.

mimetype

If you thought container.xml was bare, you’re in for a surprise here. This is literally just our mimetype:

1
application/epub+zip

OEBPS

This is where the action happens. Those files starting with @export or @public definitely look like they’ve been automatically exported from some kind of automatic publishing tool. The cover is just the PNG cover of the book, while the html files are the individual chapters of the book. For example, here’s the start of the preface, taken from h2.htm.html:

1
<?xml version='1.0' encoding='utf-8'?>
2
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
3
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
4
<head>
5
<meta name="generator" content="HTML Tidy for HTML5 for Linux version 5.6.0"/>
6
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
7
<meta http-equiv="Content-Style-Type" content="text/css"/>
8
<title>The Project Gutenberg eBook of The Three Musketeers, by Alexandre Dumas, Père</title>
9
10
<link href="0.css" rel="stylesheet" type="text/css"/>
11
<link href="1.css" rel="stylesheet" type="text/css"/>
12
<link href="pgepub.css" rel="stylesheet" type="text/css"/>
13
<meta name="generator" content="Ebookmaker 0.11.9 by Project Gutenberg"/>
14
</head>
15
<body class="x-ebookmaker"><div class="chapter" id="pgepubid00002">
16
<h2><a id="pref01"/>AUTHOR’S PREFACE</h2>
17
<p class="pfirst"><span class="dropcap c6">I</span><span class="dropspan">n</span> which it is proved that, notwithstanding their names’ ending in <i>os</i> and <i>is</i>, the heroes of the story which we are about to have the honor to relate to our readers have nothing mythological about them.</p>
18
<p class="p2">A short time ago, while making researches in the Royal Library for my History of Louis XIV...

What about those files at the end? We have three CSS files, plus an opf, an ncx, and one weird html file. The weird html file is nothing, and the css files are just styling info for the book. The opf and ncx are more interesting though.

content.opf

This is an Open Packaging Format file, and it’s here that we define all the interesting stuff about the book. Once again it’s an xml subset:

1
<?xml version='1.0' encoding='UTF-8'?>
2
3
<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="id">
4
  <metadata>
5
    <dc:rights>Public domain in the USA.</dc:rights>
6
    <dc:identifier opf:scheme="URI" id="id">http://www.gutenberg.org/1257</dc:identifier>
7
    <dc:creator opf:file-as="Dumas, Alexandre">Alexandre Dumas</dc:creator>
8
    <dc:title>The Three Musketeers</dc:title>
9
    <dc:language xsi:type="dcterms:RFC4646">en</dc:language>
10
    <dc:subject>Historical fiction</dc:subject>
11
    <dc:subject>France -- History -- Louis XIII, 1610-1643 -- Fiction</dc:subject>
12
    <dc:subject>Adventure and adventurers -- Fiction</dc:subject>
13
    <dc:subject>Swordsmen -- Fiction</dc:subject>
14
    <dc:date opf:event="publication">1998-03-01</dc:date>
15
    <dc:date opf:event="conversion">2021-09-07T17:00:12.890029+00:00</dc:date>
16
    <dc:source>https://www.gutenberg.org/files/1257/1257-h/1257-h.htm</dc:source>
17
    <meta name="cover" content="item1"/>
18
  </metadata>
19
  <manifest>
20
    <!--Image: 1200 x 1800 size=41509 -->
21
    <item href="@export@sunsite@users@gutenbackend@cache@epub@1257@1257-cover.png" id="item1" media-type="image/png"/>
22
    <item href="pgepub.css" id="item2" media-type="text/css"/>
23
    <item href="0.css" id="item3" media-type="text/css"/>
24
    <item href="1.css" id="item4" media-type="text/css"/>
25
    <!--Chunk: size=2934 Split on div.chapter-->
26
    <item href="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-0.htm.html" id="item5" media-type="application/xhtml+xml"/>
27
    <!--Chunk: size=11517 Split on div.chapter-->
28
    <item href="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html" id="item6" media-type="application/xhtml+xml"/>
29
    <!--...plus a bunch more...-->
30
    <item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>
31
    <item href="wrap0000.html" id="coverpage-wrapper" media-type="application/xhtml+xml"/>
32
  </manifest>
33
  <spine toc="ncx">
34
    <itemref idref="coverpage-wrapper" linear="yes"/>
35
    <itemref idref="item5" linear="yes"/>
36
    <itemref idref="item6" linear="yes"/>
37
    <!--...plus a bunch more...-->
38
  </spine>
39
  <guide>
40
    <reference type="toc" title="CONTENTS" href="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html#pgepubid00001"/>
41
    <reference type="cover" title="Cover" href="wrap0000.html"/>
42
  </guide>
43
</package>

So it looks like we have:

  • A bunch of metadata (spoiler alert: we’re not going to define most of this for any automatic epub bundling we do).
  • A manifest of all of the files, which are assigned IDs and media types.
  • A “spine”, which lists “all the XHTML content documents in their linear reading order”. This also has a reference to the table of contents, which we’ll get to in a bit.

toc.ncx

Wikipedia informs me that the Table of Contents file (which is a navigational control file for XML, or ncx file) is “traditionally named toc.ncx”, and it looks like the following:

1
<?xml version='1.0' encoding='UTF-8'?>
2
3
<!DOCTYPE ncx PUBLIC '-//NISO//DTD ncx 2005-1//EN' 'http://www.daisy.org/z3986/2005/ncx-2005-1.dtd'>
4
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en">
5
  <head>
6
    <meta name="dtb:uid" content="http://www.gutenberg.org/1257"/>
7
    <meta name="dtb:depth" content="1"/>
8
    <meta name="dtb:generator" content="Ebookmaker 0.11.9 by Project Gutenberg"/>
9
    <meta name="dtb:totalPageCount" content="0"/>
10
    <meta name="dtb:maxPageNumber" content="0"/>
11
  </head>
12
  <docTitle>
13
    <text>The Three Musketeers</text>
14
  </docTitle>
15
  <navMap>
16
    <navPoint id="np-1" playOrder="1">
17
      <navLabel>
18
        <text>The Three Musketeers</text>
19
      </navLabel>
20
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-0.htm.html#pgepubid00000"/>
21
    </navPoint>
22
    <navPoint id="np-2" playOrder="2">
23
      <navLabel>
24
        <text>CONTENTS</text>
25
      </navLabel>
26
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html#pgepubid00001"/>
27
    </navPoint>
28
    <navPoint id="np-3" playOrder="3">
29
      <navLabel>
30
        <text>AUTHOR’S PREFACE</text>
31
      </navLabel>
32
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-2.htm.html#pgepubid00002"/>
33
    </navPoint>
34
    <navPoint id="np-4" playOrder="4">
35
      <navLabel>
36
        <text>The Three Musketeers</text>
37
      </navLabel>
38
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-3.htm.html#pgepubid00003"/>
39
    </navPoint>
40
    <navPoint id="np-5" playOrder="5">
41
      <navLabel>
42
        <text>1 THE THREE PRESENTS OF D’ARTAGNAN THE ELDER</text>
43
      </navLabel>
44
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-4.htm.html#pgepubid00004"/>
45
    </navPoint>
46
    <!-- ... -->
47
    <navPoint id="np-72" playOrder="72">
48
      <navLabel>
49
        <text>EPILOGUE</text>
50
      </navLabel>
51
      <content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-71.htm.html#pgepubid00071"/>
52
    </navPoint>
53
  </navMap>
54
</ncx>

Building our own epub

OK! So, we’ve gone through what makes an epub. We should therefore be able to make our own.

I’m going to integrate this with my current static site build, which uses nanoc - at this stage all I have on the site are short stories, so we can get away with a single file/chapter per epub. This should make it pretty easy to build these.

Our process will need to look like the following:

  1. Identify all the pieces of writing we have on the site.
  2. Make an epub for each a. Parse the story (which could be either markdown or haml) into html. b. Create a basic TOC c. Create basic content.opf file d. Package + zip it all up

Let’s see how easy this will be to do!

Step 1: Add epub items

Nanoc builds pages and files in the output site by linking each input file to an output file through a set of rules. So this post I’m writing right now is a markdown file which will get converted to an html file (with the appropriate wrapper) when I render it. I still want each piece of writing to turn up on the site, so I need to create a new, duplicate item to produce the relevant epub.

Thankfully, we can do this with @items.create. I’m going to do this in the preprocess step:

1
preprocess do 
2
  # ...a bunch of other stuff that my site needs...
3
4
  #---------------------------------------
5
  # Create epubs for each piece of writing
6
  @items
7
    .find_all("/writing/**/*.{md,haml}")
8
    .each do |i|
9
10
      # Don't create epubs of index files - just of others
11
      next if i.identifier =~ /index.(haml|md)$/
12
13
      new_identifier = i.identifier.to_s.gsub(/(haml|md)$/, "epub.\\1")
14
      @items.create(i.raw_content, {}, new_identifier)       
15
    end

So this will take every piece of writing (at least every markdown and haml file in the writing folder) and create an epub equivalent. I’m also going to build a renderer, which looks like this:

1
compile '/writing/**/*.epub.{haml,md}' do
2
  write ext: "epub"
3
end

Right now all this will do is take the base text and output to a file that has the same name as the original, but with the extension “.epub”. So for example, this will output /writing/a-story.md.epub to the file /writing/a-story.epub.

Actually, all of the files on my site tend to be output as index.html files within folders - so this story would actually be located at /writing/a-story/index.html. So I’m going to change the output code so that the epub sits alongside the file itself:

1
compile '/writing/**/*.epub.{haml,md}' do
2
3
  # Write to a path which means the file goes alongside the index.html of the
4
  # story itself
5
6
  # Remove the haml or md extension
7
  output_filename = File.basename(item.identifier.without_ext)
8
9
  # Remove all the extensions - haml/md and epub
10
  output_path = item.identifier.without_exts
11
12
  write "#{output_path}/#{output_filename}"
13
end

OK! We’ve now got an epub step in our processing, and it’s outputting to the right place. It’s a bit more complex to build the epubs themselves.

Step 2: Build an epub

The usual way we convert something from one format to another in nanoc is through filters. It’s really easy to build a filter. Here’s an example that appears on Nanoc’s site:

1
class CensorFilter < Nanoc::Filter
2
  identifier :censor
3
4
  def run(content, params = {})
5
    content.gsub('Nanoc sucks', 'Nanoc rocks')
6
  end
7
end

So this would then allow us to use the filter in our Rules:

1
compile '/some/glob' do
2
  filter :censor
3
  # ...
4
end

This is a text-to-text filter - we take in text, and we return text. But you can also create a text-to-binary filter, which looks more like the following:

1
class EpubFilter < Nanoc::Filter
2
  identifier :epub
3
4
  type :text => :binary
5
6
  def run(content, params = {})
7
    # This function will receive text content as its first argument, and should
8
    # produce the result to the value `output_filename`.
9
  end
10
end

So what do we need to do here? We need to create to:

  • Create a temporary folder (with the relevant META-INF and OEBPS folders)
  • Create a bunch of relevant files:
    • container.xml
    • mimetype
    • content.opf
    • toc.ncx
  • Add the rendered story content
  • And then zip the whole thing up!

Step 2d: Zipping it all up

If you’ve been following along at home, right now you’ll be going “Now hold on Jan, this is the last step! We should start at 2a!” And you’re right. But it turns out that when you search for ruby zip gems, you quickly find yourself checking out rubyzip, and rubyzip lets us assemble our zipped folder in situ. For example, here’s some code that produces our metadata file in a zipped folder:

1
require "zip"
2
3
class EpubFilter < Nanoc::Filter
4
  identifier :epub
5
6
  type :text => :binary
7
8
  # This takes them item as the param
9
  def run(content, params = {})
10
    Zip::File.open(output_filename, create: true) do |zipfile|
11
      zipfile.get_output_stream("mimetype"){ |f| f.write "application/epub+zip" }
12
    end
13
  end
14
end

So that means we can fold steps 2a through 2c into 2d.

It’s nice and easy to add container.xml to this file as well:

1
class EpubFilter < Nanoc::Filter
2
  identifier :epub
3
4
  type :text => :binary
5
6
  # This takes them item as the param
7
  def run(content, params = {})
8
    Zip::File.open(output_filename, create: true) do |zipfile|
9
      # mimetype
10
      zipfile.get_output_stream("mimetype"){ |f| f.write "application/epub_zip" }
11
12
      # META-INF/container.xml
13
      zipfile.get_output_stream("META-INF/container.xml") do |f|
14
        f.write <<-end
15
<?xml version='1.0' encoding='utf-8'?>
16
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
17
  <rootfiles>
18
    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
19
  </rootfiles>
20
</container>
21
end
22
23
    end
24
  end
25
end

We still need to add the table of contents, and the file itself - let’s split them out into their own functions. Here’s the final look:

1
require "zip"
2
3
class EpubFilter < Nanoc::Filter
4
  identifier :epub
5
6
  type :text => :binary
7
8
  def run(content, params = {})
9
10
    Zip::File.open(output_filename, create: true) do |zipfile|
11
      # mimetype
12
      zipfile.get_output_stream("mimetype"){ |f| f.write generate_mimetype }
13
14
      # META-INF/container.xml
15
      zipfile.get_output_stream("META-INF/container.xml"){ |f| f.write generate_container }
16
17
      # OEBPS/toc.ncx
18
      zipfile.get_output_stream("OEBPS/toc.ncx"){ |f| f.write generate_toc }
19
20
      # OEBPS/content.html
21
      zipfile.get_output_stream("OEBPS/content.html"){ |f| f.write generate_content_html(content) }
22
23
      # OEBPS/content.opf
24
      zipfile.get_output_stream("OEBPS/content.opf"){ |f| f.write generate_content_opf }
25
    end
26
  end
27
28
  # Generator functions --------------------------------------------------------
29
  def generate_mimetype
30
    "application/epub_zip"
31
  end
32
33
  def generate_container
34
    return <<-end
35
<?xml version='1.0' encoding='utf-8'?>
36
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
37
  <rootfiles>
38
    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
39
  </rootfiles>
40
</container>
41
    end
42
  end
43
44
  def generate_toc
45
    return <<-end
46
<?xml version='1.0' encoding='UTF-8'?>
47
48
<!DOCTYPE ncx PUBLIC '-//NISO//DTD ncx 2005-1//EN' 'http://www.daisy.org/z3986/2005/ncx-2005-1.dtd'>
49
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en">
50
  <head>  
51
  </head>
52
  <docTitle>
53
    <text>#{item[:title]}</text>
54
  </docTitle>
55
  <navMap>
56
    <navPoint id="np-1" playOrder="1">
57
      <navLabel>
58
        <text>#{item[:title]}</text>
59
      </navLabel>
60
      <content src="content.html"/>
61
    </navPoint>
62
  </navMap>
63
</ncx>
64
    end
65
  end
66
67
  def generate_content_html(content)
68
69
    rendered_content = case item.identifier.ext
70
    when "md"
71
      markdown = Redcarpet::Markdown.new(Redcarpet::Render::HTML)
72
      markdown.render(content)
73
    when "haml"
74
      Haml::Engine.new(content).render
75
    else
76
      raise "Don't know how to convert #{item.identifier.ext}"
77
    end
78
79
    return <<-end
80
<?xml version='1.0' encoding='utf-8'?>
81
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
82
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
83
<head>
84
  <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
85
  <meta http-equiv="Content-Style-Type" content="text/css"/>
86
  <title>#{item[:title]}</title>
87
</head>
88
<body>#{rendered_content}</body>
89
</html>
90
    end
91
  end
92
93
  def generate_content_opf
94
    return <<-end
95
<?xml version='1.0' encoding='UTF-8'?>
96
97
<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="id">
98
  <metadata>
99
    <dc:creator opf:file-as="Ruzicka, Jan-Yves">Jan-Yves Ruzicka</dc:creator>
100
    <dc:title>#{item[:title]}</dc:title>
101
    <dc:language xsi:type="dcterms:RFC4646">en</dc:language>
102
    <dc:date opf:event="publication">#{item[:date]}</dc:date>
103
  </metadata>
104
  <manifest>
105
    <item href="content.html" id="content" media-type="application/xhtml+xml"/>
106
    <item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>
107
  </manifest>
108
  <spine toc="ncx">
109
    <itemref idref="content" linear="yes"/>
110
  </spine>
111
</package>
112
    end
113
  end
114
end

And now we’re creating epubs as we go! Super-easy! OK, final step: adding a download link into our site.

Linking to the epub

If you visit a story page right now, it looks like the following:

We’re taking a minimalist approach here

How about we put a little button just under the title, which allows the viewer to download the epub version of the story?

Nanoc uses layouts to wrap content in boilerplate html. Right now, our writing uses a very basic “default” layout, that provides the header, sidebar, footer, all the rest. Let’s quickly make a custom layout for our writing:

1
# layouts/writing.haml
2
3
=render "/default.*" do
4
5
  %p
6
    =link_to "Download epub", "#", class: "button"
7
8
  =yield

This layout will render the default layout, as normal, but rather than just spitting out the content of the story, it’ll put a little “download” button at the top. The button won’t do anything yet, but it’ll look pretty (thanks to some default rendering).

Inside our rules, we need to set things up to ensure the stories use this layout1:

1
compile '/writing/**/*.{haml,md}' do
2
3
  # Filter
4
  case item.identifier.ext
5
  when "haml"
6
    filter :haml
7
  when "md"
8
    filter :redcarpet, renderer: MarkdownOptions::renderer, options: MarkdownOptions::options
9
  else
10
   raise RuntimeError, "Don't know how to render #{item.identifier}"
11
  end
12
13
  layout '/writing.*'
14
end

Now, let’s hook things up! Actually, you know what? I keep on referring to the epub’s location - why don’t we define it in the preprocessing step and make it an attribute of both the story, and its epub equivalent:

1
preprocess do
2
  # All the other stuff...
3
   #---------------------------------------
4
  # Create epubs for each piece of writing
5
  @items
6
    .find_all("/writing/**/*.{md,haml}")
7
    .each do |i|
8
9
      # Don't create epubs of index files - just of others
10
      next if i.identifier =~ /index.(haml|md)$/
11
12
      # Identifier for the epub element
13
      new_identifier = i.identifier.to_s.gsub(/(haml|md)$/, "epub.\\1")
14
15
      # Final URL for the epub
16
      i[:epub_location] = 
17
        i.identifier.without_exts +
18
        "/" +
19
        File.basename(i.identifier.without_ext) +
20
        ".epub"
21
22
      @items.create(i.raw_content, i.attributes, new_identifier)       
23
    end
24
end

Now we’re defining the final URL where the epub should end up, and setting it to the :epub_location attribute of both items. And that means we can simplify the epub rendering step in our Rules:

1
compile '/writing/**/*.epub.{haml,md}' do
2
  filter :epub
3
  write item[:epub_location]
4
end

And finally, we can link up that button in our layout:

1
# layouts/writing.haml
2
3
=render "/default.*" do
4
5
  %p
6
    =link_to "Download epub", item[:epub_location], class: "button"
7
8
  =yield

And now, look what we have!

Success all around.


  1. I’m using some custom RedCarpet options, which is why I have that interesting set of parameters going to the redcarpet renderer.