Calibre Python



Apr 11, 2021 Calibre is a free, open source, ebook management and conversion utility created and maintained by Kovid Goyal. It is available for Windows, Mac OS X and Linux. Calibre cannot, on its own, remove DRM from ebooks. However, it is possible to added third-party software (‘plugins’) to enhance calibre. This class teaches students how to create custom Calibre DRV scripts and batch files that can be used to analyze and manipulate layout data. This class also teaches students how to extend the Calibre DRV GUI to obtain user input, display results from running DRV scripts, and add menus and menu items to invoke the scripts they write in class. The course presents a number of practical examples. Build the calibre installers, including all dependencies from scratch - kovidgoyal/build-calibre.

The Search & replace tool in the editor support a function mode.In this mode, you can combine regular expressions (see All about using regular expressions in calibre) witharbitrarily powerful Python functions to do all sorts of advanced textprocessing.

In the standard regexp mode for search and replace, you specify both aregular expression to search for as well as a template that is used to replaceall found matches. In function mode, instead of using a fixed template, youspecify an arbitrary function, in thePython programming language. This allowsyou to do lots of things that are not possible with simple templates.

Techniques for using function mode and the syntax will be described by means ofexamples, showing you how to create functions to perform progressively morecomplex tasks.

Automatically fixing the case of headings in the document¶

Here, we will leverage one of the builtin functions in the editor toautomatically change the case of all text inside heading tags to title case:

For the function, simply choose the Title-case text (ignore tags) builtinfunction. The will change titles that look like: <h1>someTITLE</h1> to<h1>SomeTitle</h1>. It will work even if there are other HTML tags insidethe heading tags.

Your first custom function - smartening hyphens¶

The real power of function mode comes from being able to create your ownfunctions to process text in arbitrary ways. The Smarten Punctuation tool inthe editor leaves individual hyphens alone, so you can use the this function toreplace them with em-dashes.

To create a new function, simply click the Create/edit button to create a newfunction and copy the Python code from below.

Every Search & replace custom function must have a unique name and consist of aPython function named replace, that accepts all the arguments shown above.For the moment, we wont worry about all the different arguments toreplace() function. Just focus on the match argument. It represents amatch when running a search and replace. Its full documentation in availablehere.match.group() simply returns all the matched text and all we do is replacehyphens in that text with em-dashes, first replacing double hyphens andthen single hyphens.

Use this function with the find regular expression:

And it will replace all hyphens with em-dashes, but only in actual text and notinside HTML tag definitions.

The power of function mode - using a spelling dictionary to fix mis-hyphenated words¶

Often, e-books created from scans of printed books contain mis-hyphenated words– words that were split at the end of the line on the printed page. We willwrite a simple function to automatically find and fix such words.

Use this function with the same find expression as before, namely:

And it will magically fix all mis-hyphenated words in the text of the book. Themain trick is to use one of the useful extra arguments to the replace function,dictionaries. This refers to the dictionaries the editor itself uses tospell check text in the book. What this function does is look for wordsseparated by a hyphen, remove the hyphen and check if the dictionary recognizesthe composite word, if it does, the original words are replaced by the hyphenfree composite word.

Note that one limitation of this technique is it will only work formono-lingual books, because, by default, dictionaries.recognized() uses themain language of the book.

Auto numbering sections¶

Now we will see something a little different. Suppose your HTML file has manysections, each with a heading in an <h2> tag that looks like<h2>Sometext</h2>. You can create a custom function that willautomatically number these headings with consecutive section numbers, so thatthey look like <h2>1.Sometext</h2>.

Use it with the find expression:

Place the cursor at the top of the file and click Replace all.

This function uses another of the useful extra arguments to replace(): thenumber argument. When doing a Replace All number isautomatically incremented for every successive match.

Another new feature is the use of replace.file_order – setting that to'spine' means that if this search is run on multiple HTML files, the filesare processed in the order in which they appear in the book. SeeChoose file order when running on multiple HTML files for details.

Auto create a Table of Contents¶

Finally, lets try something a little more ambitious. Suppose your book hasheadings in h1 and h2 tags that look like<h1id='someid'>SomeText</h1>. We will auto-generate an HTML Table ofContents based on these headings. Create the custom function below:

And use it with the find expression:

Run the search on All text files and at the end of the search, awindow will popup with “Debug output from your function” which will have theHTML Table of Contents, ready to be pasted into toc.html.

The function above is heavily commented, so it should be easy to follow. Thekey new feature is the use of another useful extra argument to thereplace() function, the data object. The data object is a Pythondict that persists between all successive invocations of replace() duringa single Replace All operation.

Another new feature is the use of call_after_last_match – setting that toTrue on the replace() function means that the editor will callreplace() one extra time after all matches have been found. For this extracall, the match object will be None.

This was just a demonstration to show you the power of function mode,if you really needed to generate a Table of Contents from headings in your book,you would be better off using the dedicated Table of Contents tool inTools → Table of Contents.

The API for the function mode¶

All function mode functions must be Python functions named replace, with thefollowing signature:

When a find/replace is run, for every match that is found, the replace()function will be called, it must return the replacement string for that match.If no replacements are to be done, it should return match.group() which isthe original string. The various arguments to the replace() function aredocumented below.

The match argument¶

The match argument represents the currently found match. It is aPython Match object.Its most useful method is group() which can be used to get the matchedtext corresponding to individual capture groups in the search regularexpression.

The number argument¶

The number argument is the number of the current match. When you runReplace All, every successive match will cause replace() to becalled with an increasing number. The first match has number 1.

The file_name argument¶

This is the filename of the file in which the current match was found. Whensearching inside marked text, the file_name is empty. The file_name isin canonical form, a path relative to the root of the book, using / as thepath separator.

The metadata argument¶

This represents the metadata of the current book, such as title, authors,language, etc. It is an object of class calibre.ebooks.metadata.book.base.Metadata.Useful attributes include, title, authors (a list of authors) andlanguage (the language code).

The dictionaries argument¶

This represents the collection of dictionaries used for spell checking thecurrent book. Its most useful method is dictionaries.recognized(word)which will return True if the passed in word is recognized by the dictionaryfor the current book’s language.

The data argument¶

This a simple Python dict. When you runReplace all, every successive match will cause replace() to becalled with the same dict as data. You can thus use it to store arbitrarydata between invocations of replace() during a Replace alloperation.

The functions argument¶

The functions argument gives you access to all other user definedfunctions. This is useful for code re-use. You can define utility functions inone place and re-use them in all your other functions. For example, suppose youcreate a function name MyFunction like this:

Then, in another function, you can access the utility() function like this:

You can also use the functions object to store persistent data, that can bere-used by other functions. For example, you could have one function that whenrun with Replace All collects some data and another function thatuses it when it is run afterwards. Consider the following two functions:

Debugging your functions¶

You can debug the functions you create by using the standard print()function from Python. The output of print will be displayed in a popup windowafter the Find/replace has completed. You saw an example of using print()to output an entire table of contents above.

Choose file order when running on multiple HTML files¶

When you run a Replace all on multiple HTML files, the order inwhich the files are processes depends on what files you have open for editing.You can force the search to process files in the order in which the appear bysetting the file_order attribute on your function, like this:

file_order accepts two values, spine and spine-reverse which causethe search to process multiple files in the order they appear in the book,either forwards or backwards, respectively.

Having your function called an extra time after the last match is found¶

Sometimes, as in the auto generate table of contents example above, it isuseful to have your function called an extra time after the last match isfound. You can do this by setting the call_after_last_match attribute on yourfunction, like this:

Appending the output from the function to marked text¶

When running search and replace on marked text, it is sometimes useful toappend so text to the end of the marked text. You can do that by settingthe append_final_output_to_marked attribute on your function (note that youalso need to set call_after_last_match), like this:

Suppressing the result dialog when performing searches on marked text¶

You can also suppress the result dialog (which can slow down the repeatedapplication of a search/replace on many blocks of text) by settingthe suppress_result_dialog attribute on your function, like this:

More examples¶

More useful examples, contributed by calibre users, can be found in thecalibre E-book editor forum.

Latest version

Released:

Calibre helper scripts (ISBN guessing, RTF to DOC conversion,hanging books detection, ...).

Project description

This is a set of standalone scripts which I wrote to helpmanaging my Calibre ebook-library.

  • Script details
  • Installation and configuration
  • History

Script list

The following scripts are available:

calibre_guess_and_add_isbn

Checks books without ISBN (set in metadata) for ISBN-like stringpresent in leading pages. If found, add it to the metadata (whatmakes it possible to download full metadata, covers, etc).

calibre_convert_docs_to_rtf

Convert any .doc to .rtf (unless already present) - usingopenoffice.

calibre_add_if_missing

Checks given directory tree for books not yet present in calibre,add them if found. Uses binary file comparison to check whetherthe file is identical (file name and metadata are not used, onpurpose).

calibre_find_books_missing_in_database

Checks whether Calibre database directory contains someunregistered files and report them if found.

calibre_report_duplicates

Report duplicates, adding information of which of them are surelysafe to merge (because duplicated books are identical or becauseformats do not overlap), and which require careful examination(because different files in the same format exist).

Script details

calibre_guess_and_add_isbn

Queries Calibre for all books without ISBN, then tries to locate ISBNinside (via scanning a few leading pages) and updates Calibre bookmetadata if ISBN is found.

Run it without parameters:

Any ISBN numbers found will be added to the book metadata (and thescript will report them). Books are scanned from the newest, so youcan abort (Ctrl-C) script once it handled new books.

Later on ISBN can be used to grab the book metatada and/or book coverinside Calibre GUI. Just spawn Calibre and look for books with ISBNset and missing metadata, for example using query like:

(above means: isbn contains some digit, publisher does not contain anyletter). Depending on your workflow, you can then either

  • grab metadata automaticaly (mark all those books, right click,pick Edit Metadata Information/Download Metadata)
  • review each book individually (mark those books, right click,pick Edit Metadata Information/Edit Metadata Individually, thenclick Fetch Metadata on every book successively and review whetherit fits).

calibre_convert_docs_to_rtf

Queries Calibre for all books which have only .doc format, then usesOpenOffice to convert them to .rtf and add this format as analternative.

OpenOffice (and pyuno libraries provided by it) are used in theprocess.

Run it without parameters:

Note: from time to time the script happens to crash on the end of thejob (while finishing). I haven’t diagnosed the reasons (most likelythe problem is in the libraries I use), but the crash is harmless anddoes not influence the actual conversion process.

calibre_find_books_missing_in_database

Reports the files present inside Calibre library directory but notpresent in the database (and therefore not visible in the Calibreinterface).

The files are reported to standard output. To add themall to calibre, pipe output. For example:

(but, better, review everything beforehand)

Note

The problematic scenario may happen for example if Calibre is usedfrom two or more machines over synchronized or networked directoryand, by mistake, two copies are run simultaneously. Or in caseof some crashes.

Tutorial

calibre_add_if_missing

Scans given directory and/or specified files, adds to calibre allbooks which are not yet present there.

Duplicate checking is done solely according to the file content. Thefile is skipped if identical file is already present in Calibre.

I initially wrote this script to handle I want to ensure everythingis already imported and can be deleted scenario, but over years I tendto use it for most batch ebook imports.

Example:

(import any books below OldBooks which are not yet present, don’ttouch this directory - which probably can be removed afterwards).

Or maybe:

(add all .epub files from ./freshly-bought, tag them withprogramming and web-development, move all succesfully importedfiles to ~/ebooks-done/).

For all options, run:

calibre_report_duplicates

Analyzes calibre database looking for likely duplicates, and reportsthem, adding info of which of those are surely identical, and whichrequire examination.

Do not perform any changes, just prints report (as text, or html, orjavascript-enabled html from which reviewed items can be removed byclicking).

Example:

(text output to the console):

(HTML output redirected to file):

(also HTML, but with buttons to hide rows, handy for review).

See also:

Installation and configuration

Prerequisities

Calibre must be installed, properly configured and has some database(otherwise it does not make sense to run those scripts). The:

command must be in PATH (or calibredb variable inside .inifile must be properly set, see below).

Tools providing commands:

should be installed and present in PATH (or properly configured in.ini, or disabled in .ini, see below). On Ubuntu Linux orDebian Linux those can be installed from standard repositories, justinstall the following packages:

Python 2.6 or 2.7 is required (scripts are using some featuresintroduced in 2.6 - in particular tempfile extensions, subprocess andnamedtuple). Also, lxml library must be installed. On Debian orUbuntu just install the following packages:

For calibre_convert_docs_to_rtf to work, ootools library must beinstalled. Simplest method to install it:

(on Ubuntu sudo easy_install ootools).

I develop and use those scripts on Ubuntu Linux. They should work onWindows or Mac if necessary tools are installed, but I’ve never triedit.

Actual installation

Simple:

or:

should do (the latter requires adding ~/.local/bin toPATH). In case you don’t want to mess with your systemor user directories, consider using virtualenv.

Configuration

The ~/.calibre-utils file can be used to configure some programsettings. The file is created, if missing, whenever any of thescripts is run, and can be customized.

Here is the default content:

The commands section defines location of the external tools beingused. In case the commands are present in PATH, bare names can beused. Otherwise full path can be specified. Finally, if some tool ismissing, it can be defined as empty string.

The isbn-search section specifies how many leading pages (inpage-based document formats like PDF or DJVU) or lines (in the freeformats like TXT or CHM) are scanned looking for ISBN-like strings.

For example, the file can be changed so:

In such a case catdoc will be used from /usr/local/bin,calibredb will be expected in /opt/calibre, pdftotext will besought in PATH, and archmage and djvutxt will be treat asmissing (so the isbn guessing script won’t be able to scan CHM andDJVU files for ISBN and will ignore them).

History

(only major changes described)

Python

1.4.1

calibre_add_if_missing: added --force-language (to set booklanguage attribute).

calibre_add_if_missing crashed in case it was to move the file out(--move option was used), but identically named file alreadyexisted in the target directory. After the fix, file is moved to somesubdirectory of target instead.

1.4.0

calibre_report_duplicates enhancements:

  • output format is chosen by --format=txt or --format=html (insteadof --html)
  • added javascript-enabled output (--format=js) which supports clicking-outreport items
  • using SimHash instead of difflib to look for similar titles. MUCH faster,provides a bit different but sensible results
  • reporting similar authors
  • new option --cache (use cached manifest to speed up reruns on large libraries)
  • new option --output (name output file)

1.3.2

Fixed errors:

  • calibre_guess_and_add_isbn crashed with Unicode decode errorwhile saving isbn to book with non-ascii character in title (wrongdiagnostic print),
  • as since calibre 1.0 calibredb catalog --sort-by=id crashes(and, therefore scripts internally using this command crash too), weuse sort by timestamp instead.

1.3.1

calibre_add_if_missing fixes:

  • even runs executed without --cache preserve cached metadata forpossible next run,
  • runs with --cache ignore cached data if they are more than 24 hours old,
  • in case file found in calibre catalog does not exist (what can happen ifit was renamed or deleted while we run, or if cache is in use and somebooks were removed since it was created), calibre_add_if_missing just warns,but continues it’s work (instead of exiting with an error).

1.3.0

Python3 compatibility work, scripts should be runnable under Python3(note: daily I still use them under 2.7, so 3 is less tested).

calibre_add_if_missing performs some comparison of epub internalsin case possible duplicate of similar size exists. In particular,it is able to ignore calibre_bookmarks (so duplicate epub is notadded due to this file being added or modified by viewer).

calibre_add_if_missing has --cache option (reuse cachedcatalog from previous run to speed up processing on large libraries).

calibre_add_if_missing has --dry-run option.

1.2.4

calibre_add_if_missing has --move option (move succesfullyadded files to another directory - likely something trash-like).

calibre_add_if_missing has --title-from-name option (forceusing filename as title instead of processing metadata).

calibre_add_if_missing has --tag and --author options (forcegiven tags and/or author instead of processing metadata).

calibre_add_if_missing copies filename as title for .doc,.docx, .rtf and .txt files. Those extremely rarely havesensible metadata.

1.2.3

calibre_find_books_missing_in_database no longer reports booksubdirectories and such (reasoning: I use book subfolders to storethings like source code added to the book or book sources, at the sametime it is not the place where calibre would put the book by itself).

Fixed two more “UnicodeEncodeError” bugs (reported for books withoutfiles and with unicode character in names)

1.2.2

calibre_guess_and_add_isbn catches various errors, reports them, and continuesto work. For any errors information mentions problematic book name.

Ctrl-C aborts ISBN guessing and properly cleans up.

#5 - fix for ISBN’s containing X letter.

1.2.0

calibre_add_if_missing disables Calibre own duplicate checking(which is title based, so too simplistic, and occasionally rejectsfine books) and prints detailed info about found actual duplicates (ifpresent).

Some calibre_report_duplicates improvements:

  • pruning some redundant matches (if a is similar to b, a is similarto c, and b is similar to c, we don’t report the latter),
  • books which have same/similar author and title are not reported asduplicates if they have the same series and different seriesindex (so different volumes of the same book are no longerreported as possible duplicates).

calibre_guess_and_add_isbn fixes:

  • #3 - avoiding crash on latin-1 encoded chm files (during ISBN detection)
  • handling some Unicode charactes in ISBN text (hard space, long dash, …)
  • verifying ISBN checksum before using it.

1.1.1

Added calibre_report_duplicates.

calibre_add_if_missing can be given individual files (initiallyonly complete directories could be processed).

1.0.4

First serious release. Workingcalibre_find_books_missing_in_database,calibre_guess_and_add_isbn, calibre_convert_docs_to_rtf,calibre_add_if_missing.

Project details


Release historyRelease notifications | RSS feed

1.5.0

Calibre Python Code

1.4.1

1.4.0

1.3.2

Calibre Python

1.3.1

1.3.0

1.2.3

1.2.2

1.2.1

1.2.0

1.1.1

1.1.0

1.0.4

1.0.3

1.0.2

1.0.1

1.0.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for mekk.calibre, version 1.5.0
Filename, sizeFile typePython versionUpload dateHashes
Filename, size mekk.calibre-1.5.0.tar.gz (34.8 kB) File type Source Python version None Upload dateHashes
Close

Hashes for mekk.calibre-1.5.0.tar.gz

Calibre Python Version

Hashes for mekk.calibre-1.5.0.tar.gz
AlgorithmHash digest
SHA25624941efb964a791f13ce17b712d5fd866ffd641ce0be078b65e6a89fe26f127f
MD519a52ec721021ca3252be90fb19c0b38
BLAKE2-256401316efebbec3efeadbd29f04dd54a9544d1b58508bd1611f5900e48c6b1cc8