Turn scanned pdfs into Ebooks
Turning scanned pdfs into Ebooks
Introduction
In the Netherlands copyrights of works such as books expire 70 years after the death of the creator. To be precise, 70 years after the first of january in the year after the author passed away. Many of these books are scanned by the national library and are made available at Delpher.nl
It’s fun to browse these PDFs, but if you really want to read them a computer screen is not the best way to do so. It would be better to read this on an E-Reader. Strictly speaking you can read a PDF on an E-reader, but those scans are not very comfortable to read. So now in the age of AI it should be possible to turns those books into Ebooks automatically, right?
Well I’ve tried. And so far we’re not there yet. Although with the right tools, it now is a lot easier to transfer a scanned PDF to a proper Ebook.
Ivans
First things first. When I was young I read quite a few titles by the author: Ivans. This was the first well known detective writer in the Netherlands. He has a large body of work. About 15 or so of them were republished in the seventies, where the spelling was modernized and the books were abbreviated. Those are the ones I read. But he has written many more. It is nice to read the them again, especially the titles I haven’t read. It is probably even better in these times to read stories written in the days long before smartphones. In fact even telephones were so uncommon that the stories talk about going into town to sent a telegram. And as scanned pdf files can be freely download, all in all perfect to see if we can transform these to Ebooks. The first one is a book that actually starts in the area I grew up in. So a book by Ivans was the first test: “Aan den rand van den Bosch”. Extra challenge is that the dutch spelling in those days was quite different than modern day dutch.
ChatGPT?
So obviously the first step was uploading a pdf to ChatGPT and ask it to do this. Now ChatGPT can (try to) do it for you and it also gives you the tools or scripts to this locally. I went down this rabbit hole for a while. I did learn quite a bit immidiately. Without ChatGPT I probably would have as well, but it would have taking me a lot longer. For instance, the fact that it makes sense to first create a markdown file. The markdown file can be easily converted to a PDF, but whilst it is in markdown format it can be easily edited and there are many tools available that make this an easy proces. Unfortenately, this editing by hand really is necessary.
The results can be categorised as better than manually typing the text. But there were too many issues:
- Initially it would add the page numbers of the print, which you don’t want in the Epub
- There were too many end of lines. It tried to stay too close to the layout of the book, again you don’t want that
It then gave a method of doing the conversion on my own machine, which is running Linux, which gave me new issues
- It would produce bash scripts that worked, but the results had too many whitespaces
- Then after asking for a fix it would randomly add end of lines
All in all, if you don’t mind a lot of manual cleanup in the markdown file it sort of works.
MinerU
Then I tried a tip I found on Mastodon: Mineru.
First things first, the recognition was definately better. Unfortenately there were two major drawbacks. Judging by the large numbers of chinese characters (I assume they were chinese I am not expert) the software is based there. I have no problem with that, but it may explain why there were quite a few weird characters. A bigger problem was that all the spaces between the words were gone.
So: too much manual work left
Finally a solution
Then I found Marker. Spoiler: this software worked the best and although not without manual cleanup and some special steps, it did a good job and left me with my first proper Epub!
But there were hurdles to overcome and ChatGPT was actually very helpful in this proces. To explain this let me first explain what Marker does. It is specialised software to convert files including PDFS and will do things like extract images, remove headers and footers etc. Just what I want. It does use AI. It can use locally running AI in combination with a step that takes place in an external LLM. It defaults to Googles Gemini if you do. But you can run an LLM locally as well.
The basic dependency of this software us Pytorch. It can run totally on a CPU, but will perform better using a GPU. This is where it got a bit hairy. Pytorch required a lot of memory for a large pdf. It failed with a segfault when I ran Pytorch on CPU with 48 GB memory. A quick test showed that when I reduced the size by first extracting a few pages and running the software on these pages that the software was running fine and gave the desired result.
So I turned to ChatGPT and asked to write me some scripts that first seperated the pdf in pdfs with just a few pages, then ran the conversion into markdown and finally reassemble the markdown back into one large markdown.
The scripts worked remarkably well, but there were a few iterations. What it struggled to do was to extract the images out of the PDF with proper links in the markdown. This is one of the reasons I asked it to seperate the script into seperate scripts per step with a wrapper script. This made trouble shooting easier. I ended up running the conversion step manually to tell ChatGPT what was happening with the images and then it adapted the conversion script and finally it worked!
A bonus of this is that this actually now works on my very old Nvidia GPU: a GTX970 with just 4 Gb of ram. The script ran on a Ryzen 7, where monitoring showed that most of the work was indeed done on the GPU. A GPU that was released more than 10 years ago!
There was still a lot of cleanup. The stitching together of the pages means in the final markdown file you need to at each page remove some empty lines. And there were other small issues. But now the manual work was doable. And with Calibre I was able to turn this in a very nice Epub that reads well on my E-Reader
Oh and the scripts can be found here.
On using an external LLM
I ran the conversion twice, one time with an additional step where the software ran the scan through Gemini. I used diff, the tried and test software on Linux, to see if there were noticable differences. There were none that I cared for. And actually an index table was better without the Gemini step. So in this example, no need to use the external LLM,
The result
I now have a proper Epub. The spelling is very outdated and I did experiment with using AI to update automatically. That was unreliable. Furthermore, I noticed that you very quickly get used to the old spelling and it adds to the charm of reading a book that is well over a hundred years old. I am playing with the idea to sell it in the Kindle store. As far as I can tell that is legally allowed. Maybe if I make a bit of money by selling the Ebook, it will allow me to buy a graphics card with more ram which. As that would mean less pages stitched together, the end result would mean a lot less manual work and enable me to create more of these books.
maybe I will…