Post

Cleaning up

Cleaning up

Cleaning up the markdown file

If you used the scripts to convert a PDF to markdown you are left with a markdown file and the images that belong to it. There is definitely some cleanup left to be done afterwards and here are some tips.

Tips for cleanup

  • look for the $ sign. Often this is an indicator that it had problems recognising the the text
  • look for html tags such as <sup> These can be valid in the markdown but you want to verify
  • in Markdown the * character works well to make something italic. However when converting to Epub that doesn’t work anymore. Use thnane <em> <\em> tags in stead. Alternatively, using the _(underscore) twice instead of * seems to work well also.
  • Dutch: the ij character. In the Dutch language the y character is different and is usually type as ij. Strictly speaking it is one character, but typewriters and keyboards don´t have that character. Therefore it is always represent as ij. In converting this is often however seen as ii. Simply do a find and replace ii for ij. However, do it step by step if there are Roman numerals in the text such as VII.

Audio book

If you want to create an audio book of the markdown, you probably want to remove all tags, images etc. Here is a line that will do some of that

1
sed -E '/^\*<sup>/d; /^!\[\]/d' input.md | sed -E 's/<sup>[^<]*<\/sup>//g' > output.md

Updates to this page

If I have more tips I will add them to this page.

This post is licensed under CC BY 4.0 by the author.