Boost Your GenAI Results with One Simple (and Free) Tool
tldr; Use Pandoc to convert docs to markdown
AI is great at summarizing a document or a small collection of documents. When you get to larger collections, the complexity begins to grow rapidly. More complex prompts are the least of it. You need to set up RAG (retrieval-augmented generation) and the accompanying vector stores. For really large stores, this is going to be necessary regardless. Most of us work in a realm that is between massive content repositories and a manageable set of documents.
One handy helping application for this is Pandoc (https://pandoc.org/), aptly self-described as “your Swiss Army knife” for converting files between formats (without having to do “File > Open > Save As” to the point of carpal tunnel damage). Most of our files are in people-friendly formats like Word and PDF. To an LLM, these files contain mostly useless formatting instructions and metadata (yes, some metadata is useful, but most of it in these files is not going to be helpful as inputs to GenAI models). Pandoc will take those files and convert them to Markdown, which is highly readable for GenAI purposes (and humans can still parse it — and some even prefer it) and use 1/10000000th of the markup for format (confession: I pulled that number out of thin air to get your attention, but the real number is still big enough to matter).
The conversion may not be perfect, especially as the formatting of most documents is not perfect. You can see this for yourself by using the Outline view in Word. With a random document pulled from SharePoint, odds are you will find empty headings between the real ones, entire paragraphs that are marked as headings, or no headings at all because someone manually formatted text using the Normal style to make it look like a heading.
If you are only converting a few documents, you can use a text editor with regex (provided by your favorite GenAI) to do find and replace. Otherwise, leave them as is — it is already in a much more efficient format for prompting against, and the LLM will likely figure it out anyway.
You can get fancier with this by incorporating a call to Pandoc as a tool in an agentic workflow, converting the files at runtime before passing them to an LLM for analysis (and if you are a developer, managing the conversions so that they aren’t wastefully repeated). So long as you are being fancy, you can have it try to fix the minor formatting errors too, but you have already made a huge leap forward just by dumping all the formatting (that is just noise to an LLM) so that the neural network is processing what really matters: the content that is going to make you look like a prompting genius.
©2025 Scott S. Nelson
Originally published at https://theitsolutionist.com/2025/08/12/boost-your-genai-results-with-one-simple-and-free-tool/