Physicist, Emacser, Digitales Spielkind

Goodbye file-format woes: Using Pandoc to export LaTeX documents to word processors
Published on Apr 14, 2021.

I really like writing papers and complex documents in LaTeX. The results look very nice (with a little tweaking) and things tend to behave even when the text gets long. Emacs makes editing the document a smooth experience. For collaboration, I can rely on the awesome power of git.

While the use of LaTeX is quite wide-spread in my discipline, some of my colleagues prefer to use WYSIWYG-style word processors. In that case, I still want to at least be able to draft the document in a format that I work efficiently in before converting it to something else. So, here is how I turned a physics paper draft from LaTeX into MS Word with a very satisfying end result!

The go-to tool for these purposes is pandoc. It It describes itself as swiss-army knife for markup file format conversions and has a long list of supported formats. Many of these come with bidirectional capabilities, meaning you can convert both to and from that particular format. However, LaTeX itself can be rather complex, at least behind the scenes, which makes it hard to convert from. It is what makes the format so flexible and the typesetting so clean. However, for conversion into other formats, this makes it notoriously difficult to get anywhere near right.

Pandoc does come with its own, simplified LaTeX parser which supports enough of the format to produce good-looking and complete conversions including figures, formulas and references. You have to help it along a little though by removing and/or replacing certain non-supported packages and classes.

My main.tex LaTeX document from the Elsevier template package, which includes all header information, looked roughly like this:

 1: \documentclass[draft]{elsarticle}
 3: \usepackage{lineno,hyperref}
 4: \modulolinenumbers[5]
 6: \journal{Journal of \LaTeX\ Templates}
 7: \usepackage{fixltx2e}
 8: \usepackage[binary-units = true]{siunitx}    % Enable SI units
 9: \DeclareSIUnit\neutron{neutron}
10: \DeclareSIUnit\Bq{Bq}
11: \DeclareSIUnit\uranium{U}
12: \DeclareSIUnit\n{n}
13: \usepackage{tikz}
14: \usetikzlibrary{backgrounds,positioning,fit,decorations.pathmorphing,arrows,shapes,calc,shadows,fadings}
16: \usepackage{xfrac} % for nice (inline) fractions
17: \usepackage[utf8]{inputenc}
18: %% `Elsevier LaTeX' style
19: \bibliographystyle{elsarticle-num}
21: \begin{document}
22: %% Settings for how to display units using the SI and SIrange commands
23: \sisetup{range-phrase=-,range-units=single,product-units = power}
25: \begin{frontmatter}
27: \title{My paper's title}
29: \author[mysecondaryaddress]{Author 1}
30: \author[mymainaddress]{Author 2\corref{mycorrespondingauthor}}
31: \cortext[mycorrespondingauthor]{Corresponding author}
32: \ead{email@example.com}
34: \begin{abstract}
35: Add text here, summarizing the paper.
36: \end{abstract}
38: \begin{keyword}
39: add\sep Text\sep here\sep
40: \end{keyword}
42: \end{frontmatter}
44: \linenumbers
46: \input{article_text}
47: \bibliography{references}
49: \end{document}

In the document above, the main text of the paper is contained in article_text on line 46 which is included in the document’s body. Running pandoc over this, I encountered issues mostly with the meta-data (title, authors due to the elsarticle class stuff line 25ff) and with the units (siunitx package on line 8). The latter helps with handling various units and typesetting them in a consistent manner. For the word-processor document that we want to create, all this doesn’t have too much of a visible effect though. So, I created a separate export.tex where I replaced many of the style settings and set up my own, much reduced unit commands:

 1: \documentclass{article}
 3: \usepackage{tikz}
 4: \usetikzlibrary{backgrounds,positioning,fit,decorations.pathmorphing,arrows,shapes,calc,shadows,fadings}
 6: \usepackage{xfrac} % for nice (inline) fractions
 7: \usepackage[utf8]{inputenc}
 9: \bibliographystyle{usrtnum}
11: \newcommand\SI[2]{#1~#2}
12: \newcommand\SIrange[3]{#1~#3 - #2~#3}
14: \newcommand\neutron{\textrm{n}}
15: \newcommand\uranium{\textrm{U}}
16: \newcommand\n{\textrm{n}}
17: \newcommand\nano{\textrm{n}}
18: \newcommand\kilo{\textrm{k}}
19: \newcommand\mega{\textrm{M}}
20: \newcommand\giga{\textrm{G}}
21: \newcommand\tera{\textrm{T}}
22: \newcommand\meter{\textrm{m}}
23: \newcommand\m{\textrm{m}}
24: \newcommand\mm{\textrm{mm}}
25: \newcommand\cm{\textrm{cm}}
26: \newcommand\volt{\textrm{V}}
27: \newcommand\micro{\mu}
28: \newcommand\second{\textrm{s}}
29: \newcommand\hour{\textrm{h}}
30: \newcommand\keV{\textrm{keV}}
31: \newcommand\MeV{\textrm{MeV}}
32: \newcommand\per{/}
33: \newcommand\ampere{\textrm{A}}
34: \newcommand\Hz{\textrm{Hz}}
35: \newcommand\hertz{\textrm{Hz}}
36: \newcommand\byte{\textrm{B}}
37: \newcommand\gray{\textrm{Gr}}
38: \newcommand\Bq{\textrm{Bq}}
39: \newcommand\sievert{\textrm{Sv}}
41: \title{My paper's title}
42: \author{Author 1, Author 2}
44: \begin{document}
45: %% Settings for how to display units using the SI and SIrange commands
47: \begin{abstract}
48: Add text here, summarizing the paper.
49: \end{abstract}
51: % \input{article_text}
52: \input{article_text}
53: \bibliography{references}
55: \end{document}

On line 11 I create my own \SI command for setting units that I then define in the following lines. Yes, I do need that many different units in the same text… ;)

This can now be converted with pandoc:

1: pandoc -s export.tex -o export.docx --bibliography references.bib --csl elsevier-vancouver.csl

The needed citation style file (and many others) can be downloaded from https://citationstyles.org/authors/.

Voila, we now have a export.docx document! It does not look identical to the LaTeX output (how could it?) but it includes the same information and looks not too shabby!

If you also want to have your figures and equations numbered as you would expect in LaTeX, you need an additional pandoc-filter: pandoc-xnos. It can be installed via pip:

pip install pandoc-fignos pandoc-eqnos pandoc-tablenos \
            pandoc-secnos --user

Then, call pandoc using the corresponding filter:

1: pandoc -s export.tex -o export.docx --filter pandoc-xnos --bibliography references.bib --csl elsevier-vancouver.csl

But beware that labels for equations, figures and such cannot have underline characters (’_’) but have to consist of single words (camel case is ok though). That was at least the case at the time of testing it here.

Once sent out, you might get back documents in unhandly proprietary formats. In that case, you can even extract diffs between versions send out and received back – or even from the track-changes feature of MS Word using pandiff: https://github.com/davidar/pandiff

Hope this helps your workflow – and keeps your colleagues happy while allowing you to use the tools of your choice!

Update 20210610: Added info on numbering equations and figures.

Tags: latex, pandoc