Converting MS Word .doc to LaTeX by command line

This answer is specific to OS X...

Command line utility textutil

There's a nice command line utility called textutil included in OS X that will convert among common document formats:

Word docx to txt

$ textutil -convert txt worddoc.docx

txt to Word docx

$ textutil -convert docx mytextdoc.txt

txt to Word, using Times New Roman 12pt

$ textutil -convert docx -font "Times New Roman" -fontsize 12 blah.txt

Also works with html, rtf, doc, odt, and others...

word2latex and latex2word by using textutil with Pandoc

If you use Pandoc in combination with textutil you have can have a decent Word-to-LaTeX and LaTeX-to-Word roundtrip. For docx support you need the latest version of Pandoc (1.9+).

word2latex

$ textutil -convert html worddoc.docx -stdout | pandoc -s -f html -t latex -o latexdoc.tex

latex2word

$ pandoc -t docx -f latex -o backtoword.docx latexdoc.tex

Antiword is going to do reasonable good job converting .doc to .tex files . It makes every effort to preserve not only the content but formating as well. It is well suited for batch processing that you want to do.

Edit: Several people asked me privately about LaTeX switch in Antiword and the latest version of Antiword. The latest version is indeed 0.37. As of LaTeX output I think I mixed up things a bit. I used Antiword for formated ASCII output. I think it is capable of PostScript output but not of LaTeX output. As Jon observed you can use pandoc to convert well formated ASCII into LaTeX. However, wvWare (wv and wv2) are capable of outputting LaTeX. A bit of warning. wvWare is depreciated in favor of AbiWord but can be used for batch processing (I have no clue if AbiWord can be used from the command line). It is still a bit younger program (dormant since 2006) than Antiword(dormant since 2004).

Finally there is a tool called catdoc which is great for batch processing but will not preserve format (great for extracting content though and supports MS Excel format).


A lot depends on how complicated the Word document formatting is. I have had very good success with rtf2latex2e, which converts RTF formatted text to LaTeX. It has various levels of matching the RTF formatting. I have mainly used its "minimal LaTeX markup mode", which is ideal for a document that will be subsequently hand-edited (which I understand is not the same conditions as you require.)