A pure python-based utility to extract text from docx files.
The code is taken and adapted from python-docx. It can however also extract text from header, footer and hyperlinks. It can now also extract images.
pip install docx2txt
a. From command line:
# extract text
docx2txt file.docx
# extract text and images
docx2txt -i /tmp/img_dir file.docx
b. From python:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")