Fails to extract .emf image from the .docx document #45

sand1k · 2023-03-15T15:14:24Z

Tested on the 'Test Summary.docx' file from the following dataset https://catalog.data.gov/dataset/test-report-for-detonation-velocity-measurements-f22cf

Extracts 4 out of 5 images. The image that is not extracted has a .emf extension.

cyy-2024 · 2024-10-17T13:55:55Z

It looks like python-docx2txt only supports extracting .png and .jpg files. To handle unsupported formats like .emf, you can add a check in the code and use an external tool (like ImageMagick) to convert .emf to .png. Here's an example modification:
import subprocess
from docx import Document

doc = Document("yourfile.docx")
for rel in doc.part.rels.values():
if "image" in rel.target_ref:
img_name = rel.target_ref.split("/")[-1]
if img_name.endswith(('.png', '.jpg')):
with open(img_name, "wb") as img_file:
img_file.write(rel.target_part.blob)
elif img_name.endswith('.emf'):
with open(img_name, "wb") as img_file:
img_file.write(rel.target_part.blob)
subprocess.run(["magick", img_name, img_name.replace(".emf", ".png")])
This way, .emf files can be detected and converted automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails to extract .emf image from the .docx document #45

Fails to extract .emf image from the .docx document #45

sand1k commented Mar 15, 2023

cyy-2024 commented Oct 17, 2024

Fails to extract .emf image from the .docx document #45

Fails to extract .emf image from the .docx document #45

Comments

sand1k commented Mar 15, 2023

cyy-2024 commented Oct 17, 2024