You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like python-docx2txt only supports extracting .png and .jpg files. To handle unsupported formats like .emf, you can add a check in the code and use an external tool (like ImageMagick) to convert .emf to .png. Here's an example modification:
import subprocess
from docx import Document
doc = Document("yourfile.docx")
for rel in doc.part.rels.values():
if "image" in rel.target_ref:
img_name = rel.target_ref.split("/")[-1]
if img_name.endswith(('.png', '.jpg')):
with open(img_name, "wb") as img_file:
img_file.write(rel.target_part.blob)
elif img_name.endswith('.emf'):
with open(img_name, "wb") as img_file:
img_file.write(rel.target_part.blob)
subprocess.run(["magick", img_name, img_name.replace(".emf", ".png")])
This way, .emf files can be detected and converted automatically.
Tested on the 'Test Summary.docx' file from the following dataset https://catalog.data.gov/dataset/test-report-for-detonation-velocity-measurements-f22cf
Extracts 4 out of 5 images. The image that is not extracted has a .emf extension.
The text was updated successfully, but these errors were encountered: