Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/partition_msg halts for attachmentes with UNK type #3671

Closed
S1M0N38 opened this issue Sep 27, 2024 · 7 comments · Fixed by #3694
Closed

bug/partition_msg halts for attachmentes with UNK type #3671

S1M0N38 opened this issue Sep 27, 2024 · 7 comments · Fixed by #3694
Assignees
Labels
email Related to EML and/or MSG format msg Related to partitioning Outlook MSG files

Comments

@S1M0N38
Copy link

S1M0N38 commented Sep 27, 2024

Currently, the attachment_partitioner is hardcoded to partition in the following file:

def _iter_elements(self) -> Iterator[Element]:
"""Partition the file in an `oxmsg.attachment.Attachment` into elements."""
from unstructured.partition.auto import partition
with tempfile.TemporaryDirectory() as tmp_dir_path:
# -- save attachment as file in this temporary directory --
detached_file_path = os.path.join(tmp_dir_path, self._attachment_file_name)
with open(detached_file_path, "wb") as f:
f.write(self._file_bytes)
# -- partition the attachment --
for element in partition(
detached_file_path,
metadata_filename=self._attachment_file_name,
metadata_last_modified=self._attachment_last_modified,
**self._opts.partitioning_kwargs,
):
element.metadata.attached_to_filename = self._opts.metadata_file_path
yield element

However, according to the official documentation, the partition-msg function accepts attachment_partitioner as an argument.

@S1M0N38 S1M0N38 changed the title Hardcoded attachment_partitioner for msg bug/Hardcoded attachment_partitioner for msg Sep 27, 2024
@S1M0N38
Copy link
Author

S1M0N38 commented Sep 27, 2024

#3605 PR (Draft)

@scanny
Copy link
Collaborator

scanny commented Sep 27, 2024

@S1M0N38 that was removed on purpose. @Paul-Cornell can you remove that parameter from the docs for us?

@S1M0N38
Copy link
Author

S1M0N38 commented Sep 27, 2024

When a .msg file contains an attachment of an unsupported type (UNK), the partition_msg function halts. I've implemented a custom attachment_partitioner to filter out unsupported types.

Is there another way to process the supported types and ignore the unsupported ones?

@Paul-Cornell
Copy link

@scanny I'm not quite sure how to "remove" that parameter from the docs. Searching for attachment_partitioner across the docs returns four results, some in text, and some in code. I don't have enough context to know exactly what needs to be removed here. Could you please either advise, or create a PR in the docs repo? Thanks!

@scanny
Copy link
Collaborator

scanny commented Sep 28, 2024

@S1M0N38 Ahh, okay, so that's a bug then. You shouldn't have to provide a custom partitioner for that :)

Shall we make this into a bug report for that or do you want to open a new one?

The correct behavior would be for partition_msg() to simply skip any attachments it doesn't know how to partition.

@scanny
Copy link
Collaborator

scanny commented Sep 28, 2024

@Paul-Cornell it looks like the text is about the same for partition_email() and partition_msg() on that page:


You can optionally partition e-mail attachments by setting process_attachments=True. If you set process_attachments=True, you’ll also need to pass in a partitioning function to attachment_partitioner. The following is an example of what the workflow looks like:

from unstructured.partition.auto import partition
from unstructured.partition.email import partition_email

filename = "example-docs/eml/fake-email-attachment.eml"
elements = partition_email(
  filename=filename, process_attachments=True, attachment_partitioner=partition
)

You can change it like this in each case:

You can optionally partition e-mail attachments by setting process_attachments=True. The following is an example of what the workflow looks like:

from unstructured.partition.email import partition_email

filename = "example-docs/eml/fake-email-attachment.eml"
elements = partition_email(filename=filename, process_attachments=True)

So:

  • get rid of the middle sentence
  • remove the import of partition
  • remove the attachment_partitioner=partition argument in the partition_email() call.

@S1M0N38
Copy link
Author

S1M0N38 commented Sep 28, 2024

@S1M0N38 Ahh, okay, so that's a bug then. You shouldn't have to provide a custom partitioner for that :)

Shall we make this into a bug report for that or do you want to open a new one?

The correct behavior would be for partition_msg() to simply skip any attachments it doesn't know how to partition.

let's make this issue into a bug report and modify the title accordingly.

@S1M0N38 S1M0N38 changed the title bug/Hardcoded attachment_partitioner for msg bug/partition_msg halts for attachmentes with UNK type Sep 28, 2024
@scanny scanny self-assigned this Oct 3, 2024
@scanny scanny added msg Related to partitioning Outlook MSG files email Related to EML and/or MSG format labels Oct 3, 2024
github-merge-queue bot pushed a commit that referenced this issue Oct 16, 2024
**Summary**
Initial attempts to incrementally refactor `partition_email()` into
shape to allow pluggable partitioning quickly became too complex for
ready code-review. Prepare separate rewritten module and tests and swap
them out whole.

**Additional Context**
- Uses the modern stdlib `email` module to reliably accomplish several
manual decoding steps in the legacy code.
- Remove obsolete email-specific element-types which were replaced 18
months or so ago with email-specific metadata fields for things like Cc:
addresses, subject, etc.
- Remove accepting an email as `text: str` because MIME-email is
inherently a binary format which can and often does contain multiple and
contradictory character-encodings.
- Remove `encoding` parameters as it is now unused. An email file is not
a text file and as such does not have a single overall encoding.
Character encoding is specified individually for each MIME-part within
the message and often varies from one part to another in the same
message.
- Remove the need for a caller to specify `attachment_partitioner`.
There is only one reasonable choice for this which is
`auto.partition()`, consistent with the same interface and operation in
`partition_msg()`.
- Fixes #3671 along the way by silently skipping attachments with a
file-type for which there is no partitioner.
- Substantially extend the test-suite to cover multiple
transport-encoding/charset combinations.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: scanny <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
email Related to EML and/or MSG format msg Related to partitioning Outlook MSG files
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants