Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong numbering on converting list of bullet contents #150

Open
ZeyuTeng96 opened this issue Dec 27, 2024 · 7 comments
Open

wrong numbering on converting list of bullet contents #150

ZeyuTeng96 opened this issue Dec 27, 2024 · 7 comments

Comments

@ZeyuTeng96
Copy link

Hi, I were trying to converting a structure word into html but I got a wrong numbering for it.

Here is the word file I used for converting:
image

The converted html is:

<ul><li>Heading #1</li></ul><p>Some content at here. A beautiful day. A nice Friday. Balabala</p><ol><li>Friday:</li></ol><p>Today is Friday</p><ol><li>Saturday:</li></ol><p>The day comes after. Which is tomorrow.</p><ol><li>Sunday:</li></ol><p>The day after tomorrow.</p><ul><li>Heading #2</li></ul><p>How nice it is?</p>

That is resulting to:
image

What I want is html tags like the following:

<ul><li>Heading #1</li><p>Some content at here. A beautiful day. A nice Friday. Balabala</p><ol><li>Friday:</li><p>Today is Friday</p><li>Saturday:</li><p>The day comes after. Which is tomorrow.</p><li>Sunday:</li><p>The day after tomorrow.</p></ol><li>Heading #2</li><p>How nice it is?</p></ul>

Which looks like:
image

What is the problem caused for wrong numbering? How can I fix this issue?

@mwilliamson
Copy link
Owner

I can't be sure without an example document, but I believe the issue is that (for instance) the paragraph after the first bullet point are not themselves a list i.e. so far as the source document is concerned, it's just an ordinary paragraph. A similar issue for mammoth.js contains an explanation and possible solutions.

@ZeyuTeng96
Copy link
Author

Heading.docx

Hi there,

this is the word document which I used for testing. I also looked around the simialar issue on mammoth.js https://github.com/mwilliamson/mammoth.js/issues/121

I saw there is a way to provide a style map to fix it.

However, I applied this style map, nothing changed.

The code is:

import mammoth

with open('./input_files/Heading.docx', "rb") as docx_file:
    style_map = """p[style-name='Ordered List 1 Continuation'] => ol > li > p:fresh"""
    
    result = mammoth.convert_to_html(docx_file, style_map=style_map)

html_content = result.value
print(html_content)

I am not sure whether I were applying the style map correctly?

@mwilliamson
Copy link
Owner

To use that specific style map, you'll need to make sure that the continuation paragraphs have a style with that name applied to them, and that the preceding paragraph (i.e. the paragraph that actually has the bullet point) is a top-level unordered list.

@ZeyuTeng96
Copy link
Author

oh, so basically, how can I check those paragraphs' style? Do I have to inspect those xml elements?

@mwilliamson
Copy link
Owner

You should be able to view the paragraph style when editing the paragraph in Microsoft Word.

@ZeyuTeng96
Copy link
Author

Hi,

Taking a bit of while for figuring out it.

If I want the specific groups of paragraphs have spcific html path like the following graph, I have to set two unique style for those two groups of paragraphs. Am I right?

image

Can I set those paragraphs into the same style, and provide two style maps for that style? Can mammoth apply two different style map for different paeagraphs automatically?

@mwilliamson
Copy link
Owner

Yes, I believe you'd have to have two separate styles for the two groups of paragraphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants