Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF/UA accessibility. Labeled strange. #2153

Open
marina31714 opened this issue May 9, 2024 · 3 comments
Open

PDF/UA accessibility. Labeled strange. #2153

marina31714 opened this issue May 9, 2024 · 3 comments
Labels
bug Existing features not working as expected

Comments

@marina31714
Copy link

marina31714 commented May 9, 2024

Hello,
I'm trying to generate a PDF from HTML with PDF/UA, but it returns strange tagging.
Is this labeling correct? Is there any way to modify it? It is the first time I use your library, and I am very interested in the accessibility part.

I am using Adobe Acrobat Pro to look at the labeling.

Thank you in advance.

HTML:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
    <title>Ejemplo PDF</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 50px;
        }
        h1 {
            color: pink;
        }
    </style>
</head>
<body>
    <h1>Hello World</h1>
    <p>Lorem ipsum dolor sit amet consectetur adipiscing elit pellentesque, eros blandit porttitor primis mollis nisi in nunc, ante interdum vestibulum viverra mattis et sociosqu. Faucibus a risus laoreet posuere placerat class tempus vehicula, dignissim congue netus odio potenti phasellus malesuada sodales habitant, egestas id imperdiet sociis vitae taciti curabitur.</p>
</body>
</html>

Python (Flask):

from flask import Flask, render_template, make_response
from weasyprint import HTML

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/pdf')
def generate_pdf():
    HTML('./templates/index.html').write_pdf('test_pdf_ua.pdf', pdf_version="1.6",  pdf_variant='pdf/ua-1')
    return "PDF Generated"

if __name__ == '__main__':
    app.run(debug=True)

Result:
image

Expected result:
image

@liZe
Copy link
Member

liZe commented May 29, 2024

Hmmm… There’s something strange in these labels, we have to check what’s wrong and try to improve this structure.

@liZe liZe added bug Existing features not working as expected labels May 29, 2024
@liZe
Copy link
Member

liZe commented Aug 13, 2024

Here’s a PDF with new labels, could you please test how it works in your PDF reader?

ua.pdf

(By the way, what’s your PDF reader?)

@dhdaines
Copy link
Contributor

dhdaines commented Sep 10, 2024

I too wondered about this NonStruct in the structure tree generated by Weasyprint... It is in the standard (PDF 32000-1:2008 page 584):

NonStruct(Nonstructural element) A grouping element having no inherent structural
significance; it serves solely for grouping purposes. This type of element differs from
a division (structure type Div) in that it shall not be interpreted or exported to other
document formats; however, its descendants shall be processed normally.

But I am not really sure why it needs to be used in this case - it appears Weasyprint is treating <html> and <body> as "nonstructural elements" wrapped around the text content. Probably, it could just not do that - but on the other hand it isn't necessarily incorrect to do so (after all, they are grouping elements having no inherent structural significance), just unexpected, since other PDF/UA tools (like Microsoft Word) don't do it.

The above ua.pdf is definitely not correct though. pdfinfo -struct won't even read it:

$ pdfinfo -struct ~/Downloads/ua.pdf
Syntax Error: StructElem object is wrong type (None)
Syntax Error: StructElem object is wrong type (None)
Document

You can also look at structure trees with (I'm biased because I contributed this functionality) pdfplumber --structure-text which gives JSON and tries to be tolerant of invalid structure trees (of which there are many):

[{"type": "Document", "children": [
  {"type": "None", "page_number": 1, "children": [
    {"type": "None", "page_number": 1, "children": [
      {"type": "H1", "page_number": 1, "mcids": [0], "text": ["Hello World"]},
      {"type": "P", "page_number": 1, "mcids": [1], "text": ["Lorem ipsum dolor sit amet consectetur adipiscing elit pellentesque, erosblandit porttitor primis mollis nisi in nunc, ante interdum vestibulum viverramattis et sociosqu. Faucibus a risus laoreet posuere placerat class tempusvehicula, dignissim congue netus odio potenti phasellus malesuada sodaleshabitant, egestas id imperdiet sociis vitae taciti curabitur."]}]}]}]}]

Clearly None is not in the standard as a structure element ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Existing features not working as expected
Projects
None yet
Development

No branches or pull requests

3 participants