Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection of various unrelated fixes, workarounds and hacks #16

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

hheimbuerger
Copy link

I am fully aware that this isn't a 'merge as seen' PR. It's more of an unsorted collection of hacks and workarounds that I applied to my local fork of slackchannel2pdf to quickly run the one-time archive I needed to complete.

Mostly, this was me running my archive tasks in a debugger until an exception was raised, then hacking something into those lines without any real comprehension of the wider context my changes operated in.

I'm sorry, but I don't have the time to clean these up and fully develop them right now. I still wanted to contribute them in some way, hoping it might help someone get started and really polish them into a releaseable state. As such, I'm also not offended if you just close this PR.

I did carefully group my changes by commit, and spent some effort on the commit messages, so I recommend reading them by commit. They also contain example stack traces where applicable.

The main change is possibly the fix for the rate limiting exceptions, which I got pretty reliably for almost every non-trivial channel. Always on the pagination function, somewhere when retrieving messages, I believe (but didn't preserve the specific stack traces). My solution isn't very elegant, but I never experienced any issues with rate limiting again.
There's probably other, non-_fetch_pages-based calls that should receive the same treatment, but I didn't experience them, so I didn't go searching either.

I hereby put all my changes under the MIT license, and you're free to use, modify, sell or ignore them in any way you see fit. 😀

No font is registered to support a combination of italic and mono. This
adds those registrations, although I don't have the required fonts
at hand, so it simply uses the non-italic versions of the mono font,
effectively ignoring all italic formatting in mono blocks.

Fixes exception:

```
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 209, in <module>
    main()
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 89, in main
    result = exporter.run(
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 883, in run
    self._write_messages_to_pdf(document, messages, threads)
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 538, in _write_messages_to_pdf
    last_user_id = self._parse_message_and_write_to_pdf(
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 189, in _parse_message_and_write_to_pdf
    document.write_html(
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\fpdf_ext.py", line 90, in write_html
    self._open_tag(tag, attributes)
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\fpdf_ext.py", line 135, in _open_tag
    self.set_font(font_family, size=size, style=style)
  File "C:\Develop\slackchannel2pdf\fpdf_mod\fpdf.py", line 737, in set_font
    self.error("Undefined font: " + family + " " + style)
  File "C:\Develop\slackchannel2pdf\fpdf_mod\fpdf.py", line 286, in error
    raise RuntimeError("FPDF error: " + msg)
RuntimeError: FPDF error: Undefined font: notosansmono I
```
The PDF renderer for Python 3 encodes everything as latin1, which is
of course *very* unrealistic for Slack workspaces in an international
requirement.

This terrible hack replaces some common unicode modification codepoints
(e.g. for ´ and ` accents), at least ignoring them to prevent hard
crashes.

Fixes exception:
```
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 209, in <module>
    main()
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 89, in main
    result = exporter.run(
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 889, in run
    document.output(str(filename_pdf))
  File "C:\Develop\slackchannel2pdf\fpdf_mod\fpdf.py", line 1325, in output
    buffer = self.buffer.encode("latin1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2028' in position 398771: ordinal not in range(256)
```
The PDF renderer raised an exception if an `<s>` tag was nested. While
it's true that that's not supposed to occur, it does. I can't quite tell
why, but it is happening at lot, and on messages that only contain one
`<s>` tag, so I assume the `_last_font` must already be set outside the
message for some reason.

This change simply allows those nested tags, effectively ignoring the
nested tag, but at least moving on with the rendering nevertheless.

Fixes exception:
```
[...]
Traceback (most recent call last):
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\fpdf_ext.py", line 90, in write_html
    self._open_tag(tag, attributes)
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\fpdf_ext.py", line 112, in _open_tag
    raise HtmlConversionError("<s> tags can not be nested")
slackchannel2pdf.fpdf_ext.HtmlConversionError: <s> tags can not be nested

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\logging\__init__.py", line 1100, in emit
    msg = self.format(record)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\logging\__init__.py", line 943, in format
    return fmt.format(record)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\logging\__init__.py", line 678, in format
    record.message = record.getMessage()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\logging\__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
[...]
```
This one's nasty, because it causes another exception to occur while
handling this one.
The PDF renderer attempts to parse HTML by means of regex, which you
[cannot do](https://stackoverflow.com/a/1732454/6278). This breaks when
encountering text like `A <> B`, which it thinks is an empty HTML tag,
causing an IndexError when trying to identify whether it's a closing tag
(`if part[0] == "/"`).

This is a small workaround for this specific issue, although the bigger
problem is really the approach of parsing HTML with regex.
When running a lot of export attempts, it's convenient to be able to
see in the logs what channels were being exported.
Fetching the details of a bot commonly fails. At least log the name of
the missing bot to support further troubleshooting.

Example result:
```
Exporting channel #office from Slack...
[WARNING] Bot not found, could not fetch name: B01
[WARNING] Bot not found, could not fetch name: 49927755382
[WARNING] Bot not found, could not fetch name: 34639333695
[WARNING] Bot not found, could not fetch name: 89759792692
[WARNING] Bot not found, could not fetch name: 62872444257
written: Workspace_office.pdf
```
In retrospect, the code for this might have already been there, just
following the exception-raising part, instead of preceding it. I've only
noticed that later and cannot comment on what the

Fixes exception:
```
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 209, in <module>
    main()
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 89, in main
    result = exporter.run(
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 883, in run
    self._write_messages_to_pdf(document, messages, threads)
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 538, in _write_messages_to_pdf
    last_user_id = self._parse_message_and_write_to_pdf(
  File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 445, in _parse_message_and_write_to_pdf
    layout_block["text"]["text"],
KeyError: 'text'
```
The script often encounters messages it cannot process. To aid in the
debugging of these messages and to better document the missing message
in the logs, print out a permalink to the message that can be opened
in a browser for review.
Some messages contain no `user` or `bot_id` field, but they do contain
a `username` field. Previously, these were dropped completely and logged
as 'can not process message'. But they're actually pretty standard
messages that can be processed like any other chat message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant