-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collection of various unrelated fixes, workarounds and hacks #16
Open
hheimbuerger
wants to merge
10
commits into
ErikKalkoken:master
Choose a base branch
from
hheimbuerger:dev
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
No font is registered to support a combination of italic and mono. This adds those registrations, although I don't have the required fonts at hand, so it simply uses the non-italic versions of the mono font, effectively ignoring all italic formatting in mono blocks. Fixes exception: ``` Traceback (most recent call last): File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 209, in <module> main() File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 89, in main result = exporter.run( File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 883, in run self._write_messages_to_pdf(document, messages, threads) File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 538, in _write_messages_to_pdf last_user_id = self._parse_message_and_write_to_pdf( File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 189, in _parse_message_and_write_to_pdf document.write_html( File "C:\Develop\slackchannel2pdf\slackchannel2pdf\fpdf_ext.py", line 90, in write_html self._open_tag(tag, attributes) File "C:\Develop\slackchannel2pdf\slackchannel2pdf\fpdf_ext.py", line 135, in _open_tag self.set_font(font_family, size=size, style=style) File "C:\Develop\slackchannel2pdf\fpdf_mod\fpdf.py", line 737, in set_font self.error("Undefined font: " + family + " " + style) File "C:\Develop\slackchannel2pdf\fpdf_mod\fpdf.py", line 286, in error raise RuntimeError("FPDF error: " + msg) RuntimeError: FPDF error: Undefined font: notosansmono I ```
The PDF renderer for Python 3 encodes everything as latin1, which is of course *very* unrealistic for Slack workspaces in an international requirement. This terrible hack replaces some common unicode modification codepoints (e.g. for ´ and ` accents), at least ignoring them to prevent hard crashes. Fixes exception: ``` Traceback (most recent call last): File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 209, in <module> main() File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 89, in main result = exporter.run( File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 889, in run document.output(str(filename_pdf)) File "C:\Develop\slackchannel2pdf\fpdf_mod\fpdf.py", line 1325, in output buffer = self.buffer.encode("latin1") UnicodeEncodeError: 'latin-1' codec can't encode character '\u2028' in position 398771: ordinal not in range(256) ```
The PDF renderer raised an exception if an `<s>` tag was nested. While it's true that that's not supposed to occur, it does. I can't quite tell why, but it is happening at lot, and on messages that only contain one `<s>` tag, so I assume the `_last_font` must already be set outside the message for some reason. This change simply allows those nested tags, effectively ignoring the nested tag, but at least moving on with the rendering nevertheless. Fixes exception: ``` [...] Traceback (most recent call last): File "C:\Develop\slackchannel2pdf\slackchannel2pdf\fpdf_ext.py", line 90, in write_html self._open_tag(tag, attributes) File "C:\Develop\slackchannel2pdf\slackchannel2pdf\fpdf_ext.py", line 112, in _open_tag raise HtmlConversionError("<s> tags can not be nested") slackchannel2pdf.fpdf_ext.HtmlConversionError: <s> tags can not be nested During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\logging\__init__.py", line 1100, in emit msg = self.format(record) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\logging\__init__.py", line 943, in format return fmt.format(record) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\logging\__init__.py", line 678, in format record.message = record.getMessage() File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\logging\__init__.py", line 368, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting [...] ``` This one's nasty, because it causes another exception to occur while handling this one.
The PDF renderer attempts to parse HTML by means of regex, which you [cannot do](https://stackoverflow.com/a/1732454/6278). This breaks when encountering text like `A <> B`, which it thinks is an empty HTML tag, causing an IndexError when trying to identify whether it's a closing tag (`if part[0] == "/"`). This is a small workaround for this specific issue, although the bigger problem is really the approach of parsing HTML with regex.
When running a lot of export attempts, it's convenient to be able to see in the logs what channels were being exported.
Fetching the details of a bot commonly fails. At least log the name of the missing bot to support further troubleshooting. Example result: ``` Exporting channel #office from Slack... [WARNING] Bot not found, could not fetch name: B01 [WARNING] Bot not found, could not fetch name: 49927755382 [WARNING] Bot not found, could not fetch name: 34639333695 [WARNING] Bot not found, could not fetch name: 89759792692 [WARNING] Bot not found, could not fetch name: 62872444257 written: Workspace_office.pdf ```
In retrospect, the code for this might have already been there, just following the exception-raising part, instead of preceding it. I've only noticed that later and cannot comment on what the Fixes exception: ``` Traceback (most recent call last): File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 209, in <module> main() File "C:\Develop\slackchannel2pdf\slackchannel2pdf\cli.py", line 89, in main result = exporter.run( File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 883, in run self._write_messages_to_pdf(document, messages, threads) File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 538, in _write_messages_to_pdf last_user_id = self._parse_message_and_write_to_pdf( File "C:\Develop\slackchannel2pdf\slackchannel2pdf\channel_exporter.py", line 445, in _parse_message_and_write_to_pdf layout_block["text"]["text"], KeyError: 'text' ```
The script often encounters messages it cannot process. To aid in the debugging of these messages and to better document the missing message in the logs, print out a permalink to the message that can be opened in a browser for review.
Some messages contain no `user` or `bot_id` field, but they do contain a `username` field. Previously, these were dropped completely and logged as 'can not process message'. But they're actually pretty standard messages that can be processed like any other chat message.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I am fully aware that this isn't a 'merge as seen' PR. It's more of an unsorted collection of hacks and workarounds that I applied to my local fork of slackchannel2pdf to quickly run the one-time archive I needed to complete.
Mostly, this was me running my archive tasks in a debugger until an exception was raised, then hacking something into those lines without any real comprehension of the wider context my changes operated in.
I'm sorry, but I don't have the time to clean these up and fully develop them right now. I still wanted to contribute them in some way, hoping it might help someone get started and really polish them into a releaseable state. As such, I'm also not offended if you just close this PR.
I did carefully group my changes by commit, and spent some effort on the commit messages, so I recommend reading them by commit. They also contain example stack traces where applicable.
The main change is possibly the fix for the rate limiting exceptions, which I got pretty reliably for almost every non-trivial channel. Always on the pagination function, somewhere when retrieving messages, I believe (but didn't preserve the specific stack traces). My solution isn't very elegant, but I never experienced any issues with rate limiting again.
There's probably other, non-
_fetch_pages
-based calls that should receive the same treatment, but I didn't experience them, so I didn't go searching either.I hereby put all my changes under the MIT license, and you're free to use, modify, sell or ignore them in any way you see fit. 😀