Performance optimization ideas #302

christianparpart · 2021-07-06T07:23:24Z

christianparpart
Jul 6, 2021
Maintainer

I think about it regularly, but have never written down anything.

Maybe this thread can serve as brainstorming of what is all possible to improve. And if we come up with something I didn't mention in this top-post yet, I think it's best to extend it here (potentially with ticket numbers if present), so it's already easy to grasp the conclusions.

Areas of user-perceived performance improvements

terminal throughput performance
display render performance
input latency

And maybe as a result we can yield some smaller tickets to tackle each of these separately. So let's start:

✅ [VT parser] optimize for plaintext-throughput

Some people like to do cat-style throughput performance tests.
The content to optimize for usually contains a lot of characters and regularly newline characters.
This case can be optimized by scanning the input for escape sequences and C0 characters using SIMD to greatly speed up processing. What would be the performance gain? (Answer: plain ASCII text throughput on my Ryzen 9 with 3200MHz RAM: 16 GB/s, plain ASCII with LF (linefeed) chars: 1.9 GB/s - both up from ~250 MB/s)

[VT parser] UTF--8 input & ranged output

Don't first convert UTF-8 to UTF-32 and then feed the parser as only text (and C1) will be using more than one byte.

The parser should then decode UTF-8 iff it detects it, keeping decoding state internal.

Input length should therefore be divisible by 128 bits (16 bytes).

Mind, not just ground state is UTF-8 decoded but also textual payload from sequences like DCS and OSC. But is that good? Sixel payload is huge and should not be decoded for nothing.

[VT parser] L1/L2 cache level optimization

The VT parser currently is a state machine, so using tables. It would be worth investigating to see how well a switch/case-based VT parser performs compared to an FSM-based one. A dedicated test executable would be needed to produce some reliable numbers to reason about what is better.

Screen Grid's Cell

In order to maximize throughput, try to make Cell a trivial data type.

std::u32string can become a custom std::array<char32_t, N> based string together with a size count.
Refactor the use of hyperlinks in grid cells. Only store an ID into a LRU cache and live with the fact that some hyperlink may not be fully rendered anymore when LRU cache gets full.
Can we do the same with image fragment ID? if so, we could truely make Cell trivally copyable.
Evaluate whether it does make sense to factor out some rarely filled properties, such as: additional codepoints for grapheme clusters, (underline) decoration color, hyperlink ID, image fragment ID
Optimize scrollUp(n) for the most common (full-margin) case.

Screen Grid

Do not use one vector per line but one vector per scrollback buffer, and make it a ring-buffer.
Line{} access into the scrollback should then ideally be just a trivial view object containing begin/end offsets
Should text & SGR kept next to each other (like now) or split?
pre-allocate scrollback buffer

Also, the parser should be able to detect bigger chunks of pure LF-delimited ASCII (not containing any other control codes). With such a sequence of N "pure" lines, they could be directly copied into the grid buffer as each line's starting offset can be pre-calculated so it can be a (threaded) paralellized copy

Embrace std::vector over map/unordered_map

With the one talk @whisperity linked me once I realized that it's not really worth using map/unordered_map in most cases. Using a vector and maybe have it sorted for O(log n) speed.

[Renderer] OpenGL rendering

While this does not have a direct impact on throughput performance, it may impact input latency as for input latency, the speed of rendering is important (also: input and rendering share the same (main) thread).

Passive render buffer updates

The code paths are all already there, but disabled as for some reason the performance wasn't as expected. Currently the render buffers are updated on request in the render thread, which implies that the terminal thread needs to be locked for the time of fetching a fresh render buffer state. This locking could be prevented by re-enabling passive render-buffer updates again, as they happen on the terminal-thread side, and accessing the render buffer on the renderer thread will therefore not block the terminal thread. Why this one currently (if enabled) is not more performant should be investigated. Visually it looks like a render-lag, that must be avoided.

input latency and key presses

when a key is pressed, the next render frame should already reflect that key press (typically by having that pressed character displayed). If that is the case, and if not, how to achieve that, I don't know yet. (TODO) :)

Balance

While all these are nice ideas, features that do make sense to the broader user-base must not be neglected just to maintain some certain level of performance, nor should code readability suffer.

Software usability and maintainability is much more of a concern than climbing up the who's-the-fasted-terminal-ladder.

References

https://www.youtube.com/watch?v=fHNmRkzxHWs (Efficiency with Algorithms, Performance with Data Structures)
https://www.youtube.com/watch?v=2EWejmkKlxs&t=1299s (nowhere near faster)

data-man · 2021-07-06T12:01:51Z

data-man
Jul 6, 2021

Build with external allocator: jemalloc, mimalloc, snmalloc, ...
libunicode: SIMD-optimized utf8 decoding (https://github.com/BobSteagall/utf_utils, https://gitlab.com/mkrupcale/libutf8decoders, ...)
Introduce common namespace:

containers (my personal favorites: Tessil's projects, https://github.com/martinus/robin-hood-hashing, https://github.com/greg7mdp/parallel-hashmap)
fmt
sorting (https://github.com/zeux/nanosort, https://github.com/Morwenn/cpp-sort 👍 , ...)
string
ranges
https://github.com/foonathan/lexy - "a parser combinator library for C++17 and onwards". Very cool, header-only, but "under active development and especially currently undocumented features"
...
This will allow us to experiment without changing a lot of code.

I'll think more. :)

7 replies

christianparpart Oct 14, 2021
Maintainer Author

I think there is as of the time being no point in going fore more SIMD then already done in the new development branch I am working on. When this is done AND stable we can think of more perf optimizations. :)

WSLUser Oct 14, 2021

Oops. I thought I linked to the main source, not his fork. That's now fixed.

data-man Oct 15, 2021

Added https://github.com/foonathan/lexy. :)

christianparpart Oct 15, 2021
Maintainer Author

Guys. I've nothing against library / use ideas. But this thread was supposed to be about performance improvement ideas. ^^

uspasojevic96 Oct 15, 2021
Maintainer

just rename the discussion, problem solved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimization ideas #302

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Performance optimization ideas #302

christianparpart Jul 6, 2021 Maintainer

Areas of user-perceived performance improvements

✅ [VT parser] optimize for plaintext-throughput

[VT parser] UTF--8 input & ranged output

[VT parser] L1/L2 cache level optimization

Screen Grid's Cell

Screen Grid

Embrace std::vector over map/unordered_map

[Renderer] OpenGL rendering

Passive render buffer updates

input latency and key presses

Balance

References

Replies: 1 comment · 7 replies

data-man Jul 6, 2021

christianparpart Oct 14, 2021 Maintainer Author

WSLUser Oct 14, 2021

data-man Oct 15, 2021

christianparpart Oct 15, 2021 Maintainer Author

uspasojevic96 Oct 15, 2021 Maintainer

christianparpart
Jul 6, 2021
Maintainer

Replies: 1 comment 7 replies

data-man
Jul 6, 2021

christianparpart Oct 14, 2021
Maintainer Author

christianparpart Oct 15, 2021
Maintainer Author

uspasojevic96 Oct 15, 2021
Maintainer