Suggestion: Speed up decoder by queue-ing queueInputBuffer (previous idea: BUFFER_FLAG_PARTIAL_FRAME) #2436

vosk · 2024-10-02T11:22:11Z

Just an idea, I've been browsing through the client decoder stream setup and it seems a whole frame needs to be queued to begin the decoder. Ignoring any details that I have no idea about, it could be possible to retrieve the dequeInputBuffer and dump directly to the mediacodec buffer each chunk from the socket. I haven't explored the details (rust is new to me), but it seems it would save a ptr: copy non overlapping, as well as priming with more data as soon as they arrive. Any thoughts on why this can't work?

vosk · 2024-10-03T13:29:38Z

It looks like the safest way to do this is to Have VideoDecoderSink produce something like a SocketBuffer which can be fed into SocketReader which will eventually get a
https://doc.rust-lang.org/std/io/struct.BorrowedCursor.html
through a
https://doc.rust-lang.org/std/io/struct.BorrowedBuf.html
to do a read(socketbuffer) .
using the media_codec queue_input_buffer offset and size arguments, we can send the "video data" part of the buffer for processing.
However, both of those are experimental yet.

vosk · 2024-10-03T17:27:24Z

Abandoning the idea since StreamSocket.recv is used for multiple streams :)

zmerp · 2024-10-03T20:52:28Z

Sorry I haven't replied sooner. The idea actually makes sense, but not in the way you think. while we cannot do transmission and decoding in parallel for parts of the same frame, we can receive and copy the frame in parallel. This could maybe save 1-2ms. I know because previously we managed to shave some latency just by being careful not to do unnecessary copies of the data

zmerp · 2024-10-03T20:58:45Z

This would actually be the first instance of intra-frame pipelining in ALVR. currently we show statistics as a simple bar graph where it's impossible to show overlaps. I plan to rework this soon by switching to a hybrid scatter/timeline plot. Let's work on this after I finished the statistics rewrite.

vosk · 2024-10-04T06:27:40Z

Thank you for taking notice :) You are talking about StreamSocket.recv around line 563 I guess. I havent wrapped my head around the shard logic, but it would require to have a "partial packet" logic around the whole thing I believe.

zmerp · 2024-10-04T11:35:58Z

Actually no, i didn't mean touching the StreamSocket logic. I mean sending smaller packets and receiving smaller packets, so that StreamSocket will not split the packets at all. Doing this will not take advantage of the packet reordering logic, but it doesn't matter, we can wait because the copy logic is much faster than the network transmission.

In any case, we would still need to test this, it's not guaranteed that this will reduce latency. It will increase CPU usage because the send, recv, and copy logic need to be repeated for each smaller packet instead of once per frame. We might need to find a compromise regarding the packet size. I'd say that this improvement is a micro optimization, since the most we can save is 1 video frame copy.

zmerp · 2024-10-04T14:49:08Z

As suggested by @xytovl, we don't need to use BUFFER_FLAG_PARTIAL_FRAME, we can just acquire a input buffer and queue it when it has been filled. Let's keep this issue open for the copy pipelining feature.

vosk · 2024-10-05T08:28:26Z

I see multiple ideas overlapping, so i'll try and summarize what I undrestand:

Smaller packets to avoid splitting them up.
Decode partial frames
Use decoder buffers directly injected into the socket to avoid copies.

The obvious caveat right now as I see it, is the fact that it is guaranteed right now that we have a whole frame before push_frame_nal.

My thoughts
There are two "weak" points in the code as I understand it.
a) there is a ptr copy in push_frame_nal that copies from the socket buffer into decoder buffer
b) there is a resize operation inside streamsocket

b)I believe is not really a problem because shard_length is probably constant so the resize is not that common. Not sure.

My suggested approach is two-fold:
i) Make ReceiverData.get_shard() that can deliver partial data. This means that the frame is whole, but not necessarily in a contiguous buffer. This means that ReceiverData must be mutable to have internal state.
then push_frame_nal(som_object) can do

while (some_object.haveMore()) {
decoder.push_partial(some_object.next())
}

ii)Make subscribe_to_stream() require a supplier object that can deliver buffers to the socket logic
Each buffer can be a shard (as it is now)

i+ii) means the decoder provides buffers, they are filled with shards, each can be allocated once and delivered in-order as data comes through, enabling lost packet and skipped frames to be handled as they are.

Noob question: how do I make rustanalyzer build dependencies for android? where do I put --platform in vscode.

zmerp · 2024-10-05T14:34:55Z

I don't think we can do any better. We have one copy from socket buffer to decoder buffer. on the server/send side instead we need to prefix the data with the packet header (at least the packet index), which means a copy. So nothing we can save

zmerp · 2024-10-05T18:10:53Z

i) Make ReceiverData.get_shard() that can deliver partial data. This means that the frame is whole, but not necessarily in a contiguous buffer. This means that ReceiverData must be mutable to have internal state.
then push_frame_nal(som_object) can do
while (some_object.haveMore()) {
decoder.push_partial(some_object.next())
}

This is a interesting idea, but of limited use. we can just do its own separate thing for video.

Noob question: how do I make rustanalyzer build dependencies for android? where do I put --platform in vscode.

You should be able to change the "check" command for rust-analyzer. I did it some times but changing it back and forth got old. if I need to edit android cfg-gated files I comment the cfg-gate temporarily

vosk · 2024-10-06T10:06:06Z

This is a interesting idea, but of limited use. we can just do its own separate thing for video.

I agree. Also for audio maybe?

zmerp · 2024-10-06T12:33:00Z

I would say audio is not even a hot code path. The data size is much lower, plus we are not as sensitive to audio latency

vosk · 2024-10-08T12:22:20Z

For posterity:
I tried keeping each shard in its own buffer and avoid resizing, and feeding deque_input_buffer with PARTIAL_FRAME, but the problem is that media codec ( I guess depending on codec implementation) may run out of buffers, thus forcing again to merge the shards in a contiguous buffer at the decoder. Somewhere along the way, I made streamsocket have less jitter, but time to photon actually measures worse.

zmerp · 2024-10-08T13:17:49Z

Ah ok. I think best thing is to do a compromise, by splitting the frame into a set number of packets to not exceed the max mediacodec input buffers.

vosk · 2024-10-10T07:31:54Z

I am having trouble with the decoder flags, if anyone is interested to test in another device, (just for the heck of it at this point), I can provide a branch to play around. I am testing on a Xiaomi Note 7 and the decoder doesnt really know what to do with the partial frames. It almost looks like that it does not support the flag, and simply tries to render the partial frame. Its weird. The top of the image is perfect, and the rest is extremely low bitrate. Its as if the first partial frame is all that it gets rendered.

zmerp · 2024-10-10T08:37:52Z

That's not the first time we see vendors not supporting Mediacodec flags. Seems the best option is not to mess with mediacodec at all.

vosk closed this as completed Oct 3, 2024

zmerp reopened this Oct 3, 2024

zmerp added the enhancement New feature or request label Oct 3, 2024

zmerp changed the title ~~Suggestion: Speed up decoder by queue-ing queueInputBuffer using BUFFER_FLAG_PARTIAL_FRAME~~ Suggestion: Speed up decoder by queue-ing queueInputBuffer (previous idea: BUFFER_FLAG_PARTIAL_FRAME) Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Speed up decoder by queue-ing queueInputBuffer (previous idea: BUFFER_FLAG_PARTIAL_FRAME) #2436

Suggestion: Speed up decoder by queue-ing queueInputBuffer (previous idea: BUFFER_FLAG_PARTIAL_FRAME) #2436

vosk commented Oct 2, 2024

vosk commented Oct 3, 2024 •

edited

Loading

vosk commented Oct 3, 2024

zmerp commented Oct 3, 2024 •

edited

Loading

zmerp commented Oct 3, 2024 •

edited

Loading

vosk commented Oct 4, 2024

zmerp commented Oct 4, 2024 •

edited

Loading

zmerp commented Oct 4, 2024 •

edited

Loading

vosk commented Oct 5, 2024

zmerp commented Oct 5, 2024 •

edited

Loading

zmerp commented Oct 5, 2024

vosk commented Oct 6, 2024

zmerp commented Oct 6, 2024

vosk commented Oct 8, 2024

zmerp commented Oct 8, 2024

vosk commented Oct 10, 2024

zmerp commented Oct 10, 2024 •

edited

Loading

Suggestion: Speed up decoder by queue-ing queueInputBuffer (previous idea: BUFFER_FLAG_PARTIAL_FRAME) #2436

Suggestion: Speed up decoder by queue-ing queueInputBuffer (previous idea: BUFFER_FLAG_PARTIAL_FRAME) #2436

Comments

vosk commented Oct 2, 2024

vosk commented Oct 3, 2024 • edited Loading

vosk commented Oct 3, 2024

zmerp commented Oct 3, 2024 • edited Loading

zmerp commented Oct 3, 2024 • edited Loading

vosk commented Oct 4, 2024

zmerp commented Oct 4, 2024 • edited Loading

zmerp commented Oct 4, 2024 • edited Loading

vosk commented Oct 5, 2024

zmerp commented Oct 5, 2024 • edited Loading

zmerp commented Oct 5, 2024

vosk commented Oct 6, 2024

zmerp commented Oct 6, 2024

vosk commented Oct 8, 2024

zmerp commented Oct 8, 2024

vosk commented Oct 10, 2024

zmerp commented Oct 10, 2024 • edited Loading

vosk commented Oct 3, 2024 •

edited

Loading

zmerp commented Oct 3, 2024 •

edited

Loading

zmerp commented Oct 3, 2024 •

edited

Loading

zmerp commented Oct 4, 2024 •

edited

Loading

zmerp commented Oct 4, 2024 •

edited

Loading

zmerp commented Oct 5, 2024 •

edited

Loading

zmerp commented Oct 10, 2024 •

edited

Loading