Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Speed up decoder by queue-ing queueInputBuffer (previous idea: BUFFER_FLAG_PARTIAL_FRAME) #2436

Open
vosk opened this issue Oct 2, 2024 · 16 comments
Labels
enhancement New feature or request

Comments

@vosk
Copy link

vosk commented Oct 2, 2024

Just an idea, I've been browsing through the client decoder stream setup and it seems a whole frame needs to be queued to begin the decoder. Ignoring any details that I have no idea about, it could be possible to retrieve the dequeInputBuffer and dump directly to the mediacodec buffer each chunk from the socket. I haven't explored the details (rust is new to me), but it seems it would save a ptr: copy non overlapping, as well as priming with more data as soon as they arrive. Any thoughts on why this can't work?

@vosk
Copy link
Author

vosk commented Oct 3, 2024

It looks like the safest way to do this is to Have VideoDecoderSink produce something like a SocketBuffer which can be fed into SocketReader which will eventually get a
https://doc.rust-lang.org/std/io/struct.BorrowedCursor.html
through a
https://doc.rust-lang.org/std/io/struct.BorrowedBuf.html
to do a read(socketbuffer) .
using the media_codec queue_input_buffer offset and size arguments, we can send the "video data" part of the buffer for processing.
However, both of those are experimental yet.

@vosk
Copy link
Author

vosk commented Oct 3, 2024

Abandoning the idea since StreamSocket.recv is used for multiple streams :)

@vosk vosk closed this as completed Oct 3, 2024
@zmerp
Copy link
Member

zmerp commented Oct 3, 2024

Sorry I haven't replied sooner. The idea actually makes sense, but not in the way you think. while we cannot do transmission and decoding in parallel for parts of the same frame, we can receive and copy the frame in parallel. This could maybe save 1-2ms. I know because previously we managed to shave some latency just by being careful not to do unnecessary copies of the data

@zmerp zmerp reopened this Oct 3, 2024
@zmerp zmerp added the enhancement New feature or request label Oct 3, 2024
@zmerp
Copy link
Member

zmerp commented Oct 3, 2024

This would actually be the first instance of intra-frame pipelining in ALVR. currently we show statistics as a simple bar graph where it's impossible to show overlaps. I plan to rework this soon by switching to a hybrid scatter/timeline plot. Let's work on this after I finished the statistics rewrite.

@vosk
Copy link
Author

vosk commented Oct 4, 2024

Thank you for taking notice :) You are talking about StreamSocket.recv around line 563 I guess. I havent wrapped my head around the shard logic, but it would require to have a "partial packet" logic around the whole thing I believe.

@zmerp
Copy link
Member

zmerp commented Oct 4, 2024

Actually no, i didn't mean touching the StreamSocket logic. I mean sending smaller packets and receiving smaller packets, so that StreamSocket will not split the packets at all. Doing this will not take advantage of the packet reordering logic, but it doesn't matter, we can wait because the copy logic is much faster than the network transmission.

In any case, we would still need to test this, it's not guaranteed that this will reduce latency. It will increase CPU usage because the send, recv, and copy logic need to be repeated for each smaller packet instead of once per frame. We might need to find a compromise regarding the packet size. I'd say that this improvement is a micro optimization, since the most we can save is 1 video frame copy.

@zmerp
Copy link
Member

zmerp commented Oct 4, 2024

As suggested by @xytovl, we don't need to use BUFFER_FLAG_PARTIAL_FRAME, we can just acquire a input buffer and queue it when it has been filled. Let's keep this issue open for the copy pipelining feature.

@zmerp zmerp changed the title Suggestion: Speed up decoder by queue-ing queueInputBuffer using BUFFER_FLAG_PARTIAL_FRAME Suggestion: Speed up decoder by queue-ing queueInputBuffer (previous idea: BUFFER_FLAG_PARTIAL_FRAME) Oct 4, 2024
@vosk
Copy link
Author

vosk commented Oct 5, 2024

I see multiple ideas overlapping, so i'll try and summarize what I undrestand:

  1. Smaller packets to avoid splitting them up.
  2. Decode partial frames
  3. Use decoder buffers directly injected into the socket to avoid copies.

The obvious caveat right now as I see it, is the fact that it is guaranteed right now that we have a whole frame before push_frame_nal.


My thoughts
There are two "weak" points in the code as I understand it.
a) there is a ptr copy in push_frame_nal that copies from the socket buffer into decoder buffer
b) there is a resize operation inside streamsocket

b)I believe is not really a problem because shard_length is probably constant so the resize is not that common. Not sure.

My suggested approach is two-fold:
i) Make ReceiverData.get_shard() that can deliver partial data. This means that the frame is whole, but not necessarily in a contiguous buffer. This means that ReceiverData must be mutable to have internal state.
then push_frame_nal(som_object) can do

while (some_object.haveMore()) {
decoder.push_partial(some_object.next())
}

ii)Make subscribe_to_stream() require a supplier object that can deliver buffers to the socket logic
Each buffer can be a shard (as it is now)

i+ii) means the decoder provides buffers, they are filled with shards, each can be allocated once and delivered in-order as data comes through, enabling lost packet and skipped frames to be handled as they are.

Noob question: how do I make rustanalyzer build dependencies for android? where do I put --platform in vscode.

@zmerp
Copy link
Member

zmerp commented Oct 5, 2024

I don't think we can do any better. We have one copy from socket buffer to decoder buffer. on the server/send side instead we need to prefix the data with the packet header (at least the packet index), which means a copy. So nothing we can save

@zmerp
Copy link
Member

zmerp commented Oct 5, 2024

i) Make ReceiverData.get_shard() that can deliver partial data. This means that the frame is whole, but not necessarily in a contiguous buffer. This means that ReceiverData must be mutable to have internal state.
then push_frame_nal(som_object) can do
while (some_object.haveMore()) {
decoder.push_partial(some_object.next())
}

This is a interesting idea, but of limited use. we can just do its own separate thing for video.

Noob question: how do I make rustanalyzer build dependencies for android? where do I put --platform in vscode.

You should be able to change the "check" command for rust-analyzer. I did it some times but changing it back and forth got old. if I need to edit android cfg-gated files I comment the cfg-gate temporarily

@vosk
Copy link
Author

vosk commented Oct 6, 2024

This is a interesting idea, but of limited use. we can just do its own separate thing for video.

I agree. Also for audio maybe?

@zmerp
Copy link
Member

zmerp commented Oct 6, 2024

I would say audio is not even a hot code path. The data size is much lower, plus we are not as sensitive to audio latency

@vosk
Copy link
Author

vosk commented Oct 8, 2024

For posterity:
I tried keeping each shard in its own buffer and avoid resizing, and feeding deque_input_buffer with PARTIAL_FRAME, but the problem is that media codec ( I guess depending on codec implementation) may run out of buffers, thus forcing again to merge the shards in a contiguous buffer at the decoder. Somewhere along the way, I made streamsocket have less jitter, but time to photon actually measures worse.

@zmerp
Copy link
Member

zmerp commented Oct 8, 2024

Ah ok. I think best thing is to do a compromise, by splitting the frame into a set number of packets to not exceed the max mediacodec input buffers.

@vosk
Copy link
Author

vosk commented Oct 10, 2024

I am having trouble with the decoder flags, if anyone is interested to test in another device, (just for the heck of it at this point), I can provide a branch to play around. I am testing on a Xiaomi Note 7 and the decoder doesnt really know what to do with the partial frames. It almost looks like that it does not support the flag, and simply tries to render the partial frame. Its weird. The top of the image is perfect, and the rest is extremely low bitrate. Its as if the first partial frame is all that it gets rendered.

@zmerp
Copy link
Member

zmerp commented Oct 10, 2024

That's not the first time we see vendors not supporting Mediacodec flags. Seems the best option is not to mess with mediacodec at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants