Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow LiveViews to be adopted #3551

Open
josevalim opened this issue Dec 4, 2024 · 8 comments
Open

Allow LiveViews to be adopted #3551

josevalim opened this issue Dec 4, 2024 · 8 comments

Comments

@josevalim
Copy link
Member

One of the issues with LiveView is the double render when going from dead render to live render and the fact we lose all state on disconnection.

This issue proposes for us to render a LiveView (a Phoenix.Channel really) upfront and then it gets "adopted" when necessary. In a nutshell:

  • On disconnect, we keep the LiveView alive for X seconds. Then on reconnect, we reestablish the connection back to the same LiveView, and we just send the latest diff

  • On dead render, we already spawn the LiveView, keep it alive until the WebSocket connection arrives and "adopts" the LiveView, so we just need to send the latest diff

However, this has some issues:

  • If the new connection happens on the same node, it is perfect. However, if it happens on a separate node, then we can either do cluster round-trips on every payload, copy only some assigns (from assigns_new) or build a new LiveView altogether (and discard the old one).

  • This solution means we will keep state around on the server for X seconds. This could perhaps be abused for DDoS attacks or similar. It may be safer to enable this only on certain pages (for example, where authentication is required) or keep the timeout short on public pages (e.g. 5 seconds instead of 30).

On the other hand, this solution should be strictly better than a cache layer for a single tab: there is zero copying and smaller payloads are sent on both connected render and reconnects. However, keep in mind this is not a cache, so it doesn't share across tabs (and luckily it does not introduce any of the caching issues, such as unbound memory usage, cache key management, etc).

There are a few challenges to implement this:

  • We need to add functionality for adoption first in Phoenix.Channel

  • We need to make sure that an orphan LiveView will submit the correct patch once it connects back. It may be we cannot squash patches on the server. We would need to queue them, which can introduce other issues

  • We may need an opt-in API

While this was extracted from #3482, this solution is completely orthogonal to the one outlined there, as live_navigation is about two different LiveViews.

@elliottneilclark
Copy link
Contributor

elliottneilclark commented Dec 4, 2024

After dead view renders the http response we have all of the assigns. If we spend the cycles to start a LV process off the critical path of sending the response then the cost of extra work is not user facing. So when the websocket connection comes in we already have the LiveView started and the assigns are cached. This would trade a bit of memory and non-critical CPU usage for faster websocket connect.

However we don't want to leak memory forever, and it's possible that the websocket never comes. So there would need to be some eviction system in place (time, memory, etc)

I don't know the Erlang VM well enough, but is it possible to convince the VM to move ownership of structs (specifically assigns) rather than copying if there are no live references? That would be another way to make spinning up LV processes even cheaper. Rather than copying assigns to a new LV process, since we are using them after we have the bytes of the response ready, we can drop all references.

@josevalim
Copy link
Member Author

@elliottneilclark there are some tricks we could do:

  • On HTTP 1, we send connection close but we keep the process around to be adopted as a LiveView later. No copying necessary. This may require changes to the underlying webservers.

  • On HTTP 2, each request is a separate process, so we can just ask for it to stick around.

Outside of that, we do need to copy it, the VM cannot transfer it (RefCounting is only for large binaries). But we can spawn the process relatively early on. For example, we spawn the process immediately after the router, so none of the data mounted in the LiveView needs to be copied, only the assigns set in the plug pipeline that are accessed by the LiveView are copied (using a similar optimization as live_navigation).

@elliottneilclark
Copy link
Contributor

Outside of that, we do need to copy it, the VM cannot transfer it (RefCounting is only for large binaries).

That makes sense; large binaries are equivalent of huge objects in JVM with different accounting.

@simoncocking
Copy link

it's possible that the websocket never comes

Our experience is that this happens only on public pages which are exposed to bots / search engines / other automatons. So in our situation:

It may be safer to enable this only on certain pages (for example, where authentication is required)

this is exactly what we'd do.

If the new connection happens on the same node, it is perfect. However, if it happens on a separate node, then we can either do cluster round-trips on every payload, copy only some assigns (from assigns_new) or build a new LiveView altogether (and discard the old one).

We have some LiveViews that do some pretty heavy lifting on connected mount, so we'd need some way to guarantee that this work wouldn't be repeated if the LV was spawned on a different node to that which receives the WebSocket connection.

@josevalim
Copy link
Member Author

josevalim commented Dec 5, 2024

I just realized that the reconnection approach has some complications. If the client crashes, LiveView doesn't know if the client has received the last message or not. So in order for reconnections to work, we would need to change LiveView server to keep a copy of all responses and only delete them when the client acknowledges it. This will definitely make the protocol chattier and perhaps affect the memory profile on the server. So for reconnection, we may want to spawn a new LiveView anyway, and then transfer the assigns, similar to push_navigate.

This goes back to the previous argument that it may be necessary to provide different solutions for each problem, if we want to maximize their efficiency.

We have some LiveViews that do some pretty heavy lifting on connected mount, so we'd need some way to guarantee that this work wouldn't be repeated if the LV was spawned on a different node to that which receives the WebSocket connection.

This is trivial to do if they are in the same node, it is a little bit trickier for distinct nodes. For distinct nodes, you would probably need to opt-in and say that a LiveView state is transferrable, which basically says that you don't rely on local ETS or resources (such as dataframes) in your LiveView state.

@bcardarella
Copy link
Contributor

bcardarella commented Jan 12, 2025

There may be an option here by pulling from a prior work, although I don't know if the solution is entirely what @josevalim has in mind.

Back in Ember the SSR framework is called Ember Fastboot. Ember's own Data library, Ember-Data, introduced a double-render issue in that on the server it would request the data, render the template, then that template was sent to the client. The app would boot, then make the same request again. The solution was called the Shoebox

The Shoebox worked by encoding the Ember-Data content during the SSR, and injecting a <script> tag with the data sent back to the client. Ember would then load data from the shoebox first rather than make the API request for data from the server (our version of double-render)

So what if this is what LiveView did? On the dead render it also injected certain data from the assigns into the template. When the Liveview connection is made inject that data as a query param. If the QP is present skip the mount function and render a diff back to the client to hydrate the app.

The upside here is that we could skip the double-render if the mount is skipped. The trade off is that there may be data that we don't want to expose to the client.

I could see this working with something like this:

def mount(params, session, socket) do
  data = MyApp.some_data_operations()
  {:ok, shoebox_assign(socket, data: data)}
end

This way there could be an acceptance from the developer that they have to opt-into this rendering benefit. Not assiging to the special keys or with a specific assign function would be existing double-render behavior.

Furthermore, it would be on the developer to ensure that whatever data they are embedding would need to be encodable and shouldn't be exposing sensitive information.

This doesn't entirely solve the problem for every use case but considering how difficult this problem has been to solve on the server I do wonder if we'll ever find a "one solution for all" here.

@hubertlepicki
Copy link

hubertlepicki commented Jan 23, 2025

The Shoebox worked by encoding the Ember-Data content during the SSR, and injecting a <script> tag with the data sent back to the client.

This data, I assume would not be server-side only, it would not be sensitive? In LV, we would load say User schema there, we don't want to expose it to the client. We would probably store the data somewhere and then pass the key to the data to the client, that would be passed back to us on re-connecting, instead of passing the whole data.

But I like the idea of LiveViews sticking around for X amount of time. I assume they could receive messages over pubsub and otherwise, and we would eliminate the race condition where events can be missed between original mount and first connected mount.

To add to @josevalim options on what to do when a LiveView process connects via Web Socket, we could also:

  1. Try to always connect to the same pod, i.e. "sticky sessions". This is possible but requires load balancer config and these can be widely different. I suspect this would be the easiest solution, but a headache to turn on for sys admins.

  2. We could attempt to migrate the LV process from the original node/pod to the one where WebSocket connected to, preserving its internal state. I guess that's a can of worms and opens doors for another race condition, but should be doable. I have used Swarm and then Horde for something like that before.

Another cool feature that LV processes that "stick around" would give us is possibly better handling of recovery of state after connection drops. It'd be the same functionality, allowing for not only form state to be restored, but also internal state of LV like user_id that's never exposed on the form.

@bcardarella
Copy link
Contributor

@hubertlepicki as I said in my comment:

Furthermore, it would be on the developer to ensure that whatever data they are embedding would need to be encodable and shouldn't be exposing sensitive information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants