Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't pause processing when send_local_response fails #423

Closed
wants to merge 1 commit into from

Conversation

krinkinmu
Copy link

For context see the Envoy issue envoyproxy/envoy#28826. Here is a shorter summary:

  1. A wasm plugin calls proxy_send_local_response from both onRequestHeaders and onResponseHeaders
  2. When proxy_send_local_reply is called from onRequestHeaders it triggers a local reply and that reply goes through the filter chain in Envoy
  3. The same plugin is called again as part of the filter chain processing but this time onResponseHeaders is called
  4. onResponseHeaders calls proxy_send_local_response which ultimately does not generate a local reply, but it stops filter chain processing.

As a result we end up with a stuck connection on Envoy - no local reply and processing is stopped.

I think that proxy wasm plugins shouldn't use proxy_send_local_response this way, so ultimately whoever created such a plugin shot themselves in the foot. That being said, I think there are a few improvements that could be made here on Envoy/proxy-wasm side to handle this situation somewhat better:

  1. We can avoid stopping processing in such cases to prevent stuck connections on Envoy
  2. We can return errors from proxy_send_local_response instead of silently ignoring them.

Currently Envoy implementation of sendLocalResponse can detect when a second local response is requested and returns an error in this case without actually trying to send a local response.

However, even though Envoy reports an error, send_local_response ignores the result of the host specific sendLocalResponse implementation and stops processing and returns success to the wasm plugin.

With this change, send_local_response will check the result of the host-specific implementation of the sendLocalResponse. In cases when sendLocalResponse fails it will just propagate the error to the caller and do nothing else (including stopping processing).

I think this behavior makes sense in principle because on the one hand we don't ignore the failure from sendLocalResponse and on the other hand, when the failure happens we don't trigger any side-effects expected from the successful proxy_send_local_response call.

NOTE: Even though I do think that this is a more resonable behavior, it's still a change from the previous behavior and it might break existing proxy-wasm plugins. Specifically:

  1. C++ plugins that proactively check the result of proxy_send_local_response will change behavior (e.g., before proxy_send_local_response failed silently)
  2. Rust plugins, due to the way Rust SDK handles errors from proxy_send_local_response will crash in runtime in this case.

On the bright side of things, the plugins that are affected by this change currently just cause stuck connections in Envoy, so we are changing one undesirable behavior for another, but more explicit.

A couple of additional notes for reviewers:

  • If there are not disagreement with the overall approach, but you don't want to change user visible beahvior when sendLocalResponse fails, I can revert to silencing the error, though it would not be my first preference;
  • I created an utility function for unit tests to stringify a list of arguments, but I'm pretty sure similar functions already exist in libraries like Abseil; if reviewers will be in favor of including Abseil to the proxy-wasm-cpp-host, I can spend some time and work out how to make that happen and not break Envoy in the meantime.

@krinkinmu
Copy link
Author

+cc @keithmattix

@keithmattix
Copy link
Contributor

Seems reasonable to me! The change seems worth the risk IMO; from what I understand, the affected population are plugin authors who are already pausing envoy

@krinkinmu
Copy link
Author

Seems reasonable to me! The change seems worth the risk IMO; from what I understand, the affected population are plugin authors who are already pausing envoy

Yes, that's correct. Plugins affected by this change are already affected, but in a different way.

@PiotrSikora
Copy link
Member

Propagating returned status is definitely a good thing to do. But please note that proxy_send_local_response was originally infallible, and the error codepath was added in envoyproxy/envoy#23049 to address the "double send local response" issue (breaking the ABI contract in the process, which is why Rust SDK panics when this happens).

Having said that:

  1. The plugin returning multiple local responses is clearly buggy, so why are we adding a workaround for it if it's not crashing hosts? We're not adding workarounds for a plugins that consistently return Pause and never make any progress, which results in exactly the same "stuck" behavior.
  2. What should be the fallback for failed proxy_send_local_response (which in Envoy is pretty much exclusively used for generating short error responses) that plugins should implement? This adds extra complexity to all plugins in order to address a broken use case.

Also, and that's a topic for a separate issue, but I question whether we should be calling the plugin for response processing of response it generated itself.

@PiotrSikora
Copy link
Member

  1. What should be the fallback for failed proxy_send_local_response (which in Envoy is pretty much exclusively used for generating short error responses) that plugins should implement? This adds extra complexity to all plugins in order to address a broken use case.

For example, in Rust SDK's example HTTP authorization plugin that relies on this behavior (like every other authorization plugin), what should be behavior when this calls fails?

The only solution that comes to mind is returning Pause and adding checks to make sure that Pause is also returned in all other callbacks, which is basically reimplementing existing logic and "stuck" behavior on the plugin side in much more error-prone way, and at much higher cost.

What am I missing?

For context see the Envoy issue envoyproxy/envoy#28826.
Here is a shorter summary:

1. A wasm plugin calls proxy_send_local_response from both onRequestHeaders and
   onResponseHeaders
2. When proxy_send_local_reply is called from onRequestHeaders it triggers
   a local reply and that reply goes through the filter chain in Envoy
3. The same plugin is called again as part of the filter chain processing
   but this time onResponseHeaders is called
4. onResponseHeaders calls proxy_send_local_response which ultimately does
   not generate a local reply, but it stops filter chain processing.

As a result we end up with a stuck connection on Envoy - no local reply
and processing is stopped.

I think that proxy wasm plugins shouldn't use proxy_send_local_response this
way, so ultimately whoever created such a plugin shot themselves in the foot.
That being said, I think there are a few improvements that could be made here
on Envoy/proxy-wasm side to handle this situation somewhat better:

1. We can avoid stopping processing in such cases to prevent stuck connections
   on Envoy
2. We can return errors from proxy_send_local_response instead of silently
   ignoring them.

Currently Envoy implementation of sendLocalResponse can detect when a second
local response is requested and returns an error in this case without actually
trying to send a local response.

However, even though Envoy reports an error, send_local_response ignores the
result of the host specific sendLocalResponse implementation and stops
processing and returns success to the wasm plugin.

With this change, send_local_response will check the result of the
host-specific implementation of the sendLocalResponse. In cases when
sendLocalResponse fails it will just propagate the error to the caller and
do nothing else (including stopping processing).

I think this behavior makes sense in principle because on the one hand we don't
ignore the failure from sendLocalResponse and on the other hand, when the
failure happens we don't trigger any side-effects expected from the successful
proxy_send_local_response call.

NOTE: Even though I do think that this is a more resonable behavior, it's
still a change from the previous behavior and it might break existing
proxy-wasm plugins. Specifically:

1. C++ plugins that proactively check the result of proxy_send_local_response
   will change behavior (e.g., before proxy_send_local_response failed silently)
2. Rust plugins, due to the way Rust SDK handles errors from
   proxy_send_local_response will crash in runtime in this case.

On the bright side of things, the plugins that are affected by this change
currently just cause stuck connections in Envoy, so we are changing one
undesirable behavior for another, but more explicit.

Signed-off-by: Mikhail Krinkin <[email protected]>
@krinkinmu
Copy link
Author

@PiotrSikora you're correct that the plugins that do that are buggy and it's definitely not the intent here to create a workarounds for them.

That being said, I think Envoy can do better in a presense of a buggy plugin and not leave a stuck connection behind. And a buggy plugin could benefit from a signal that would tell them that they are doing something wrong.

As for the fallback, my thinking here is that plugins shouldn't need a fallback in this case - they should just stop calling proxy_send_local_response when processing a local response. In a way proxy_send_local_response remains infallible as long as it is used correctly.

Rather than returning an error, I can try and find a way to "crash" the plugin (i.e., Rust SDK would just panic in this case and I can see if I can do the same for C++). I can also skip returning an error all together, though I don't think it's the best way forward, because plugins that do that basically get no clear indication that they are doing something wrong.

I also agree with you that the whole situation when a plugin is processing its own local reply is confusing and seem to be causing subtle issues. So before moving forward with the review of this PR let me see if I can come up with a change on the Envoy side to address that behavior - maybe Envoy folks will be receptive to this change of Envoy behavior.

@PiotrSikora
Copy link
Member

@krinkinmu the existing plugins currently have code that looks like this:

// authorization declined
send_http_response(...);
return Action::Pause;

However, after you start propagating error codes, any well-written plugin should check it, so that code becomes:

// authorization declined
if send_http_response(...) == Status::OK {
   return Action::Pause;
} else {
   // what to do here???
   return Action::Pause;
}

which will result in exactly the same "stuck" behavior that you're trying to fix in this PR.

@krinkinmu
Copy link
Author

krinkinmu commented Oct 19, 2024 via email

@PiotrSikora
Copy link
Member

I think well written plugins shouldn't call send_http_response in cases where it will return an error.

But now that the error is propagated, plugins should check the return value, and plugins authors must think about what to do in case send_http_response fails.

So the code might be like this:
''' assert(send_http_response(...) == WasmResult::Ok); '''

assert() or panic() are definitely not going to unstuck anything, they're going to bring the whole WasmVM down, so I don't see how that's any improvement.

I also don't quite follow, why fallback for incorrect use of the API would be pausing in the first place? If you pause processing in case of errors what exactly is supposed to resume processing?

That's exactly my point. What should be the fallback behavior? And why the processing should be resumed? If the authorization or anti-exfiltration plugin decides that request/response should be not be sent to the client, but the send_http_response fails for some reason, then I suspect that resuming processing and forwarding request/response (effectively bypassing the security mechanism in place) isn't the most desirable solution.

@krinkinmu
Copy link
Author

krinkinmu commented Oct 19, 2024 via email

@krinkinmu
Copy link
Author

krinkinmu commented Oct 19, 2024 via email

@PiotrSikora
Copy link
Member

Actually, let me step back, because I don't seem to understand the concern here.

I think you might be looking at this with a very narrow focus (i.e. what to do in case when send_http_response is called again by the same plugin for a locally generated response), but you're covering a lot more with this change.

Errors are usually propagated up the stack to be acted upon, so from the plugin's point of view, if it receives an error, then it should have alternative code path for it. The "fallback for something that plugins shouldn't do in the first place" isn't really a thing, because noone intentionally writes bad code.

Also, I suspect that this can happen even when send_http_response is called for a local response generated by another native/Lua/Wasm extensions in the filter chain, so when the Proxy-Wasm plugin itself is perfectly fine.

If that's the case, then that's a bug in Envoy. It would be good to verify it.

If I keep the patch the same, but stop returning an error to the plugin would you find it problematic? From the plugin point of view, it will be exactly as it currently works - they call proxy_send_local_response and it will silently fail. The only difference is the processing will continue.

But why the processing should be resumed? And what does it entail? If we go back to the authorization plugin example, what do you want to happen here? Do you want the response to be forwarded to the client?

@krinkinmu
Copy link
Author

krinkinmu commented Oct 19, 2024 via email

@krinkinmu
Copy link
Author

krinkinmu commented Oct 19, 2024 via email

@PiotrSikora
Copy link
Member

Yes, I expect response to be forwarded to the client, given that it was a local response triggered by the plugin in the first place.

As far as I can tell, the changes in this PR aren't limited to cover only that scenario.

And in case when multiple plugins general a local response processing the same request or response, I would expect at least one of those to actually complete, rather than none of them.

Assume such deployment: "client -> plugin A (anti-exfiltration) -> plugin B -> origin".

"plugin B" generates a local response, "plugin A" decides it should be blocked and calls send_http_response which fails.

What should happen now? Resuming processing (i.e. sending response from "plugin B") would completely bypass the security mechanism supposedly provided by "plugin A (anti-exfiltration)".

@krinkinmu
Copy link
Author

krinkinmu commented Oct 19, 2024 via email

@krinkinmu
Copy link
Author

krinkinmu commented Oct 19, 2024 via email

@PiotrSikora
Copy link
Member

In this case plugin A has a few things that it could do:

  1. It could scrape the response to prevent exfiltration
  2. It could intentionally pause processing, if somebody decides that it's a desirable behavior, but to do that it shouldn't call proxy_send_local_response in the first place it should just return pause directly.

The anti-exfiltration HTTP body example in Rust SDK uses send_http_response for that purpose, so it would be negatively affected by this change.

@krinkinmu
Copy link
Author

krinkinmu commented Oct 19, 2024 via email

@PiotrSikora
Copy link
Member

I see, would it be fair to summarize your argument as this - proxy_send_local_response could be intentionally used to pause processing in some cases and that's a use case that we want to support?

Yes, proxy_send_local_response is pretty much always used to stop further processing and to generate an error response describing why (again: authorization and anti-exfiltration plugins).

The "forever pause" in case of local response being generated twice is not intentional (and possibly an error in Envoy that's not exclusive to Wasm), but that's actually not a bad fallback.

Ideally though, instead of pausing processing and leaving a hanging
connection, IMO, better behavior would be to close the connection.

I agree that closing connection might be a better solution.

@krinkinmu
Copy link
Author

krinkinmu commented Oct 19, 2024 via email

@PiotrSikora
Copy link
Member

One more question if you don't mind, if we close the connection when we detect multiple local responses (by the same plugin or different plugins, doesn't matter), would it be problematic for proxy-wasm extensions?

Returning back to the example above, let's say exfiltration prevention plugin checks the regular response (non-local) and calls proxy_send_local_response - in this case we just send back to the client the local response.

However, if the same plugin calls proxy_send_local_response while processing another local response we just close the connection completely without sending anything back.

I think that's would be fine, since both effectively stop processing.

Also, note that this is very much an Envoy-specific issue. In principle, it shouldn't make any difference whether response came directly from the upstream or if it was generated locally, either by the plugin itself or another extension.

My thinking here is that when multiple local responses are involved, there are two behaviors that make sense:

  1. Don't allow multiple local responses and close the connection when we detect the second local response
  2. Allow multiple local responses, but when it happens, only the latest generated response actually gets sent back to the client.
    I'm not yet completely sure that the second option is feasible yet, but will look into it in more detail. In the meantime the first option seems somewhat simpler, so I want to see if there are counter-points to such behaviour.

The second option seems reasonable... other than in case of buggy plugin that rewrites its own response (which, ironically, is what you're trying to address with this PR).

Also, it looks that Envoy has a few ways to generate LocalReply nowadays: prepareLocalReplyViaFilterChain() and sendDirectLocalReply(). Perhaps one of those works better?

@krinkinmu
Copy link
Author

Thank you! I'm closing this PR, and will see if I can make a broader scoped change to the general local reply processing.

@krinkinmu krinkinmu closed this Oct 22, 2024
@krinkinmu
Copy link
Author

@PiotrSikora I digged through the code and tested a bit with both wasm and native Envoy extensions and wanted to share what I found here.

TL;DR: it looks like envoyproxy/envoy#23049 can be basically reverted now and it will not cause use-after-free issue and will at the same time prevent stuck connections left behind by proxy-wasm. I thought you might be interested in this because it would remove the failure case the PR introduced from sendLocalResponse in Envoy.

I will spend a bit more time experiemnting, but I'm thinking of preparing a PR that just reverts the change in envoyproxy/envoy#23049 - it would be nice if you could weight in on that change once it's ready.

I also wanted to bring up the hypothetical example you shared in #423 (comment):

Assume such deployment: "client -> plugin A (anti-exfiltration) -> plugin B -> origin".

"plugin B" generates a local response, "plugin A" decides it should be blocked and calls send_http_response which fails.

What should happen now? Resuming processing (i.e. sending response from "plugin B") would completely bypass the security mechanism supposedly provided by "plugin A (anti-exfiltration)".

Reading through the code and testing a bit, I don't think that plugin A will work as expected in this scenario, regardless of how proxy-wasm is implemented in Envoy. The problem is that Envoy does not send locally generated reply throught the filter chain in all cases - sometimes it just sends it directly to the client bypassing the filter chain and in such cases plugin A wouldn't be called at all. I thought it is important to mention in case somebody relies on this.

Also, the issue with stuck connections is proxy-wasm specific and native extensions don't have the same issue. To be precise, if we compare two extensions:

  1. native Envoy extension with the logic like this:
Http::FilterHeadersStatus Filter::decodeHeaders(Http::RequestHeaderMap&, bool) {
  decoder_callbacks_->sendLocalReply(
      Http::Code::BadRequest, "", &modifyHeaders, absl::nullopt, "");
  return Http::FilterHeadersStatus::StopAllIterationAndWatermark;
}

Http::FilterHeadersStatus Filter::encodeHeaders(Http::ResponseHeaderMap&, bool) {
  decoder_callbacks_->sendLocalReply(
      Http::Code::BadRequest, "", &modifyHeaders, absl::nullopt, "");
  return Http::FilterHeadersStatus::StopAllIterationAndWatermark;
}
  1. proxy-wasm extension with logic like this:
FilterHeadersStatus ExampleContext::onRequestHeaders(uint32_t, bool) {
  sendLocalResponse(400, "", "", {});
  return FilterHeadersStatus::StopAllIterationAndWatermark;
}

FilterHeadersStatus ExampleContext::onResponseHeaders(uint32_t, bool) {
  sendLocalResponse(400, "", "", {});
  return FilterHeadersStatus::StopAllIterationAndWatermark;
}

They behave differently, even though they look similar. The native extension will end up sending back to the client the response from the encodeHeaders, while the proxy-wasm extension will end up with a stuck connection. The difference here is that native extension actually calls sendLocalReply before stopping iteration, while proxy-wasm extension ends up stopping iteration without calling sendLocalReply from onResponseHeaders (because of the changes introduced in envoyproxy/envoy#23049 to deal with use-after-free issue).

Finally, it seems that sendLocalReply in Envoy is already trying to close the connection when local reply cannot be sent to the client, though in all cases I tried it also triggers an assert and kills Envoy, but that's a separate issue.

@PiotrSikora
Copy link
Member

@krinkinmu great job, thank you for investigating this!

Yeah, if that PR can be safely reverted (retaining the test cases), then that's great.

cc @johnlanni @wbpcode @mpwarres @martijneken

They behave differently, even though they look similar. The native extension will end up sending back to the client the response from the encodeHeaders, while the proxy-wasm extension will end up with a stuck connection. The difference here is that native extension actually calls sendLocalReply before stopping iteration, while proxy-wasm extension ends up stopping iteration without calling sendLocalReply from onResponseHeaders (because of the changes introduced in envoyproxy/envoy#23049 to deal with use-after-free issue).

There is a lot of deferred actions we do in Wasm to avoid re-entrance from callbacks.

Coincidentally, stopNextIteration(), which was added to stop plugins returning Continue after sending local response might help here by deferring StopAllIterationAndWatermark to after deferred local response? Would you mind checking if returning FilterHeadersStatus::Continue instead in your examples changes anything? Thanks!

@johnlanni
Copy link

johnlanni commented Oct 23, 2024

@krinkinmu I looked at the latest version of the Envoy code, If this PR is reverted, the issue mentioned here should still exist: envoyproxy/envoy#23047 (comment)

  1. In wasm function proxy_on_request_body calls host function proxy_send_local_response twice, registers the sendLocalReply by addAfterVmCallAction
  2. this doAfterVmCallActions only pop the first function in the queue.
  3. The first sendLocalReply happens before onRequestBody return, an it will trigger the wasm' filter's encodeHeaders and enter the wasm function proxy_on_response_header
  4. The second sendLocalReply happens before onResponseHeaders return, and it will re-enter the decoder_callbacks_->sendLocalReply.
  5. The second call of sendLocalReply will trigger sendDirectLocalReply, it will override the response headers, and the reference in the first sendLocalReply will be released.
  6. The cors filter's encodeHeaders will be called after the wasm filter's encodeHeaders during the first sendLocalReply, and it will access the freed memory by using the reference of the old response headers.

@krinkinmu
Copy link
Author

@johnlanni compared to envoyproxy/envoy#23049 now when proxy-wasm calls into sendLocalReply during request processing it will end up calling prepareLocalReplyViaFilterChain instead of sendLocalReplyViaFilterChain. The difference between the two is that prepareLocalReplyViaFilterChain just saves the local request and returns without triggering it. Instead it gets triggered later by executeLocalReplyIfPrepared.

So with this change in mind, if we return to the example you had in envoyproxy/envoy#23047 (comment), we will end up with the following sequence of events:

  1. proxy_on_request_body calls proxy_send_local_response and registers two delayed functions to call sendLocalReply
  2. before returning back to Envoy, doAfterVmCallActions will pop the first one and execute it
  3. It will end up in prepareLocalReplyViaFilterChain that does not start filter chain and does not call encodeHeaders of the wasm filter and returns saving a copy of the generated local response to be sent later
  4. doAfterVmCallActions will pop and execute the second function
  5. The second call will end up in prepareLocalReplyViaFilterChain again, but this time it will see that there is already a prepared local reply and will do nothing and just return.

To make sure I also did an asan build of Envoy and run a similar example to the one in envoyproxy/envoy#23047 (comment). Here is my Envoy config:

static_resources:                                                                                                                                                                                           
  listeners:                                                                                                                                                                                                
      address:                                                                                                                                                                                              
        socket_address:                                                                                                                                                                                     
          address: 0.0.0.0                                                                                                                                                                                  
          port_value: 5678                                                                                                                                                                                  
      filter_chains:                                                                                                                                                                                        
        - filters:                                                                                                                                                                                          
            - name: envoy.filters.network.http_connection_manager                                                                                                                                           
              typed_config:                                                                                                                                                                                 
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager                                                                              
                stat_prefix: ingress_http
                codec_type: AUTO                                                                                                                                                                    
                route_config:                                                                                                                                                                               
                  name: local_routes                                                                                                                                                                        
                  virtual_hosts:                                                                                                                                                                            
                    - name: local_service
                      domains: "*"
                      routes:
                      - match: { path: "/" }
                        name: test
                        route:
                          cluster: some_service
                          cors:
                            allow_methods: "GET,POST,PUT,DELETE,HEAD,OPTIONS,PATCH"
                            allow_headers: "*"
                            expose_headers: "*"
                            max_age: "86400"
                            allow_credentials: true 
                            filter_enabled:
                              default_value:
                                numerator: 100
                            allow_origin_string_match:
                              safe_regex:
                                google_re2: {}
                                regex: ".*"
                http_filters:
                  - name: envoy.filters.http.cors
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors
                  - name: envoy.filters.http.wasm
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.wasm.v3.Wasm
                      config:
                        name: "http_headers"
                        vm_config:
                          code:
                            local:
                              filename: "/mount/data/ws/proxy-wasm/proxy-wasm-cpp-sdk/bazel-bin/example/http_wasm_example.wasm"
                  - name: envoy.filters.http.router 
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
  - name: some_service
    connect_timeout: 0.25s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: some_service
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 5678

I didn't use Go SDK though and instead created what I think is equivalent C++ SDK filter:

class ExampleContext : public Context {
public:
  explicit ExampleContext(uint32_t id, RootContext *root) : Context(id, root) {}

  FilterDataStatus onRequestBody(size_t headers, bool end_of_stream) override;
  FilterHeadersStatus onResponseHeaders(uint32_t headers, bool end_of_stream) override;
};

static RegisterContextFactory register_ExampleContext(CONTEXT_FACTORY(ExampleContext));

FilterDataStatus ExampleContext::onRequestBody(size_t, bool) {
  logDebug("onRequestBody");
  sendLocalResponse(400, "", "", {});
  sendLocalResponse(401, "", "", {});
  return FilterDataStatus::StopIterationAndWatermark;
}

FilterHeadersStatus ExampleContext::onResponseHeaders(uint32_t, bool) {
  logDebug("onResponseHeaders");
  return FilterHeadersStatus::Continue;
}

And I tested it like this:

$ curl -F "hello=hello" localhost:5678 -vvv -i
*   Trying 127.0.0.1:5678...
* Connected to localhost (127.0.0.1) port 5678 (#0)
> POST / HTTP/1.1
> Host: localhost:5678
> User-Agent: curl/7.81.0
> Accept: */*
> Content-Length: 145
> Content-Type: multipart/form-data; boundary=------------------------9a0454caa399f23e
> 
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
HTTP/1.1 400 Bad Request
< date: Wed, 23 Oct 2024 09:53:19 GMT
date: Wed, 23 Oct 2024 09:53:19 GMT
< server: envoy
server: envoy
< connection: close
connection: close
< content-length: 0
content-length: 0

< 
* Closing connection 0

ASAN does not complain and I also stepped through the code under gdb to see that my understanding of the code flow matches what actually happens in practice.

Could you share more, why do you think that the issue still exists? Maybe there is a different example that triggers use-after-free that I'm overlooking?

@johnlanni
Copy link

@krinkinmu Thank you for your detailed analysis, I didn't notice the handling of prepareLocalReplyViaFilterChain, your analysis is correct.

@krinkinmu
Copy link
Author

@krinkinmu great job, thank you for investigating this!

Yeah, if that PR can be safely reverted (retaining the test cases), then that's great.

cc @johnlanni @wbpcode @mpwarres @martijneken

They behave differently, even though they look similar. The native extension will end up sending back to the client the response from the encodeHeaders, while the proxy-wasm extension will end up with a stuck connection. The difference here is that native extension actually calls sendLocalReply before stopping iteration, while proxy-wasm extension ends up stopping iteration without calling sendLocalReply from onResponseHeaders (because of the changes introduced in envoyproxy/envoy#23049 to deal with use-after-free issue).

There is a lot of deferred actions we do in Wasm to avoid re-entrance from callbacks.

Coincidentally, stopNextIteration(), which was added to stop plugins returning Continue after sending local response might help here by deferring StopAllIterationAndWatermark to after deferred local response? Would you mind checking if returning FilterHeadersStatus::Continue instead in your examples changes anything? Thanks!

@PiotrSikora I did check and it seems like returning Continue from either proxy-wasm plugin or native plugin does not actually change anything.

For proxy-wasm, it's what I expected, because regardless of what plugin returns, proxy-wasm will override the return status when proxy_send_local_response is called. For native plugins I had to read through the code a bit and basically for native plugins something very similar happens:

  1. DownstreamFilterManager::sendLocalReply sets state_.decoder_filter_chain_aborted_ = true here
  2. It then checks it and overrides the status here

Which makes me think whether we still need to defer local responses in proxy-wasm anymore? It looks like envoyproxy/envoy#24367 addressed the issue for local replies at least in a generic way that works for all plugins and not just proxy-wasm.

krinkinmu added a commit to krinkinmu/envoy that referenced this pull request Oct 24, 2024
This change fixes envoyproxy#28826.
Some additional discussions for context can be found in
proxy-wasm/proxy-wasm-cpp-host#423.

The issue reported in envoyproxy#28826
happens when proxy-wasm plugin calls proxy_send_local_response during
the HTTP request proessing and HTTP response processing.

This happens because in attempt to mitigate a use-after-free issue
(see envoyproxy#23049) we added logic
to proxy-wasm that avoids calling sendLocalReply multiple times.

So now when proxy-wasm plugin calls proxy_send_local_response only
the first call will result in sendLocalReply, while all subsequent
calls will get ignored. At the same time, when proxy-wasm plugins
call proxy_send_local_response, because it's used to report an
error in the plugin, proxy-wasm also stops iteration.

During HTTP request processing this leads to the following chain
of events:

1. During request proxy-wasm plugin calls proxy_send_local_response
2. proxy_send_local_response calls sendLocalReply, which schedules
   the local reply to be processed later through the filter chain
3. Request processing filter chain gets aborted and Envoy sends the
   previous created local reply though the filter chain
4. Proxy-wasm plugin gets called to process the response it generated
   and it calls proxy_send_local_response
5. proxy_send_local_response **does not** call sendLocalReply, because
   proxy-wasm prevents multiple calls to sendLocalReply currently
6. proxy-wasm stops iteration

So in the end the filter chain iteration is stopped for the response
and because proxy_send_local_respose does not actually call
sendLocalReply we don't send another locally generated response
either.

I think we can do slightly better and close the connection in this
case. This change includes the following parts:

1. Partial rollback of envoyproxy#23049
2. Change to Envoy implementation of failStream used by proxy-wasm in
   case of critical errors
3. Tests covering this case and some other using the actual FilterManager.

The most important question is why rolling back
envoyproxy#23049 now is safe?

The reason why it's safe, is that since introduction of
prepareLocalReplyViaFilterChain in
envoyproxy#24367, calling sendLocalReply
multiple times is safe - that PR basically address the issue in a generic
way for all the plugins, so a proxy-wasm specific fix is not needed anymore.

On top of being safe, there are additional benefits to making this change:

1. We don't end up with a stuck connection in case of errors, which is
   slightly better
2. We remove a failure mode from proxy_send_local_response that was
   introduced in envoyproxy#23049 - which
   is good, because proxy-wasm plugins don't have a good fallback when
   proxy_send_local_response is failing.

Why do we need to change the implementation of the failStream?

failStream gets called when Wasm VM crashes (e.g., null pointer derefernce
or abort inside the VM or any other unrecoverable error with the VM).
Current implementation just calls sendLocalReply in this case. Let's
consider what happens during the HTTP request processing when Wasm VM
crashes:

1. Wasm VM crashes
2. proxy-wasm calls failStream which calls sendLocalReply
3. Envoy prepares local reply and eventually sends it through the filter
   chain
4. proxy-wasm plugin with a crashed VM is called to process the reply

proxy-wasm in this case can't really do anything and just stops the
iteration. Which is a fine way of dealing with the issue, but we can
do slightly better and close the connection in this case instead of
just pausing the iteration.

And we are not losing anything in this case by replacing sendLocalReply
with resetStream, because the local reply doesn't get send anyways.

> NOTE: The same issue does not happen if the VM crashes during response
> processing, because sendLocalReply in this case will send the response
> directly ignoring the filter chain.

Finally, why replace the current mocks with a real FilterManager?

Mock implementation of sendLocalReply works fine for tests that just need
to assert that sendLocalReply gets called. However, in this case we rely
on the fact that it's safe to call sendLocalReply multiple times and it
will do the right thing and we want to assert that the connection will
get closed in the end - that cannot be tested by just checking that the
sendLocalReply gets called or by relying on a simplistic mock
implementation of sendLocalReply.

Signed-off-by: Mikhail Krinkin <[email protected]>
krinkinmu added a commit to krinkinmu/envoy that referenced this pull request Oct 24, 2024
This change fixes envoyproxy#28826.
Some additional discussions for context can be found in
proxy-wasm/proxy-wasm-cpp-host#423.

The issue reported in envoyproxy#28826
happens when proxy-wasm plugin calls proxy_send_local_response during
the HTTP request proessing and HTTP response processing.

This happens because in attempt to mitigate a use-after-free issue
(see envoyproxy#23049) we added logic
to proxy-wasm that avoids calling sendLocalReply multiple times.

So now when proxy-wasm plugin calls proxy_send_local_response only
the first call will result in sendLocalReply, while all subsequent
calls will get ignored. At the same time, when proxy-wasm plugins
call proxy_send_local_response, because it's used to report an
error in the plugin, proxy-wasm also stops iteration.

During HTTP request processing this leads to the following chain
of events:

1. During request proxy-wasm plugin calls proxy_send_local_response
2. proxy_send_local_response calls sendLocalReply, which schedules
   the local reply to be processed later through the filter chain
3. Request processing filter chain gets aborted and Envoy sends the
   previous created local reply though the filter chain
4. Proxy-wasm plugin gets called to process the response it generated
   and it calls proxy_send_local_response
5. proxy_send_local_response **does not** call sendLocalReply, because
   proxy-wasm prevents multiple calls to sendLocalReply currently
6. proxy-wasm stops iteration

So in the end the filter chain iteration is stopped for the response
and because proxy_send_local_respose does not actually call
sendLocalReply we don't send another locally generated response
either.

I think we can do slightly better and close the connection in this
case. This change includes the following parts:

1. Partial rollback of envoyproxy#23049
2. Change to Envoy implementation of failStream used by proxy-wasm in
   case of critical errors
3. Tests covering this case and some other using the actual FilterManager.

The most important question is why rolling back
envoyproxy#23049 now is safe?

The reason why it's safe, is that since introduction of
prepareLocalReplyViaFilterChain in
envoyproxy#24367, calling sendLocalReply
multiple times is safe - that PR basically address the issue in a generic
way for all the plugins, so a proxy-wasm specific fix is not needed anymore.

On top of being safe, there are additional benefits to making this change:

1. We don't end up with a stuck connection in case of errors, which is
   slightly better
2. We remove a failure mode from proxy_send_local_response that was
   introduced in envoyproxy#23049 - which
   is good, because proxy-wasm plugins don't have a good fallback when
   proxy_send_local_response is failing.

Why do we need to change the implementation of the failStream?

failStream gets called when Wasm VM crashes (e.g., null pointer derefernce
or abort inside the VM or any other unrecoverable error with the VM).
Current implementation just calls sendLocalReply in this case. Let's
consider what happens during the HTTP request processing when Wasm VM
crashes:

1. Wasm VM crashes
2. proxy-wasm calls failStream which calls sendLocalReply
3. Envoy prepares local reply and eventually sends it through the filter
   chain
4. proxy-wasm plugin with a crashed VM is called to process the reply

proxy-wasm in this case can't really do anything and just stops the
iteration. Which is a fine way of dealing with the issue, but we can
do slightly better and close the connection in this case instead of
just pausing the iteration.

And we are not losing anything in this case by replacing sendLocalReply
with resetStream, because the local reply doesn't get send anyways.

> NOTE: The same issue does not happen if the VM crashes during response
> processing, because sendLocalReply in this case will send the response
> directly ignoring the filter chain.

Finally, why replace the current mocks with a real FilterManager?

Mock implementation of sendLocalReply works fine for tests that just need
to assert that sendLocalReply gets called. However, in this case we rely
on the fact that it's safe to call sendLocalReply multiple times and it
will do the right thing and we want to assert that the connection will
get closed in the end - that cannot be tested by just checking that the
sendLocalReply gets called or by relying on a simplistic mock
implementation of sendLocalReply.

Signed-off-by: Mikhail Krinkin <[email protected]>
krinkinmu added a commit to krinkinmu/envoy that referenced this pull request Oct 24, 2024
This change fixes envoyproxy#28826.
Some additional discussions for context can be found in
proxy-wasm/proxy-wasm-cpp-host#423.

The issue reported in envoyproxy#28826
happens when proxy-wasm plugin calls proxy_send_local_response during
the HTTP request proessing and HTTP response processing.

This happens because in attempt to mitigate a use-after-free issue
(see envoyproxy#23049) we added logic
to proxy-wasm that avoids calling sendLocalReply multiple times.

So now when proxy-wasm plugin calls proxy_send_local_response only
the first call will result in sendLocalReply, while all subsequent
calls will get ignored. At the same time, when proxy-wasm plugins
call proxy_send_local_response, because it's used to report an
error in the plugin, proxy-wasm also stops iteration.

During HTTP request processing this leads to the following chain
of events:

1. During request proxy-wasm plugin calls proxy_send_local_response
2. proxy_send_local_response calls sendLocalReply, which schedules
   the local reply to be processed later through the filter chain
3. Request processing filter chain gets aborted and Envoy sends the
   previous created local reply though the filter chain
4. Proxy-wasm plugin gets called to process the response it generated
   and it calls proxy_send_local_response
5. proxy_send_local_response **does not** call sendLocalReply, because
   proxy-wasm prevents multiple calls to sendLocalReply currently
6. proxy-wasm stops iteration

So in the end the filter chain iteration is stopped for the response
and because proxy_send_local_respose does not actually call
sendLocalReply we don't send another locally generated response
either.

I think we can do slightly better and close the connection in this
case. This change includes the following parts:

1. Partial rollback of envoyproxy#23049
2. Tests covering this case and some other using the actual FilterManager.

The most important question is why rolling back
envoyproxy#23049 now is safe?

The reason why it's safe, is that since introduction of
prepareLocalReplyViaFilterChain in
envoyproxy#24367, calling sendLocalReply
multiple times is safe - that PR basically address the issue in a generic
way for all the plugins, so a proxy-wasm specific fix is not needed anymore.

On top of being safe, there are additional benefits to making this change:

1. We don't end up with a stuck connection in case of errors, which is
   slightly better
2. We remove a failure mode from proxy_send_local_response that was
   introduced in envoyproxy#23049 - which
   is good, because proxy-wasm plugins don't have a good fallback when
   proxy_send_local_response is failing.

Finally, why replace the current mocks with a real FilterManager?

Mock implementation of sendLocalReply works fine for tests that just need
to assert that sendLocalReply gets called. However, in this case we rely
on the fact that it's safe to call sendLocalReply multiple times and it
will do the right thing and we want to assert that the connection will
get closed in the end - that cannot be tested by just checking that the
sendLocalReply gets called or by relying on a simplistic mock
implementation of sendLocalReply.

Signed-off-by: Mikhail Krinkin <[email protected]>
kyessenov pushed a commit to envoyproxy/envoy that referenced this pull request Oct 31, 2024
…6809)

Commit Message:
This change fixes #28826. Some
additional discussions for context can be found in
proxy-wasm/proxy-wasm-cpp-host#423.

The issue reported in #28826
happens when proxy-wasm plugin calls proxy_send_local_response during
the HTTP request proessing and HTTP response processing.

This happens because in attempt to mitigate a use-after-free issue (see
#23049) we added logic to
proxy-wasm that avoids calling sendLocalReply multiple times.

So now when proxy-wasm plugin calls proxy_send_local_response only the
first call will result in sendLocalReply, while all subsequent calls
will get ignored. At the same time, when proxy-wasm plugins call
proxy_send_local_response, because it's used to report an error in the
plugin, proxy-wasm also stops iteration.

During HTTP request processing this leads to the following chain of
events:

1. During request proxy-wasm plugin calls proxy_send_local_response
2. proxy_send_local_response calls sendLocalReply, which schedules the
local reply to be processed later through the filter chain
3. Request processing filter chain gets aborted and Envoy sends the
previous created local reply though the filter chain
4. Proxy-wasm plugin gets called to process the response it generated
and it calls proxy_send_local_response
5. proxy_send_local_response **does not** call sendLocalReply, because
proxy-wasm prevents multiple calls to sendLocalReply currently
6. proxy-wasm stops iteration

So in the end the filter chain iteration is stopped for the response and
because proxy_send_local_respose does not actually call sendLocalReply
we don't send another locally generated response either.

I think we can do slightly better and close the connection in this case.
This change includes the following parts:

1. Partial rollback of #23049
2. Tests covering this case and some other using the actual
FilterManager.

The most important question is why rolling back
#23049 now is safe?

The reason why it's safe, is that since introduction of
prepareLocalReplyViaFilterChain in
#24367, calling sendLocalReply
multiple times is safe - that PR basically address the issue in a
generic way for all the plugins, so a proxy-wasm specific fix is not
needed anymore.

On top of being safe, there are additional benefits to making this
change:

1. We don't end up with a stuck connection in case of errors, which is
slightly better
2. We remove a failure mode from proxy_send_local_response that was
introduced in #23049 - which is
good, because proxy-wasm plugins don't have a good fallback when
proxy_send_local_response is failing.

Finally, why replace the current mocks with a real FilterManager?

Mock implementation of sendLocalReply works fine for tests that just
need to assert that sendLocalReply gets called. However, in this case we
rely on the fact that it's safe to call sendLocalReply multiple times
and it will do the right thing and we want to assert that the connection
will get closed in the end - that cannot be tested by just checking that
the sendLocalReply gets called or by relying on a simplistic mock
implementation of sendLocalReply.

Additional Description:
Risk Level: low
Testing: Manually, by reproducing the case reported in
#28826. I also added new unit
tests and verified that they pass and aren't flaky:

```
bazel test --runs_per_test=1000 //test/extensions/common/wasm:all --config=docker-clang
```

Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A
Fixes #28826

---------

Signed-off-by: Mikhail Krinkin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants