Restarted host is not picking up newly-deployed assemblies #2636

brettsam · 2018-04-05T20:17:52Z

Spawning this from #1690, as it is a separate issue.

While investigating #1690, I noticed that a customer was seeing ~20 simultaneous requests to shutdown the host, yet the process never recycled. The newly deployed assembly was never loaded so the site had to be restarted to pick up the changes.

Details on finding site name:

| where PreciseTimeStamp between (datetime(2018-03-08 11:00) .. datetime(2018-03-08 12:10))
| where Summary contains "c67d3f04-1134-4b2d-a4c3-266878710784"

And then for seeing the full timeframe of this occurrence:

| where PreciseTimeStamp between (datetime(2018-03-08 11:00) .. datetime(2018-03-08 12:10))
| where AppName == "{sitenamefoundabove}"
| project PreciseTimeStamp, Pid, Details, Summary, FunctionName

You'll see that an environment restart was requested around 11:05, yet the process id never changed. That didn't happen until the next request around 12:05, at which point the host shuts down and the process id changes.

@simonness, FYI

The text was updated successfully, but these errors were encountered:

brettsam · 2018-04-05T20:18:19Z

@fabiocav -- FYI, this is what we were chatting about earlier.

paulbatum · 2018-04-18T17:27:58Z

@brettsam This issue is relevant to V1 only, right?

brettsam · 2018-04-18T17:34:16Z

Yes -- v2 exits the process itself rather than signalling for an environment shutdown.

brettsam · 2018-04-18T17:59:28Z

I'm seeing more and more cases of this -- another just happened where the Health Monitor is trying to shut down the host over-and-over. We see the log that "Environment shutdown has been triggered," but the Pid never changes. Repeat forever.

paulbatum · 2018-04-18T18:57:27Z

I do not think the process ID is supposed to change in this case - the app domain should recycle and that should be sufficient. Now we're supposed to log the app domain during host start. Can you check the log to see if the app domain was changing after each HostingEnvironment.Shutdown ?

cc @pragnagopa

brettsam · 2018-04-18T19:52:34Z

Yeah, you're right. I see the AppDomain Id changing in this case, yet the site still fails the very first Host Health check, before any functions are executed. I think this may be a separate issue, then.

paulbatum · 2018-04-18T21:19:12Z

Can you send @mathewc a kusto query so that he can take a look.

brettsam · 2018-04-19T13:54:30Z

I'm re-titling this as I misunderstood how that API worked. It looks like the shutdown is happening -- but rather than recycling the process, it recycles the AppDomain. But that doesn't explain why we see apps occasionally using old assemblies after an AppDomain recycle.

We've had one internal report of this as well, and there are likely others that just don't realize it.

There's something fishy going on there.

paulbatum · 2018-04-25T18:20:45Z

I have a suspicion that this is due to VSTS default usage of the old kudu ZIP api. Things to check:

does this issue ever occur if webdeploy is enabled
does the old zip API block on full extraction, or does it return early? If it returns early, that could explain a test seeing the old assembly after deployment.

simonness · 2018-07-09T20:40:39Z

Hi @brettsam @paulbatum thanks for your help on this one so far. Was there a resolution?

I've seen something similar again today, this time 2 different versions of dlls running at the same time for quite a while after a deployment. I've raised #3119 for it. Much appreciated if you could take a look!

brettsam · 2018-07-10T15:46:29Z

@mathewc / @fabiocav -- I think we've discussed this in the past (#3119 is another instance).

In this case, it was a Web Deploy. I can see the host restarts and recycles the AppDomain, yet the previous file version is loaded. We can't see that in the logs, but the customer is reporting it. Could there be some file caching issue upon restart?

fabiocav · 2018-11-06T19:44:01Z

Going through some old threads and noticed this was waiting for a response. Is this still an issue?

cgillum · 2018-11-19T23:09:29Z

FYI: We have a customer seeing a similar issue here: Azure/azure-functions-durable-extension#512

ghills · 2018-11-20T03:31:08Z

Beyond the issue described in Azure/azure-functions-durable-extension#512, I think I saw it happen again today in a slightly different way.

I deployed updated code via web deploy to a pre-compiled c# function app. The deployment output showed the expected dll & pdb update being pushed and a new json file being uploaded for an added function. However, while trying to test out the new functionality and looking at some trace output in AI it appeared that my added durable function activities were not being called at all. In the Azure portal the added function name showed up in the list but when I selected it I got an error about the function name being invalid (I don't have the exact verbage anymore) but based on my searching it appeared to be an error when the function could not be found in the dll.

In this case restarting the app service didn't do the trick. What I did was redeploy the exact same binaries again, via web deploy again the same way, and the output here showed as I expected with no updated or new files but it did the synchronizing step. After this, my new durable function activities were being called and working like expected and the function was selectable without error in the Azure portal.

I'd be happy to provide any details or ID values regarding the environment this happened in if it would be helpful in debugging.

function invocation from environment (West US):
[2018-11-20T03:21:00.014Z] Function started (Id=bf366cf6-50aa-4da4-a10c-ae931b991003)

First deployment time: 19 Nov 13:25 PST
Second deployment time: 19 Nov 14:25 PST

Update
Here is the wording of the error I saw in the portal:

Function (WriteAppDocumentStatus) Error: The function type name 'Puller.Pipeline.Activities.WriteAppDocumentStatusActivity' is invalid.

paulbatum · 2018-11-22T00:37:02Z

@ghills You mentioned that issuing a restart did not resolve this, but when did you issue the restart? The reason I ask is that I'm looking at our logs and I can't see your process ID changing during this sequence, and a restart should absolutely result in a new process ID.

In general I'm having trouble matching the timestamps in my logs up with your timeline. What I see roughly (all times UTC):

2018-11-19 17:57:48.8073066 - The app starts seeing filesystem updates and starts shutting down the current host instance and initializing a new one.

2018-11-19 17:57:51.4153313 - New host starts executing functions.

2018-11-19 17:57:51.5755555 - New host sees another file change come in, triggers same process as above

2018-11-19 17:57:52.4594099 - Second new host starts

2018-11-19 17:57:52.8923307 - this host hits an error

Failed to load type 'Puller.Pipeline.Activities.WriteAppDocumentStatusActivity' from 'D:\home\site\wwwroot\bin\Puller.dll'

2018-11-19 17:58:23.5163803 - this host emits the error you noted above:

The following 1 functions are in error:
WriteAppDocumentStatus: The function type name 'Puller.Pipeline.Activities.WriteAppDocumentStatusActivity' is invalid.

The same host keeps running executing your other functions but not your new one.

The fact that two new hosts started so close together is likely to represent a problem. It looks like your deployments are slow enough that you are getting host restarts that land in a partial state. I still can't explain why we failed to load your new function after the last host restart. I suspect there is a very subtle race condition here.

My best recommendation right now is to move to run from package deployments. VS has native support for these deployments now and they basically eliminate this type of problem.

ghills · 2018-11-22T04:38:28Z

@paulbatum
Sorry about the time discrepancies - apparently I was looking at another deployment from later in the day when I transcribed the times. Here are the corrected times (based on the duration reported by AppVeyor, unfortunately just to the minute accuracy):

Times in UTC
first deployment
start: 2018-11-19 17:57
duration: 26 seconds

second deployment (no changes to files)
start: 2018-11-19 21:25
duration: 17 seconds

The first deployment when the binaries were updated to the version with the new function lines up with what you saw in the logs.

Regardless your explanation makes sense given the timeline you provided. Deploying this project has averaged 20-30 seconds to do the deployment so that leaves a large enough window to explain the second new host starting up while it was in progress.

I tried restarting the app from the portal well after I deployed the updated binaries the first time but before I re-deployed the binaries in the second one. Something like 20:00 UTC probably. Now that I think about it though, this was after I had hit the error in the portal trying to select the added function and after that error parts of the portal seemed to be failing to load for me so I wonder if I actually triggered the restart to happen or not.

Either way it sounds like I will be looking at moving away from web deploy as we've had this happen a couple of times now and I would prefer to avoid whatever issue is causing the behavior. I just need to figure out how to integrate our CD workflow to run from package!

Thanks! And thank you to @cgillum for getting me redirected here as I didn't know initially where the issue was stemming from.

paulbatum · 2018-11-27T00:55:28Z

Yeah I don't think your restart went through. I can't find a record of it in our logs. I think this deployment shows that our 500ms "wait for more files" logic needs to be relaxed slightly. I filed a separate issue for that:
#3818

paulbatum · 2018-11-27T00:56:41Z

@fabiocav I think the only recent activity on this issue is a different case to what is being described (I confirmed that no restart went through). Perhaps we should close this one?

andrewdmoreno · 2019-10-22T21:15:36Z

I apologize for stirring up an old issue, but I did come across this the last few days and was struggling to figure out why the code in my updated assemblies was not executing. Finally decided to do a manual restart and, voila, after a brief cold start the next runs of my durable function executed with the new code.

I'm using the Azure Functions deploy task in Azure DevOps, and below is the output from the last execution:

2019-10-22T18:47:59.4143684Z ##[section]Starting: Deploy Function App
2019-10-22T18:47:59.4266170Z ==============================================================================
2019-10-22T18:47:59.4266279Z Task         : Azure Functions
2019-10-22T18:47:59.4266370Z Description  : Update a function app with .NET, Python, JavaScript, PowerShell, Java based web applications
2019-10-22T18:47:59.4266443Z Version      : 1.158.0
2019-10-22T18:47:59.4266516Z Author       : Microsoft Corporation
2019-10-22T18:47:59.4266587Z Help         : https://docs.microsoft.com/azure/devops/pipelines/tasks
2019-10-22T18:47:59.4266680Z ==============================================================================
2019-10-22T18:48:00.4286092Z Got service connection details for Azure App Service:'xxxxxx'
2019-10-22T18:48:12.3255470Z Updating App Service Application settings. Data: {"WEBSITE_RUN_FROM_PACKAGE":"1"}
2019-10-22T18:48:12.3255867Z Deleting App Service Application settings. Data: ["WEBSITE_RUN_FROM_ZIP"]
2019-10-22T18:48:12.6355088Z Updated App Service Application settings and Kudu Application settings.
2019-10-22T18:48:21.4600716Z Package deployment using ZIP Deploy initiated.
2019-10-22T18:48:37.3461167Z Successfully deployed web package to App Service.
2019-10-22T18:48:37.3471374Z Updating App Service Application settings. Data: {xxxx}
2019-10-22T18:48:37.3473221Z Updated App Service Application settings and Kudu Application settings.
2019-10-22T18:48:39.6308614Z Successfully added release annotation to the Application Insight : xxxxxx
2019-10-22T18:48:40.1181500Z App Service Application URL: http://xxxxx.azurewebsites.net
2019-10-22T18:48:40.1417639Z ##[section]Finishing: Deploy Function App

Is there any additional information I can provide to help in the diagnosis? Or perhaps there is something blatantly obvious in those logs that would explain why it may be occurring?

andrewdmoreno · 2019-10-28T15:50:47Z

As an update to my last comment, this seems to be happening pretty consistently on our end (function deployment requires a manual restart in order for new code to run).

Just a heads up that we will probably be submitting a formal support request for this today to see if we can get some traction as perhaps it's something we are doing wrong since no one else appears to be having this issue.

jimmcslim · 2019-11-12T12:06:48Z

This may be something I have experienced as well. With Pulumi's addition of .NET Core support to their infrastructure-as-code product I have been experimenting with some of their samples, one of which deploys a basic Azure Functions app...

https://github.com/pulumi/examples/tree/master/azure-cs-functions-raw

I successfully build and publish the functions app in the functions directory of that sample.
I successfully deploy the app to Azure via pulumi up.
I successfully call the app from a browser at the endpoint provided by the deployment process.
I make a minor change to the app... changing the salutation from 'Hello!' to something less salutary...
I run dotnet publish functions again to change the content of the app locally.
I run pulumi up to ultimately have the new content published to Azure as a zip.
While this is running I refresh a browser and briefly see the new message get served up... but then reverts to the old message?
Leave it for some time (e.g. the functions app goes 'cold') and come back, the new code is now running.

paulbatum · 2019-11-12T19:39:07Z

@jimmcslim do you know how pulumi up is deploying the updated zip? If it is deploying it to a blob in azure storage, then a restart is required to force the function app to pick up the changes.

jimmcslim · 2019-11-13T10:48:13Z

I'm trying to get a more detailed set of logs out of it, but I believe the steps it follows are:

Create a zip file locally, having been pointed at the output of dotnet publish.
Upload the zip file to Azure blob storage into a known container, with a unique name.
Update the properties of the Azure Functions app to point at the new zip file in blob storage.
I don't believe there is an explicit restart command sent.

So I guess I kind of thought that step 3 would be the prompt for Azure Functions to reload the new code. And it seems like sometimes it actually does; if I madly reload the page in the browser I see the new response come through... other times it seems to see the new code for a few requests and then goes back to the old code, in which case I do need to restart the app.

I appreciate that putting an explicit restart step in is probably for the best - but just trying to understand if there is a reasonable suggestion for the behaviour I was seeing.

So that said, what API should I be calling to trigger the restart? I can click the button in the portal but obviously that's not infrastructure-as-code.

fabiocav · 2019-11-13T17:33:46Z

@jimmcslim could you please open a new issue with those details? It would be good to track this separately as it is targeting V2 and the root cause is not related to the what this issue is originally tracking.

jimmcslim · 2019-11-13T21:38:08Z

Actually I had realised that the Pulumi sample creates a functions app with the v3 preview host.

I can remove all the Pulumi stuff from an issue.; ultimately I would be publishing a new version of an existing functions app to a zip in Azure Storage, updating the WEBSITE_RUN_FROM_PACKAGE configuration setting of the existing functions app to point at the URL of the new zip, and expecting the host to reload the code from the new URL.... without having to do an explicit restart? But maybe that is not supported behaviour?

paulbatum · 2019-11-13T21:59:33Z

sorry I wasn't clear with my previous answer. If pulumi is overwriting an existing zip then an explicit restart is required. If its creating a new zip, then as you said, updating the appsettings to point to that new zip should be sufficient. Anyway as fabio mentioned, I think you should be discussing this behavior in a different issue to this one.

jcd79 · 2020-02-18T13:05:10Z

As an update to my last comment, this seems to be happening pretty consistently on our end (function deployment requires a manual restart in order for new code to run).

Just a heads up that we will probably be submitting a formal support request for this today to see if we can get some traction as perhaps it's something we are doing wrong since no one else appears to be having this issue.

Hi, we are experience the exact same thing with DevOps releases. Did you find/get a way to resolve this issue?

andrewdmoreno · 2020-02-18T14:35:32Z

@jcd79 Unfortunately no. It stopped occurring for a time, but the issue has resurfaced for us and we are struggling with it again. I haven’t yet had time to open a new issue as requested above by @fabiocav since the original issue in this thread was not having anything to do with V2. If you have an opportunity to provide the details of what you’re seeing in a new issue I can have some of my team members gather details as well.

petro2050 · 2020-04-18T14:23:50Z

Over the last 2 years, I've had several instances of a similar issue where deployed files were actually not being deployed even though VS publish was successful. For me, the solution was to restart the app. The worst thing happened only recently. I made several changes in a single publish and some were deployed while others were not.

Azure/Azure-Functions#1555

paulbatum · 2020-04-21T17:26:02Z

@petro2050 I don't know what the underlying cause was of your partial deployment, but the possibility of this can be avoided completely by using run-from-package:
https://docs.microsoft.com/en-us/azure/azure-functions/run-functions-from-deployment-package

JaimeHu · 2020-04-21T19:23:28Z

My team has seen this occasionally over the last few months and restarting the app usually fixes it. We're using Azure Functions v2, WEBSITE_RUN_FROM_PACKAGE is set to 1, and we're using the Azure DevOps task Azure Functions App Deploy. The ADO Task deployment is set to "Autodetect" and the logs show it's using Zip Deploy and succeeding.

We're investigating an issue today that started last night and is behaving like an old version of our code is running but I'm having trouble verifying that's the case. Restarting the app did not fix it. I'm trying to figure out how to examine the content of the package that is running but haven't figured out how to get a look at yet for more information.

JaimeHu · 2020-04-21T20:03:32Z

I figured out how to download the bits from the server and verified it did have our latest by comparing timestamps on the dlls. They lined up with our latest deployment from this morning.

AartBluestoke · 2020-04-22T00:06:46Z

i have seen this issue in my functions, where the running code was different from the "successfully" devops deployed zip,
This was traced to a corrupt host.json. There may be other scenarios where the uploaded zip is unable to be used, so the old code continues to run.

petro2050 · 2020-04-22T03:36:39Z

@paulbatum I'm on V3 and already running from package. As others mentioned, there are occasional differences between what is supposedly deployed and what is actually deployed. I'll be going live this year and worry about this biting me in the future, especially considering multiple active instances. The process to deploy changes to every worker has to be robust.

paulbatum · 2020-04-22T21:11:55Z

@petro2050 Can you give me more details about what you saw when you said "I made several changes in a single publish and some were deployed while others were not." If you are running from a package, you might have had some instances running the older package, and some instances running the newer package, but I'm not aware of any possible condition where only some of the changes in your new package could be active.

If you give me an execution ID, region and UTC timestamp, I should be able to determine which package that execution was running on.

petro2050 · 2020-04-24T23:26:55Z

@paulbatum How can I know how many active workers are running in a given time? I normally look at Kudu's Process Explorer tab. Since I'm still in DEV, I only see one active worker (which is fine). I have yet to see two or more.

What happened with me is that I have a common helper class that saves records to an Azure Table. I changed the partition key logic to save a long instead of a Guid. After deployment and waiting for at least 10 mins, I started seeing mixed results. Some of my API calls were saving the new PartitionKey type while others kept saving the old one. The workaround for me was restarting the function app.

That said, I don't have the problem anymore, and I'm not sure if the logs are there to be useful for you - not to mention I did not save the execution ID and other information at the time. You might be right in that another instance was running the old package, but as I said before, I'm not sure if it's possible for me to be assigned multiple workers in DEV with little workload. Can you comment on this?

paulbatum · 2020-04-28T23:29:50Z

@petro2050 Thanks for this additional info. It is possible for multiple workers to be assigned, even when load is low. This sometimes happens when we are running a platform upgrade. We start running your app on another machine so that we can unload it from the existing machine, and we do this with overlap so there is no downtime. I think what was happening here is as you said, an older version of the app running in parallel with the new version.

The fact that even after waiting 10 minutes, the old version was still running and you had to do a restart is suspicious. It sounds like something did not work right. But this is different to the problem I was worried about that can happen very occasionally with deployments that are not based on a package, where the app picks up only some of the file changes.

ltalirz · 2024-07-20T05:32:45Z

I see this issue is 4 years old by now, but it still affects me today (runtime version 4.34.2.2).

I deployed a new version of the app via the Azure Functions VS Code extension (reports deployment completed). Then I restarted the app from the Azure Portal. Even after that, when testing the function via "Test/Run" in the Azure portal, the old code remained active.

Only after 'some time' the request went to a different worker (according to worker id shown in the test/run logs in the Azure portal) and the new code took effect.

What is the recommended way to check whether a deployment of an Azure function has actually taken effect for users of the function?

erfannariman · 2025-01-20T12:19:51Z

This just cost us a full day, not understanding why our code changes were not working in the function while deployment was successful. What solved the issue at our end was to use a different GitHub Actions deployment, this one: https://learn.microsoft.com/en-us/azure/azure-functions/functions-how-to-github-actions?tabs=linux%2Cpython&pivots=method-template#example-workflow-configuration-file

This deployment yaml is different then the one which is generated through deployment center when you link your GitHub repo.

brettsam added the 1.x label Apr 18, 2018

brettsam changed the title ~~HostingEnvironment.InitiateShutdown() not always shutting down host~~ Restarted host is not picking up newly-deployed assemblies Apr 19, 2018

paulbatum added this to the Unknown milestone Apr 25, 2018

cgillum mentioned this issue Nov 19, 2018

Failed orchestrations appearing to be from inconsistent versions after web deploy Azure/azure-functions-durable-extension#512

Closed

simonness mentioned this issue Jan 10, 2019

Old code still running intermittently after a deployment #3119

Closed

Restarted host is not picking up newly-deployed assemblies #2636

Restarted host is not picking up newly-deployed assemblies #2636

Comments

brettsam commented Apr 5, 2018

brettsam commented Apr 5, 2018

paulbatum commented Apr 18, 2018

brettsam commented Apr 18, 2018

brettsam commented Apr 18, 2018

paulbatum commented Apr 18, 2018

brettsam commented Apr 18, 2018

paulbatum commented Apr 18, 2018

brettsam commented Apr 19, 2018

paulbatum commented Apr 25, 2018

simonness commented Jul 9, 2018

brettsam commented Jul 10, 2018

fabiocav commented Nov 6, 2018

cgillum commented Nov 19, 2018

ghills commented Nov 20, 2018 • edited Loading

paulbatum commented Nov 22, 2018

ghills commented Nov 22, 2018 • edited Loading

paulbatum commented Nov 27, 2018

paulbatum commented Nov 27, 2018

andrewdmoreno commented Oct 22, 2019

andrewdmoreno commented Oct 28, 2019

jimmcslim commented Nov 12, 2019

paulbatum commented Nov 12, 2019

jimmcslim commented Nov 13, 2019 • edited Loading

fabiocav commented Nov 13, 2019

jimmcslim commented Nov 13, 2019

paulbatum commented Nov 13, 2019

jcd79 commented Feb 18, 2020

andrewdmoreno commented Feb 18, 2020

petro2050 commented Apr 18, 2020

paulbatum commented Apr 21, 2020

JaimeHu commented Apr 21, 2020

JaimeHu commented Apr 21, 2020

AartBluestoke commented Apr 22, 2020

petro2050 commented Apr 22, 2020

paulbatum commented Apr 22, 2020

petro2050 commented Apr 24, 2020

paulbatum commented Apr 28, 2020

ltalirz commented Jul 20, 2024 • edited Loading

erfannariman commented Jan 20, 2025

ghills commented Nov 20, 2018 •

edited

Loading

ghills commented Nov 22, 2018 •

edited

Loading

jimmcslim commented Nov 13, 2019 •

edited

Loading

ltalirz commented Jul 20, 2024 •

edited

Loading