-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OrleansMessageRejection exception and Orleans stream messages stuck in azure storage queue #8540
Comments
I have been able to make further observations on another incident of delayed processing of Orleans stream messages. Today, I observed that 1 of 18 silos was unable to activate grains to process pubsub messages from Orleans streams (azure storage queue). From the time the silo started until it was shutdown (scaling on a schedule), it was not able to process an Orleans stream message. The other 17 silos processed stream messages, only this one could not. This 1 silo was able to process Orleans grain messages coming from other silos or clients. Once this silo was shut down another silo was able to process the stream messages. I'm attaching logs of the OrleansMessageRejection exceptions. Again, I'm running Orleans in Azure container apps on version Here's a sample message for one of the OrleansMessageRejection errors
|
Are you able to share your configuration? |
Thanks @ReubenBond ! Here's my Orleans configuration: public static class OrleansStartup
{
public static IHostBuilder UseOrleans2(this IHostBuilder hostBuilder)
{
return hostBuilder.UseOrleans((ctx, siloBuilder) =>
{
string AZURE_STORAGE_CONNECTION_STRING = ctx.Configuration["AZURE_STORAGE_CONNECTION_STRING"];
Log.Logger.Debug($"Configure Orleans with azure storage account: {AZURE_STORAGE_CONNECTION_STRING}");
siloBuilder.Services.AddSerializer(b => b.CommonSiteDocsSerialization());
var collectionAgeMinutes = GetCollectionAgeMinutes(ctx);
Log.Logger.Information("Setting grain collection age to {CollectionAgeMinutes} seconds", collectionAgeMinutes);
siloBuilder.Configure<GrainCollectionOptions>(options =>
{
// docs here: http://sergeybykov.github.io/orleans/Documentation/clusters_and_clients/configuration_guide/activation_garbage_collection.html
// set the value of CollectionAge for all grains
options.CollectionAge = TimeSpan.FromSeconds(collectionAgeMinutes);
});
SetupOrleansStreams(siloBuilder, AZURE_STORAGE_CONNECTION_STRING, ctx.Configuration);
SetupCommonOrleans(siloBuilder, AZURE_STORAGE_CONNECTION_STRING);
if (!ctx.HostingEnvironment.IsProduction())
{
var testIp = ctx.Configuration["testIp"];
SetupOrleansDevelopment(siloBuilder, testIp, ctx.Configuration);
}
else
{
Log.Logger.Information("Initializing Orleans in non-development mode");
siloBuilder
.Configure<ClusterOptions>(options =>
{
options.ClusterId = ctx.Configuration["OrleansClusterId"] ?? "Cluster";
options.ServiceId = "Service";
Log.Logger.Information("Initializing cluster with {ServiceId} and {ClusterId}", options.ServiceId, options.ClusterId);
})
.ConfigureEndpoints(
hostname: Dns.GetHostName(),
siloPort: 11_111,
gatewayPort: 30_000)
.ConfigureLogging(logging =>
{
logging.AddConsole(options => options.LogToStandardErrorThreshold = LogLevel.Error);
logging.AddSerilog();
});
}
});
}
private static int GetCollectionAgeMinutes(HostBuilderContext ctx)
{
// The minimum must be > 60 seconds according to Orleans docs.
var collectionAge = ctx.Configuration["GrainCollectionAgeSeconds"] ?? "300";
if(int.TryParse(collectionAge, out int collectionAgeSeconds))
{
return collectionAgeSeconds > 60 ? collectionAgeSeconds : 65;
}
;
return 300; // default to 5 minutes
}
private static void SetupCommonOrleans(ISiloBuilder siloBuilder, string AZURE_STORAGE_CONNECTION_STRING)
{
siloBuilder
.UseAzureStorageClustering(opts => opts.ConfigureTableServiceClient(AZURE_STORAGE_CONNECTION_STRING))
.AddAzureBlobGrainStorageAsDefault(options =>
{
options.ConfigureBlobServiceClient(AZURE_STORAGE_CONNECTION_STRING);
})
.AddAzureBlobGrainStorage(name: "documentIntegrationState",
configureOptions: options =>
{
options.ConfigureBlobServiceClient(AZURE_STORAGE_CONNECTION_STRING);
})
.AddAzureBlobGrainStorage(name: "projectionState",
configureOptions: options =>
{
options.ConfigureBlobServiceClient(AZURE_STORAGE_CONNECTION_STRING);
})
.AddLogStorageBasedLogConsistencyProvider("LogStorage");
}
private static void SetupOrleansDevelopment(ISiloBuilder siloBuilder, string testIp, IConfiguration config)
{
Log.Logger.Information("Initializing Orleans in development mode");
siloBuilder
.Configure<ClusterOptions>(options =>
{
options.ClusterId = ctx.Configuration["OrleansClusterId"] ?? "Cluster";
options.ServiceId = "Service";
Log.Logger.Information("Initializing cluster with {ServiceId} and {ClusterId}", options.ServiceId, options.ClusterId);
})
.Configure<EndpointOptions>(options =>
{
// Port to use for Silo-to-Silo
options.SiloPort = 11111;
// Port to use for the gateway
options.GatewayPort = 30000;
// IP Address to advertise in the cluster
options.AdvertisedIPAddress = IPAddress.Parse(testIp);
// The socket used for silo-to-silo will bind to this endpoint
options.GatewayListeningEndpoint = new IPEndPoint(IPAddress.Any, 30000);
// The socket used by the gateway will bind to this endpoint
options.SiloListeningEndpoint = new IPEndPoint(IPAddress.Any, 11111);
})
;
}
private static void SetupOrleansStreams(ISiloBuilder siloBuilder, string AZURE_STORAGE_CONNECTION_STRING, IConfiguration config)
{
siloBuilder.AddAzureQueueStreams("AzureQueueProvider", configurator =>
{
configurator.ConfigureAzureQueue(builder => builder.Configure(options =>
{
options.CommonStreamQueueOptions(AZURE_STORAGE_CONNECTION_STRING, config);
}));
configurator.ConfigureCacheSize(1024);
configurator.ConfigurePullingAgent(ob => ob.Configure(options =>
{
options.GetQueueMsgsTimerPeriod = TimeSpan.FromMilliseconds(200);
options.BatchContainerBatchSize = 256;
}));
});
siloBuilder.AddAzureBlobGrainStorage("PubSubStore",
options => options.ConfigureBlobServiceClient(AZURE_STORAGE_CONNECTION_STRING));
}
} |
Nothing jumps out at me, maybe @benjaminpetit will have an idea when he gets back. Do you know why DeactivateOnIdle is being called during activation (refering to the original message)? The serialization exception looks like it occurred after the silo was shutdown |
Thanks @ReubenBond ! I'm not sure why that DeactivateOnIdle was called. However, we have not seen it again. We continue to see This seem to only impact Orleans streams (Azure storage stream provider). |
It seems that one silo wa still trying to get information on another silo that wasn't alive anymore. Directory cache poisoning? |
Hi @benjaminpetit, To add further context, we only see These message rejection exceptions are only happening for some of the Orleans stream messages, not the standard grain messages from a cluster client or from other grains. I'm trying to determine where the error is triggered. It may be our grain code, but the grains activate and handle the stream message after the cluster scales back in to 6, from 18 at the peak. Does the The message rejection exception message we are seeing most often is
|
I was reffering about the The streaming infrastructure is using some internal grains, called It would be interesting to see if you have more directory related logs. Also when you scale your cluster up, do you see some silo dying in the meantime? |
Thanks @benjaminpetit. Where would we find more directory logs? What setting would we need to see those? I'm anticipating they are debug level? We do not see silos dying while scaling up, but it does take time for the nodes to be initialized. |
Hi @benjaminpetit, We continue to see these This causes delays for processing mission critical messages. We are developing an alternative solution that migrates processes depending on Orleans streams to azure functions, but hoping we can find a solution to this. Can you provide any guidance on this? Do you know of other users that have Orleans streams reliability issues with clusters that periodically scale up and down? Thank you very much! |
I believe I just saw this in one of our clusters as well. We did not have an auto scale event, this just happened out of the blue with a sudden spike in silo failures/restarts and the message mentioned by @iamsamcoder being logged repeatedly. I saw no clear reason as to why, and didn't know about this issue so I did a full restart of all deployments involved in the cluster which seems to have remediated the issue for us. Looking at our telemetry afterwards it look like there was a problem with a single queue; we run SQS streaming with 4 queues for the affected stream provider and we only saw one of those queues get backlogged during the incident. Once restarted, all remaining messages in that queue was delivered. |
After some digging today we are definitely bitten by this. We see quite often that streams start failing with this kind of error, unrelated to scaleups/downs or rolling deployments. I don't know right now if we have had any other types of silo failures around that time causing a silo restart, I'll continue digging tomorrow. What is consistent is that once a stream starts logging this is can never recover without restarting the silo that owns the stream/queue. This always fixes it, the silo that starts taking over the responsibility chugs along just fine. Errors we typically see when this starts happening is:
As it currently stands I can't find a good way to detect this has happened from within the silo so I could terminate it and have it restart, so we are relying on manual monitoring for unblocking queues that provide important data to our grains. We could consider swapping to a Redis grain directory here instead if this is suspected to be a grain directory issue. Though looking at #8632 that seems to not be straight forward as - as far as I understand - unless you can add an Its also worth noticing we have millions of concurrent grains and the only grains we ever really see the "Failed to register activation in grain directory." outside of a silo crash/rolling restart - and then only as transient errors where retries succeed - is the PubSubRendevouz grain. |
I don't want to prematurely celebrate, but I think this problem went away when I changed the grain directory of the Due to #8632 this wasn't super straight forward but I managed to solve it by adding a named Redis grain directory and registering a custom public class PubSubRendezvousGrainUseRedisGrainDirectoryResolver : IGrainDirectoryResolver
{
private readonly IServiceProvider _services;
public PubSubRendezvousGrainUseRedisGrainDirectoryResolver(IServiceProvider services)
{
_services = services;
}
public bool TryResolveGrainDirectory(GrainType grainType, GrainProperties properties, out IGrainDirectory grainDirectory)
{
if (grainType.Value.ToString() != "pubsubrendezvous")
{
grainDirectory = default!;
return false;
}
grainDirectory = _services.GetRequiredServiceByName<IGrainDirectory>("redis-grain-directory");
return true;
}
}
services.AddSingleton<IGrainDirectoryResolver, PubSubRendezvousGrainUseRedisGrainDirectoryResolver>(); I guess this points to it being a grain directory issue and not directly a streaming issue, but as already mentioned the only grain in our cluster having this issue was the |
@tanordheim That is great to hear! I hope it has resolved the issue. We didn't try a different directory for pubsub grains. That is interesting. We were in prod and this issue was causing significant delays so we migrated off Orleans streams to use servicebus and azure function apps. This is helpful to know for the future, perhaps we can try Orleans streams again. Thank you for sharing! |
Just an aside, but this looks suspicious... @iamsamcoder siloBuilder.Configure<GrainCollectionOptions>(options =>
{
// docs here: http://sergeybykov.github.io/orleans/Documentation/clusters_and_clients/configuration_guide/activation_garbage_collection.html
// set the value of CollectionAge for all grains
options.CollectionAge = TimeSpan.FromSeconds(collectionAgeMinutes);
}) FromSeconds against collectionAgeMinutes? This could contribute to a very spammy DHT |
Good observation @oising ! Naming should have been refactored. I've confirmed that my default of 5 mins (300 s) is being used. Would that still be a concern for a spammy DHT? We lowered the collection age because memory was becoming an issue for us. However, we plan to increase our machine resources. |
The v7.2.3 release which aims to fix this is now available, so I will close this but please open a new issue and reference this if you still encounter this issue: https://github.com/dotnet/orleans/releases/tag/v7.2.3 |
I've encountered this a couple times in the last 1.5 weeks. I'll deploy a new revision of my Orleans application and within a couple days silos will become unavailable and messages will be undeliverable on some instances. The problematic silos will not recover and I have to restart the cluster to resolve this issue.
When 1 or more of 9 silos get in this state where grain messages can't be delivered, then Orleans stream messages pushed to the queue will also get stuck until I restart the cluster (container app environment). The issue may have started shortly after the last deployment. The last few times this issue occurred, it seemed to follow shortly after the new release.
I'd appreciate some further guidance on tracking down the issue here.
Here are some further observations:
2023-07-07T20:24:08z
7.1.2
2023-07-10T05:32:23.4393447Z
Container silo failed liveness probe, will be restarted
2023-07-10T05:34:30.094215Z
Container 'silo' was terminated with exit code '1'
2023-07-10T05:34:20.2993575Z
--Orleans.Runtime.OrleansMessageRejectionException
2023-07-10T05:34:23.1327681Z
--System.ObjectDisposedException at Orleans.Serialization.Serializers.CodecProvider.GetServiceOrCreateInstance
2023-07-10T05:37:08.2696465Z
--System.InvalidOperationException at Orleans.Runtime.ActivationData.StartDeactivating
As of
2023-07-10T20:48:15.9391397Z
still seeingOrleans.Runtime.OrleansMessageRejectionException
and there are 34k orleans messages stuck in queue-1.The text was updated successfully, but these errors were encountered: