Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUERY] Azure Communication Services Chat slow/long response times #47463

Open
ErikAndreas opened this issue Dec 9, 2024 · 25 comments
Open

[QUERY] Azure Communication Services Chat slow/long response times #47463

ErikAndreas opened this issue Dec 9, 2024 · 25 comments
Labels
Client This issue points to a problem in the data-plane of the library. Communication - Chat Communication - Identity customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@ErikAndreas
Copy link

ErikAndreas commented Dec 9, 2024

Library name and version

Azure.Communication.Chat 1.3.1

Query/Question

We're constantly seeing slower and slower response times from various chat sdk endpoints (using chat sdk from azure functions). Most endpoints are in +200 ms, but sometimes we get timeouts (code 499?)

This makes e.g. listing chat threads (GetAccessToken, GetMessages + GetReadReceipts) and some info about their contents, a rather "standard" use-case, almost unbearable from a ux perspective when each call takes such a long time.

InvocationId f04bd15e-a07b-4bf1-8c5b-5c3c43f9b1f5 makes 11 request to Chat SDK endpoints with response times in the range of 101 - 352 ms

We're also seeing cases where GetAccessToken (POST /identities/<ACS UserId>/:issueAccessToken) completely times out:
b5b1845e-fa3c-4000-8632-315256c01552
3206e6ee-1623-482c-8f8c-609fafc80d80
3a121576-8b85-4a35-a3c9-598e95e67f9e

Environment

Azure Functions .net 8 iwp

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. Communication - Chat customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-triage Workflow: This issue needs the team to triage. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Dec 9, 2024
@ErikAndreas
Copy link
Author

ErikAndreas commented Dec 9, 2024

stats for dependency calls to (acsInstanceName).europe.communication.azure.com last 24hrs
Image

@gabbsmo
Copy link

gabbsmo commented Dec 9, 2024

Also seeing similar behaviour with Communication Services Email. Connections closing or timing out around 1 minute at peak load that should be well within our sending quota.

@jsquire jsquire added Service Attention Workflow: This issue is responsible by Azure service team. Communication and removed needs-team-triage Workflow: This issue needs the team to triage. labels Dec 9, 2024
Copy link

github-actions bot commented Dec 9, 2024

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @abhishesingh-msft @abhiucandoit @adamtuck-msft @adriansynal-msft @AikoBB @ajaych-ms @ajitht-msft @akania @alcail @alexandra142 @AliRafiee @alkwa-msft @allchiang-msft @allenplusplus @AlonsoMondal @alvinjiang-msft @amagginetti @amandaong @amariwest-msft @amatukmolina @AmeliaHeMS @AMollis @andboyer @andrewjhopkins @angelcaz-msft @angellan-msft @angiurgiu @anjulgarg @ankeshni @ankitamm @ankitarorabit @anmolbohra97 @ansrin @antonsamson-msft @anujb-msft @anujissarMS @arifibrahim4 @arifsaikat-microsoft @armandliv @armansabaa @arupdutta-msft @ashwinder @BalajiUmmadisetti @beltr0n @besh2014 @bga-msft @bharat-kalyan-namburi @bobserr @boris-bazilevskiy @brpiment @bterlson @calvinkwtang @carlosalm-msft @carocao-msft @cemateia @chengyuanlai-msft @cheukchuen @CHILIU-MSFT @chriswhilar @chrwhit @claireoberg-msft @cn0151 @cochi2 @csandjon @dailam-msft @danielav7 @danielgerlag @danielortega-msft @dassabya15 @DaybreakQuip @ddouglas-msft @devwolf1 @dinazavyr @dmceachernmsft @DominikMe @dvillasenor-msft @dzeliar @ealejandrootalvaro @edwardlee-msft @elavarasidc @emlynmac @enguerrandb-msft @enricohuang @ericasp16 @eriwang-msft @fangchen0601 @fanruisun @FarhadJabiyev @FerOsorio @fhaghbin-msft @fizampou @fuyan2024 @gaobob-msft @garchiro7 @Gemakk @genevievetok @gfeitosa-msft @glorialimicrosoft @GoWang @grangasamy-msft @grigoryk @guoqing2023 @hansung-msft @hrazi @huachuandeng @HuangXin-MS @iaulakh @ihuseynov-msft @ikumarapeli-msft @ilyapaliakou-msft @imikemo @jadacampbell @jakublehotsky @JamesBurnside @jamescadd @jannovak-msft @jdebroin-msft @jethier-msft @JianpingChen @jimchou-dev @jiminwen-msft @jiriburant @jirisofka @jjsanchezms @Joeleniqs @jorge-beauregard @josecomboni @JoshuaLai @jowang-msft @jpeng-ms @jrathor @jsaurezlee-msft @juancamilor @juntuchen-msft @jutik0 @kagbakpem @khannavikas @kieraniles-msft @ktimofejev-msft @kurtzeborn @Leah-Xia-Microsoft @lei-msft @lisaleehan @LoadLibrary @lsundaralingam @LuChen-Microsoft @lucianopa-msft @magesh-ms @mannyovena @marche0133 @mariusu-msft @matthohn-msft @maximrytych-ms @mayssamm @megheshpatil @mgamis-msft @miguhern @mikehang-msft @MilanKaur-01 @minnieliu @minwoolee-msft @mjafferi-msft @mmpowers-mi @msft-qifan @MSFTFox @namratasimha-msft @natekimball-msft @NathanJBennett @navali-msft @nemofei11 @nikithauc @nikuklic @ninikasharma @nmaredia @nmurav @nostojicMs @osaghaso @palatter @paolamvhz @Paresh-Arvind-Patil @paveldostalms @pavelprystinka @pereiralex @pgrandhi @phermanov-msft @pkestikar @pohtsng @poorva-MSFT @prabhjot-msft @prasadker @priyankaaprakash @rajasekaran2003 @rajat-rastogi @rajuanitha88 @ralphgabrinao @ramneettung-msft @raosanat @rasinive @RezaJooyandeh @rkprasad-ks @rorezende @ryturn757 @rzdor @sacheu @Saisang @SamuelSA @sankum-msft @sarkar-rajarshi @satyakonmsft @Scott-Leong @Shamkh @sharifrahaman @ShaunaSong @shirleyqin-msft @shwali-msft @slpavkov @sofiar-msft @soricos85 @sorrego-msft @sphenry @Srinuvasu-Bodepudi @stefang931 @subhasan @swagatmishra2007 @tadam-msft @tariqzafa700 @tomaschladek @tonyliu43361 @tophpalmer @usvoyager @v-durgeshs @v-mbarad-msft @v-pivamshi @v-shazilms @v-vdharmaraj @vaibhavjain-msft @vamoskal @vhuseinova-msft @VikramDhumal @vikrampraveen @viniciusl-msft @vinoPuzzle @vivekmore-msft @vriosrada-msft @wangrui-msft @waynemo @whisper6284 @williamzhao87 @xixian73 @xumo-95 @xxwikkixx @yassirbisteni @yogeshmo @ypradhan-msft @zeyingguo1991 @zhengni-msft @zihzhan-msft.

@github-actions github-actions bot added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Dec 9, 2024
@ErikAndreas
Copy link
Author

GetAccessToken (POST /identities/<ACS UserId>/:issueAccessToken) has very long response times
InvocationId e292e400-e496-4d71-9fbc-8a97d2405960, ClientRequestId 3a4472f4-96db-4c3a-87d6-c45192ab9562 37,2 secs
InvocationId 6140e077-5f38-497e-8abd-a7cf7d101b5c, ClientRequestId 56634e3c-4d46-4c10-9bae-356ef7423627 24,7 secs
InvocationId 04ecde28-b625-4ea5-bb6d-2e8afe9ebb06, ClientRequestId 6fca2aed-07c9-4d97-b95a-51ee28822547, 12,6 secs

this can't be expected or within range of what should be considered normal?

@ErikAndreas
Copy link
Author

@ErikAndreas
Copy link
Author

Some more slow invocations or with timeouts:
bf7f49d6-6350-4999-baf7-37120a39bb0c ClientRequestIds: f988cbc0-6b78-4761-8ef7-a3eed666184d (timeout, 4 retries)
cb830c6f-c101-4a08-8f4c-07051fbd0e3f ClientRequestIds: 02f4db24-9a6b-45b2-a2c5-8df56e7c7399 (1.5 mins)
345b39ba-a6b8-46da-bf10-4d134007fb0b ClientRequestIds: 2724f05f-0699-42fe-9b12-3de880c2c6e2 (one timeout 100 sec, 23.2 secs 2nd retry)

@jiminwen-msft
Copy link
Member

Hi @ErikAndreas

Can you provide MS-CV for the requests as well? They are in the response headers. And also the timestamp for making the requests.

@ErikAndreas
Copy link
Author

ErikAndreas commented Dec 17, 2024

Hi @jiminwen-msft
we're not actively logging that much and we're using the Chat SDK, most info is default from app insights and I don't think response headers are being logged?

a few other samples with timestamps (UTC) though:
invocationid 1de4eece-2524-4cc1-aa8d-48f14beafab4:
POST /identities/<acsId redacted>/:issueAccessToken 2024-12-17T08:11:44.0279616Z clientreqid 1fe1beeb-045c-4b1e-bf62-738ca4b669eb TIMEOUT
retry at 2024-12-17T08:13:24.9790344Z 1.2 mins

GET /chat/threads/<threadId redacted>/readReceipts 2024-12-17T08:14:34.7000022Z clientreqid 969cfaad-3230-4f3e-bc2b-7a639e181246 417.4 ms

GET /chat/threads/<threadId redacted>/readReceipts 2024-12-17T08:14:34.7010059Z clientreqid 41c44da6-c44f-4eff-b483-1e96cc9cead8 889.4 ms

GET /chat/threads/<threadId redacted>/messages 2024-12-17T08:14:34.9426991Z clientreqid ec0a1283-ebc4-425b-9cf1-6f08935f7857 458.6 ms

GET /chat/threads/<threadId redacted>/messages 2024-12-17T08:14:35.5906255Z clientreqid ecac4057-a5b8-428a-ad4e-9fb327925c9b 354.4 ms

as you can understand the end user experience for this one request to our API (there are more requests being made from this function invocation to ACS but these were some of the slowest logged responses ("duration" in app insights)) is not that great...

@ErikAndreas
Copy link
Author

other endpoints as well:
invocationid f31997fe-778c-44bf-913e-0449d3d56d3a POST /chat/threads 2024-12-17T09:51:26.0435891Z clientreqid 97294611-c5f4-4b33-ad58-880412daa29d 449.0 ms

invocationid 46aa3e10-dde6-4396-98c5-dca89c4ddc33 POST /chat/threads/<threadId redacted>/messages 2024-12-17T09:52:49.1113092Z clientreqid e8ba7331-5f9e-4238-9758-4754e398bcae 1.1 s

@jiminwen-msft
Copy link
Member

jiminwen-msft commented Dec 18, 2024

@ErikAndreas
Do you have the resource IDs or thread IDs that long latency problems? Is the issue still there?

@jiminwen-msft jiminwen-msft self-assigned this Dec 18, 2024
@ErikAndreas
Copy link
Author

ErikAndreas commented Dec 18, 2024

yes, issue still there - slowest endpoint and having most timeouts is POST /identities/<acsId redacted>/:issueAccessToken

a few slow entries just now:
2024-12-18T08:01:19.0936347Z
ClientRequestId 42bb38f6-caf8-43a4-92df-a530ad77bda9
Duration 833.6 ms
Name GET /chat/threads/19%3AacsV1_M5ALS9KsBG71bs9PhSYDabxdxn2bjvD40j_dJa8ltck1%40thread.v2/readReceipts

2024-12-18T08:01:19.3342841Z
ClientRequestId 0a25de7c-9774-4609-87d9-29dc39ab66e3
Duration 1.1 s
Name GET /chat/threads/19%3AacsV1_5pOyYsZCOyAdADqQRjJ4HeOhxnmQ-C3X5v7CywMy51k1%40thread.v2/messages

2024-12-18T08:01:19.9186785Z
ClientRequestId cbbf8f14-472d-4812-a694-aaa7bb1d46d9
Duration 866.8 ms
Name GET /chat/threads/19%3AacsV1_qE3HKwRNkYNIvNx1u77M_wbxzlJjAKzcDIDN-jproiM1%40thread.v2/messages

looking at the performance blade of app insights with 'target has '(our acs instance).europe.communication.azure.com' for the last 24 hrs, we're still at the same approximate numbers as my previous screenshot; 50th at 710 ms, 99th at 2.4 s - rather slow for an api I'd say...

@jiminwen-msft
Copy link
Member

@ErikAndreas

Since the slowest response are Auth related problems. I suggest resolving this first.

@jiminwen-msft
Copy link
Member

I cannot find the requests by querying the thread IDs on our service. The requests may not arrive on our end. Can you share the resource ID. It is available on the overview page of your ACS resource. @ErikAndreas

@ErikAndreas
Copy link
Author

ErikAndreas commented Dec 19, 2024

Since the slowest response are Auth related problems. I suggest resolving this first.

Not following, the GetToken(Anync) method (which I assume calls underlaying rest method /identities/<acsId>/:issueAccessToken) is a part of the (same) SDK? But, that call/endpoint alone will not make up for the response times of all the others, you can't be meaning that right?

I cannot find the requests by querying the thread IDs on our service. The requests may not arrive on our end. Can you share the resource ID. It is available on the overview page of your ACS resource.

Can I somehow provide you the resource id privately (I can't post it publicly)? But, you can't be suggesting that they're not arriving? We are getting replies/data, although very slowly...

@jiminwen-msft jiminwen-msft removed their assignment Dec 19, 2024
@jiminwen-msft
Copy link
Member

jiminwen-msft commented Dec 19, 2024

@ErikAndreas
I am able to find the resource based on thread you provided. The latency for ACS chat related operations seem normal. On chat service side. The p95 can go up to 600 ms during peak traffic.
Image

Does the issue apply to a small percentage of requests? (p95, p99)

@ErikAndreas
Copy link
Author

No, as previously stated and shown by my screenshot from app insights we're rather constant on +600 ms for 50th.

I strongly disagree with any of the numbers on your graph (lowest at 350) as something to be considered "normal", I'd say such response times are way above anything to be expected from a modern API? Consider any real feature would require multiple calls to the service, again, the ux of such feature is not that great!

Our experience though is that it wasn't this bad when we started using this service but we don't have any numbers to show for it.

Can we get an "official" statement on those numbers to be "normal", to be what we should expect, for ACS (Chat)?

@jiminwen-msft
Copy link
Member

Image

Service side metrics show p50 to be around 100ms, p95 to be 350 ms.

GetAccessToken seems to be a contributor to the high latency. It is managed by another team. Engaging them to see if they have some insights on the GetAccessToken latency.

@jiminwen-msft
Copy link
Member

@ErikAndreas The graph above is list_messages operation on service side. Does it match the latency for this particular operation on your side. If it matches, then the latency is expected. If the latency on your side is significant higher. It can be a network issue, or something on the SDK.

@ErikAndreas
Copy link
Author

So your graph/numbers are excluding GetAccessToken? Might make sense. Please do check with other team. Will be off for holidays now and check back later.

@anny-hiq
Copy link

anny-hiq commented Jan 7, 2025

@jiminwen-msft I'd say we're seeing higher response times than indicated by your service side metrics - but that's from a SDK usage perspective and we have reasons to believe that the SDK does not handle paging very well, see #45816 (page size seems to be fixed at six messages) - any feedback on that?

@jiminwen-msft
Copy link
Member

Did you trying using Azure CLI to invoke the target endpoints with the same request? If the high latency issues exist for Azure ALI, it should cause by network settings. The SDK itself does not do much extra processing after getting responses from the service side.

@jiminwen-msft
Copy link
Member

I think max_page_size can be set a parameter when listing messages. Can you try setting a higher value?

@ErikAndreas
Copy link
Author

I did not try the Azure CLI. Just looked at some stats in app insights (again), we're now caching access tokens for their full lifetime so no contrib to slowness from getting token. Sample here which must be said to be a fairly typical use case for listing threads (and their meta content), this sample fetches read receipts and only the first page of get messages (we quit after first "page" since only need last message so any paging related issues are out of the way): 2025-01-16T12:12:55.2323098Z
Image

This sums up to an API call to our backend which takes almost a second to complete - for a fairly typical and standard use-case, this is not good enough?!

I should add we're doing get chat thread client, get read receipts and get messages (per thread) async in an unawaited loop and then awaiting Task.WhenAll so I don't think we can be much more performant on our end?

@gabbsmo
Copy link

gabbsmo commented Jan 23, 2025

@ErikAndreas Are you running your app in Azure Web Jobs or Functions by any chance? I am having a lot of trouble with throughput in Web Jobs and QueueListeners creating a lot of outgoing TCP connections seem to be a factor. See Azure/azure-webjobs-sdk#3116

@ErikAndreas
Copy link
Author

@gabbsmo as per original issue description we're using Azure Functions .net 8 isolated (both consumption and premium plans, varying by environment but I'd say we see same slowness regardless)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. Communication - Chat Communication - Identity customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

5 participants