-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance and Subscription Issues During Operation in Kubernetes Using Custom Helm Chart #1237
Comments
Hi @michelbarnich, Many thanks for this detailed report and very sorry for this late reply. I have first a general question: why are you using QuantumLeap? It was typically used for NGSIv2 context brokers because there was no temporal API in NGSIv2, but is generally not used with NGSI-LD context brokers which have a native temporal API. I had a quick look at the QuantumLeap repository on GH, not sure the NGSI-LD support is complete (nothing really new for NGSI-LD since 2021, which is a really long time for NGSI-LD!). We did something a bit similar with Apache NiFi to export denormalized representations of entities easily usable in a BI tool like Apache Superset and it is quite complex to support the many features of NGSI-LD...
I don't think it will change a lot the numbers but you could use the Batch Entity Merge endpoint instead. Batch Entity Update will replace the attributes found in the payload, Batch Entity Merge will merge them.
The results are surprising to me. We ran a performance campaign beginning of 2024 and we achieved quite better results on a single "standard" VM. You can see some numbers on https://stellio.readthedocs.io/en/latest/admin/performance.html (we'll run them again after the fix with the DB connection pool, see at the end of my answer) The main difference between the two configurations is that we were running Stellio in docker-compose and you are deploying it in a Kubernetes cluster. Do you mind if I use your Helm charts? I'd like to try to reproduce as much as possible your environment and analyze what is the cause of the performance issues.
The total RAM used is what we typically see in our deployments. Typically, the components using most of it are PostgreSQL and Kafka. Then come search and subscription services. The IOPS are very high but this is not a problem. It only means that you have good storage performance :)
What is in the I don't know how QuantumLeap works but there is one thing to keep in mind. As explained in https://stellio.readthedocs.io/en/latest/user/internal_event_model.html, if you update two attributes in an entity, Stellio will internally trigger two events that may be end up in two notifications with the same content being sent (depending on the subscription). This behavior may cause some duplications if not properly handled.
What was the problem? Subscription service not able to send the data or QuantumLeap not able to handle the rate?
Among other things (https://stellio.readthedocs.io/en/latest/user/internal_event_model.html), Kafka is used to decouple the communication between search and subcription services. If only used for the communication between the two services, you can safely set a low retention time in Kafka (IIRC, by default, Kafka has a 7 days retention period).
Yes, you can follow this to change the log level of a module: https://stellio.readthedocs.io/en/latest/admin/misc_configuration_tips.html#change-the-log-level-of-a-library-namespace (you can use
Indeed, you spotted a big issue here! There was an issue in the configuration of the connection pool and it was not properly running. I created a PR (#1241) which will be internally reviewed today. Once validated, I will publish a fix release of Stellio. From the test I've done, it should also fix the problem with the subscription service struggling to reconnect to the DB. |
Hello, Thank you for your answer. Of course you can use our Helm Chart for testing. During the day, I will come back with answers to your questions and suggestions. Thank you very much! |
Hi @bobeal, I wanted to provide some additional context and updates regarding our setup and results: General Question:We use Qualtumleap for a de-normalized representation of entities. 1:Test Results: After reviewing your test results, we suspect that the performance degradation might be due to larger entities in our system. 2:We have had the problem you describe and we therefor only subscribe to one field in the entity which we know will always be update, i.e. "dateObserved". It can therefor not be because of the workings of the internal event model. time_index is an attribute set by Quantumleap. When Quantumleap gets a notification it will check all observed_at properties and use the newest one to set the time_index. 3:Subscription Service: From our analysis, it seems that the subscription service might struggle to check subscription triggers correctly, though this is still a suspicion at this stage. Thanks for your suggestions and the Pullrequest. We’ll test it out and report back. |
Hi @michelbarnich,
Ok, so similar to what we are doing with NiFi :)
I am currently running our load test suite to get fresh new numbers. I'll add some tests with an entity similar to the one you are using and see how it behaves.
OK, let me think and do some tests about this one. |
Hi there, I hope you're doing well! I wanted to check in and see if there have been any updates on the performance issue. We’re planning to run some tests soon for choosing a broker for a lighthouse project and would love to know if there's anything new we should be aware of before proceeding. Best regards, |
Hi @michelbarnich, The problem with the DB connection pool has been fixed and is part of the recently released version 2.17.1. We took this opportunity to run our load test suite and results were indeed better. We have then been a bit busy on other topics but in the next two weeks, we are planning to:
Beginning of November, we should have some time to be able to use your Helm charts and do some testing in a k8s environment. I also keep in mind the problem with the subscription service (I was wondering if it was not also related to the DB connection pool...). Btw, were you able to get more info about the crashing of API-Gateway container during the load tests? Regards, Benoit. |
Hello @michelbarnich, We just did some tests with a larger entity having the same number of attributes (and same "topology") than the one you provided in the issue. We noticed (using the exact same hardware configuration as for our previous load tests):
At first sight, it seems consistent with our previous results. We have some ideas to improve the creation time, we'll work on them soon. I'll let you know when we have some progress on the other topics. |
Title: Performance and Subscription Issues During Deployment in Kubernetes Using Custom Helm Chart
Description:
We are currently deploying Stellio in a Kubernetes environment using our own Helm Chart.
We are using the following versions for the different components:
During performance tests, several key issues and observations were identified regarding resource usage, data insertion, and subscription behavior. Below is a detailed summary of our findings:
Performance Tests Overview
The data is sent to the following endpoint:
http://<stellio API service>/ngsi-ld/v1/entityOperations/upsert?options=update
This is our subscription:
Issues and Observations
1. Inserting/Updating Entities One-by-One:
Graphs for CPU and IOPS during tests:
Is there a way for us to improve the ressource usage?
2. Subscriptions and Insertion Behavior:
Example of duplicated/wrong updates (query results):
The timestamp and/or entity_id ends up being equal to other entries, even though when the messages were originally sent to Stellio, the entity_id and timestamp were different. Stellio accidentally merges multiple messages together, resulting in wrong entries for certain entities or timestamps.
3. Batch Insertion Performance:
Graph showing improved resource usage with batches:
4. Kafka Configuration:
5. Subscription Component Behavior:
6. API-Gateway Container Crashes in Load Tests (Other Environments):
7. Postgres max connections:
Stellio doesnt seem to use one (or maybe a couple) open connections to its DB, but rather opens a new connection for each message it receives. Under high load, this will lead to an issue in the Database:
High message rates trigger the "remaining connection slots are reserved for non-replication superuser connections"
This could be handled in 2 ways: pooling connections/transactions or using a PG Bouncer.
We hope this feedback is helpful, and we’d appreciate any insights or recommendations on addressing these issues, particularly around Kafka configurations, PostgreSQL reconnections, subscriptions during batch inserts, and the API-Gateway crashes.
The text was updated successfully, but these errors were encountered: