[Bug]: After the message queue component's pod is killed and recovered, all Milvus flush operations are timing out. #39197

zhuwenxing · 2025-01-13T03:46:42Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:master-20250110-e5eb1159-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):kafka/pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2025-01-11T07:17:31.906Z] platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /usr/bin/python3

[2025-01-11T07:17:31.906Z] cachedir: .pytest_cache

[2025-01-11T07:17:31.906Z] metadata: {'Python': '3.10.12', 'Platform': 'Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'asyncio': '0.24.0', 'Faker': '19.2.0', 'allure-pytest': '2.7.0', 'assume': '2.4.3', 'cov': '2.8.1', 'forked': '1.6.0', 'html': '3.1.1', 'level': '0.1.1', 'metadata': '3.1.1', 'parallel': '0.1.1', 'print': '0.2.1', 'random-order': '1.1.1', 'repeat': '0.8.0', 'rerunfailures': '14.0', 'sugar': '0.9.5', 'tags': '1.8.1', 'timeout': '1.3.3', 'xdist': '2.5.0'}, 'CI': 'true', 'BUILD_NUMBER': '19115', 'BUILD_ID': '19115', 'BUILD_URL': 'https://qa-jenkins.milvus.io/job/chaos-test-kafka-cron/19115/', 'NODE_NAME': 'chaos-test-kafka-cron-19115-wszms-x9m3v-ng6b7', 'JOB_NAME': 'chaos-test-kafka-cron', 'BUILD_TAG': 'jenkins-chaos-test-kafka-cron-19115', 'EXECUTOR_NUMBER': '0', 'JENKINS_URL': 'https://qa-jenkins.milvus.io/', 'WORKSPACE': '/home/jenkins/agent/workspace', 'GIT_COMMIT': 'f2ce474eca96d53d42cff453ab752ce66719ccfe', 'GIT_URL': '[email protected]:zhuwenxing/test-jobs.git', 'GIT_BRANCH': 'origin/main'}

[2025-01-11T07:17:31.906Z] Test order randomisation NOT enabled. Enable with --random-order or --random-order-bucket=<bucket_type>

[2025-01-11T07:17:31.906Z] tags: all

[2025-01-11T07:17:31.906Z] rootdir: /home/jenkins/agent/workspace/tests/python_client

[2025-01-11T07:17:31.906Z] configfile: pytest.ini

[2025-01-11T07:17:31.906Z] plugins: asyncio-0.24.0, Faker-19.2.0, allure-pytest-2.7.0, assume-2.4.3, cov-2.8.1, forked-1.6.0, html-3.1.1, level-0.1.1, metadata-3.1.1, parallel-0.1.1, print-0.2.1, random-order-1.1.1, repeat-0.8.0, rerunfailures-14.0, sugar-0.9.5, tags-1.8.1, timeout-1.3.3, xdist-2.5.0

[2025-01-11T07:17:31.906Z] asyncio: mode=strict, default_loop_scope=function

[2025-01-11T07:17:31.906Z] collecting ... collected 2 items

[2025-01-11T07:17:31.906Z] 
selected 2 items

[2025-01-11T07:17:31.906Z] 

[2025-01-11T07:17:31.906Z] testcases/test_data_persistence.py::TestDataPersistence::test_milvus_default[default] 

[2025-01-11T07:17:31.906Z] -------------------------------- live log setup --------------------------------

[2025-01-11T07:17:31.906Z] [2025-01-11 07:17:31 - INFO - ci_test]: ################################################################################ (conftest.py:232)

[2025-01-11T07:17:31.906Z] [2025-01-11 07:17:31 - INFO - ci_test]: [initialize_milvus] Log cleaned up, start testing... (conftest.py:233)

[2025-01-11T07:17:31.906Z] [2025-01-11 07:17:31 - INFO - ci_test]: [setup_class] Start setup class... (client_base.py:41)

[2025-01-11T07:17:31.906Z] [2025-01-11 07:17:31 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:47)

[2025-01-11T07:17:31.906Z] [2025-01-11 07:17:31 - INFO - ci_test]: pymilvus version: 2.6.0rc44 (client_base.py:48)

[2025-01-11T07:17:31.906Z] [2025-01-11 07:17:31 - INFO - ci_test]: [setup_method] Start setup test case test_milvus_default. (client_base.py:49)

[2025-01-11T07:17:31.906Z] -------------------------------- live log call ---------------------------------

[2025-01-11T07:17:31.906Z] [2025-01-11 07:17:31 - INFO - ci_test]: server version: e5eb115 (client_base.py:166)

[2025-01-11T07:17:31.906Z] [2025-01-11 07:17:31 - INFO - ci_test]: all database: ['prod', 'default'] (test_data_persistence.py:26)

[2025-01-11T07:20:39.359Z] 2025-01-11 07:20:31,627 [WARNING][handler]: Retry timeout: 180s (decorators.py:106)

[2025-01-11T07:20:39.360Z] 2025-01-11 07:20:31,628 [ERROR][handler]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Hello_Milvus, flusht_ts: 455233988598431755)>, <Time:{'RPC start': '2025-01-11 07:17:31.573124', 'RPC error': '2025-01-11 07:20:31.627967'}> (decorators.py:140)

[2025-01-11T07:20:39.360Z] [2025-01-11 07:20:31 - ERROR - ci_test]: Traceback (most recent call last):

[2025-01-11T07:20:39.360Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 22, in inner_wrapper

[2025-01-11T07:20:39.360Z]     res = func(*args, **_kwargs)

[2025-01-11T07:20:39.360Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 53, in api_request

[2025-01-11T07:20:39.360Z]     return func(*arg, **kwargs)

[2025-01-11T07:20:39.360Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/orm/collection.py", line 318, in flush

[2025-01-11T07:20:39.360Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2025-01-11T07:20:39.360Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 141, in handler

[2025-01-11T07:20:39.360Z]     raise e from e

[2025-01-11T07:20:39.360Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 137, in handler

[2025-01-11T07:20:39.360Z]     return func(*args, **kwargs)

[2025-01-11T07:20:39.360Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 176, in handler

[2025-01-11T07:20:39.360Z]     return func(self, *args, **kwargs)

[2025-01-11T07:20:39.360Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 107, in handler

[2025-01-11T07:20:39.360Z]     raise MilvusException(

[2025-01-11T07:20:39.360Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Hello_Milvus, flusht_ts: 455233988598431755)>

[2025-01-11T07:20:39.360Z]  (api_request.py:35)

[2025-01-11T07:20:39.360Z] [2025-01-11 07:20:31 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Hello_Milvus, flusht_ts: 455233988598431755)> (api_request.py:36)

[2025-01-11T07:20:39.360Z] FAILED

[2025-01-11T07:20:39.360Z] ------------------------------ live log teardown -------------------------------

[2025-01-11T07:20:39.360Z] [2025-01-11 07:20:31 - INFO - ci_test]: *********************************** teardown *********************************** (test_data_persistence.py:15)

[2025-01-11T07:20:39.360Z] [2025-01-11 07:20:31 - INFO - ci_test]: [teardown_method] Start teardown test case test_milvus_default... (test_data_persistence.py:16)

[2025-01-11T07:20:39.360Z] [2025-01-11 07:20:31 - INFO - ci_test]: skip drop collection (test_data_persistence.py:18)

[2025-01-11T07:20:39.360Z] 

[2025-01-11T07:20:39.360Z] testcases/test_data_persistence.py::TestDataPersistence::test_milvus_default[prod] 

[2025-01-11T07:20:39.360Z] -------------------------------- live log setup --------------------------------

[2025-01-11T07:20:39.360Z] [2025-01-11 07:20:31 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:47)

[2025-01-11T07:20:39.360Z] [2025-01-11 07:20:31 - INFO - ci_test]: pymilvus version: 2.6.0rc44 (client_base.py:48)

[2025-01-11T07:20:39.360Z] [2025-01-11 07:20:31 - INFO - ci_test]: [setup_method] Start setup test case test_milvus_default. (client_base.py:49)

[2025-01-11T07:20:39.360Z] -------------------------------- live log call ---------------------------------

[2025-01-11T07:20:39.360Z] [2025-01-11 07:20:31 - INFO - ci_test]: server version: e5eb115 (client_base.py:166)

[2025-01-11T07:20:39.360Z] [2025-01-11 07:20:31 - INFO - ci_test]: all database: ['default', 'prod'] (test_data_persistence.py:26)

[2025-01-11T07:23:45.771Z] 2025-01-11 07:23:32,179 [WARNING][handler]: Retry timeout: 180s (decorators.py:106)

[2025-01-11T07:23:45.771Z] 2025-01-11 07:23:32,179 [ERROR][handler]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Hello_Milvus, flusht_ts: 455234035837042700)>, <Time:{'RPC start': '2025-01-11 07:20:31.790322', 'RPC error': '2025-01-11 07:23:32.179439'}> (decorators.py:140)

[2025-01-11T07:23:45.771Z] [2025-01-11 07:23:32 - ERROR - ci_test]: Traceback (most recent call last):

[2025-01-11T07:23:45.771Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 22, in inner_wrapper

[2025-01-11T07:23:45.771Z]     res = func(*args, **_kwargs)

[2025-01-11T07:23:45.771Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 53, in api_request

[2025-01-11T07:23:45.771Z]     return func(*arg, **kwargs)

[2025-01-11T07:23:45.771Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/orm/collection.py", line 318, in flush

[2025-01-11T07:23:45.771Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2025-01-11T07:23:45.771Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 141, in handler

[2025-01-11T07:23:45.771Z]     raise e from e

[2025-01-11T07:23:45.771Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 137, in handler

[2025-01-11T07:23:45.771Z]     return func(*args, **kwargs)

[2025-01-11T07:23:45.771Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 176, in handler

[2025-01-11T07:23:45.771Z]     return func(self, *args, **kwargs)

[2025-01-11T07:23:45.771Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 107, in handler

[2025-01-11T07:23:45.771Z]     raise MilvusException(

[2025-01-11T07:23:45.771Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Hello_Milvus, flusht_ts: 455234035837042700)>

[2025-01-11T07:23:45.771Z]  (api_request.py:35)

[2025-01-11T07:23:45.771Z] [2025-01-11 07:23:32 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Hello_Milvus, flusht_ts: 455234035837042700)> (api_request.py:36)

[2025-01-11T07:23:45.771Z] FAILED

[2025-01-11T07:23:45.771Z] ------------------------------ live log teardown -------------------------------

[2025-01-11T07:23:45.771Z] [2025-01-11 07:23:32 - INFO - ci_test]: *********************************** teardown *********************************** (test_data_persistence.py:15)

[2025-01-11T07:23:45.771Z] [2025-01-11 07:23:32 - INFO - ci_test]: [teardown_method] Start teardown test case test_milvus_default... (test_data_persistence.py:16)

[2025-01-11T07:23:45.771Z] [2025-01-11 07:23:32 - INFO - ci_test]: skip drop collection (test_data_persistence.py:18)

[2025-01-11T07:23:45.771Z] [2025-01-11 07:23:32 - INFO - ci_test]: [teardown_class] Start teardown class... (client_base.py:44)

[2025-01-11T07:23:45.771Z] 

[2025-01-11T07:23:45.771Z] 

[2025-01-11T07:23:45.771Z] =================================== FAILURES ===================================

[2025-01-11T07:23:45.771Z] _______________ TestDataPersistence.test_milvus_default[default] _______________

[2025-01-11T07:23:45.771Z] 

[2025-01-11T07:23:45.771Z] self = <test_data_persistence.TestDataPersistence object at 0x7fcc028688b0>

[2025-01-11T07:23:45.771Z] db_name = 'default'

[2025-01-11T07:23:45.771Z] 

[2025-01-11T07:23:45.771Z]     @pytest.mark.tags(CaseLabel.L3)

[2025-01-11T07:23:45.771Z]     @pytest.mark.parametrize("db_name", ["default", "prod"])

[2025-01-11T07:23:45.771Z]     def test_milvus_default(self, db_name):

[2025-01-11T07:23:45.771Z]         self._connect()

[2025-01-11T07:23:45.771Z]         # create database if not exist

[2025-01-11T07:23:45.771Z]         dbs, _ = self.database_wrap.list_database()

[2025-01-11T07:23:45.771Z]         log.info(f"all database: {dbs}")

[2025-01-11T07:23:45.771Z]         if db_name not in dbs:

[2025-01-11T07:23:45.772Z]             log.info(f"create database {db_name}")

[2025-01-11T07:23:45.772Z]             self.database_wrap.create_database(db_name)

[2025-01-11T07:23:45.772Z]         self.database_wrap.using_database(db_name)

[2025-01-11T07:23:45.772Z]         # create collection

[2025-01-11T07:23:45.772Z]         name = "Hello_Milvus"

[2025-01-11T07:23:45.772Z]         t0 = time.time()

[2025-01-11T07:23:45.772Z]         collection_w = self.init_collection_wrap(name=name, active_trace=True)

[2025-01-11T07:23:45.772Z]         tt = time.time() - t0

[2025-01-11T07:23:45.772Z]         assert collection_w.name == name

[2025-01-11T07:23:45.772Z] >       entities = collection_w.num_entities

[2025-01-11T07:23:45.772Z] 

[2025-01-11T07:23:45.772Z] testcases/test_data_persistence.py:37: 

[2025-01-11T07:23:45.772Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2025-01-11T07:23:45.772Z] ../base/collection_wrapper.py:67: in num_entities

[2025-01-11T07:23:45.772Z]     self.flush()

[2025-01-11T07:23:45.772Z] ../utils/wrapper.py:18: in inner_wrapper

[2025-01-11T07:23:45.772Z]     res, result = func(*args, **kwargs)

[2025-01-11T07:23:45.772Z] ../base/collection_wrapper.py:164: in flush

[2025-01-11T07:23:45.772Z]     check_items, check, **kwargs).run()

[2025-01-11T07:23:45.772Z] ../check/func_check.py:45: in run

[2025-01-11T07:23:45.772Z]     result = self.assert_succ(self.succ, True)

[2025-01-11T07:23:45.772Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2025-01-11T07:23:45.772Z] 

[2025-01-11T07:23:45.772Z] self = <check.func_check.ResponseChecker object at 0x7fcc026df160>

[2025-01-11T07:23:45.772Z] actual = False, expect = True

[2025-01-11T07:23:45.772Z] 

[2025-01-11T07:23:45.772Z]     def assert_succ(self, actual, expect):

[2025-01-11T07:23:45.772Z] >       assert actual is expect, f"Response of API {self.func_name} expect {expect}, but got {actual}"

[2025-01-11T07:23:45.772Z] E       AssertionError: Response of API flush expect True, but got False

[2025-01-11T07:23:45.772Z] 

[2025-01-11T07:23:45.772Z] ../check/func_check.py:130: AssertionError

[2025-01-11T07:23:45.772Z] ------------------------------ Captured log setup ------------------------------

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - INFO - ci_test]: ################################################################################ (conftest.py:232)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - INFO - ci_test]: [initialize_milvus] Log cleaned up, start testing... (conftest.py:233)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - INFO - ci_test]: [setup_class] Start setup class... (client_base.py:41)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:47)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - INFO - ci_test]: pymilvus version: 2.6.0rc44 (client_base.py:48)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - INFO - ci_test]: [setup_method] Start setup test case test_milvus_default. (client_base.py:49)

[2025-01-11T07:23:45.772Z] ------------------------------ Captured log call -------------------------------

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': '10.255.20.172', 'port': 19530} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : None  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - INFO - ci_test]: server version: e5eb115 (client_base.py:166)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [list_database] args: ['default', None], kwargs: {} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : ['prod', 'default']  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - INFO - ci_test]: all database: ['prod', 'default'] (test_data_persistence.py:26)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [using_database] args: ['default', 'default'], kwargs: {} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : None  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [FieldSchema] args: ['int64', <DataType.INT64: 5>, ''], kwargs: {'is_primary': False, 'is_partition_key': False, 'nullable': False} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : {'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>}  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [FieldSchema] args: ['varchar', <DataType.VARCHAR: 21>, ''], kwargs: {'max_length': 65535, 'is_primary': False, 'is_partition_key': False, 'nullable': False} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : {'name': 'varchar', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [FieldSchema] args: ['float_vector', <DataType.FLOAT_VECTOR: 101>, ''], kwargs: {'dim': 128, 'is_primary': False, 'nullable': False} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}}  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [FieldSchema] args: ['float', <DataType.FLOAT: 10>, ''], kwargs: {'is_primary': False, 'nullable': False} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [FieldSchema] args: ['json_field', <DataType.JSON: 23>, ''], kwargs: {'is_primary': False, 'nullable': False} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : {'name': 'json_field', 'description': '', 'type': <DataType.JSON: 23>}  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [CollectionSchema] args: [[{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}, {'name': 'json_field', 'description': '', 'type': <DataTyp......, kwargs: {'primary_field': 'int64', 'auto_id': False, 'enable_dynamic_field': False} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params......  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [Connections.has_connection] args: ['default'], kwargs: {} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : True  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['Hello_Milvus', {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar', 'description': '', 'type': <DataType.VARC......, kwargs: {'consistency_level': 'Strong'} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_response) : <Collection>:

[2025-01-11T07:23:45.772Z] -------------

[2025-01-11T07:23:45.772Z] <name>: Hello_Milvus

[2025-01-11T07:23:45.772Z] <description>: 

[2025-01-11T07:23:45.772Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'n......  (api_request.py:27)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:17:31 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:52)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:20:31 - ERROR - ci_test]: Traceback (most recent call last):

[2025-01-11T07:23:45.772Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 22, in inner_wrapper

[2025-01-11T07:23:45.772Z]     res = func(*args, **_kwargs)

[2025-01-11T07:23:45.772Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 53, in api_request

[2025-01-11T07:23:45.772Z]     return func(*arg, **kwargs)

[2025-01-11T07:23:45.772Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/orm/collection.py", line 318, in flush

[2025-01-11T07:23:45.772Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2025-01-11T07:23:45.772Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 141, in handler

[2025-01-11T07:23:45.772Z]     raise e from e

[2025-01-11T07:23:45.772Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 137, in handler

[2025-01-11T07:23:45.772Z]     return func(*args, **kwargs)

[2025-01-11T07:23:45.772Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 176, in handler

[2025-01-11T07:23:45.772Z]     return func(self, *args, **kwargs)

[2025-01-11T07:23:45.772Z]   File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 107, in handler

[2025-01-11T07:23:45.772Z]     raise MilvusException(

[2025-01-11T07:23:45.772Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Hello_Milvus, flusht_ts: 455233988598431755)>

[2025-01-11T07:23:45.772Z]  (api_request.py:35)

[2025-01-11T07:23:45.772Z] [2025-01-11 07:20:31 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Hello_Milvus, flusht_ts: 455233988598431755)> (api_request.py:36)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

Both tests have enabled the streaming node.

kafka pod kill chaos test
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/19115/pipeline
log:
artifacts-kafka-pod-kill-19115-server-logs.tar.gz

pod info

[2025-01-11T07:17:23.022Z] + grep kafka-pod-kill-19115

[2025-01-11T07:17:23.023Z] + kubectl get pods -o wide

[2025-01-11T07:17:23.023Z] kafka-pod-kill-19115-0                                           2/2     Running                  0                9m22s   10.104.32.181   4am-node39   <none>           <none>

[2025-01-11T07:17:23.023Z] kafka-pod-kill-19115-1                                           2/2     Running                  1 (9m12s ago)    9m22s   10.104.19.140   4am-node28   <none>           <none>

[2025-01-11T07:17:23.023Z] kafka-pod-kill-19115-2                                           2/2     Running                  1 (9m11s ago)    9m22s   10.104.30.35    4am-node38   <none>           <none>

[2025-01-11T07:17:23.023Z] kafka-pod-kill-19115-etcd-0                                      1/1     Running                  0                33m     10.104.19.107   4am-node28   <none>           <none>

[2025-01-11T07:17:23.023Z] kafka-pod-kill-19115-etcd-1                                      1/1     Running                  0                33m     10.104.32.163   4am-node39   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-etcd-2                                      1/1     Running                  0                33m     10.104.30.22    4am-node38   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-exporter-bf45b8f56-gszgk                    1/1     Running                  0                9m22s   10.104.9.16     4am-node14   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-datanode-647c98d9dc-g5prt            1/1     Running                  3 (32m ago)      33m     10.104.23.193   4am-node27   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-datanode-647c98d9dc-n7bm8            1/1     Running                  3 (32m ago)      33m     10.104.16.41    4am-node21   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-indexnode-5b858b848-j54p7            1/1     Running                  3 (32m ago)      33m     10.104.23.192   4am-node27   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-indexnode-5b858b848-rdskq            1/1     Running                  3 (32m ago)      33m     10.104.32.155   4am-node39   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-indexnode-5b858b848-w246g            1/1     Running                  3 (32m ago)      33m     10.104.27.140   4am-node31   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-mixcoord-76c4f99656-8x89h            1/1     Running                  3 (32m ago)      33m     10.104.26.73    4am-node32   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-proxy-8655b57684-m7cnw               1/1     Running                  3 (32m ago)      33m     10.104.27.139   4am-node31   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-querynode-7fbfb796f4-bmrnn           1/1     Running                  3 (32m ago)      33m     10.104.34.17    4am-node37   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-querynode-7fbfb796f4-djpks           1/1     Running                  3 (32m ago)      33m     10.104.14.7     4am-node18   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-querynode-7fbfb796f4-v9jnl           1/1     Running                  3 (32m ago)      33m     10.104.26.74    4am-node32   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-milvus-streamingnode-999cc87b4-ldm5z        1/1     Running                  3 (32m ago)      33m     10.104.23.191   4am-node27   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-minio-0                                     1/1     Running                  0                33m     10.104.32.164   4am-node39   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-minio-1                                     1/1     Running                  0                33m     10.104.19.106   4am-node28   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-minio-2                                     1/1     Running                  0                33m     10.104.15.248   4am-node20   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-minio-3                                     1/1     Running                  0                33m     10.104.30.23    4am-node38   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-zookeeper-0                                 1/1     Running                  0                33m     10.104.32.162   4am-node39   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-zookeeper-1                                 1/1     Running                  0                33m     10.104.19.110   4am-node28   <none>           <none>

[2025-01-11T07:17:23.024Z] kafka-pod-kill-19115-zookeeper-2                                 1/1     Running                  0                33m     10.104.15.249   4am-node20   <none>           <none>

pulsar pod kill chaos test
faild job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/20120/pipeline
log:

artifacts-pulsar-pod-failure-20120-server-logs.tar.gz

Pulsar has a particular issue: after pod is killed, 2 out of 3 bookies fail to restart.

pod info

[2025-01-10T09:15:33.354Z] + kubectl get pods -o wide

[2025-01-10T09:15:33.355Z] + grep pulsar-pod-failure-20120

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-etcd-0                                   1/1     Running                  0                34m     10.104.32.46    4am-node39   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-etcd-1                                   1/1     Running                  0                34m     10.104.19.203   4am-node28   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-etcd-2                                   1/1     Running                  0                34m     10.104.26.119   4am-node32   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-datanode-bd87c6f86-fbpfv          1/1     Running                  2 (33m ago)      34m     10.104.30.151   4am-node38   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-datanode-bd87c6f86-nrw4k          1/1     Running                  2 (33m ago)      34m     10.104.33.236   4am-node36   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-indexnode-86c56b45bc-4cm8f        1/1     Running                  2 (33m ago)      34m     10.104.34.145   4am-node37   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-indexnode-86c56b45bc-7chwr        1/1     Running                  2 (33m ago)      34m     10.104.14.147   4am-node18   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-indexnode-86c56b45bc-sxvrq        1/1     Running                  2 (33m ago)      34m     10.104.23.10    4am-node27   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-mixcoord-5475cb7979-kvlq9         1/1     Running                  2 (33m ago)      34m     10.104.23.9     4am-node27   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-proxy-77c6d556df-x5px9            1/1     Running                  2 (33m ago)      34m     10.104.33.235   4am-node36   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-querynode-69959fd6fc-cq6ww        1/1     Running                  2 (33m ago)      34m     10.104.15.101   4am-node20   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-querynode-69959fd6fc-dvh2x        1/1     Running                  2 (33m ago)      34m     10.104.33.237   4am-node36   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-querynode-69959fd6fc-gjrnh        1/1     Running                  2 (33m ago)      34m     10.104.16.9     4am-node21   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-milvus-streamingnode-548f5d78f5-vqwdd    1/1     Running                  2 (33m ago)      34m     10.104.23.11    4am-node27   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-minio-0                                  1/1     Running                  0                34m     10.104.32.50    4am-node39   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-minio-1                                  1/1     Running                  0                34m     10.104.19.206   4am-node28   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-minio-2                                  1/1     Running                  0                34m     10.104.26.114   4am-node32   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-minio-3                                  1/1     Running                  0                34m     10.104.24.129   4am-node29   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-pulsarv3-bookie-0                        1/1     Running                  8 (10m ago)      34m     10.104.19.204   4am-node28   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-pulsarv3-bookie-1                        0/1     Running                  8 (10m ago)      34m     10.104.32.51    4am-node39   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-pulsarv3-bookie-2                        0/1     Running                  8 (10m ago)      34m     10.104.26.120   4am-node32   <none>           <none>

[2025-01-10T09:15:33.615Z] pulsar-pod-failure-20120-pulsarv3-bookie-init-r4k45               0/1     Completed                0                34m     10.104.19.197   4am-node28   <none>           <none>

[2025-01-10T09:15:33.616Z] pulsar-pod-failure-20120-pulsarv3-broker-0                        1/1     Running                  8 (9m55s ago)    34m     10.104.6.41     4am-node13   <none>           <none>

[2025-01-10T09:15:33.616Z] pulsar-pod-failure-20120-pulsarv3-broker-1                        1/1     Running                  8 (10m ago)      34m     10.104.13.144   4am-node16   <none>           <none>

[2025-01-10T09:15:33.616Z] pulsar-pod-failure-20120-pulsarv3-proxy-0                         1/1     Running                  8 (10m ago)      34m     10.104.32.42    4am-node39   <none>           <none>

[2025-01-10T09:15:33.616Z] pulsar-pod-failure-20120-pulsarv3-proxy-1                         1/1     Running                  8 (10m ago)      34m     10.104.13.140   4am-node16   <none>           <none>

[2025-01-10T09:15:33.616Z] pulsar-pod-failure-20120-pulsarv3-pulsar-init-5rd6x               0/1     Completed                0                34m     10.104.6.40     4am-node13   <none>           <none>

[2025-01-10T09:15:33.616Z] pulsar-pod-failure-20120-pulsarv3-recovery-0                      1/1     Running                  8 (10m ago)      34m     10.104.13.143   4am-node16   <none>           <none>

[2025-01-10T09:15:33.616Z] pulsar-pod-failure-20120-pulsarv3-zookeeper-0                     1/1     Running                  8 (10m ago)      34m     10.104.19.205   4am-node28   <none>           <none>

[2025-01-10T09:15:33.616Z] pulsar-pod-failure-20120-pulsarv3-zookeeper-1                     1/1     Running                  8 (10m ago)      34m     10.104.32.47    4am-node39   <none>           <none>

[2025-01-10T09:15:33.616Z] pulsar-pod-failure-20120-pulsarv3-zookeeper-2                     1/1     Running                  8 (10m ago)      34m     10.104.26.113   4am-node32   <none>           <none>

Anything else?

No response

The text was updated successfully, but these errors were encountered:

zhuwenxing · 2025-01-13T03:47:13Z

/assign @chyezh

PTAL

zhuwenxing · 2025-01-13T03:50:56Z

The Kafka pod kill chaos test passes when the streaming node is not enabled.
https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/19093/pipeline

zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 13, 2025

zhuwenxing assigned yanliang567 Jan 13, 2025

sre-ci-robot assigned chyezh Jan 13, 2025

chyezh added the feature/streaming node streaming node feature label Jan 13, 2025

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 13, 2025

yanliang567 removed their assignment Jan 13, 2025

yanliang567 added this to the 2.6.0 milestone Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: After the message queue component's pod is killed and recovered, all Milvus flush operations are timing out. #39197

[Bug]: After the message queue component's pod is killed and recovered, all Milvus flush operations are timing out. #39197

zhuwenxing commented Jan 13, 2025

zhuwenxing commented Jan 13, 2025

zhuwenxing commented Jan 13, 2025

[Bug]: After the message queue component's pod is killed and recovered, all Milvus flush operations are timing out. #39197

[Bug]: After the message queue component's pod is killed and recovered, all Milvus flush operations are timing out. #39197

Comments

zhuwenxing commented Jan 13, 2025

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

zhuwenxing commented Jan 13, 2025

zhuwenxing commented Jan 13, 2025