Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudWatch logs have not completed ingestion within 1 minute #10

Open
omus opened this issue Mar 4, 2021 · 1 comment
Open

CloudWatch logs have not completed ingestion within 1 minute #10

omus opened this issue Mar 4, 2021 · 1 comment

Comments

@omus
Copy link
Member

omus commented Mar 4, 2021

I've noticed this failure show up a few times with the online tests:

 Num workers (10): Error During Test at /home/runner/work/AWSClusterManagers.jl/AWSClusterManagers.jl/test/batch_online.jl:122
  Got exception outside of a @test
  CloudWatch logs have not completed ingestion within 1 minute
  Stacktrace:
   [1] error(::String) at ./error.jl:33
   [2] run_batch_job(::String, ::Int64; timeout::Minute, should_fail::Bool) at /home/runner/work/AWSClusterManagers.jl/AWSClusterManagers.jl/test/batch_online.jl:109

The logs for the manager:

Manager accepting worker connections via: 10.0.12.53:32768
--
Found previously registered job definition: "arn:aws:batch:us-east-1:134847318362:job-definition/AWSClusterManagers-jl:18"
Submitted array job "AWSClusterManagers-jl-n10" (76e81a62-65ae-488a-a6b9-1bbc8af886cf, n=10)
Spawning array job: 76e81a62-65ae-488a-a6b9-1bbc8af886cf (n=10)
NumProcs: 11
Worker container 2: c6205854a2d2ed54efe54d38572d0ce3848262ea8475f1b9a8cef1ebf92927c8
Worker job 2: 76e81a62-65ae-488a-a6b9-1bbc8af886cf:7
Worker container 3: 9a3945505cd55d1a54818fd5184b08a96f082dc89fcdb739cfe1f1de840cb0b4
Worker job 3: 76e81a62-65ae-488a-a6b9-1bbc8af886cf:0
Worker container 4: 04dd75c4e042186eb2e31c6eda96a2ae24279ea260a1bc5eafabfa8bc9616cb6
Worker job 4: 76e81a62-65ae-488a-a6b9-1bbc8af886cf:2
Worker container 5: 19253f8ed3bd9388773750efcdc9287f5ea67468800e02844ecf4d8ba1c72c7b
Worker job 5: 76e81a62-65ae-488a-a6b9-1bbc8af886cf:1
Worker container 6: c9e538218ac792ddfe549abbf6e08ae4decc5f6f1d67d8441f3af62758a4770b
Worker job 6: 76e81a62-65ae-488a-a6b9-1bbc8af886cf:6
Worker container 7: a2c02e7da471ce44e7d16ff3db89613482a810860ac6f28764c3c2bd89890613
Worker job 7: 76e81a62-65ae-488a-a6b9-1bbc8af886cf:8
Worker container 8: 65e7248374413e35edb107f29c0892f211631ed94fafd4745b22b50d7522cb52
Worker job 8: 76e81a62-65ae-488a-a6b9-1bbc8af886cf:5
Worker container 9: c0d5f47a071e852939bbb95d0f6beaf19dd4b896caaa4208be6d502a48bf32c1
Worker job 9: 76e81a62-65ae-488a-a6b9-1bbc8af886cf:9
Worker container 10: 4346c26d27e19f0f751bfbafc1cdc5700920416fdb675ae529c13133f7563edb
Worker job 10: 76e81a62-65ae-488a-a6b9-1bbc8af886cf:3
Worker container 11: 66244fbcc7b297cbc9b92cd60e536591d119bed5a12afa8cd95efe798af4da19
Worker job 11: 76e81a62-65ae-488a-a6b9-1bbc8af886cf:4
Manager Complete
┌ Warning: Forcibly interrupting busy workers
│   exception = rmprocs: pids [3, 4, 7, 8, 9, 10, 11] not terminated after 5.0 seconds.
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1234
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1030
@omus
Copy link
Member Author

omus commented Mar 4, 2021

The issue is that the check assumes the "Manager Complete" message will be the last thing written to the logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant