-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I set DDL_LOCK_TIMEOUT before resetPassword? #35
Comments
Hey @yaooqinn, Thanks a lot for using these images for the Apache Spark integration tests! Yeah, I have no problem increasing the timeout (already saw the PR, thanks!) but I'm curious as to why this error is happening. |
Thank you @gvenzl. Spark's docker IT uses
Would establishing the connection within an interval of 1 second be considered pushy and cause the DDL LOCK problem?
The |
Hey @yaooqinn, Thanks a lot for the background, this is interesting.
No, an interval of 1 second shouldn't be a problem at all, and actually, it cannot cause this error.
Hm, even more peculiar. Given that the The more I think about this, the more I think we have found some sort of race condition or bug here. Is there any way for me to see what the regression tests execute, what's done inside |
Thank you @gvenzl. Here is I recent example I have found: To reproduce this in GitHub Actions, you can fork the spark repo, and the To reproduce locally,
Here is a detailed guide you may refer to: https://github.com/apache/spark/tree/master/connector/docker-integration-tests#readme FYI, the code for docker container initialization can be found at: |
When the GitHub action job fails, you can found the error log on the |
Thanks a lot, @yaooqinn! Let me give this a spin and take a look what's going on here. |
Thank you very much for the help @gvenzl |
Ok, I got it running locally. However, I have looked through a bunch of failed tests and seem to find a lot more runs that simply time out with:
I'll keep digging. |
Besides the console log, much richer information, including those from docker container, could be found at |
Thanks a lot, @yaooqinn, Btw, I don't think that the workaround is an actual workaround. According to The
Yet, the original error occurred during the second
|
Hey @yaooqinn, Is there any way to see or extract the container logs in cases such as this one:
|
@gvenzl. As you mentioned earlier, the workaround on the Spark side was not helpful at all.
Additional logs can be found in |
Thanks @yaooqinn, Unfortunately, neither of these are very helpful. My latest suspicion is that the runner simply runs out of either horsepower, i.e. CPU, or memory. From what I've found out, these runners come with 4 CPUs only, and 16 GB memory, which ought to be enough to run these tests, unless there are some executed in parallel. It seems like running these tests eats up about 3 - 3.5GB of RAM: Here is the output from a run with
With these tests taking about 3 - 3.5 GB of memory and Oracle DB taking about another 1.5 ~ 2 GB, an active run would worst case take about 5.5GB, which in itself shouldn't be an issue, unless other stuff runs in parallel on the machine. Same with the CPU, the run queue is somewhere around nothing to 4/5 ish, again, not a biggy, unless other things run in parallel, perhaps. What would be really interesting is if we could inject these resource verification steps while these tests are running. |
I forgot to add that so far I was unable to reproduce this locally on my VM , another reason why I'm suspicious whether this is an issue of too many things running in parallel during the GHA executions. |
Thank you for the detailed information.
Unfortunately, I don't know.
There are no parallel test works being executed. Given on my observations, the error in GHA always/coincidentally occured on the second
The issue can be reproduced my local Apple M2/colima docker context, it should not be GHA-specific. |
If the load is the root cause, changing the ddl lock timeout seems to be a positive change https://github.com/gvenzl/oci-oracle-free/pull/37? |
I have discovered that we can include a health check to ensure that we achieve this goal. https://github.com/docker-java/docker-java/pull/2244/files |
Hey @yaooqinn, Thanks, I wasn't aware that you can replicate it in your local environment. The When you encounter this error locally, could you please provide me with the log from the currently running Oracle Database container via ( I have still not managed to reproduce this issue at my end, but will continue to try.
Hm, I'm not sure what exactly that addition does but it sounds like it only provides for a start interval delay for running the healthcheck. What we need really is the container's log from when that failure ocurrs, do you think this will provide us with that? Thanks, |
It seems that ORA-12541 can be ignored, the normal ones also include such errors |
Ok, although I saw test failures on the official Spark repository actions with this error. In any case, if you could provide me with the container log, it would already help tremendously. |
|
The Java API was used to collect the logs above, which are expected to be the same as the output of the |
Thanks a lot, @yaooqinn. Is there any reason why you are resetting the
|
Sorry, I just found it here: That also confirms again that the workaround of increasing the DDL_LOCK_TIMEOUT to a higher number won't remedy the situation either, as the |
Hi @gvenzl I have witnessed failures even with DDL_LOCK_TIMEOUT=30, indicating its ineffectiveness. I have no idea how to fix this:( |
Hey @yaooqinn, I chatted about this with my colleagues a bit and there is no easy way to diagnose this without getting trace files from the database to know where this is happening. I can help craft such a init script, but a fair warning, it will probably require a couple of round trips. However, before we go there, Oracle just released a new version of Oracle Database Free (23.4) last Thursday, and I updated my images last night. |
Thank you @gvenzl I have seen a PR apache/spark#46399 for bumping up to 23.4 |
I recommend removing the code that sets the password in override def beforeContainerStart. As @gvenzl pointed out, this step is redundant because setting the environment variable ORACLE_PASSWORD already configures the SYSTEM password as well. I've conducted a preliminary test and confirmed that the functionality remains intact without this segment of code. If there are no objections, I'm ready to submit a pull request to Apache Spark to implement this change. |
I have created a PR for this , see https://issues.apache.org/jira/browse/SPARK-48289 basically reverting the extra password change and DDL_LOCK_TIMEOUT. It does look like we hit a bug / unwanted feature ;) in Oracle (maybe related to PDB?) that cannot be easily worked around with the attempted setting of DDL_LOCK_TIMEOUT (good try BTW). So my view is that the extra code can be removed as it creates extra space for issues. Our hope at this stage is that the issue is not there in Oracle Free 23.4. Otherwise we can keep investigating. |
…dundant SYSTEM password reset ### What changes were proposed in this pull request? This pull request improves the Oracle JDBC tests by skipping the redundant SYSTEM password reset. ### Why are the changes needed? These changes are necessary to clean up the Oracle JDBC tests. This pull request effectively reverts the modifications introduced in [SPARK-46592](https://issues.apache.org/jira/browse/SPARK-46592) and [PR #44594](#44594), which attempted to work around the sporadic occurrence of ORA-65048 and ORA-04021 errors by setting the Oracle parameter DDL_LOCK_TIMEOUT. As discussed in [issue #35](gvenzl/oci-oracle-free#35), setting DDL_LOCK_TIMEOUT did not resolve the issue. The root cause appears to be an Oracle bug or unwanted behavior related to the use of Pluggable Database (PDB) rather than the expected functionality of Oracle itself. Additionally, with [SPARK-48141](https://issues.apache.org/jira/browse/SPARK-48141), we have upgraded the Oracle version used in the tests to Oracle Free 23ai, version 23.4. This upgrade should help address some of the issues observed with the previous Oracle version. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch was tested using the existing test suite, with a particular focus on Oracle JDBC tests. The following steps were executed: ``` export ENABLE_DOCKER_INTEGRATION_TESTS=1 ./build/sbt -Pdocker-integration-tests "docker-integration-tests/testOnly org.apache.spark.sql.jdbc.OracleIntegrationSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46598 from LucaCanali/fixOracleIntegrationTests. Lead-authored-by: Kent Yao <[email protected]> Co-authored-by: Luca Canali <[email protected]> Signed-off-by: Kent Yao <[email protected]>
Hey @LucaCanali, thanks for your help on this! If this issue reoccurs, it would be great to have a reproducible test case, ideally outside of the Spark test runs via a "pure" Docker + scripts environment as that would make it easier for me to debug. But hopefully, this is solved now. |
Thanks @gvenzl and @LucaCanali
|
Oops, the error is still there to make the test flakey https://github.com/yaooqinn/spark/actions/runs/9155473477/job/25167935142 |
I can see the error:
|
This issue might be related to port forwarding. Based on the error message, it appears that IP:PORT 10.1.0.151:46019 should be forwarded to the container's port 1521, where the listener runs. If this forwarding is not functioning correctly, it could explain the ORA-12541 error. Another possibility is that the listener did not start up properly within the container, though this seems less likely to me, unless there is a high load on the system, which could cause the Oracle startup process to be significantly delayed? |
@LucaCanali It's not a port forwarding issue, DDL_LOCK_TIMEOUT errors it. You can find more information about the issue by referring to the above discussions |
@yaooqinn, I reviewed the logs from the following job run: https://github.com/yaooqinn/spark/actions/runs/9155473477/job/25167935142 is The error identified in the logs is ORA-12541: TNS:no listener, indicating that the client cannot connect to the Oracle service on host 10.1.0.151 at port 46019. This error seems to be unrelated to the previously reported errors ORA-65048 and ORA-04021. ORA-04021, as previously discussed in this thread, is usually associated with a timeout waiting for a resource, such as a library cache lock (when changing the password for example). ORA-65048 typically relates to an issue with attempting to access an invalid or non-existent container within a pluggable database environment. These errors suggest different root causes compared to the current connectivity issue. Additionally, could we gain insights into the current frequency of failures with the OracleIntegrationSuite? Are there statistics or logs available that detail these occurrences? Moreover, is there a way to assess if the testing infrastructure might be experiencing high CPU load, potentially leading to slow container image start times? This information could be crucial for identifying recurring patterns or persistent issues. Thank you! |
Hi @LucaCanali. I am sorry that I attached the wrong link as I was filing multiple PRs to spark repo at that time, it shall be this one - https://github.com/yaooqinn/spark/actions/runs/9155585688/job/25169246357 It's still ORA-65048 with ORA-04021 as the root cause
As Spark runs specific CI jobs by identifying related code change, it depends on how often we change the JDBC connectors. I have been working on an umbrella https://issues.apache.org/jira/browse/SPARK-47361 that touches this part frequently, so based on my observation, the rate is very high
Unfortunately, No.
ORA-04021 can be reproduced on local machines, not necessary has to be testing infrastructures like Github Actions. I'm not sure whether ORA-04021's related to high CPU load or not. |
Thank you for the additional info. |
Oh, I see. You only viewed the console log but not the |
Thank you for the explanation. I can see that you had ORA-0421 now in https://github.com/yaooqinn/spark/actions/runs/9155585688/job/25169246357
I am now trying to reproduce the issue using Github actions. |
@LucaCanali Feel free to ping me here or with a PR in Spark repo if you make any progress. Thanks. |
I have been attempting to reproduce the issue using GitHub Actions. However, the OracleIntegrationSuite has been running successfully there as well, so far. |
Hi @yaooqinn, @LucaCanali, @yaooqinn, is this still an issue at your end? |
Yes. @gvenzl |
Hi Everyone, |
Hi @gvenzl
The Apache Spark Docker Integration tests sometimes fail due to the failure of docker to start. It seems like the reason behind this is ORA-04021. To fix the problem, we can try to increase DDL_LOCK_TIMEOUT. Can we set this value using environment variables?
The text was updated successfully, but these errors were encountered: