Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting new Zowe results in multiple S0C4 (U4088) abends then terminates #736

Open
bobbydixon opened this issue Oct 31, 2024 · 5 comments
Labels
bug Something isn't working new not yet triaged severity-high A bug for which there may be workaround but limits the usage of the Zowe for major use cases

Comments

@bobbydixon
Copy link

Describe the bug
Client installed Zowe V2.18 to test. After completing installation steps, the started the two Zowe STCs, ZWESISTC and ZWESLSTC. ZWESLSTC issued multiple S0C4 dumps and terminated.

Steps to Reproduce

  1. Install Zowe V2.18
  2. Start the two Zowe STCs, ZWESISTC & ZWESLSTC

Expected behavior
Expect the two Zowe STCs to start, initialize, and stay up until stopped.

Screenshots (if needed)

Logs

Describe your environment

  • Zowe version number (Check the Desktop login screen, or manifest.json in the Zowe install folder): V2.18
  • Install method (pax, smpe, kubernetes, github clone): pax
  • Operating system (z/OS, kubernetes, etc) and OS version: z/OS
  • Node.js version number (Shown in logs, or via node --version): v20.16.0
  • Java version number (Shown in logs, or via java -version): Java 1.8.0_411
  • z/OSMF version: running z/OS V2.5, not sure how this translates to the z/OSMF version
  • What is the output of log message ZWES1014I: do not see this message
  • Environment variables in use:

Additional context
This seems similar to GitHub issue #600 which closed due to inactivity.

I had Howard, one of our LE experts review the dump, and he found:

As noted previously the u4088-75 is happening because the XPLINK backchain pointer at 33150830 is not in storage.   I don't see any sign of a prior freemain/release but I do see many prior program checks and RCVY systrace entries.  The customer is specifying TERMTHDACT(MSG...), which is not optimal for this.  Please ask them to change their TERMTHDACT RTO from MSG to UADUMP so we can get a u4039 dump of the first failure.

They have:

PARMLIB(CEEPRMML) OVR TERMTHDACT(MSG,CESE,00000096)

Change it to TERMTHDACT(UADUMP,CESE,00000096)

They can add it to their existing _CEE_RUNOPTS statement.

The dump shows many DSA's using addressess within the page at 33150000 but that page is not in storage. Systrace also shows many recovery retries pointing to SDWA's that are also no longer in storage. Also TERMTHDACT is set to UADUMP now.

Updates noted, I am continuing with the u4088 dump. The u4088 is getting issued out of CEEVXPAL because the calculated previous DSA location is not within the current downstack segment. In CEEVXPAL we have register 4 set to the value passed to CEEVXPAL, 33150830. We do some calculations to see if this value is greater than SMCB_DSBOS, the highest stack address, or less than the stack floor value, 3325080. It is less than the stack floor so we take the u4088-75 abend. The value in register 4 is passed to CEEVXPAL by the caller. We save the callers registers in the SMCB. Register 7 is set to x'B2780DD0, which is x418' bytes into JsonToJS1, so this code needs to be reviewed to see why it is passing this value.

 I also reviewed the prior RCVY/pgm chk calls in systrace.  They all involve CEEVHPSO, LE's HP linkage stack overflow processor.  These are telling me the stack segments are overflowing and a new segment is needed.  This happens when the stack size is too small and a new segment is needed.  I see this happening about 242 times between 14:45:27.404950835 and 14:45:48.968235664 before the u4088, which looks excessive.  The Stack RTO is set to 

OVERRIDE OVR STACK(0000012272,0000004080,ANY ,FREE,
0000012272,0000004080)

Which may be too small. I would recommend setting the RPTSTG(ON) option and then reviewing the report to get better idea of what to use for the stack setting.

 But JsonToJS1 need to be reviewed in addition to this. 

It is also probable that the stack overflow is exhausting stack storage, driving jsonToJS1 t request stack storage that is now outside of the current segment. The traceback also hints at a possible loop in jsonToJS1 +48C?

Traceback:
DSA Entry E Offset Statement Load Mod Program Unit Service Stat

1 jsonToJS1 +00000000 *PATHNAM Call
2 jsonToJS1 +0000048C *PATHNAM Call
3 jsonToJS1 +0000048C *PATHNAM Call
4 jsonToJS1 +0000048C *PATHNAM Call
5 ejsJsonToJS_internal
+00000022 *PATHNAM Call
6 evaluationVisitor
+00000196 *PATHNAM Call
7 visitJSON +00000124 *PATHNAM Call
8 visitJSON +00000166 *PATHNAM Call
9 visitJSON +00000166 *PATHNAM Call
10 evaluateJsonTemplates
+00000090 *PATHNAM Call
11 cfgLoadConfiguration2
+00000166 *PATHNAM Call
12 cfgLoadConfiguration
+00000022 *PATHNAM Call
13 CEEVHPFR +00000002 CEEPLPKA Call
14 main +000005E4 *PATHNAM Call

After discussing this issue in our Zowe meeting, Andrew suggested starting ZSS in 64-bit mode by adding 64bit: true to the zowe.yaml file as follows, then recycle the Zowe STCs:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

zss:
enabled: true
port: 7557
crossMemoryServerName: ZWESIS_STD
tls: true
agent:
64bit: true
jwt:
fallback: true

@bobbydixon bobbydixon added bug Something isn't working new not yet triaged severity-high A bug for which there may be workaround but limits the usage of the Zowe for major use cases labels Oct 31, 2024
@1000TurquoisePogs
Copy link
Member

Hi, discussing this with others but I was able to reproduce this and resolve it by varying the CEE runopts setting STACK

_CEE_RUNOPTS=STACK(24576,16384,ANYWHERE,KEEP,524288,13107)
Works.

This may be covering up a bug, so it's still being investigated but it can be a workaround at the moment.

@bobbydixon
Copy link
Author

Hi, thanks for the update.

Are you saying that adding "64bit: true" to zss in the zowe.yaml file is a workaround (for now), or adding "_CEE_RUNOPTS=STACK(24576,16384,ANYWHERE,KEEP,524288,13107)" to the CEE runopts?

Kind regards,
Bobby

ifakhrutdinov added a commit to zowe/zowe-common-c that referenced this issue Nov 1, 2024
When an ABEND occurs and there is a user-defined ESTAEX in
an LE application, the language environment must be notified
via a call to CEE3ERP; that way LE has a chance to handle
things like hitting a stack guard page. If we don't call
CEE3ERP, things can go terribly wrong.

At some point, the ZSS 31-bit build was changed to use
XPLINK and the CEE3ERP call in the recovery facility was
erroneously limited to non-XPLINK 31-bit LE environments.

This commit changes the code to call the CEE3ERP routine
in XPLINK 31-bit LE applications.

Fixes:
* zowe/zss#600
* zowe/zss#736

Signed-off-by: Irek Fakhrutdinov <[email protected]>
ifakhrutdinov added a commit to zowe/zowe-common-c that referenced this issue Nov 1, 2024
When an ABEND occurs and there is a user-defined ESTAEX in
an LE application, the language environment must be notified
via a call to CEE3ERP; that way LE has a chance to handle
things like hitting a stack guard page. If we don't call
CEE3ERP, things can go terribly wrong.

At some point, the ZSS 31-bit build was changed to use
XPLINK and the CEE3ERP call in the recovery facility was
erroneously limited to non-XPLINK 31-bit LE environments.

This commit changes the code to call the CEE3ERP routine
in XPLINK 31-bit LE applications.

Fixes:
* zowe/zss#600
* zowe/zss#736

Signed-off-by: Irek Fakhrutdinov <[email protected]>
@ifakhrutdinov
Copy link
Contributor

@bobbydixon, yes, those should be valid workarounds.

ifakhrutdinov added a commit to zowe/zowe-common-c that referenced this issue Nov 29, 2024
When an ABEND occurs and there is a user-defined ESTAEX in
an LE application, the language environment must be notified
via a call to CEE3ERP; that way LE has a chance to handle
things like hitting a stack guard page. If we don't call
CEE3ERP, things can go terribly wrong.

At some point, the ZSS 31-bit build was changed to use
XPLINK and the CEE3ERP call in the recovery facility was
erroneously limited to non-XPLINK 31-bit LE environments.

This commit changes the code to call the CEE3ERP routine
in XPLINK 31-bit LE applications.

Fixes:
* zowe/zss#600
* zowe/zss#736

Signed-off-by: Irek Fakhrutdinov <[email protected]>
ifakhrutdinov added a commit to zowe/zowe-common-c that referenced this issue Nov 29, 2024
When an ABEND occurs and there is a user-defined ESTAEX in
an LE application, the language environment must be notified
via a call to CEE3ERP; that way LE has a chance to handle
things like hitting a stack guard page. If we don't call
CEE3ERP, things can go terribly wrong.

At some point, the ZSS 31-bit build was changed to use
XPLINK and the CEE3ERP call in the recovery facility was
erroneously limited to non-XPLINK 31-bit LE environments.

This commit changes the code to call the CEE3ERP routine
in XPLINK 31-bit LE applications.

Fixes:
* zowe/zss#600
* zowe/zss#736

Signed-off-by: Irek Fakhrutdinov <[email protected]>
@ifakhrutdinov
Copy link
Contributor

@bobbydixon , we've fixed a bug, which may have contributed to this issue, in our code; but we've also found an issue in the compiler and that's probably the root cause. We've opened a case with IBM. We'll let you know if there are any additional workarounds you could use.

ifakhrutdinov added a commit to zowe/zowe-common-c that referenced this issue Nov 29, 2024
When an ABEND occurs and there is a user-defined ESTAEX in
an LE application, the language environment must be notified
via a call to CEE3ERP; that way LE has a chance to handle
things like hitting a stack guard page. If we don't call
CEE3ERP, things can go terribly wrong.

At some point, the ZSS 31-bit build was changed to use
XPLINK and the CEE3ERP call in the recovery facility was
erroneously limited to non-XPLINK 31-bit LE environments.

This commit changes the code to call the CEE3ERP routine
in XPLINK 31-bit LE applications.

Fixes:
* zowe/zss#600
* zowe/zss#736

Signed-off-by: Irek Fakhrutdinov <[email protected]>
@JoeNemo
Copy link
Contributor

JoeNemo commented Dec 4, 2024

As far as we can tell the 31 bit XPLINK version of the compiler will not fix the Variable-length array problem, so we need to fix this by migrating or removing their use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working new not yet triaged severity-high A bug for which there may be workaround but limits the usage of the Zowe for major use cases
Projects
Status: No status
Development

No branches or pull requests

4 participants