Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display in browsers and tools for addresses of functions and instructions #990

Closed
dschuff opened this issue Feb 15, 2017 · 21 comments
Closed

Comments

@dschuff
Copy link
Member

dschuff commented Feb 15, 2017

Currently if you have a trap or validation error in your wasm binary, V8 gives you error output like
Compiling WASM function #26:<?> failed:Result = expected 1 elements on the stack for fallthru to @3 @+58 or at (<WASM>[3]+156).
SpiderMonkey says something like CompileError: wasm validation error: at offset 4214: type mismatch: expression has type void but expected i32

Meanwhile if you want to look at your binary with WABT's wasmdump tool, you will see something like

000015 func[0]:
 000017: 0c 00                      | br 0
 000019: 02 7f                      | block i32

WABT shows functions as indices in the (import+function) function index space, and all the "addresses" as hexadecimal file offsets.
If I understand correctly, V8 is giving you function index (in same function index space) + decimal function offset in the stack representation and... something else... in the validation error text.
SpiderMonkey gives you no function name or index, and a (file?) offset.

So to debug anything you might have to look at the source code of your favorite browser engine to determine what information you are getting, and then use your favorite programmers' calculator to convert between file and function offset and number bases. Needless to say this is a suboptimal user experience.

#814 had some discussion of normalizing the content of Error.stack, and I understand that's somewhat fraught. My goal here is simultaneously more ambitious (in that we should include tools as well as browsers) and less ambitious (see below).

Here are possible things we might agree on (where the definition of "agree" is also negotiable).

  1. Representation of function ID (name and/or index)
  2. Representation of function and instruction addresses (offset relative to file, section, or function; hex or decimal)
  3. Representation of section ID (may or may not be needed in browsers, but will be used in tools)
  4. Representation of module ID (stack traces can contain frames from 2 or more different wasm instances dynamically linked together)
  5. Exact or partial format of some string that is emitted in stack traces, exception messages (e.g. LinkError and CompileError text), and even dev tools.
  6. Whether exceptions are precise
  7. ???

I think getting 2 and 4 is the absolute bare minimum, so a user can dump their binary and find (an approximation of) the faulting address or (the exact, hopefully) the invalid address, without having to do base conversions and arithmetic. 1 and 3 would be nice too, and seems like they ought to be pretty easy to agree on and implement. 5 and 6 would in principle be nice too, but as long as the other conditions are met, the incremental benefit is fairly small and the cost might be higher. I'm fine if we want to punt those.

As to the definition of "agree", there is:

  1. Normative text in the most-formal relevant spec (I guess that would be JS.md for browser behavior for now)
  2. Some suggested/non-normative text in JS.md, or elsewhere in the design repo (which would give it some kind of official guideline or weight)
  3. Some text somewhere else (perhaps tool-conventions) which would be a non-source-code description, but less official.
  4. Nothing written anywhere, the browsers and tools written by us just all do the same thing.

Honestly even 4 would be fine. Whatever we do, other tools will probably follow for their own convenience, and users (who aren't going to look for specs unless they are confused) will be happy.

Thoughts?

@kripken
Copy link
Member

kripken commented Feb 15, 2017

I agree 2 and 4 are basic things that would be great to agree on. Regarding 2, offsets relative to the entire binary file seem most natural and simple - there is no ambiguity, no offsets to add when looking in the binary, etc. And regarding hex vs. dec, no strong feelings but JS line numbers are dec, so maybe nice to be consistent with that.

@dschuff
Copy link
Member Author

dschuff commented Feb 15, 2017

And now, in which I suggest a color for the bikeshed and tradeoffs:

I think I'd prefer to see function index (and name if present) as well as an absolute offset (instead of an offset from the function start). Using a section offset would be more conventional (for the tools) compared to existing tools like objdump, but I agree that in the browser it would require printing yet another number or name, which wouldn't be great. I don't have a strong opinion on that. Also wrt base: hex is more common for tools, but decimal is more common for browsers. Naturally I prefer everything to look as much like unix tools from the 60s as possible 😁

@RyanLamansky
Copy link

I'm usually very opinionated on things, but in this particular issue I don't have strong feelings. I'd like to give the browser vendors some time to experiment--maybe try things we're not thinking of--in a competitive effort to provide a great debugging solution.

@titzer
Copy link

titzer commented Feb 16, 2017

We wanted to take a pass over all the error messages that V8 generates and make them a bit nicer. Happy to converge with others here. My thoughts:

  • Don't really have a strong opinion on hex vs decimal offsets
  • Would prefer to use function names (from the names section) if they are available and function id's (in the function index space)
  • Shouldn't be radically different than what we report for WebAssembly.RuntimeError stacktraces, e.g. we use decimal byte offsets from function-body start

@dschuff
Copy link
Member Author

dschuff commented Feb 16, 2017

Shouldn't be radically different than what we report for WebAssembly.RuntimeError stacktraces, e.g. we use decimal byte offsets from function-body start

I think the intent is for this agreement to cover those stacktraces, as well as the compile-time errors. Was there some particular reason you chose function-offset rather than file-offset for that? I think my preference would be for file or section offset but honestly that preference is based more on what it would look like in tooling rather than browsers.

@titzer
Copy link

titzer commented Feb 16, 2017

The function offset is preferable for stack traces, since they are more stable. E.g. I am thinking of tools that use stack traces of exceptions to track different kinds of software defects (like my previous project). It's nicer to have a more stable function-relative offset for those use cases.

That said it is also good to see the byte offset in the module when debugging producers. So that's why the V8 --trace-wasm-decoder flag outputs both.

@dschuff
Copy link
Member Author

dschuff commented Feb 16, 2017

I'm not sure I buy that function offsets are more stable. If you recompile a binary, both function offsets and file offsets are subject to arbitrary changes (not to mention that function index values or names may change or functions may disappear entirely). If you want to make any sense of a stack trace you need a copy of the exact binary in any case, no?

@titzer
Copy link

titzer commented Feb 16, 2017

It depends on the optimization level, of course. But for a given small change to a C++ function, unless that function is inlined everywhere, or somehow otherwise triggers different inlining or optimization decisions in other functions, those other functions' bytecode should be stable.

@binji
Copy link
Member

binji commented Feb 17, 2017

I like hex over dec. I also like file offsets; you can use the same offset-space for any error. IIRC, @sbc100 modeled wasmdump after objdump, which appears to use file offsets, not section offsets:

$ objdump d8 -h
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
...
 10 .init         00000018  000000000001c608  000000000001c608  0001c608  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE

...
$ objdump d8 -d
Disassembly of section .init:

000000000001c608 <_init>:
   1c608:       48 83 ec 08             sub    $0x8,%rsp
   1c60c:       e8 47 17 00 00          callq  1dd58 <call_gmon_start>
   1c611:       e8 ea 17 00 00          callq  1de00 <frame_dummy>
   1c616:       e8 b5 40 04 00          callq  606d0 <__do_global_ctors_aux>
   1c61b:       48 83 c4 08             add    $0x8,%rsp
   1c61f:       c3                      retq   

It looks like label+offset is used as well:

000000000001c630 <__libc_start_main@plt>:
   1c630:       ff 25 7a df 05 00       jmpq   *0x5df7a(%rip)        # 7a5b0 <_GLOBAL_OFFSET_TABLE_+0x18>
   1c636:       68 00 00 00 00          pushq  $0x0
   1c63b:       e9 e0 ff ff ff          jmpq   1c620 <_init+0x18>

@dschuff
Copy link
Member Author

dschuff commented Feb 17, 2017

OK, let's not consider section offsets then. I still have a preference for file/module offsets over function offsets, but I could be convinced. Do engine implementors think that one or the other of those would carry extra cost (e.g. memory or bookkeeping) in the engine? It's pretty easy for the offline tools to just do whatever. WRT hex vs dec split it might make sense to use hex if we do file offsets (they are like "addresses" and correspond more directly to how addresses are used elsewhere) and decimal if we do function offsets, just to signal that this is not exactly an address.

@titzer
Copy link

titzer commented Feb 20, 2017

For the --trace-wasm-decoder output, I think it makes sense to do what @binji has done and follow the objdump output format. For decoding error messages, function index + function name + function offset + module offset would probably be the most useful. For stacktraces, I think function index + function name + function offset is sufficient, as I don't see how the module offset would be particularly useful for runtime errors. WDYT?

@rossberg
Copy link
Member

rossberg commented Feb 20, 2017 via email

@kripken
Copy link
Member

kripken commented Feb 20, 2017

@titzer: I don't see the distinction between error messages and runtime stacktraces in that context, the module offset seems very useful in both.

(I personally don't see much value in the function offset, but am not opposed to having it as well.)

@MikeHolman
Copy link
Member

I am against 6 (mandating precise exception reporting). Mostly because this makes bce much harder on 32 bit platforms. I also think whatever we do should fall short of normative text on error.stack output.

I wonder too if it's even desirable to fully converge. Inevitably that means people parsing the text. And then for compat, the suggested non-normative text may become de facto standard (and unchangeable).

@lukewagner
Copy link
Member

lukewagner commented Feb 21, 2017

I definitely like the idea of standardizing on a set of fields (that get embedded in the string) and their meaning. So for a given callstack entry, I can see these fields all being meaningful:

  • name: optional function name, coming from the names section if present
  • pcOffset: offset of the trapping instruction (for the faulting frame) or the callsite (for calling frames)
  • funcIndex: function index
  • url: with the Response-taking functions (Adding Response based versions of validate, compile, and instantiate. #991), some Modules will have a URL. It's also possible to synthesize a URL, as engines currently do for eval, that uses the scripted caller, on a best-effort basis (since there's not always a scripted caller).

For pcOffset, I agree with @dschuff, @kripken and @rossberg-chromium that module-relative seems more useful when someone cares at all. No strong opinion on dec vs. hex for pcOffset; hex (I assume with the 0x prefix) might be a nice contrast to the decimal funcIndex.

Given the existing JS name@url:line:column, at name (url:line:column) and other formats for JS stack entries, it seems the placement of the wasm name field would naturally want to go in the same place for each respective engine.

For the remaining three fields, it seems like we have the opportunity to define a shared string format to replace url:line:column. How about:

${url}:wasm-function[${funcIndex}]:${pcOffset}

? Some reasons:

  • reserves spot for URL, even if it's empty
  • avoids shouting "WASM"
  • makes it clear that funcIndex is a function index and also allows us to have non-function-indexed-things in callstacks in the future
  • colon before pcOffset seems more appropriate if offset is module-relative
  • somewhat symmetric with url:line:column but it's unambiguously wasm

Thoughts?

@lukewagner
Copy link
Member

lukewagner commented Feb 21, 2017

@MikeHolman Agreed on precise offsets. I think we actually already agreed to this in the middle of #814.

But w.r.t best-effort stack strings, I think server-side crash reporting/triage is a valid use case for Error.stack so it'll definitely be helping our users to converge as much as we can.

@MikeHolman
Copy link
Member

Fair point. Error bucketing (e.g. Watson) proves very helpful for me, so it would be bad to neglect it.

How do people do this in js?

@lukewagner
Copy link
Member

IIUC, they just grab an Error.stack string and send it back to the server where they do UA-specific parsing to pull out file/line and work from there.

@dschuff
Copy link
Member Author

dschuff commented Apr 3, 2017

OK, time to push this forward a little more.

I like @lukewagner's suggestion, so carrying that out a little more. In browsers, text representations of locations would be something like
${name}@${location} (currently it looks like SM and JSC use something like this) or at ${name} (${location}) (V8)
For JS, name is the JS function name and location is ${URL}:${line}:${column}.

For wasm, name is the name from the names section or.. <wasm>? an empty string?
location would then be
${url}:wasm-function[${funcIndex}]:0x${pcOffset}
where funcIndex is the index in the function-index space, and pcOffset is the module/file offset of the beginning of the instruction, printed in hexadecimal, with lowercase digits.

For tools such as wasmdump (and in some browser use cases such as dev tools) it won't always make sense to have that whole location string together. For example if you are dumping a file, the filename would replace the URL but it wouldn't make sense to print it over and over again. You might have a format that includes a subset, formatted in the same format and order that it appears elsewhere. Likewise in devtools UI.
Browsers presumably won't print section IDs; we can come up with a convention in the tools, maybe fork that to a separate issue.

This covers the goals:

  1. function index is always displayed in the specified format, with name where applicable
  2. module-relative offsets in hex
  3. section IDs not specified here
  4. module ID is URL where applicable.
  5. the string that I called location is specified. if we want we can also cover the name@location bit.
  6. precise traps not specified.

A few open questions:

  1. Do we want to specify the name@location part where browsers currently differ?
  2. Do we want to specify what we print when a function has no name?
  3. I specified hex for module offsets here. There seems to be more support for that than not-support, among those who care?
  4. Do we want to specify the module IDs of modules created from ArrayBuffers?

@lukewagner
Copy link
Member

  1. Given that the wasm frames are embedded in an overall JS callstack with JS frames (whose format we probably can't change), I think it'd probably be less friction overall if we had wasm match the engine's JS format, at least within Error.stack. In other contexts, it'd make sense to standardize.
  2. When no name is specified, leaving name as the empty string makes sense to me: you're basically saying that there isn't a name. The frame is still unamiguously wasm, though, due to the wasm-function later.
  3. Hex is fine with me. Maybe it's even preferable b/c a decimal number might suggest to people that it's the line number in the rendered text format which it's not.
  4. For ArrayBuffer, what I was suggesting above is doing what eval does: include the location of the JS caller that called instantiate/compile. So, e.g., in SM we'll use test.js line 2 > eval as the URL of a script eval'd at line 2 of test.js. (V8 seems to do something symmetric with more at.)

@dschuff
Copy link
Member Author

dschuff commented May 22, 2017

The intent of this issue is addressed in #1053. I also filed #1071 for the question of byte offset vs other ideas for referring to instructions.

@dschuff dschuff closed this as completed May 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants