Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of DebugInfomationFormat.md. #1

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
220 changes: 220 additions & 0 deletions DebugInfomationFormat.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
#Debug Symbol Integration

This is a design proposal for providing debug information within WebAssembly
that would allow source-level debugging of code compiled into WebAssembly.

## Goals and Principles

The immediate goal is to implement enough debugging functionality in WebAssembly
MVP to let users perform at least rudimentary debugging on their source.
Another goal is not to bake in any limitations that would prevent reasonable
future evolution of the standard.

Specifically, we are currently not focused on identifying a language-independent
subset of the debug format -- this proposal bundles language-dependent and
language-independent parts together.

META: We need an evolution strategy that allows new front-end/debugger pairs to
use the format in the future to transfer information currently unanticipated.
Examples: a) dynamic scoping in Lisp; b) full DWARF 4 equivalence.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full DWARF 4 seems pretty huge? Isn't that a Turing-complete language? :)
Would you have a specific DWARF feature to point at instead?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples are just to jog one's imagination for where future evolution may take us. They're not meant to suggest that we intend to go in that direction, just that we don't want to gratuitously prevent it. Another example could be "enough of DWARF to allow stock gdb to run on wasm modules".

Anyway, see the discussion below -- we are moving to a different, more flexible design that allows any debug-info format to be used. I'll mothball this PR soon and create another one with the new design description.


## Debug Info Goes into Extra Sections

To allow easy stripping of debug info, all of it will go into a separate
section. In the binary format, this will be an unknown section inserted into
the wasm file. (Note that this new info doesn't make the current
[name section](BinaryEncoding.md#name-section) redundant, because that section
contains WebAssembly names, not source-code names.)

META: How do we distinguish the debug-containing unknown section from any other
unknown section?

In the text format, this will be an extra child of the **module** element named
**debug**.

META: If there's a better name than **debug**, we can certainly use it. Also,
currently undecided about allowing multiple **debug** children, like **func**.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you proposing that there be a single section, and a single format, that will suffice for all HLL debugging use cases? And perhaps further, that we attempt to build in the flexibility to grow to encompass everything DWARF can do, and everything needed to describe LISP, and anything else that comes up, all in this one format?

Copy link
Owner Author

@dekimir dekimir Jun 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment just says that I'm undecided whether we should allow multiple debug children under module. I'm leaning towards allowing multiple debugs as a syntactic convenience; their union will constitute the module's debug info.

As for the larger question of flexibility, I'd like this design to cover all debugging use cases supported in the MVP. Beyond MVP, I envision extending the format in backward-compatible ways for each new use case. But I don't pretend to know which use cases (and in which order) that will be -- LISP and DWARF are just some examples that I threw in to illustrate the possible variety of what could be coming one day.

Copy link

@sunfishcode sunfishcode Jun 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At an initial read, this sounds like a plan to make a format that has the eventual goal of reimplementing the functionality of DWARF. It's not explicitly spelled out here, but I assume the purpose of standardizing such a format is to allow browsers to consume it and provide their own HLL debugging experiences. Is this the vision here?

If so, I'd like to have a discussion about the merits of that vision compared with the merits of another one:

Another vision is that WebAssembly just provides '@' notation and byte offsets for identifying locations in WebAssembly code (as you describe), browser APIs for debugging primitives like reading out memory, setting breakpoints, and so on (in a security-appropriate manner), and browser APIs for reading the contents of unknown sections from wasm files. These make it possible for content to implement debugging functionality itself, such as in the form of a library that gets linked in, or possibly in the form of code that the user might request be loaded alongside the main content of a site.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if that's the impression, I didn't mean to take other options off the table. I'm happy to have the discussion about different visions. I'll reach out on IRC.


## AST Gets Unique Identifiers

Since all the debug info is in a separate section, there must be a way for it to
reference AST nodes. For that, we introduce a mechanism to tag any AST node
with a unique identifier: just add an atom to the node's s-expression. That
atom must begin with the character '@' and be unique across the entire module
(for instance: `(get_local $x @abc123)` or `(param $y f64 @def456)`).

This extension is not visible in the binary format. In binary, the debug
section can refer to the wasm code by its byte address.

## Debug Functions

The **debug** child of the **module** will contain invocations of special
functions defined in this section. These functions are intended to convey the
debug info about the AST.

The functions' syntax is given using conventions borrowed from
[here](https://www.gnu.org/software/emacs/manual/html_node/elisp/A-Sample-Function-Description.html#A-Sample-Function-Description).
Any _symbolID_ argument must be unique across the module.

### File

Function: **file** _symbolID_ _stringFilepath_

Describes a file that can be referenced by _symbolID_. The file contents can be
read by accessing _stringFilepath_ in a manner described elsewhere. No two
**file** invocations may have the same _stringFilepath_.

META: Specify how exactly the browser may obtain the file content.

### Source Location

Function: **source\_location** _symbolID_ _symbolFile_ _integerLine_ _integerColumn_

Describes a source location that can be referenced by _symbolID_. _symbolFile_
refers to a file described by the **file** function. The last two arguments are
line and column in that file.

Function: **has\_source\_location** _symbolLocation_ &rest _@nodes_

Declares that _@nodes_ (an arbitrary number of @ references to the AST)
correspond to the source location _symbolLocation_.

### Lexical Context

Function: **context** _symbolID_ _symbolStart_ _symbolEnd_

Describes a set of consecutive lines in a file, starting at location
_symbolStart_ and endint at location _symbolEnd_. Both locations must be in the
same file, and the end location must come lexically after the start.

This can be used to describe lexical scopes in C-like program, with the start
location pointing to the opening brace and the end location pointing to the
closing brace.

### Type

Function: **type** _symbolID_ _symbolLocation_

TODO: Devise a convention to represent types. OK if it's specific to C++.

### Variable

Function: **var** _symbolID_ _stringName_ _symbolContext_ _symbolType_ _integerMemoryAddress_

Describes a source-code variable that can be referenced by _symbolID_. The
variable name (as used in the source) is captured in _stringName_, its lexical
scope in _symbolContext_ (refers to a _symbolID_ of a **context** invocation),
and its type in _symbolType_ (refers to a _symbolID_ of a **type** invocation).
_integerMemoryAddress_ is the address of module's memory that holds the value of
this source variable.

TODO: Capture the source location where the variable is declared? Isn't the
context redundant then?

### Function

Function: **func** _symbolID_ _stringName_ _symbolType_ _symbolStart_ _symbolEnd_

Describes a source-code function or method whose mangled name is _stringName_,
whose type is the **type** referenced by _symbolType_, and whose start/end
locations are referenced by **source\_location** references _symbolStart_ and
_symbolEnd_.

## How to Perform Debugging Actions

Here's how to perform some basic debugging actions based on this debug-info
format:

### Set a Breakpoint on a Specific Line of a Specific File

1. Find the **file** invocation whose _stringFilepath_ argument matches the
specified file. If none exist, abort with an error "no such file".
2. Find all **source\_location** invocations whose _symbolFile_ equals the
above's _symbolID_. Among them, find invocations with the minimal
_integerLine_ equal or larger than the specified line. If none exist, abort
with an error "no debuggable code on that line".
3. Find all the **has\_source\_location** invocations whose _symbolLocation_ is
among the _symbolID_ arguments to the above invocations. If none exist,
abort with an error "no debuggable code on that line".
4. Find the earliest (in evaluation order) AST node in the union of the above
invocations' _@nodes_. This is where the breakpoint goes.

### Set a Breakpoint on a Specific Function

1. Find the **func** invocation whose _stringName_ matches (META: describe the
matching process, including mangling) specified function name. If none
exist, abort with an error "unknown function". If multiple invocations
match, ask the user to choose one among them.
2. Find the **source\_location** invocation whose _symbolID_ equals the
_symbolStart_ of the above **func**. If none exists, abort with an error
"incomplete debug info".
3. Find the **file** invocation whose _symbolID_ equals the _symbolFile_ of the
above.
4. Proceed with setting a breakpoint on the file and line derived from the
above's _stringFilepath_ and the source location's _integerLine_. The
procedure is described in the previous section.

### Set a Breakpoint Inside a C++ Template Definition

1. Find all instantiations of the template in question.
2. For each instantiation, repeat the above procedure for setting a breakpoint.

### Show which Source Line is Being Executed

1. Find the first **has\_source\_location** invocation whose _@nodes_ contains
the current AST node's @ identifier. (META: should we disallow multiple
locations for the same AST node?) If none exists, the result is nil.
2. Find the **source\_location** invocation whose _symbolID_ equals the
_symbolLocation_ of the above. If none exists, the result is nil.
3. Find the **file** invocation whose _symbolID_ equals the _symbolFile_ of the
above. If none exists, abort with an error "incomplete debug info".
4. The result is _integerLine_ of the **source\_location** from 2. above, plus
the _stringFilepath_ of the above.

### Step Into the Next Line (Following Function Calls)

1. Record the current source line as described in the previous section.
2. Execute the current AST node, then consider the next in evaluation order.
Until the calculated line/file changes, repeat from 1. above.

### Step Over the Next Line (Skipping Function Calls)

Same as the above section, except finish evaluating entire **invoke** nodes
before recalculating the current line and file.

### Show a Specified Variable's Value

1. Record the currently executed source line as described in a prior section.
2. Find all **var** invocations whose _stringName_ matches the (mangled)
specified variable name. If none exist, abort with an error "unknown
variable".
3. For each above invocation, find the **context** whose _symbolID_ equals the
_symbolContext_ of the **var**. Keep only those **var** invocations whose
**context** envelopes the current line (ie, same file and current line is
between the context's start and end).
4. If multiple **var** invocations remain after the above step, ask the user to
choose one among them.
5. Find the **type** whose _symbolID_ equals _symbolType_ from the above.
6. Inspect the memory contents at the _integerMemoryAddress_, in accordance with
the variable's type.

### Show a Specified Object's Field Value

Like the above section, but use the type structure information to find the
field's offset in memory.

### Show a Variable's Type

Like showing the variable's value, but show the **type** info instead.

### Set a Specified Variable's Value

Like showing the variable's value, but set the memory instead.

### Set the Current Function's Return Value

META: Describe this.

### Display Call Stack

META: Describe this.