[wip] pointer schema

ethdebug · Feb 6, 2024 · ac5cda7 · ac5cda7
1 parent 04490ed
commit ac5cda7
Show file tree

Hide file tree

Showing 44 changed files with 1,745 additions and 0 deletions.
diff --git a/packages/tests/schemas/examples.test.ts b/packages/tests/schemas/examples.test.ts
@@ -11,6 +11,8 @@ const idsOfSchemasAllowedToOmitExamples = new Set([
   "schema:ethdebug/format/type",
   "schema:ethdebug/format/type/complex",
   "schema:ethdebug/format/type/elementary",
+  "schema:ethdebug/format/pointer/region",
+  "schema:ethdebug/format/pointer/collection",
 ]);
 
 describe("Examples", () => {

diff --git a/packages/web/spec/pointer/_category_.json b/packages/web/spec/pointer/_category_.json
@@ -0,0 +1,8 @@
+{
+  "label": "ethdebug/format/pointer",
+  "position": 4,
+  "link": {
+    "type": "generated-index",
+    "description": "Work-in-progress formal schema for ethdebug format"
+  }
+}
diff --git a/packages/web/spec/pointer/collection/_category_.json b/packages/web/spec/pointer/collection/_category_.json
@@ -0,0 +1,8 @@
+{
+  "label": "Collections",
+  "position": 5,
+  "link": {
+    "type": "generated-index",
+    "description": "Work-in-progress formal schema for ethdebug format"
+  }
+}
diff --git a/packages/web/spec/pointer/collection/collection.mdx b/packages/web/spec/pointer/collection/collection.mdx
@@ -0,0 +1,11 @@
+---
+sidebar_position: 1
+---
+
+import SchemaViewer from "@site/src/components/SchemaViewer";
+
+# Collection schema
+
+<SchemaViewer
+  schema={{ id: "schema:ethdebug/format/pointer/collection" }}
+  />
diff --git a/packages/web/spec/pointer/collection/conditional.mdx b/packages/web/spec/pointer/collection/conditional.mdx
@@ -0,0 +1,11 @@
+---
+sidebar_position: 4
+---
+
+import SchemaViewer from "@site/src/components/SchemaViewer";
+
+# Conditional
+
+<SchemaViewer
+  schema={{ id: "schema:ethdebug/format/pointer/collection/conditional" }}
+  />
diff --git a/packages/web/spec/pointer/collection/group.mdx b/packages/web/spec/pointer/collection/group.mdx
@@ -0,0 +1,11 @@
+---
+sidebar_position: 2
+---
+
+import SchemaViewer from "@site/src/components/SchemaViewer";
+
+# Group
+
+<SchemaViewer
+  schema={{ id: "schema:ethdebug/format/pointer/collection/group" }}
+  />
diff --git a/packages/web/spec/pointer/collection/list.mdx b/packages/web/spec/pointer/collection/list.mdx
@@ -0,0 +1,11 @@
+---
+sidebar_position: 3
+---
+
+import SchemaViewer from "@site/src/components/SchemaViewer";
+
+# List
+
+<SchemaViewer
+  schema={{ id: "schema:ethdebug/format/pointer/collection/list" }}
+  />
diff --git a/packages/web/spec/pointer/concepts.mdx b/packages/web/spec/pointer/concepts.mdx
@@ -0,0 +1,309 @@
+---
+sidebar_position: 2
+---
+
+import SchemaViewer from "@site/src/components/SchemaViewer";
+
+# Key concepts
+
+## A **pointer** is a region or a collection of other pointers
+
+High-level languages allow programmers to describe and manipulate conceptual
+ideas as succinct, individual building blocks of machine execution.
+Nowadays, thanks to decades of compiler research, resulting low-level machine
+states are often reliably indecipherable, bearing no resemblance at all to
+even basic data abstractions.
+
+The **ethdebug/format/pointer** schema provides a tree-based syntax for
+representing complete (and often minutely detailed) address information for
+finding a particular high-level data object. (For instance: a compiler may need
+to inform a debugger about where to find a particular array in memory.)
+As such, this schema is specified recursively: a pointer is either a single,
+continuous sequence of bytes addresses (a **region**), or it aggregates other
+pointers (a **collection**).
+
+## A **region** is a single continuous range of byte addresses
+
+For simple allocations (like those that fit into a single word), the
+**ethdebug/format/pointer** representation is also quite simple: just a single,
+optionally named, continuous chunk of bytes in the machine state.
+
+<details>
+<summary>
+**Example**: Pointing to the first 32 bytes of memory
+</summary>
+
+```json
+{
+  "name": "memory-start",
+  "location": "memory",
+  "offset": "0x0000000000000000000000000000000000000000000000000000000000000000",
+  "length": 32
+}
+```
+
+</details>
+
+This schema defines the concept of a **region** to be the representation
+of the addressing details for a particular block of continuous bytes. Different
+data locations use different, location-specific schemas for regions
+(since, e.g., stack regions are very different than storage regions). The
+**ethdebug/format/pointer/region** schema aggregates these using the
+`"location"` field as a polymorphic discriminator.
+
+## A **collection** aggregates other pointers
+
+Other allocations are not so cleanly represented by a single continuous block
+of bytes anywhere. In these situations, the **ethdebug/format/pointer**
+representation can describe the aggregation of other composed pointers.
+
+<details>
+<summary>
+**Example**: Solidity `struct` in memory
+</summary>
+
+Consider the `struct` definition:
+```solidity
+struct Record {
+    uint256 x;
+    uint256 y;
+}
+```
+
+A minimal way to represent one possible memory allocation for this type is
+to `"group"` together two single-region pointers for each of the two struct
+members.
+
+```json
+{
+  "group": [{
+    "location": "memory",
+    "offset": "0x40",
+    "length": 32
+  }, {
+    "location": "memory",
+    "offset": "0x60",
+    "length": 32
+  }]
+}
+```
+
+</details>
+
+This kind of pointer is defined to be a **collection** of other pointers.
+This schema includes several sub-schemas for different kinds of collections
+(e.g., since the allocation of struct types are very different from the
+allocation of array types, etc.). The **ethdebug/format/pointer/collection**
+schema aggregates these.
+
+
+## Pointers allow describing a value as a complex **expression**
+
+The examples above all use literal values for `"offset"` and `"length"`, but
+these will very often be impossible to predict at compile-time. This format
+provides a JSON-based **expression** syntax for relating the data components
+involved in a complex allocation.
+
+Besides representing byte offsets, word addresses, byte range lengths, etc.
+using just unsigned integer literals, this schema also allows representing
+addressing details via numeric operations, references to named regions,
+explicit EVM lookup, and so on.
+reference to other regions by name, and a few builtin operations.
+
+<details>
+<summary>
+**Example**: Region defined using an arithmetic operation
+</summary>
+
+```json
+{
+  "location": "memory",
+  "offset": {
+    "$sum": [1, 1]
+  },
+  "length": 1
+}
+```
+
+</details>
+
+More expressively, regions whose representations include a
+`"name": "<identifier>"` property allow the use of this `<identifier>` in
+references to this region elsewhere in the pointer representation.
+
+This can be used, for example, to indicate a storage slot whose address is
+known at some point in execution to be the value at the top of the stack.
+
+<details>
+<summary>
+**Example**: Naming a region and reading this region's data from machine state
+</summary>
+
+```json
+{
+  "group": [{
+    "name": "pointer-to-storage-slot",
+    "location": "stack",
+    "slot": 0
+  }, {
+    "name": "storage-slot",
+    "location": "storage",
+    "slot": {
+      "$read": "pointer-to-storage-slot"
+    }
+  }]
+}
+```
+
+</details>
+
+Regions can also be referenced for the purposes of copying fields (to
+avoid duplicating constants, etc.) This is useful, e.g., with structs, whose
+members often are allocated one after the next with no gap. In this example,
+the sub-pointer corresponding to `y` is positioned based on `x`'s offset
+and length.
+
+<details>
+<summary>
+**Example**: Defining regions in terms of other regions
+</summary>
+
+```json
+{
+  "group": [{
+    "name": "record-pointer",
+    "location": "stack",
+    "slot": 0
+  }, {
+    "name": "record-x",
+    "location": "memory",
+    "offset": {
+      "$read": "record-pointer"
+    },
+    "length": 32
+  }, {
+    "name": "record-y",
+    "location": "memory",
+    "offset": {
+      "$sum": [{ ".offset": "record-x" }, { ".length": "record-x" }]
+    },
+    "length": 32
+  }]
+}
+```
+
+</details>
+
+This schema is designed to allow compilers maximal expressiveness in producing
+self-contained representations that are completely knowable at compile-time.
+
+## Collections can be dynamic
+
+For collections whose cardinality or configuration is unpredictable at
+compile-time (e.g., `uint256[]` or `string storage` allocations, respectively),
+and for collections whose static representation would simply be too cumbersome
+(e.g., `bytes32[3200][5600][111]` allocations),
+**ethdebug/format/pointer/collection** provides sub-schemas for describing the
+full set of inter-related data addresses in dynamic terms.
+
+<details>
+<summary>
+**Example**: Representing a dynamically-sized array allocation
+</summary>
+
+The following represents an allocation where the array's item count is stored
+as the leading word-sized sequence of bytes, and each item in the array has
+the same fixed size and appears in memory sequentially following that with no
+gaps.
+
+Notice how `"list": { ... }` expects an object with an expression for the
+value of `"count"`, the name of the scalar variable to represent `"each"`
+item's index in the list, and the underlying pointer for the item itself.
+
+```json
+{
+  "group": [
+    {
+      "name": "array-count",
+      "location": "memory",
+      "offset": "0x40",
+      "length": 32
+    },
+    {
+      "list": {
+        "count": { "$read": "array-count" },
+        "each": "item-index",
+        "is": {
+          "name": "array-item",
+          "location": "memory",
+          "offset": {
+            "$sum": [
+              { ".offset": "array-count" },
+              { ".length": "array-count" },
+              "$product": [
+                "item-index",
+                { ".length": "array-item" }
+              ]
+            ]
+          },
+          "length": 32
+        }
+      }
+    }
+  ]
+}
+```
+
+</details>
+
+## A region is specified in terms of an **addressing scheme**
+
+The EVM models its various data locations in a couple different ways based on
+how bytes are defined to be arranged in each location: e.g., storage is
+arranged in slots, but memory is just one long bytes array.
+
+This pointer schema does not make any attempt to unify these different access
+abstractions, but instead it defines the concept of an **addressing scheme**.
+
+Each location-specific region schema is defined as an extension of the schema
+for a particular addressing scheme.
+
+This format currently defines two such addressing schemes:
+**ethdebug/format/pointer/scheme/slice** and
+**ethdebug/format/pointer/scheme/segment**, for addressing ranges within a
+single continuous byte array and addressing a slot or collection of slots
+in a word-arranged locaton (respectively).
+
+<details>
+<summary>
+**Example**: Slice-based region
+</summary>
+
+This example addresses the 32 bytes starting at location `0x3334` in calldata:
+
+```json
+{
+  "location": "calldata",
+  "offset": "0x3334",
+  "length": "0x20"
+}
+```
+
+</details>
+
+<details>
+<summary>
+**Example**: Segment-based region
+</summary>
+
+This example addresses the first four slots of transient storage (beginning at
+slot 0).
+
+```json
+{
+  "location": "transient",
+  "slot": 0
+}
+```
+
+</details>