From 91168b4effe8ebd08a6465992c209fcf960a403a Mon Sep 17 00:00:00 2001 From: Jon Ross-Perkins Date: Tue, 3 Sep 2024 16:13:49 -0700 Subject: [PATCH] Move toolchain architecture to markdown (#4242) Note I'm mostly trying to capture [the docs](https://docs.google.com/document/d/1RRYMm42osyqhI2LyjrjockYCutQ5dOf8Abu50kTrkX0/edit?resourcekey=0-kHyqOESbOHmzZphUbtLrTw&tab=t.0) as they exist today, not fixing issues with the docs. I think the doc itself hasn't changed much lately (i.e., for months). Trying to organize it a little better though, particularly so that it shows up reasonably when looking in github or the website. --------- Co-authored-by: Geoff Romer --- toolchain/README.md | 4 +- toolchain/docs/README.md | 94 ++++ toolchain/docs/adding_features.md | 433 ++++++++++++++++ toolchain/docs/check.md | 616 +++++++++++++++++++++++ toolchain/docs/check.svg | 1 + toolchain/docs/diagnostics.md | 230 +++++++++ toolchain/docs/driver.md | 22 + toolchain/docs/idioms.md | 424 ++++++++++++++++ toolchain/docs/lex.md | 44 ++ toolchain/docs/lower.md | 25 + toolchain/docs/parse.md | 802 ++++++++++++++++++++++++++++++ toolchain/docs/parse.svg | 1 + website/prebuild.py | 4 +- 13 files changed, 2696 insertions(+), 4 deletions(-) create mode 100644 toolchain/docs/README.md create mode 100644 toolchain/docs/adding_features.md create mode 100644 toolchain/docs/check.md create mode 100644 toolchain/docs/check.svg create mode 100644 toolchain/docs/diagnostics.md create mode 100644 toolchain/docs/driver.md create mode 100644 toolchain/docs/idioms.md create mode 100644 toolchain/docs/lex.md create mode 100644 toolchain/docs/lower.md create mode 100644 toolchain/docs/parse.md create mode 100644 toolchain/docs/parse.svg diff --git a/toolchain/README.md b/toolchain/README.md index 33c36780d7631..43150b9641188 100644 --- a/toolchain/README.md +++ b/toolchain/README.md @@ -6,6 +6,4 @@ Exceptions. See /LICENSE for license information. SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception --> -A design is currently maintained in -[Google Drive](https://docs.google.com/document/d/1RRYMm42osyqhI2LyjrjockYCutQ5dOf8Abu50kTrkX0/edit?resourcekey=0-kHyqOESbOHmzZphUbtLrTw). -It'll be migrated to markdown once we are confident in its stability. +See [docs](docs/). diff --git a/toolchain/docs/README.md b/toolchain/docs/README.md new file mode 100644 index 0000000000000..fcb9426b69c9d --- /dev/null +++ b/toolchain/docs/README.md @@ -0,0 +1,94 @@ +# Toolchain architecture + + + + + +## Table of contents + +- [Goals](#goals) +- [High-level architecture](#high-level-architecture) + - [Design patterns](#design-patterns) +- [Adding features](#adding-features) + + + +## Goals + +The toolchain represents the production portion of Carbon. At a high level, the +toolchain's top priorities are: + +- Correctness. +- Quality of generated code, including performance. +- Compilation performance. +- Quality of diagnostics for incorrect or questionable code. + +TODO: Add an expanded document that details the goals and priorities and link to +it here. + +## High-level architecture + +The main components are: + +- [Driver](driver.md): Provides commands and ties together compilation flow. +- [Diagnostics](diagnostics.md): Produces diagnostic output. +- Compilation flow: + + 1. Source: Load the file into a + [SourceBuffer](/toolchain/source/source_buffer.h). + 2. [Lex](lex.md): Transform a SourceBuffer into a + [Lex::TokenizedBuffer](/toolchain/lex/tokenized_buffer.h). + 3. [Parse](parse.md): Transform a TokenizedBuffer into a + [Parse::Tree](/toolchain/parse/tree.h). + 4. [Check](check.md): Transform a Tree to produce + [SemIR::File](/toolchain/sem_ir/file.h). + 5. [Lower](lower.md): Transform the SemIR to an + [LLVM Module](https://llvm.org/doxygen/classllvm_1_1Module.html). + 6. CodeGen: Transform the LLVM Module into an Object File. + +### Design patterns + +A few common design patterns are: + +- Distinct steps: Each step of processing produces an output structure, + avoiding callbacks passing data between structures. + + - For example, the parser takes a `Lex::TokenizedBuffer` as input and + produces a `Parse::Tree` as output. + + - Performance: It should yield better locality versus a callback approach. + + - Understandability: Each step has a clear input and output, versus + callbacks which obscure the flow of data. + +- Vectorized storage: Data is stored in vectors and flyweights are passed + around, avoiding more typical heap allocation with pointers. + + - For example, the parse tree is stored as a + `llvm::SmallVector` indexed by `Parse::Node` + which wraps an `int32_t`. + + - Performance: Vectorization both minimizes memory allocation overhead and + enables better read caching because adjacent entries will be cached + together. + +- Iterative processing: We rely on state stacks and iterative loops for + parsing, avoiding recursive function calls. + + - For example, the parser has a `Parse::State` enum tracked in + `state_stack_`, and loops in `Parse::Tree::Parse`. + + - Scalability: Complex code must not cause recursion issues. We have + experience in Clang seeing stack frame recursion limits being hit in + unexpected ways, and non-recursive approaches largely avoid that risk. + +See also [Idioms](idioms.md) for abbreviations and more implementation +techniques. + +## Adding features + +We have a [walkthrough for adding features](adding_features.md). diff --git a/toolchain/docs/adding_features.md b/toolchain/docs/adding_features.md new file mode 100644 index 0000000000000..abe341316beb3 --- /dev/null +++ b/toolchain/docs/adding_features.md @@ -0,0 +1,433 @@ +# Adding features + + + + + +## Table of contents + +- [Lex](#lex) +- [Parse](#parse) + - [Typed parse node metadata implementation](#typed-parse-node-metadata-implementation) +- [Check](#check) + - [SemIR typed instruction metadata implementation](#semir-typed-instruction-metadata-implementation) +- [Lower](#lower) +- [Tests and debugging](#tests-and-debugging) + - [Running tests](#running-tests) + - [Updating tests](#updating-tests) + - [Reviewing test deltas](#reviewing-test-deltas) + - [Verbose output](#verbose-output) + - [Stack traces](#stack-traces) + + + +## Lex + +New lexed tokens must be added to +[token_kind.def](/toolchain/lex/token_kind.def). `CARBON_SYMBOL_TOKEN` and +`CARBON_KEYWORD_TOKEN` both provide some built-in lexing logic, while +`CARBON_TOKEN` requires custom lexing support. + +[TokenizedBuffer::Lex](/toolchain/lex/tokenized_buffer.h) is the main dispatch +for lexing, and calls that need to do custom lexing will be dispatched there. + +## Parse + +A parser feature will have state transitions that produce new parse nodes. + +The resulting parse nodes are in +[parse/node_kind.def](/toolchain/parse/node_kind.def) and +[typed_nodes.h](/toolchain/parse/typed_nodes.h). When choosing node structure, +consider how semantics will process it in post-order; this will rule out some +designs. Adding a parse node kind will also require a handler in the `Check` +step. + +The state transitions are in [parse/state.def](/toolchain/parse/state.def). Each +`CARBON_PARSER_STATE` defines a distinct state and has comments for state +transitions. If several states should share handling, name them +`FeatureAsVariant`. + +Adding a state requires adding a `Handle` function in an appropriate +`parse/handle_*.cpp` file, possibly a new file. The macros are used to generate +declarations in the header, so only extra helper functions should be added +there. Every state handler pops the state from the stack before any other +processing. + +### Typed parse node metadata implementation + +As of [#3534](https://github.com/carbon-language/carbon-lang/pull/3534): + +![parse](parse.svg) + +> TODO: Convert this chart to Mermaid. + +- [common/enum_base.h](/common/enum_base.h) defines the `EnumBase` + [CRTP](idioms.md#crtp-or-curiously-recurring-template-pattern) class + extending `Printable` from [common/ostream.h](/common/ostream.h), along with + `CARBON_ENUM` macros for making enumerations + +- [parse/node_kind.h](/toolchain/parse/node_kind.h) includes + [common/enum_base.h](/common/enum_base.h) and defines an enumeration + `NodeKind`, along with bitmask enum `NodeCategory`. + + - The `NodeKind` enumeration is populated with the list of all parse node + kinds using [parse/node_kind.def](/toolchain/parse/node_kind.def) (using + [the .def file idiom](idioms.md#def-files)) _declared_ in this file + using a macro from [common/enum_base.h](/common/enum_base.h) + + - `NodeKind` has a member type `NodeKind::Definition` that extends + `NodeKind` and adds a `NodeCategory` field (and others in the future). + + - `NodeKind` has a method `Define` for creating a `NodeKind::Definition` + with the same enumerant value, plus values for the other fields. + + - `HasKindMember` at the bottom of + [parse/node_kind.h](/toolchain/parse/node_kind.h) uses + [field detection](idioms.md#field-detection) to determine if the type + `T` has a `NodeKind::Definition Kind` static constant member. + + - Note: both the type and name of these fields must match exactly. + + - Note that additional information is needed to define the `category()` + method (and other methods in the future) of `NodeKind`. This information + comes from the typed parse node definitions in + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) (described below). + +- [parse/node_ids.h](/toolchain/parse/node_ids.h) defines a number of types + that store a _node id_ that identifies a node in the parse tree + + - `NodeId` stores a node id with no restrictions + + - `NodeIdForKind` inherits from `NodeId` and stores the id of a node + that must have the specified `NodeKind` "`Kind`". Note that this is not + used directly, instead aliases `FooId` for + `NodeIdForKind` are defined for every node kind using + [parse/node_kind.def](/toolchain/parse/node_kind.def) (using + [the .def file idiom](idioms.md#def-files)). + + - `NodeIdInCategory` inherits from `NodeId` and stores the id of + a node that must overlap the specified `NodeCategory` "`Category`". Note + that this is not typically used directly, instead this file defines + aliases `AnyDeclId`, `AnyExprId`, ..., `AnyStatementId`. + + - Similarly `NodeIdOneOf` and `NodeIdNot` inherit from `NodeId` + and stores the id of a node restricted to either matching `T::Kind` or + `U::Kind` or not matching `V::Kind`. + - In addition to the node id type definitions above, the struct + `NodeForId` is declared but not defined. + +- [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) defines a typed parse + node struct type for each kind of parse node. + + - Each one defines a static constant named `Kind` that is set using a call + to `Define()` on the corresponding enumerant member of `NodeKind` from + [parse/node_kind.h](/toolchain/parse/node_kind.h) (which is included by + this file). + - The fields of these types specify the children of the parse node using + the types from [parse/node_ids.h](/toolchain/parse/node_ids.h). + + - The struct `NodeForId` that is declared in + [parse/node_ids.h](/toolchain/parse/node_ids.h) is defined in this file + such that `NodeForId::TypedNode` is the `Foo` typed parse node + struct type. + + - This file will fail to compile unless every kind of parse node kind + defined in [parse/node_kind.def](/toolchain/parse/node_kind.def) has a + corresponding struct type in this file. + +- [parse/node_kind.cpp](/toolchain/parse/node_kind.cpp) includes both + [parse/node_kind.h](/toolchain/parse/node_kind.h) and + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) + + - Uses the macro from [common/enum_base.h](/common/enum_base.h), the + enumerants of `NodeKind` are _defined_ using the list of parse node + kinds from [parse/node_kind.def](/toolchain/parse/node_kind.def) (using + [the .def file idiom](idioms.md#def-files)). + + - `NodeKind::definition()` is defined. It has a static table of + `const NodeKind::Definition*` indexed by the enum value, populated by + taking the address of the `Kind` member of each typed parse node struct + type, using the list from + [parse/node_kind.def](/toolchain/parse/node_kind.def). + + - `NodeKind::category()` is defined using `NodeKind::definition()`. + + - Tested assumption: the tables built in this file are indexed by the enum + values. We rely on the fact that we get the parse node kinds in the same + order by consistently using + [parse/node_kind.def](/toolchain/parse/node_kind.def). + +- [parse/tree.h](/toolchain/parse/tree.h) includes + [parse/node_ids.h](/toolchain/parse/node_ids.h). It does not depend on + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) to reduce compilation + time in those files that don't use the typed parse node struct types. + + - Defines `Tree::Extract`... functions that take a node id and return a + typed parse node struct type from + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h). + + - Uses `HasKindMember` to restrict calling `ExtractAs` except on typed + nodes defined in [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h). + + - `Tree::Extract` uses `NodeForId` to get the corresponding typed parse + node struct type for a `FooId` type defined in + [parse/node_ids.h](/toolchain/parse/node_ids.h). + + - Note that this is done without a dependency on the typed parse node + struct types by using the forward declaration of `NodeForId` from + [parse/node_ids.h](/toolchain/parse/node_ids.h). + + - The `Tree::Extract`... functions ultimately call + `Tree::TryExtractNodeFromChildren`, which is a templated function + only declared in this file. Its definition is in + [parse/extract.cpp](/toolchain/parse/extract.cpp). + +- [parse/extract.cpp](/toolchain/parse/extract.cpp) includes + [parse/tree.h](/toolchain/parse/tree.h) and + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) + + - Defines struct `Extractable` that defines how to extract a field of + type `T` from a `Tree::SiblingIterator` pointing at the corresponding + child node. + + - `Extractable` is defined for the node id types defined in + [parse/node_ids.h](/toolchain/parse/node_ids.h). + + - In addition, `Extractable` is defined for standard types + `std::optional` and `llvm::SmallVector`, to support optional and + repeated children. + + - Uses [struct reflection](idioms.md#struct-reflection) to support + aggregate struct types containing extractable fields. This is used to + support typed parse node struct types as well as struct fields that they + contain. + + - Uses `HasKindMember` to detect accidental uses of a parse node type + directly as fields of typed parse node struct types -- in those places + `FooId` should be used instead. + + - Defines `Tree::TryExtractNodeFromChildren` and explicitly + instantiates it for every typed parse node struct type defined in + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) using + [parse/node_kind.def](/toolchain/parse/node_kind.def) (using + [the .def file idiom](idioms.md#def-files)). By explicitly instantiating + this function only in this file, we avoid redundant compilation work, + which reduces build times, and allow us to keep all the extraction + machinery as a private implementation detail of this file. + +- [parse/typed_nodes_test.cpp](/toolchain/parse/typed_nodes_test.cpp) + validates that each typed parse node struct type has a static `Kind` member + that defines the correct corresponding `NodeKind`, and that the `category()` + function agrees between the `NodeKind` and `NodeKind::Definition`. + +Note: this is broadly similar to +[SemIR typed instruction metadata implementation](#semir-typed-instruction-metadata-implementation). + +## Check + +Each parse node kind requires adding a `Handle` function in a +`check/handle_*.cpp` file. + +If the resulting SemIR needs a new instruction: + +- add a new kind to [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def) + - Add a `CARBON_SEM_IR_INST_KIND(NewInstKindName)` line in alphabetical + order +- a new struct definition to + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h), such as: + + ```cpp + struct NewInstKindName { + static constexpr auto Kind = InstKind::NewInstKindName.Define( + // the name used in textual IR + "new_inst_kind_name" + // Optional: , TerminatorKind::KindOfTerminator + ); + + // Optional: omit if not associated with a parse node. + Parse::Node parse_node; + + // Optional: omit if this sem_ir instruction does not produce a value. + TypeId type_id; + + // 0-2 id fields, with types from sem_ir/ids.h or sem_ir/builtin_kind.h + // For example, fields would look like: + StringId name_id; + InstId value_id; + }; + ``` + +Adding an instruction will also require a handler in the Lower step. + +Most new instructions will automatically be formatted reasonably by the SemIR +formatter. + +If the resulting SemIR needs a new built-in, add it to +[builtin_inst_kind.def](/toolchain/sem_ir/builtin_inst_kind.def). + +### SemIR typed instruction metadata implementation + +How does this work? As of +[#3310](https://github.com/carbon-language/carbon-lang/pull/3310): + +![check](check.svg) + +> TODO: Convert this chart to Mermaid. + +- [common/enum_base.h](/common/enum_base.h) defines the `EnumBase` + [CRTP](idioms.md#crtp-or-curiously-recurring-template-pattern) class + extending `Printable` from [common/ostream.h](/common/ostream.h), along with + `CARBON_ENUM` macros for making enumerations + +- [sem_ir/inst_kind.h](/toolchain/sem_ir/inst_kind.h) includes + [common/enum_base.h](/common/enum_base.h) and defines an enumeration + `InstKind`, along with `InstValueKind` and `TerminatorKind`. + + - The `InstKind` enumeration is populated with the list of all instruction + kinds using [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def) + (using [the .def file idiom](idioms.md#def-files)) _declared_ in this + file using a macro from [common/enum_base.h](/common/enum_base.h) + + - `InstKind` has a member type `InstKind::Definition` that extends + `InstKind` and adds the `ir_name` string field, and a `TerminatorKind` + field. + + - `InstKind` has a method `Define` for creating a `InstKind::Definition` + with the same enumerant value, plus values for the other fields. + +- Note that additional information is needed to define the `ir_name()`, + `value_kind()`, and `terminator_kind()` methods of `InstKind`. This + information comes from the typed instruction definitions in + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h). + +- [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h) defines a typed + instruction struct type for each kind of SemIR instruction, as described + above. + + - Each one defines a static constant named `Kind` that is set using a call + to `Define()` on the corresponding enumerant member of `InstKind` from + [sem_ir/inst_kind.h](/toolchain/sem_ir/inst_kind.h) (which is included + by this file). + +- `HasParseNodeMember` and `HasTypeIdMember` at the + bottom of [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h) use + [field detection](idioms.md#field-detection) to determine if `TypedInst` has + a `Parse::Node parse_node` or a `TypeId type_id` field respectively. + + - Note: both the type and name of these fields must match exactly. + +- [sem_ir/inst_kind.cpp](/toolchain/sem_ir/inst_kind.cpp) includes both + [sem_ir/inst_kind.h](/toolchain/sem_ir/inst_kind.h) and + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h) + + - Uses the macro from [common/enum_base.h](/common/enum_base.h), the + enumerants of `InstKind` are _defined_ using the list of instruction + kinds from [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def) + (using [the .def file idiom](idioms.md#def-files)) + + - `InstKind::value_kind()` is defined. It has a static table of + `InstValueKind` values indexed by the enum value, populated by applying + `HasTypeIdMember` from + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h) to every + instruction kind by using the list from + [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def). + - `InstKind::definition()` is defined. It has a static table of + `const InstKind::Definition*` indexed by the enum value, populated by + taking the address of the `Kind` member of each `TypedInst`, using the + list from [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def). + + - `InstKind::ir_name()` and `InstKind::terminator_kind()` are defined + using `InstKind::definition()`. + - Tested assumption: the tables built in this file are indexed by the enum + values. We rely on the fact that we get the instruction kinds in the + same order by consistently using + [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def). + + - This file will fail to compile unless every kind of SemIR instruction + defined in [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def) has a + corresponding struct type in + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h). + +- `TypedInstArgsInfo` defined in + [sem_ir/inst.h](/toolchain/sem_ir/inst.h) uses + [struct reflection](idioms.md#struct-reflection) to determine the other + fields from `TypedInst`. It skips the `parse_node` and `type_id` fields + using `HasParseNodeMember` and `HasTypeIdMember`. + + - Tested assumption: the `parse_node` and `type_id` are the first fields + in `TypedInst`, and there are at most two more fields. + +- [sem_ir/inst.h](/toolchain/sem_ir/inst.h) defines templated conversions + between `Inst` and each of the typed instruction structs: + + - Uses `TypedInstArgsInfo`, `HasParseNodeMember`, + and `HasTypeIdMember`, and + [local lambda](idioms.md#local-lambdas-to-reduce-duplicate-code). + + - Defines a templated `ToRaw` function that converts the various id field + types to an `int32_t`. + - Defines a templated `FromRaw` function that converts an `int32_t` to + `T` to perform the opposite conversion. + - Tested assumption: The `parse_node` field is first, when present, and + the `type_id` is next, when present, in each `TypedInst` struct type. + +- The "tested assumptions" above are all tested by + [sem_ir/typed_insts_test.cpp](/toolchain/sem_ir/typed_insts_test.cpp) + +## Lower + +Each SemIR instruction requires adding a `Handle` function in a +`lower/handle_*.cpp` file. + +## Tests and debugging + +### Running tests + +Tests are run in bulk as `bazel test //toolchain/...`. Many tests are using the +file_test infrastructure; see +[testing/file_test/README.md](/testing/file_test/README.md) for information. + +There are several supported ways to run Carbon on a given test file. For +example, with `toolchain/parse/testdata/basics/empty.carbon`: + +- `bazel test //toolchain/testing:file_test --test_arg=--file_tests=toolchain/parse/testdata/basics/empty.carbon` + - Executes an individual test. +- `bazel run //toolchain/parse:testdata/basics/empty.carbon.run` + - Runs `carbon` on the file with standard arguments, printing output to + console. + - This form will often be most useful when iterating over a specific test. +- `bazel run //toolchain/parse:testdata/basics/empty.carbon.verbose` + - Similar to the previous command, but with the `-v` flag implied. +- `bazel run //toolchain/driver:carbon -- compile --phase=parse --dump-parse-tree toolchain/parse/testdata/basics/empty.carbon` + - Explicitly runs `carbon` with the provided arguments. +- `bazel-bin/toolchain/driver/carbon compile --phase=parse --dump-parse-tree toolchain/parse/testdata/basics/empty.carbon` + - Similar to the previous command, but without using `bazel`. + +### Updating tests + +The `toolchain/autoupdate_testdata.py` script can be used to update output. It +invokes the `file_test` autoupdate support. See +[testing/file_test/README.md](/testing/file_test/README.md) for file syntax. + +#### Reviewing test deltas + +Using `autoupdate_testdata.py` can be useful to produce deltas during the +development process because it allows `git status` and `git diff` to be used to +examine what changed. + +### Verbose output + +The `-v` flag can be passed to trace state, and should be specified before the +subcommand name: `carbon -v compile ...`. `CARBON_VLOG` is used to print output +in this mode. There is currently no control over the degree of verbosity. + +### Stack traces + +While the iterative processing pattern means function stack traces will have +minimal context for how the current function is reached, we use LLVM's +`PrettyStackTrace` to include details about the state stack. The state stack +will be above the function stack in crash output. diff --git a/toolchain/docs/check.md b/toolchain/docs/check.md new file mode 100644 index 0000000000000..4eafd63644fa1 --- /dev/null +++ b/toolchain/docs/check.md @@ -0,0 +1,616 @@ +# Check + + + + + +## Table of contents + +- [Overview](#overview) +- [Postorder processing](#postorder-processing) +- [Key IR concepts](#key-ir-concepts) + - [Parameters and arguments](#parameters-and-arguments) +- [SemIR textual format](#semir-textual-format) + - [Raw form](#raw-form) + - [Formatted IR](#formatted-ir) + - [Instructions](#instructions) + - [Top-level entities](#top-level-entities) +- [Core loop](#core-loop) + - [Node stack](#node-stack) + - [Delayed evaluation (not yet implemented)](#delayed-evaluation-not-yet-implemented) + - [Templates (not yet implemented)](#templates-not-yet-implemented) + - [Rewrites](#rewrites) +- [Types](#types) + - [Type printing (not yet implemented)](#type-printing-not-yet-implemented) +- [Expression categories](#expression-categories) + - [ExprCategory::NotExpression](#exprcategorynotexpression) + - [ExprCategory::Value](#exprcategoryvalue) + - [ExprCategory::DurableReference and ExprCategory::EphemeralReference](#exprcategorydurablereference-and-exprcategoryephemeralreference) + - [ExprCategory::Initializing](#exprcategoryinitializing) + - [ExprCategory::Mixed](#exprcategorymixed) + - [Value bindings](#value-bindings) +- [Handling Parse::Tree errors (not yet implemented)](#handling-parsetree-errors-not-yet-implemented) +- [Alternatives considered](#alternatives-considered) + - [Using a traditional AST representation](#using-a-traditional-ast-representation) + + + +## Overview + +Check takes the parse tree and generates a semantic intermediate representation, +or SemIR. This will look closer to a series of instructions, in preparation for +transformation to LLVM IR. Semantic analysis and type checking occurs during the +production of SemIR. It also does any validation that requires context. + +## Postorder processing + +The checking step is oriented on postorder processing on the `Parse::Tree` to +iterate through the `Parse::NodeImpl` vectorized storage once, in order, as much +as possible. This is primarily for performance, but also relies on the +[information accumulation principle](/docs/project/principles/information_accumulation.md): +that is, when that principle applies, we should be able to generate IR +immediately because we can rely on the principle that when a line is processed, +the information necessary to semantically check that line is already available. + +Indirectly, what this really means is that we should be able to go from a +Parse::Tree (which cannot be used for name lookups) to a SemIR with name lookups +completed in a single pass. The SemIR should not need to be re-processed to add +more information outside of templates. By doing this, we avoid an additional +processing pass with associated storage needs. + +This single-pass approach also means that the checking step does not make use of +the tree structure of the `Parse::Tree`. In cases where the actions performed +for a parse tree node depend on the context in which that node appears, a node +that is visited earlier in the postorder traversal, such as a bracketing node, +needs to establish the necessary context. In this respect, the sequence of +`Parse::Node`s can be thought of as a byte code input that the check step +interprets to build the `SemIR`. + +## Key IR concepts + +A `SemIR::Inst` is the basic building block that represents a simple +instruction, such as an operator or declaring a literal. For each kind of +instruction, a typedef for that specific kind of instruction is provided in the +`SemIR` namespace. For example, `SemIR::Assign` represents an assignment +instruction, and `SemIR::PointerType` represents a pointer type instruction. + +Each instruction class has up to four public data members describing the +instruction, as described in +[sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h) (also see +[adding features for Check](adding_features.md#check)): + +- A `Parse::Node parse_node;` member that tracks its location is present on + almost all instructions, except instructions like `SemIR::Builtin` that + don't have an associated location. + +- A `SemIR::TypeId type_id;` member that describes the type of the instruction + is present on all instructions that produce a value. This includes namespace + instructions, which are modeled as producing a value of "namespace" type, + even though they can't be used as a first-class value in Carbon expressions. + +- Up to two additional, kind-specific members. For example `SemIR::Assign` has + members `InstId lhs_id` and `InstId rhs_id`. + +Instructions are stored as type-erased `SemIR::Inst` objects, which store the +instruction kind and the (up to) four fields described above. This balances the +size of `SemIR::Inst` against the overhead of indirection. + +A `SemIR::InstBlock` can represent a code block. However, it can also be created +when a series of instructions needs to be closely associated, such as a +parameter list. + +A `SemIR::Builtin` represents a language built-in, such as the unconstrained +facet type `type`. We will also have built-in functions which would need to form +the implementation of some library types, such as `i32`. Built-ins are in a +stable index across `SemIR` instances. + +### Parameters and arguments + +Parameters and arguments will be stored as two `SemIR::InstBlock`s each. The +first will contain the full IR, while the second will contain references to the +last instruction for each parameter or argument. The references block will have +a size equal to the number of parameters or arguments, allowing for quick size +comparisons and indexed access. + +## SemIR textual format + +There are two textual ways to view `SemIR`. + +### Raw form + +The raw form of SemIR shows the details of the representation, such as numeric +instruction and block IDs. The representation is intended to very closely match +the `SemIR::File` and `SemIR::Inst` representations. This can be useful when +debugging low-level issues with the `SemIR` representation. + +The driver will print this when passed `--dump-raw-sem-ir`. + +### Formatted IR + +In addition to the raw form, there is a higher-level formatted IR that aims to +be human readable. This is used in most `check` tests to validate the output, +and also expected to be used regularly by toolchain developers to inspect the +result of checking the parse tree. + +The driver will print this when passed `--dump-sem-ir`. + +Unlike the raw form, certain representational choices in the `SemIR` data may +not be visible in this form. However, it is intended to be possible to parse the +`SemIR` output and form an equivalent – but not necessarily identical – `SemIR` +representation, although no such parser currently exists. + +As an example, given the program: + +```carbon +fn Cond() -> bool; +fn Run() -> i32 { return if Cond() then 1 else 2; } +``` + +The formatted IR is currently: + +``` +constants { + %.1: i32 = int_literal 1 [template] + %.2: i32 = int_literal 2 [template] +} + +file { + package: = namespace [template] { + .Cond = %Cond + .Run = %Run + } + %Cond: = fn_decl @Cond [template] { + %return.var.loc1: ref bool = var + } + %Run: = fn_decl @Run [template] { + %return.var.loc2: ref i32 = var + } +} + +fn @Cond() -> bool; + +fn @Run() -> i32 { +!entry: + %Cond.ref: = name_ref Cond, file.%Cond [template = file.%Cond] + %.loc2_33.1: init bool = call %Cond.ref() + %.loc2_26.1: bool = value_of_initializer %.loc2_33.1 + %.loc2_33.2: bool = converted %.loc2_33.1, %.loc2_26.1 + if %.loc2_33.2 br !if.expr.then else br !if.expr.else + +!if.expr.then: + %.loc2_41: i32 = int_literal 1 [template = constants.%.1] + br !if.expr.result(%.loc2_41) + +!if.expr.else: + %.loc2_48: i32 = int_literal 2 [template = constants.%.2] + br !if.expr.result(%.loc2_48) + +!if.expr.result: + %.loc2_26.2: i32 = block_arg !if.expr.result + return %.loc2_26.2 +} +``` + +There are three kinds of names in formatted IR, which are distinguished by their +leading sigils: + +- `%name` denotes a value produced by an instruction. These names are + introduced by a line of the form `%name: = `, + and are scoped to the enclosing top-level entity. `` describes the + [expression category](#expression-categories), which is `init` for an + initializing expression, `ref` for a reference expression, or omitted for a + value expression. Typically, values can only be referenced by instructions + that their introduction + [dominates](), but + some kinds of instruction might have other rules. Names in the `file` block + can be referenced as `file.%`. + +- `!name` denotes a label, and `!name:` appears as a prefix of each + `InstBlock` in a `Function`. These names are scoped to their enclosing + function, and can be referenced anywhere in that function, but not outside. + +- `@name` denotes a top-level entity, such as a function, class, or interface. + The SemIR view of these entities is flattened, so member functions are + treated as top-level entities. + +Names in formatted IR are all invented by the formatter, and generally are of +the form `[.loc[_[.]]]` where `` and +`` describe the location of the instruction, and `` is used as a +disambiguator if multiple instructions appear at the same location. Trailing +name components are only included if they are necessary to disambiguate the +name. `` is a guessed good name for the instruction, often derived +from source-level identifiers, and is empty if no guess was made. + +#### Instructions + +There is usually one line in a `InstBlock` for each `Inst`. You can find the +documentation for the different kinds of instructions in +[toolchain/sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h). For example, +given a formatted SemIR line like: + +``` +%N: i32 = assoc_const_decl N [template] +``` + +you would look for a `struct` definition that uses `"assoc_const_decl"` as its +`ir_name`. In this case, this is the `AssociatedConstantDecl` instruction: + +```cpp +// An associated constant declaration in an interface, such as `let T:! type;`. +struct AssociatedConstantDecl { + static constexpr auto Kind = + InstKind::AssociatedConstantDecl.Define( + {.ir_name = "assoc_const_decl", .is_lowered = false}); + + TypeId type_id; + NameId name_id; +}; +``` + +Since this instruction produces a value, it has a `TypeId type_id` field, which +corresponds to the type written between the `:` and the `=`. In the example +above, that type is `i32`. The other arguments to the instruction are written +after the `ir_name` -- in this example the `name_id` is `N`. From this we find +that the instruction corresponds to an associated constant declaration in an +interface like `let N:! i32;`. + +Instructions producing a constant value, like `assoc_const_decl` above, are +followed by their phase, either `[symbolic]` or `[template]`, and then `=` the +value if it is the value of a different instruction. + +Instructions that do not produce a value, such as the `br` and `return` +instructions above, omit the leading `%name: ... =` prefix, as they cannot be +named by other instructions. These instructions do not have a `TypeId type_id` +field, like the `AdaptDecl` instruction: + +```cpp +// An adapted type declaration in a class, of the form `adapt T;`. +struct AdaptDecl { + static constexpr auto Kind = InstKind::AdaptDecl.Define( + {.ir_name = "adapt_decl", .is_lowered = false}); + + // No type_id; this is not a value. + TypeId adapted_type_id; +}; +``` + +An `adapt SomeClass;` declaration would have the corresponding SemIR formatted +as: + +``` +adapt_decl %SomeClass +``` + +Some instructions have special argument handling. For example, some invalid +arguments will be omitted. Or an `InstBlockId` argument will be rendered inline, +commonly enclosed in braces `{`...`}` or parens `(`...`)`. In other cases, the +formatter will combine instructions together to make the IR more readable: + +- A terminator sequence in a block, comprising a sequence of `BranchIf` + instructions followed by a `Branch` or `BranchWithArg` instruction, is + collapsed into a single + `if %cond br !label1 else if ... else br !labelN(%arg)` line. +- A struct type, formed by a sequence of `StructTypeField` instructions + followed by a `StructType` instruction, is collapsed into a single + `struct_type{.field1: %value1, ..., .fieldN: %valueN}` line. + +These exceptions may be found in +[toolchain/sem_ir/formatter.cpp](/toolchain/sem_ir/formatter.cpp). + +#### Top-level entities + +**Question:** Are these too in flux to document at this time? + +- `constants`: TODO +- `imports`: TODO +- `file`: TODO +- entities + - TODO: may be preceded by `extern`. + - TODO: may be preceded by `generic`. + - These may have an optional `!definition:` section containing the + generic's `definition_block_id`. + - `fn`: TODO; followed by `= "`...`"` for builtins + - `class`: TODO + - `interface`: TODO + - `impl`: TODO +- `specific`: TODO + - body in braces `{`...`}` has a bunch of + `` => ` assignment lines + - The first lines of the body describe the declaration + - If there is a valid definition, there are additional definition + assignments after a `!definition:` line. + +## Core loop + +The core loop is `Check::CheckParseTree`. This loops through the `Parse::Tree` +and calls a `Handle`... function corresponding to the `NodeKind` of each node. +Communication between these functions for different nodes working together is +through the `Context` object defined in +[check/context.h](/toolchain/check/context.h), which stores things in a +collection of stacks. The common pattern is that the children of a node are +processed first. They produce information that is then consumed when processing +the parent node. + +One example of this pattern is expressions. Each subexpression outputs SemIR +instructions to compute the value of that subexpression to the current +instruction block, added to the top of the `InstBlockStack` stored in the +`Context` object. It leaves an instruction id on the top of the +[node stack](#node-stack) pointing to the instruction that produces the value of +that subexpression. Those are consumed by parent operations, like an +[RPN](https://en.wikipedia.org/wiki/Reverse_Polish_notation) calculator. For +example, the expression `1 * 2 + 3` corresponds to this parse tree: + +```yaml + {kind: 'IntegerLiteral', text: '1'}, + {kind: 'IntegerLiteral', text: '2'}, + {kind: 'InfixOperator', text: '*', subtree_size: 3}, + {kind: 'IntegerLiteral', text: '3'}, +{kind: 'InfixOperator', text: '+', subtree_size: 5}, +``` + +This parse tree is processed by one call to a `Handle` function per node: + +- The first node is an integer literal, so the core loop calls + `HandleIntegerLiteral`. + + - It calls `context::AddInstAndPush` to output a `SemIR::IntegerLiteral` + instruction to the current instruction block, and pushes the parse node + along with the instruction id to the [node stack](#node-stack). + +- The second node is also an integer literal, which outputs a second + instruction and pushes another entry onto the node stack. + +- `HandleInfixOperator` pops the two entries off of the node stack, outputs + any conversion instructions that are needed, and uses + `context::AddInstAndPush` to create and push the instruction id representing + the output of a multiplication instruction. That multiplication instruction + takes the instruction ids it popped off the stack at the beginning as + arguments. + +- Another integer literal instruction is created for `3` and pushed onto the + stack. + +- `HandleInfixOperator` is called again. It pops the two instruction ids off + the stack to use as the arguments to the multiplication instruction it + creates and pushes. + +In this way, the handle functions coordinate producing their output using the +instruction block stack and node block stack from the context. + +A similar pattern uses bracketing nodes to support parent nodes that can have a +variable number of children. For example, a `return` statement can produce parse +trees following a few different patterns: + +- `return;` + + ```yaml + {kind: 'ReturnStatementStart', text: 'return'}, + {kind: 'ReturnStatement', text: ';', subtree_size: 2}, + ``` + +- `return x;` + + ```yaml + {kind: 'ReturnStatementStart', text: 'return'}, + {kind: 'NameExpr', text: 'x'}, + {kind: 'ReturnStatement', text: ';', subtree_size: 3}, + ``` + +- `return var;` + + ```yaml + {kind: 'ReturnStatementStart', text: 'return'}, + {kind: 'ReturnVarModifier', text: 'var'}, + {kind: 'ReturnStatement', text: ';', subtree_size: 3}, + ``` + +In all three cases, the introducer node `ReturnStatementStart` pushes an entry +on the [node stack](#node-stack) with just the parse node and no id, called a +_solo parse node_. The handler for the parent `ReturnStatement` node can pop and +process entries from the node stack until it finds that solo parse node from +`ReturnStatementStart` that indicates it is done. + +Another pattern that arises is state is set up by an introducer node, updated by +its siblings, and then consumed by the bracketing parent node. FIXME: example + +### Node stack + +The node stack, defined in [check/node_stack.h](/toolchain/check/node_stack.h), +stores pairs of a `Parse::Node` and an id. The type of the id is determined by +the `NodeKind` of the parse node. It is the default, general-purpose stack used +by `Handle`... functions in the check stage. Using a single stack is beneficial +since it improves locality of reference and reduces allocations. However, +additional stacks are used to ensure we never need to search through the stack +to find data -- we always want to be operating on the top of the stack (or a +fixed offset). + +The node stack contains any state pushed by siblings of the current +`Parse::Node` at the top, and state pushed by siblings of ancestors below. The +boundaries between what is a sibling of the current `Parse::Node` versus what is +a sibling of an ancestor are not explicitly determined. Instead, the handler for +the parent node knows how many nodes it must pop from the stack based either on +knowing the fixed number of children for that node kind or popping nodes until +it reaches a bracketing node. The arity or bracketing node kind for each parent +node is documented in [parse/node_kind.def](/toolchain/parse/node_kind.def). + +When each `Parse::Node` is evaluated, the SemIR for it is typically immediately +generated as `SemIR::Inst`s. To help generate the IR to an appropriate context, +scopes have separate `SemIR::InstBlock`s. + +### Delayed evaluation (not yet implemented) + +Sometimes, nodes will need to have delayed evaluation; for example, an inline +definition of a class member function needs to be evaluated after the class is +fully declared. The `SemIR::Inst`s cannot be immediately generated because they +may include name references to the class. We're likely to store a reference to +the relevant `Parse::Node` for each definition for re-evaluation after the class +scope completes. This means that nodes in a definition would be traversed twice, +once while determining that they're inline and without full checking or IR +generation, then again with full checking and IR generation. + +### Templates (not yet implemented) + +Templates need to have partial semantic checking when declared, but can't be +fully implemented before they're instantiated against a specific type. + +We are likely to generate a partial IR for templates, allowing for checking with +the incomplete information in the IR. Instantiation will likely use that IR and +fill in the missing information, but it could also reevaluate the original +`Parse::Node`s with the known template state. + +### Rewrites + +Carbon relies on rewrites of code, such as rewriting the destination of an +initializer to a specific target object once that object is known. + +We have two ways to achieve this. One is to track the IR location of a +placeholder instruction and, if it needs updating, replace it with a "rewrite" +`SemIR::Inst` that points to a new `SemIR::InstBlock` containing the required IR +and specifying which value is the result of that rewrite. This is expressed in +SemIR as a `splice_block` instruction. Another is to track the list of +instructions to be created separately from the node block stack, and merge those +instructions into the current block once we have decided on their contents. + +## Types + +Type expressions are treated like any other expression, and are modeled as +`SemIR::Inst`s. The types computed by type expressions are deduplicated, +resulting in a canonical `SemIR::TypeId` for each distinct type. + +### Type printing (not yet implemented) + +The `TypeId` preserves only the identity of the type, not its spelling, and so +printing it will produce a fully-resolved type name, which isn't a great user +experience as it doesn't reflect how the type was written in the source code. + +Instead, when printing a type name for use in a diagnostic, we will start with +one of two `InstId`s: + +- A `InstId` for a type expression that describes the way the type was + computed. +- A `InstId` for an expression that has the given type. + +In the former case, the type is pretty-printed by walking the type expression +and printing it. In the latter case, the type of the expression is reconstructed +based on the form of the expression: for example, to print the type of `&x`, we +print the type of `x` and append a `*`, being careful to take potential +precedence issues into account. + +TODO: This requires being able to print the type of, for example, +`x.foo[0].bar`, by printing only the desired portion of the type of `x`, and +similarly may require handling the case where the type of an expression involves +generic parameters whose arguments are specified by that expression. In effect, +the type computation performed when checking an operation is duplicated into the +type printing logic, but is simpler because errors don't need to be detected. + +This approach means we don't need to preserve a fully-sugared type for each +expression instruction. Instead, we compute that type when we need to print it. + +## Expression categories + +Each `SemIR::Inst` that has an associated type also has an expression category, +which describes how it produces a value of that type. These +`SemIR::ExprCategory` values correspond to the Carbon expression categories +defined in proposal +[#2006](https://github.com/carbon-language/carbon-lang/pull/2006): + +### ExprCategory::NotExpression + +This instruction is not an expression instruction, and doesn't have an +expression category. This is used for namespaces, control flow instructions, and +other constructs that represent some non-expression-level semantics. + +### ExprCategory::Value + +This instruction produces a value using the type's value representation. +Lowering the instruction will produce an LLVM value using that value +representation. + +### ExprCategory::DurableReference and ExprCategory::EphemeralReference + +This instruction produces a reference to an object. Lowering will produce a +pointer to an object representation. + +### ExprCategory::Initializing + +This instruction represents the initialization of an object. Depending on the +initializing representation for the type, the initializing expression +instruction will do one of the following: + +- For an in-place initializing representation, the instruction will store a + value to the target of the initialization. + +- For a by-copy initializing representation, the instruction will produce an + object representation by value that can be stored into the target. This is + currently only used in cases where the object representation and the value + representation are the same. + +- For a type with no initializing representation, such as an empty struct or + tuple, it does neither of the above things. + +Regardless of the initializing representation, an initializing expression should +be consumed by another instruction that finishes the initialization. For a +by-copy initialization, this final instruction represents the store into the +target, whereas in the other cases it is only used to track in SemIR how the +initialization was used. When an in-place initializer uses a by-copy initializer +as a subexpression, an `initialize_from` instruction is inserted to perform this +final store. + +### ExprCategory::Mixed + +This instruction represents a language construct that doesn't have a single +expression category. This is used for struct and tuple literals, where the +elements of the literal can have different expression categories. Instructions +with a mixed expression category are treated as a special case in conversion, +which recurses into the elements of those instructions before performing +conversions. + +### Value bindings + +A value binding represents a conversion from a reference expression to the value +stored in that expression. There are three important cases here: + +- For types with a by-copy value representation, such as `i32`, a value + binding represents a load from the address indicated by the reference + expression. + +- For types with a by-pointer value representation, such as arrays and large + structs and tuples, a value binding implicitly takes the address of the + reference expression. + +- For structs and tuples, the value representation is a struct or tuple of the + elements' value representations, which is not necessarily the same as a + struct or tuple of the elements' object representations. In the case where + the value representation is not a copy of, or pointer to, the object + representation, `value_binding` instructions are not used, and a + `tuple_value` or `struct_value` instruction is used to construct a value + representation instead. `value_binding` should still be used in the case + where the value and object representation are the same, but this is not yet + implemented. + +## Handling Parse::Tree errors (not yet implemented) + +`Parse::Tree` errors will typically indicate that checking would error for a +given context. We'll want to be careful about how this is handled, but we'll +likely want to generate diagnostics for valid child nodes, then reduce +diagnostics once invalid nodes are encountered. We should be able to reasonably +abandon generated IR of the valid children when we encounter an invalid parent, +without severe effects on surrounding checks. + +For example, an invalid line of code in a function might generate some +incomplete IR in the function's `SemIR::InstBlock`, but that IR won't negatively +interfere with checking later valid lines in the same function. + +## Alternatives considered + +### Using a traditional AST representation + +Clang creates an AST as part of compilation. In Carbon, it's something we could +do as a step between parsing and checking, possibly replacing the SemIR. It's +likely that doing so would be simpler, amongst other possible trade-offs. +However, we think the SemIR approach is going to yield higher performance, +enough so that it's the chosen approach. diff --git a/toolchain/docs/check.svg b/toolchain/docs/check.svg new file mode 100644 index 0000000000000..22d9ef3da9fad --- /dev/null +++ b/toolchain/docs/check.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/toolchain/docs/diagnostics.md b/toolchain/docs/diagnostics.md new file mode 100644 index 0000000000000..19fc5d9c8d05a --- /dev/null +++ b/toolchain/docs/diagnostics.md @@ -0,0 +1,230 @@ +# Diagnostics + + + + + +## Table of contents + +- [Overview](#overview) +- [DiagnosticEmitter](#diagnosticemitter) +- [DiagnosticConsumers](#diagnosticconsumers) +- [Producing diagnostics](#producing-diagnostics) +- [Diagnostic registry](#diagnostic-registry) +- [CARBON_DIAGNOSTIC placement](#carbon_diagnostic-placement) +- [Diagnostic context](#diagnostic-context) +- [Diagnostic parameter types](#diagnostic-parameter-types) +- [Diagnostic message style guide](#diagnostic-message-style-guide) + + + +## Overview + +The diagnostic code is used by the toolchain to produce output. + +## DiagnosticEmitter + +[DiagnosticEmitters](/toolchain/diagnostics/diagnostic_emitter.h) handle the +main formatting of a message. It's parameterized on a location type, for which a +DiagnosticLocationTranslator must be provided that can translate the location +type into a standardized DiagnosticLocation of file, line, and column. + +When emitting, the resulting formatted message is passed to a +DiagnosticConsumer. + +## DiagnosticConsumers + +DiagnosticConsumers handle output of diagnostic messages after they've been +formatted by an Emitter. Important consumers are: + +- [ConsoleDiagnosticConsumer](/toolchain/diagnostics/diagnostic_emitter.h): + prints diagnostics to console. + +- [ErrorTrackingDiagnosticConsumer](/toolchain/diagnostics/diagnostic_emitter.h): + counts the number of errors produced, particularly so that it can be + determined whether any errors were encountered. + +- [SortingDiagnosticConsumer](/toolchain/diagnostics/sorting_diagnostic_consumer.h): + sorts diagnostics by line so that diagnostics are seen in terminal based on + their order in the file rather than the order they were produced. + +- [NullDiagnosticConsumer](/toolchain/diagnostics/null_diagnostics.h): + suppresses diagnostics, particularly for tests. + +Note that `SortingDiagnosticConsumer` is used by default by `carbon compile`. In +cases where one error leads to another error at an earlier location, for example +if an error in a function call argument leads to an error in the function call, +this can result in confusing diagnostic output where a consequence of the error +is reported before the cause. Usually this should be handled by tracking that an +error occurred and suppressing the follow-on diagnostic. During toolchain +development, it can be useful to disable the sorting so that the diagnostic +order matches the order in which the file was processed. This can be done using +`carbon compile –stream-errors`. + +## Producing diagnostics + +Diagnostics are used to surface issues from compilation. A simple diagnostic +looks like: + +```cpp +CARBON_DIAGNOSTIC(InvalidCode, Error, "Code is invalid"); +emitter.Emit(location, InvalidCode); +``` + +Here, `CARBON_DIAGNOSTIC` defines a static instance of a diagnostic named +`InvalidCode` with the associated severity (`Error` or `Warning`). + +The `Emit` call produces a single instance of the diagnostic. When emitted, +`"Code is invalid"` will be the message used. The type of `location` depends on +the `DiagnosticEmitter`. + +A diagnostic with an argument looks like: + +```cpp +CARBON_DIAGNOSTIC(InvalidCharacter, Error, "Invalid character {0}.", char); +emitter.Emit(location, InvalidCharacter, invalid_char); +``` + +Here, the additional `char` argument to `CARBON_DIAGNOSTIC` specifies the type +of an argument to expect for message formatting. The `invalid_char` argument to +`Emit` provides the matching value. It's then passed along with the diagnostic +message format to `llvm::formatv` to produce the final diagnostic message. + +## Diagnostic registry + +There is a [registry](/toolchain/diagnostics/diagnostic_kind.def) which all +diagnostics must be added to. Each diagnostic has a line like: + +```cpp +CARBON_DIAGNOSTIC_KIND(InvalidCode) +``` + +This produces a central enumeration of all diagnostics. The eventual intent is +to require tests for every diagnostic that can be produced, but that isn't +currently implemented. + +## CARBON_DIAGNOSTIC placement + +Idiomatically, `CARBON_DIAGNOSTIC` will be adjacent to the `Emit` call. However, +this is only because many diagnostics can only be produced in one code location. +If they can be produced in multiple locations, they will be at a higher scope so +that multiple `Emit` calls can reference them. When in a function, +`CARBON_DIAGNOSTIC` should be placed as close as possible to the usage so that +it's easier to see the associated output. + +## Diagnostic context + +Diagnostics can provide additional context for errors by attaching notes, which +have their own location information. A diagnostic with a note looks like: + +```cpp +CARBON_DIAGNOSTIC(CallArgCountMismatch, Error, + "{0} argument(s) passed to function expecting " + "{1} argument(s).", + int, int); +CARBON_DIAGNOSTIC(InCallToFunction, Note, + "Calling function declared here."); +context.emitter() + .Build(call_parse_node, CallArgCountMismatch, arg_refs.size(), + param_refs.size()) + .Note(param_parse_node, InCallToFunction) + .Emit(); +``` + +The error and the note are registered as two separate diagnostics, but a single +overall diagnostic object is built and emitted, so that the error and the note +can be treated as a single unit. + +Diagnostic context information can also be registered in a scope, so that all +diagnostics produced in that scope attach a specific note. For example: + +```cpp +DiagnosticAnnotationScope annotate_diagnostics( + &context.emitter(), [&](auto& builder) { + CARBON_DIAGNOSTIC( + InCallToFunctionParam, Note, + "Initializing parameter {0} of function declared here.", int); + builder.Note(param_parse_node, InCallToFunctionParam, + diag_param_index + 1); + }); +``` + +This is useful when delegating to another part of Check that may produce many +different kinds of diagnostic. + +## Diagnostic parameter types + +Here are some types you might consider for the parameters to a diagnostic: + +- `llvm::StringLiteral`. Note that we don't use `llvm::StringRef` to avoid + lifetime issues. +- `std::string` +- Carbon types `T` that implement `llvm::format_provider` like: + - `Lex::TokenKind` + - `Lex::NumericLiteral::Radix` + - `Parse::RelativeLocation` +- integer types: `int`, `uint64_t`, `int64_t`, `size_t` +- `char` +- Other + [types supported by llvm::formatv](https://llvm.org/doxygen/FormatVariadic_8h_source.html) + +## Diagnostic message style guide + +In order to provide a consistent experience, Carbon diagnostics should be +written in the following style: + +- Start diagnostics with a capital letter or quoted code, and end them with a + period. + +- Quoted code should be enclosed in backticks, for example: + ``"`{0}` is bad."`` + +- Phrase diagnostics as bullet points rather than full sentences. Leave out + articles unless they're necessary for clarity. + +- Diagnostics should describe the situation the toolchain observed and the + language rule that was violated, although either can be omitted if it's + clear from the other. For example: + + - `"Redeclaration of X."` describes the situation and implies that + redeclarations are not permitted. + + - ``"`self` can only be declared in an implicit parameter list."`` + describes the language rule and implies that you declared `self` + somewhere else. + + - It's OK for a diagnostic to guess at the developer's intent and provide + a hint after explaining the situation and the rule, but not as a + substitute for that. For example, + ``"Add an `as String` cast to format this integer as a string."`` is not + sufficient as an error message, but + ``"Cannot add i32 to String. Add an `as String` cast to format this integer as a string."`` + could be acceptable. + +- TODO: Should diagnostics be atemporal and non-sequential ("multiple + declarations of X", "additional declaration here"), present tense but + sequential ("redeclaration of X", "previous declaration is here"), or + temporal ("redeclaration of X", "previous declaration was here")? We could + try to sidestep difference between the latter two by avoiding verbs with + tense ("previously declared here", "Y declared here", with no is/was). + +- TODO: Word choices: + + - For disallowed constructs, do we say they're not permitted / not allowed + / not valid / not legal / illegal / ill-formed / disallowed? Do we say + "X cannot be Y" or "X may not be Y" or "X must not be Y" or "X shall not + be Y"? + +- TODO: Is structuring diagnostics such that inputs can be parsed without + string parsing important? that is, when is passing strings in as part of the + message templating okay? + +- TODO: When do we put identifiers or expressions in diagnostics, versus + requiring notes pointing at relevant code? Is it only avoided for values, or + only allowed for types? + +- TODO: Lots more things to decide, give examples. diff --git a/toolchain/docs/driver.md b/toolchain/docs/driver.md new file mode 100644 index 0000000000000..01744b623b5d4 --- /dev/null +++ b/toolchain/docs/driver.md @@ -0,0 +1,22 @@ +# Driver + + + + + +## Table of contents + +- [Overview](#overview) + + + +## Overview + +The driver provides commands and ties together the toolchain's flow. Running a +command such as `carbon compile --phase=lower ` will run through the flow +and print output. Several dump flags, such as `--dump-parse-tree`, print output +in YAML format for easier parsing. diff --git a/toolchain/docs/idioms.md b/toolchain/docs/idioms.md new file mode 100644 index 0000000000000..a8d6fb9bbe4b9 --- /dev/null +++ b/toolchain/docs/idioms.md @@ -0,0 +1,424 @@ +# Idioms + + + + + +## Table of contents + +- [Overview](#overview) +- [C++ dialect](#c-dialect) +- [Abbreviations used in the code (AKA Carbon abbreviation decoder ring)](#abbreviations-used-in-the-code-aka-carbon-abbreviation-decoder-ring) +- [`.def` files](#def-files) + - [EnumBase types](#enumbase-types) +- [Index types](#index-types) +- [ValueStore](#valuestore) +- [Template metaprogramming](#template-metaprogramming) + - [Struct reflection](#struct-reflection) + - [Field detection](#field-detection) +- [Local lambdas to reduce duplicate code](#local-lambdas-to-reduce-duplicate-code) +- [Immediately invoked function expressions (IIFE)](#immediately-invoked-function-expressions-iife) +- [Declarations in conditions](#declarations-in-conditions) +- [CRTP or "Curiously recurring template pattern"](#crtp-or-curiously-recurring-template-pattern) +- [Multiple inheritance](#multiple-inheritance) +- [Defining constants usable in constexpr contexts](#defining-constants-usable-in-constexpr-contexts) + + + +## Overview + +The toolchain implementation uses some implementation techniques that may not be +commonly found in typical C++ code. + +## C++ dialect + +The toolchain implementation does not use some C++ features, following +[Google's C++ style guide](https://google.github.io/styleguide/cppguide.html): + +- [Exceptions](https://google.github.io/styleguide/cppguide.html#Exceptions) +- [Virtual base classes](https://google.github.io/styleguide/cppguide.html#Inheritance) +- [RTTI](https://google.github.io/styleguide/cppguide.html#Run-Time_Type_Information__RTTI_) + +## Abbreviations used in the code (AKA Carbon abbreviation decoder ring) + +Note that abbreviations are typically only used in code, not comments (except +when referring to an entity from the code). + +- **Addr**: "address" +- **Arg**: "argument" +- **Decl**: "declaration" +- **Expr**: "expression" + - **SubExpr**: "subexpression" +- **Float**: "floating point" +- **Init**: "initialization" +- **Inst**: "instruction" +- **Int**: "integer" +- **Loc**: "location" +- **Param**: "parameter" +- **Paren**: "parenthesis" +- **Ref**: "reference" + - **Deref**: "dereference" +- **Subst**: "substitute" + +Phrase abbreviations (where we have an abbreviation for a phrase, where we +wouldn't perform all of the abbreviations of those words individually): + +- **InitRepr**: "initializing representation" +- **ObjectRepr**: "object representation" +- **SemIR**: "semantics intermediate representation" +- **ValueRepr**: "value representation" + +## `.def` files + +The Carbon toolchain uses a technique related to +[X-macros](https://en.wikipedia.org/wiki/X_macro) to generate code that operates +over a collection of types, enumerators, or another similar list of names. This +works as follows: + +- A `.def` file is provided, that is intended to be repeatedly included by way + of `#include`. +- The user of the `.def` defines a macro, with a name and a form specified by + the `.def` file, for example + `#define CARBON_EACH_WIDGET(Name) Scope::Name,`. +- A `#include` of the `.def` file expands to `CARBON_EACH_WIDGET(Name1)`, + `CARBON_EACH_WIDGET(Name2)`, ... for each widget name, and then `#undef`s + the `CARBON_EACH_WIDGET` macro. + +For example: + +```cpp +enum Widgets { +#define CARBON_EACH_WIDGET(Name) Name, +#include "widgets.def" +} +``` + +... would expand to an enumeration definition with one enumerator per widget +name. + +### EnumBase types + +Most `.def` files will have a corresponding [EnumBase](/common/enum_base.h) +child class (if `widgets.def` has X-macros, `widgets.h` and `widgets.cpp` has +the `EnumBase` child class). These work similarly to an `enum class`, with the +addition of a `name()` function and `<<` stream operator support. Many also have +further utility functions for information related to the enum value. + +In code, these types and values can be used directly in a `switch`. They will +convert to an internal _actual_ `enum class` for the `switch`, and receive +corresponding compiler safety checks that all enum values are handled. + +## Index types + +Carbon makes frequent use of +[IndexBase and IdBase](/toolchain/base/index_base.h). The `IndexBase` and +`IdBase` types are small wrappers around `int32_t` to provide a measure of +type-checking when passing around indices to vector-like storage types. The only +difference is that `IndexBase` supports all comparison operators, whereas +`IdBase` only supports equality comparison. + +Variable naming will often have `_id` at the end to indicate that it corresponds +to an `IdBase`. This may include the full type, as in `operand_inst_id` being an +`InstId` for an operand. + +A block is an array of ids. These will be indicated with either a `_block` +suffix or pluralization (for example, `param_refs` pluralizing `refs`). + +The `ref` concept in a name means that there is an underlying instruction block, +but only a subset of instructions are present in the `refs` block. For example, +function parameters have a sequence, and also have a `refs` block with one entry +per parameter. The `refs` block allows parameters to be counted and accessed +directly, rather than through vector iteration. + +## ValueStore + +Many of Carbon's data types are stored in a +[ValueStore](/toolchain/base/value_store.h) or related type with similar +semantics (`sem_ir` has [several such classes](/toolchain/base/value_store.h)). +`ValueStore` links an indexing type to a value type with vector-like storage. +The indices typically use `IdBase`. + +`ValueStore`s APIs follow the shape of simple array access and mutation: + +- `Add` which takes a value and returns the index. +- `Set` which takes a value and index to modify. +- `Get` takes an index and returns a reference to the value (possibly a + constant reference). +- Other vector-like functionality, including `size` or `Reserve` + +ValueStores should be named after the type they contain. The index type used on +the value store should have a `using ValueType...` which indicates the stored +type. When taking a return of one of these functions, it's common to use `auto` +and rely on the name of the storage type to imply the returned type. + +Some name mirroring examples are: + +- `ints` is a `ValueStore`, which has an index type of `IntId` and a + value type of `llvm::APInt`. + +- `functions` is a `ValueStore`, which has an index type of + `SemIR::FunctionId` and a value type of `SemIR::` `Function`. + +- `strings` is a `ValueStore`, which has an index type of + `StringId`, but for copy-related reasons, uses `llvm::StringRef` for values. + +A fairly complete list of `ValueStore` uses should be available on +[checking's Context class](https://github.com/search?q=repository%3Acarbon-language%2Fcarbon-lang%20path%3Acheck%2Fcontext.h%20symbol%3Aidentifiers&type=code). + +## Template metaprogramming + +FIXME: show example patterns + +- TypedInstArgsInfo from toolchain/sem_ir/inst.h +- templated using +- std::declval +- decltype +- static_assert +- if constexpr +- template specialization, for example `Inst::FromRaw` (maybe also type + traits?) + +### Struct reflection + +The toolchain uses a primitive form of struct reflection to operate generically +over the fields in a typed `SemIR` instruction. This is implemented in +`common/struct_reflection.h`, and the interface to the functionality is +`StructReflection::AsTuple(your_struct)`, which converts the given struct into a +`std::tuple` containing the same fields in the same order. + +### Field detection + +The presence of specific fields in a struct with a specified type is detected +using the following idiom: + +```cpp +template +constexpr bool HasField = false; +template +constexpr bool HasField = true; +``` + +This is intended to check the same property as the following concept, which we +can't use because we currently need to compile in C++17 mode: + +```cpp +template concept HasField = requires (T x) { + { x.field } -> std::same_as; +}; +``` + +To detect a field with a specific name with a type derived from a specified base +type, use this idiom: + +```cpp +// HasField is true if T has a `U field` field, +// where `U` extends `BaseClass`. +template +inline constexpr bool HasField = false; +template +inline constexpr bool HasField< + T, bool(std::is_base_of_v)> = true; +``` + +The equivalent concept is: + +```cpp +template concept HasField = requires (T x) { + { x.field } -> std::derived_from; +}; +``` + +## Local lambdas to reduce duplicate code + +Sometimes code that would be repeated in a function is factored into a local +variable containing a +[lambda](https://en.cppreference.com/w/cpp/language/lambda): + +```cpp +auto common_code = [&](AType param1, AnotherType param2) { + // code that would otherwise be repeated + ... +} +if (something) { + common_code(...); +} +if (something_else) { + common_code(...) +} +``` + +Compared to defining a new function, this has the advantage of being able to be +declared in context and access the local variables of the enclosing function. + +## Immediately invoked function expressions (IIFE) + +Instead of creating a separate function with its own name that will be called +once to produce the initial value for a variable, the function can be declared +inline and then immediately called. + +This can be used for complex initialization, as in: + +```cpp +// variable declaration +static const llvm::ArrayRef entropy_bytes = +// initializer starts with a lambda + []() -> llvm::ArrayRef { + static llvm::SmallVector bytes; + + // a bunch of code + + // return the value to initialize the variable with + return bytes; + +// finish defining the lambda, and then immediately invoke it +}(); +``` + +It can also be used inside a `CARBON_DCHECK` to avoid computation that is only +needed in debug builds: + +```cpp +CARBON_DCHECK([&] { + // a bunch of code + + // condition that will be tested by CARBON_DCHECK + return complicated && multiple_parts; + +// finish defining the lambda, and then immediately invoke it +}()) << "Complicated things went wrong"; +``` + +See a description of this technique on +[wikipedia](https://en.wikipedia.org/wiki/Immediately_invoked_function_expression). + +## Declarations in conditions + +The condition part of an `if` statement may contain a declaration with an +initializer followed by a semicolon (`;`) and then the proper boolean condition +expression, as in: + +```cpp +if (auto verify = tree.Verify(); !verify.ok()) { +``` + +The condition can be replaced by a declaration entirely, as in: + +```cpp +if (auto equals = context.ConsumeIf(Lex::TokenKind::Equal)) { +// Equivalent to: +if (auto equals = context.ConsumeIf(Lex::TokenKind::Equal); equals) { +``` + +or + +```cpp +if (auto literal = bound_inst.TryAs()) { +// Equivalent to: +if (auto literal = bound_inst.TryAs(); literal) { +``` + +This is a common way of handling a function that returns an optional value. + +See +[https://en.cppreference.com/w/cpp/language/if](https://en.cppreference.com/w/cpp/language/if) + +## CRTP or "Curiously recurring template pattern" + +[Curiously Recurring Template Pattern - cppreference.com](https://en.cppreference.com/w/cpp/language/crtp) + +[Curiously recurring template pattern - Wikipedia](https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern) + +[Google search](https://www.google.com/search?q=crtp+c%2B%2B) + +Examples: + +- `template ` in [enum_base.h](/common/enum_base.h) +- `template ` in [ostream.h](/common/ostream.h) + +## Multiple inheritance + +We use multiple inheritance to support uses of +[CRTP](#crtp-or-curiously-recurring-template-pattern). + +Example: + +```cpp +struct NameScopeId : public IndexBase, public Printable { +``` + +## Defining constants usable in constexpr contexts + +To declare a constant usable at compile time in `constexpr` contexts as a static +class member, we use this pattern: + +Declaration: + +```cpp +class Foo { + // ... + static const std::array MyTable; + static constexpr auto ComputeMyTable() + -> std::array { ... } +}; +``` + +Definition: + +```cpp +constexpr std::array + Foo::MyTable = Foo::ComputeMyTable(); +``` + +Note the `const` on the declaration does not match the `constexpr` on +definition, and that the definition is outside of the class body. This allows +the initializer to depend on the definition of the class. + +Further note that this only works with static members of classes, not static +variables in functions. + +Due to [a Clang bug](https://github.com/llvm/llvm-project/issues/85461), this +technique does not work in a class template. The following pattern can be used +instead: + +```cpp +template +class Foo { + // ... + template + static constexpr auto MyValueImpl = Self(); + static constexpr const Foo& MyValue = MyValueImpl<>; + // ... +}; +``` + +The parameters of the variable template can be chosen to allow reuse of the same +variable template for multiple static data members. + +Examples: + +- `NodeStack::IdKindTable` in + [check/node_stack.h](/toolchain/check/node_stack.h) +- `BuiltinKind::ValidCount` in + [sem_ir/builtin_inst_kind.h](/toolchain/sem_ir/builtin_inst_kind.h) + +A global constant may use a single definition without a separate declaration: + +```cpp +static constexpr std::array IsIdStartByteTable = [] { + std::array table = {}; + // ... + return table; +}(); +``` + +Note this example is using an +[immediately invoked function expression](#immediately-invoked-function-expressions-iife) +to compute the initial value, which is common. + +Examples: + +- [lex/lex.cpp](/toolchain/lex/lex.cpp) diff --git a/toolchain/docs/lex.md b/toolchain/docs/lex.md new file mode 100644 index 0000000000000..32925aeee19eb --- /dev/null +++ b/toolchain/docs/lex.md @@ -0,0 +1,44 @@ +# Lex + + + + + +## Table of contents + +- [Overview](#overview) +- [Bracket matching](#bracket-matching) +- [Alternatives considered](#alternatives-considered) + - [Bracket matching in parser](#bracket-matching-in-parser) + + + +## Overview + +Lexing converts input source code into tokenized output. Literals, such as +string literals, have their value parsed and form a single token at this stage. + +## Bracket matching + +The lexer handles matching for `()`, `[]`, and `{}`. When a bracket lacks a +match, it will insert a "recovery" token to produce a match. As a consequence, +the lexer's output should always have matched brackets, even with invalid code. + +While bracket matching could use hints such as contextual clues from +indentation, that is not yet implemented. + +## Alternatives considered + +### Bracket matching in parser + +Bracket matching could have also been implemented in the parser, with some +awareness of parse state. However, that would shift some of the complexity of +recovery in other error situations, such as where the parser searches for the +next comma in a list. That needs to skip over bracketed ranges. We don't think +the trade-offs would yield a net benefit, so any change in this direction would +need to show concrete improvement, for example better diagnostics for common +issues. diff --git a/toolchain/docs/lower.md b/toolchain/docs/lower.md new file mode 100644 index 0000000000000..4574952c7b080 --- /dev/null +++ b/toolchain/docs/lower.md @@ -0,0 +1,25 @@ +# Lower + + + + + +## Table of contents + +- [Overview](#overview) + + + +## Overview + +Lowering takes the SemIR and produces LLVM IR. At present, this is done in a +single pass, although it's possible we may need to do a second pass so that we +can first generate type information for function arguments. + +Lowering is done per `SemIR::InstBlock`. This minimizes changes to the +`IRBuilder` insertion point, something that is both expensive and potentially +fragile. diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md new file mode 100644 index 0000000000000..397fc208739b3 --- /dev/null +++ b/toolchain/docs/parse.md @@ -0,0 +1,802 @@ +# Parse + + + + + +## Table of contents + +- [Overview](#overview) +- [Parse stack](#parse-stack) +- [Postorder tree](#postorder-tree) +- [Bracketing inside the tree](#bracketing-inside-the-tree) +- [Visual example](#visual-example) +- [Handling invalid parses](#handling-invalid-parses) +- [How is this accomplished?](#how-is-this-accomplished) + - [Introducer](#introducer) + - [Optional modifiers before an introducer](#optional-modifiers-before-an-introducer) + - [Something required in context](#something-required-in-context) + - [Optional clauses](#optional-clauses) + - [Case 1: introducer to optional clause is used as parent node](#case-1-introducer-to-optional-clause-is-used-as-parent-node) + - [Case 2: parent node is required token after optional clause, with different parent node kinds for different options](#case-2-parent-node-is-required-token-after-optional-clause-with-different-parent-node-kinds-for-different-options) + - [Case 3: optional sibling](#case-3-optional-sibling) + - [Operators](#operators) + + + +## Overview + +Parsing uses tokens to produce a parse tree that faithfully represents the tree +structure of the source program, interpreted according to the Carbon grammar. No +semantics are associated with the tree structure at this level, and no name +lookup is performed. + +The parse tree's structure corresponds to the grammar of the Carbon language. On +valid input, there will be a 1:1 correspondence between parse tree nodes and +tokens. + +A parse tree is considered _structurally valid_ if all nodes have the number of +children that their node kind requires. On invalid input, nodes may be added +that don't correspond to a token to maintain a structurally valid parse tree. +When a parse tree node is marked as having an error, it will still be +structurally valid, but its children may not match a valid grammar. Code trying +to handle children of erroneous nodes must be prepared to handle atypical +structures, but it may still be helpful for tools such as syntax highlighters or +refactoring tools. + +In general, we favor doing the checking for whether something is allowed _in a +particular context_ in [the check stage](check.md) instead of the parse stage, +unless the context is very local. This is for a few reasons: + +- We anticipate that the parse stage will be used to operate on invalid code + while still preserving as much of the intent of the author as possible, for + example in an IDE or a code formatter. +- To keep as much code out of the parse stage as possible, so it is simple and + fast. +- We are building all the infrastructure to keep track of context in the check + stage. + +These reasons explain what local context is okay: where we already have the +contextual information at hand so there is no performance cost, and we can +output a parse tree that still captures faithfully what the user wrote. +Examples: + +- All declaration modifiers are allowed in any order on any declaration in the + parse stage. Diagnosing duplicated modifiers, modifiers that conflict with + other modifiers, or modifiers that can't be used on a particular declaration + is postponed until the check stage. +- Rejecting a keyword after `fn` where a name is expected is done at the parse + stage. + +## Parse stack + +The core parser loop is `Parse::Tree::Parse`. In the loop, it pops the next +state off the stack, and dispatches to the appropriate `Handle` function. + +A typical handler function pops the state first, leaving the stack ready for the +next state. It may add nodes to the parse tree, based on the current code. If it +needs to trigger other states, it will push them onto the stack; because it's a +stack, the _next_ state is always pushed _last_. + +Operator expressions store information about current operator precedence in the +stack as well. While this isn't necessary for most parser states, and could be +stored separately, it's currently together because it has no impact on the size +of a stack entry and is thus more efficient to store in one place. + +## Postorder tree + +The parse tree's storage layout is in postorder. For example, given the code: + +```carbon +fn foo() -> f64 { + return 42; +} +``` + +The node order is (with indentation to indicate nesting): + + + + +```yaml +[ + {kind: 'FileStart', text: ''}, + {kind: 'FunctionIntroducer', text: 'fn'}, + {kind: 'Name', text: 'foo'}, + {kind: 'ParamListStart', text: '('}, + {kind: 'ParamList', text: ')', subtree_size: 2}, + {kind: 'Literal', text: 'f64'}, + {kind: 'ReturnType', text: '->', subtree_size: 2}, + {kind: 'FunctionDefinitionStart', text: '{', subtree_size: 7}, + {kind: 'ReturnStatementStart', text: 'return'}, + {kind: 'Literal', text: '42'}, + {kind: 'ReturnStatement', text: ';', subtree_size: 3}, + {kind: 'FunctionDefinition', text: '}', subtree_size: 11}, + {kind: 'FileEnd', text: ''}, +] +``` + + + +In this example, `FileStart`, `FunctionDefinition`, and `FileEnd` are "root" +nodes for the tree. Function components are children of `FunctionDefinition`. + +It's produced in this way because it's an efficient layout to produce with +vectorized storage, requiring little context to be maintained during parsing. +Because it's stored in postorder, it's also most efficient to process the parsed +output in postorder; this affects checking. + +The parse tree is printed in postorder by default because it matches how the +parse tree is expected to be processed within the toolchain , and so can make it +easier to reason about. However, the `--preorder` flag may be used in contexts +where a preorder representation would be easier to handle. + +## Bracketing inside the tree + +The parse tree is designed to be walked in postorder by checking, allowing +checking to be more efficient. To support this, checking sometimes requires +context on the meaning of a node when it is encountered. + +Each `ParseNodeKind` has either a bracketing node, or a specific child count. +This helps document and enforce the expected tree structure. + +When a bracketing node is indicated, it is the opening bracket: it will always +be the first child of the parent, and that will be the only time it occurs in +the parent's children (it may still occur in children of children). When +checking encounters the opening bracket, this means it can make contextual +decisions for the later children of the node. + +Nodes can also have a specific child count, for example, infix operators always +have two children: the lhs and rhs expressions. Many nodes have a child count of +0; this just means they're leaf nodes, and will never have children. + +Because the tree structure is always valid, these are treated as contracts. Some +nodes exist only to be used to construct valid tree structures for invalid +input, such as `StructFieldUnknown`. + +Although each subtree's size is also tracked as part of the node, we're +currently trying to avoid relying on it and may eliminate it if it turns out to +be unnecessary and a meaningful cost for the compiler. + +## Visual example + +To try to explain the transition from code to Parse Tree, consider the +statement: + +```carbon +var x: i32 = y + 1; +``` + +Lexing creates distinct tokens for each syntactic element, which will form the +basis of the parse tree: + +
+Tokens:
+
++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
+| var | |  x  | |  :  | | i32 | |  =  | |  y  | |  +  | |  1  | |  ;  |
++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
+
+ +First the `var` keyword is used as a "bracketing" node (VariableIntroducer). +When this is seen in a postorder traversal, it tells us to expect the basics of +a variable declaration structure. + +
+Tokens:
+
+        +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
+        |  x  | |  :  | | i32 | |  =  | |  y  | |  +  | |  1  | |  ;  |
+        +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
+
+Parse tree:
+
+
+
+
+
+
+
++-----+
+| var |
++-----+
+
+
+
+
+
+
+
+ +Next, we can consider the pattern binding. Here, `x` is the identifier and `i32` +is the type expression. The `:` provides a parent node that must always contain +two children, the name and type expression. Because it always has two direct +children, it doesn't need to be bracketed. + +
+Tokens:
+
+                                +-----+ +-----+ +-----+ +-----+ +-----+
+                                |  =  | |  y  | |  +  | |  1  | |  ;  |
+                                +-----+ +-----+ +-----+ +-----+ +-----+
+
+Parse tree:
+
+        +-----+ +-----+
+        |  x  | | i32 |
+        +-----+ +-----+
+           |       |
+           +-------+-------+
+                           |
++-----+                 +-----+
+| var |                 |  :  |
++-----+                 +-----+
+
+
+
+
+
+
+
+ +We use the `=` as a separator (instead of a node with children like `:`) to help +indicate the transition from binding to assignment expression, which is +important for expression parsing during checking. + +
+Tokens:
+
+                                        +-----+ +-----+ +-----+ +-----+
+                                        |  y  | |  +  | |  1  | |  ;  |
+                                        +-----+ +-----+ +-----+ +-----+
+
+Parse tree:
+
+        +-----+ +-----+
+        |  x  | | i32 |
+        +-----+ +-----+
+           |       |
+           +-------+-------+
+                           |
++-----+                 +-----+ +-----+
+| var |                 |  :  | |  =  |
++-----+                 +-----+ +-----+
+
+
+
+
+
+
+
+ +The expression is a subtree with `+` as the parent, and the two operands as +child nodes. + +
+Tokens:
+
+                                                                +-----+
+                                                                |  ;  |
+                                                                +-----+
+
+Parse tree:
+
+        +-----+ +-----+                 +-----+ +-----+
+        |  x  | | i32 |                 |  y  | |  1  |
+        +-----+ +-----+                 +-----+ +-----+
+           |       |                       |       |
+           +-------+-------+               +-------+-------+
+                           |                               |
++-----+                 +-----+ +-----+                 +-----+
+| var |                 |  :  | |  =  |                 |  +  |
++-----+                 +-----+ +-----+                 +-----+
+
+
+
+
+
+
+
+ +Finally, the `;` is used as the "root" of the variable declaration. It's +explicitly tracked as the `;` for a variable declaration so that it's +unambiguously bracketed by `var`. + +
+Tokens:
+
+
+
+
+
+Parse tree:
+
+        +-----+ +-----+                 +-----+ +-----+
+        |  x  | | i32 |                 |  y  | |  1  |
+        +-----+ +-----+                 +-----+ +-----+
+           |       |                       |       |
+           +-------+-------+               +-------+-------+
+                           |                               |
++-----+                 +-----+ +-----+                 +-----+
+| var |                 |  :  | |  =  |                 |  +  |
++-----+                 +-----+ +-----+                 +-----+
+   |                       |       |                       |
+   +-----------------------+-------+-----------------------+-------+
+                                                                   |
+                                                                +-----+
+                                                                |  ;  |
+                                                                +-----+
+
+ +This is the completed parse tree. + +In storage, this tree will be flat and in postorder. Because the order hasn't +changed much from the original code, we can do the reordering for postorder with +a minimal number of nodes being delayed for later output: it will be linear with +respect to the depth of the parse tree. + +
+Tokens:
+
++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
+| var | |  x  | |  :  | | i32 | |  =  | |  y  | |  +  | |  1  | |  ;  |
++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
+
+Parse tree:
+
+        +-----+ +-----+                 +-----+ +-----+
+        |  x  | | i32 |                 |  y  | |  1  |
+        +-----+ +-----+                 +-----+ +-----+
+           |       |                       |       |
+           +-------+-------+               +-------+-------+
+                           |                               |
++-----+                 +-----+ +-----+                 +-----+
+| var |                 |  :  | |  =  |                 |  +  |
++-----+                 +-----+ +-----+                 +-----+
+   |                       |       |                       |
+   +-----------------------+-------+-----------------------+-------+
+                                                                   |
+                                                                +-----+
+                                                                |  ;  |
+                                                                +-----+
+
+Flattened for storage:
+
++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
+| var | |  x  | | i32 | |  :  | |  =  | |  y  | |  1  | |  +  | |  ;  |
++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
+
+ +The structural concepts of bracketing nodes (`var` and `;`) and parent nodes +with a known child count (`:` and `+` with 2 children, but also `=` with 0 +children) will allow checking to reconstruct the tree as it encounters nodes +during the postorder. + +There are other structures that could have been used here, such as `=` being +parent of the `var` and pattern nodes, and `;` being the parent of the `=` and +assignment expression nodes. In that example alternative, the storage order +would be the same; it would only change the tree representation. The current +structure is influenced by choices in checking. + +## Handling invalid parses + +On an invalid parse, the output tree should still try to mirror the intended +tree structure when possible. There's a balance here, and it's not expected to +try too hard to make things correct, but outputting nodes is preferred. There +are `InvalidParse` nodes which may be used to provide a node when the planned +node kind is too difficult to get correct child counts (bracketed subtrees may +not need an `InvalidParse` node). + +When marking a child node with `has_error=true`, parent nodes may also be marked +with `has_error=true`, but try to be conservative about this. As a rule of +thumb, if checking could continue on a parent node without needing the child +node to be fully checked (possibly with incomplete information), then the parent +node should not be marked as `has_error=true`. The goal remains providing +something similar to a well-formed parse tree. + +In general, a parent node must have the immediate children described in +[parse/typed_nodes.h](/toolchain/parse/typed_nodes.h), unless it is marked +`has_error=true`. If this is violated for a particular parse tree, an error will +be raised in `Tree::Verify`. Note that an `InvalidParse` node is allowed as a +declaration or expression, and an `InvalidParseSubtree` is allowed as a +declaration. These invalid nodes can be added to more node categories as needed. + +Child states may indicate an error to their parent using `ReturnErrorOnState`. +This is particularly intended for when a child state emits a diagnostic, to +prevent the parent state from emitting redundant diagnostics; for example, an +invalid expression might have more invalid tokens following it, and the parent +might skip those without emitting diagnostics. + +## How is this accomplished? + +The specific approach to producing the desired tree depends on the kind of +grammar rule being implemented, as well as the desired output tree structure. + +### Introducer + +**Example:** `if (c) { ... }` + +Here `if` is the introducer. Many other possible introducers could occur in that +position, such as `while` or `var`, and we want to dispatch based on which token +is present. See +[parse/handle_statement.cpp](/toolchain/parse/handle_statement.cpp). + +The first step is to identify the introducer token, typically using a `switch` +or `if` on the `Lex::TokenKind` at the current position: + +```cpp +switch (context.PositionKind()) { + case Lex::TokenKind::___: { + ... + break; + } + ... +} +``` + +There should be a `default:` (or `else`) case so every kind of token is handled. +This may be an error, in which case: + +- A [diagnostic](diagnostics.md) should be emitted. + +- An invalid parse node should be added, using something like: + + ```cpp + context.AddLeafNode(NodeKind::InvalidParse, context.Consume(), + /*has_error=*/true); + ``` + +- At least one node should be consumed, particularly if it will continue with + this state at this position, to avoid an infinite loop. + +The default case may also be delegated to another state. For example, in the +state where a statement is expected, if no keyword introducer is recognized, it +switches to the expression-statement state. + +Depending on the introducer, different actions can be taken. The most common +case is to: + +- Call `context.PushState(State::___);` to mark the beginning of the statement + or declaration and indicate the state that will handle the tokens after the + introducer. + +- Call `context.AddLeafNode(NodeKind::___, context.Consume());` to output a + bracketing node for this introducer. + +The next state can then add sibling nodes until it gets to the end of the +declaration or statement. The last token, often a semicolon `;`, is used as a +parent node to match the bracketing node of the introducer. + +If the introducer token won't be used as a bracketing node, it can be +temporarily skipped after `context.PushState` by calling +`context.ConsumeAndDiscard()` instead of `context.AddLeafNode`. It must be added +to the output tree as a node by some later state, unless an error occurs. For +example, a `for` statement uses the `for` token as the root of the tree -- it +doesn't need a bracketing node since it has a fixed child count. Note that the +token was saved when the state was pushed, and can be retrieved when adding a +node as in this example: + +```cpp +auto state = context.PopState(); +context.AddNode(NodeKind::ForStatement, state.token, state.subtree_start, + state.has_error); +``` + +If this state is for an element of a scope like the statements in a code block, +most introducer tokens indicate that the current state should be repeated, to +handle the next statement, but some other token, like a close curly brace (`}`) +means that the state should be exited. + +### Optional modifiers before an introducer + +**Example:** `virtual fn Foo();` + +Here `fn` is the introducer, and `virtual` is an optional modifier that appears +before. See +[parse/handle_decl_scope_loop.cpp](/toolchain/parse/handle_decl_scope_loop.cpp). + +Use this pattern when the goal is to produce a subtree that starts with the +introducer as a bracketing node, as in the previous case, followed by nodes for +any modifiers. Note that bracketing is needed here, since the optional modifier +nodes mean that there is not a fixed child count for the parent node. This means +shuffling the introducer node before an unknown number of modifier nodes. This +is accomplished by emitting a placeholder node for the introducer, processing +all the modifiers until reaching the introducer, filling in the placeholder with +the information about the introducer, and then finishing the rest of the +declaration or statement. + +- **Step 1**: Save the current value of `context.tree().size()`. This could be + accomplished by calling `context.PushState()`, which saves that value in the + `subtree_start` field of `Context::StateStackEntry`; or by constructing a + `Context::StateStackEntry` value directly, as is done in + [parse/handle_decl_scope_loop.cpp](/toolchain/parse/handle_decl_scope_loop.cpp). + This marks the position of the placeholder node we are going to replace, as + well as the beginning of the subtree we are eventually going to emit for + this declaration or statement. + +- **Step 2**: Emit the placeholder node using + `context.AddLeafNode(NodeKind::Placeholder, *context.position());`. The + `NodeKind` and `Lex::TokenIndex` values will be overwritten later. + +- **Step 3**: Process tokens until we hit the introducer. All of the nodes we + emit at this point will appear as siblings after the introducer token in the + output tree. + +- **Step 4 - success**: If an introducer token is found, replace the + placeholder node using something like: + + ```cpp + context.ReplacePlaceholderNode(state.subtree_start, introducer_kind, + context.Consume()); + ``` + + - `state.subtree_start` is the value of `context.tree().size()` saved in + step 1, which marks the position of the placeholder node in the output + parse tree. + + - `introducer_kind` is the `NodeKind` for the introducer of this + declaration or statement, a leaf node that will act as a bracketing node + at the beginning of the subtree for this declaration or statement + +- **Step 4 - error**: If we run into something other than a modifier or + introducer before finding an introducer, we need to do error handling: + + ```cpp + context.ReplacePlaceholderNode(subtree_start, NodeKind::InvalidParseStart, + *context.position(), /*has_error=*/true); + ``` + + - Emit a [diagnostic](diagnostics.md). + + - Replace the placeholder node (similar to step 4) with an + `InvalidParseStart` node. It will be associated with the unexpected + token that triggered this error. + + - Consume input token up to the likely end of the end of the current + statement or declaration. For example, we might consume up to a `;` or a + token at a lesser indent level using `context.SkipPastLikelyEnd(...)`. + It is important that we consume at least one token in the error case, + otherwise we could have an infinite loop of generating the same error on + the same token. + + - Emit a `InvalidParseSubtree` node. This will be the parent of any + emitted modifier nodes, and will be bracketed by the `InvalidParseStart` + node emitted above. It should be associated with the last token + consumed. + + ```cpp + // Set `iter` to the last token consumed, one before the current position. + auto iter = context.position(); + --iter; + context.AddNode(NodeKind::InvalidParseSubtree, *iter, subtree_start, + /*has_error=*/true); + ``` + +- **Step 5**: (If success at step 4) Push whatever states are to be used to + parse the rest of the declaration. The first state pushed (the last state to + be processed) will handle the end of this declaration. That pushed state + should have a `subtree_start` field set to the value of + `context.tree().size()` saved in step 1. + +- **Step 6**: When handling the state for the end of the declaration, emit the + root node of subtree: + + ```cpp + state = context.PopState(); + context.AddNode(NodeKind::___, context.Consume(), + state.subtree_start, state.has_error); + ``` + + - This `state.subtree_start` will mark everything since the bracketing + introducer node as the children of this node. + +### Something required in context + +FIXME + +Example: name after introducer +[parse/handle_decl_name_and_params.cpp](/toolchain/parse/handle_decl_name_and_params.cpp) + +Example: "`[` _implicit parameter list_ `]`" after `impl forall` +[parse/handle_impl.cpp](/toolchain/parse/handle_impl.cpp) + +### Optional clauses + +#### Case 1: introducer to optional clause is used as parent node + +**Example:** The optional `-> ` in a function signature +uses this pattern, so `fn foo() -> u32;` is transformed to: + +```yaml + {kind: 'FunctionIntroducer', text: 'fn'}, + {kind: 'IdentifierName', text: 'foo'}, + {kind: 'TuplePatternStart', text: '('}, + {kind: 'TuplePattern', text: ')', subtree_size: 2}, + {kind: 'UnsignedIntTypeLiteral', text: 'u32'}, + {kind: 'ReturnType', text: '->', subtree_size: 2}, +{kind: 'FunctionDecl', text: ';', subtree_size: 7}, +``` + +Note how the `->` token becomes a `ReturnType` node in the output tree, and is +moved after the `u32` type expression that becomes its child. Compare with the +parse tree output for `fn foo();` which has no `ReturnType` node: + +```yaml + {kind: 'FunctionIntroducer', text: 'fn'}, + {kind: 'IdentifierName', text: 'foo'}, + {kind: 'TuplePatternStart', text: '('}, + {kind: 'TuplePattern', text: ')', subtree_size: 2}, +{kind: 'FunctionDecl', text: ';', subtree_size: 5}, +``` + +Here is the code from +[parse/handle_function.cpp](/toolchain/parse/handle_function.cpp) that does +this: + +```cpp +auto HandleFunctionAfterParams(Context& context) -> void { + ... + // If there is a return type, parse the expression before adding the return + // type node. + if (context.PositionIs(Lex::TokenKind::MinusGreater)) { + context.PushState(State::FunctionReturnTypeFinish); + context.ConsumeAndDiscard(); + context.PushStateForExpr(PrecedenceGroup::ForType()); + } +} + +auto HandleFunctionReturnTypeFinish(Context& context) -> void { + auto state = context.PopState(); + + context.AddNode(NodeKind::ReturnType, state.token, state.subtree_start, + state.has_error); +} +``` + +The `->` token is saved by `context.PushState(`...`)`, so it is available as +`state.token` when calling +`context.AddNode(NodeKind::ReturnType, state.token,`...`)` later in +`HandleFunctionReturnTypeFinish`. + +Also see how the optional initializer is handled on `var`, treating the `=` as +its introducer in `HandleVarAfterPattern` and `HandleVarInitializer` in +[parse/handle_var.cpp](/toolchain/parse/handle_var.cpp). + +#### Case 2: parent node is required token after optional clause, with different parent node kinds for different options + +**Example:** The optional type expression before `as` in `impl as` is +represented by producing two different output parse nodes for `as`. It outputs a +`DefaultSelfImplAs` node with no children when the type expression is absent, +and otherwise a `TypeImplAs` parse node with the type expression as its child. + +So `impl bool as Interface;` is transformed to: + +```yaml + {kind: 'ImplIntroducer', text: 'impl'}, + {kind: 'BoolTypeLiteral', text: 'bool'}, + {kind: 'TypeImplAs', text: 'as', subtree_size: 2}, + {kind: 'IdentifierNameExpr', text: 'Interface'}, +{kind: 'ImplDecl', text: ';', subtree_size: 5}, +``` + +while `impl as Interface;` is transformed to: + +```yaml + {kind: 'ImplIntroducer', text: 'impl'}, + {kind: 'DefaultSelfImplAs', text: 'as'}, + {kind: 'IdentifierNameExpr', text: 'Interface'}, +{kind: 'ImplDecl', text: ';', subtree_size: 4}, +``` + +This is handled by the `ExpectAsOrTypeExpression` code from +[parse/handle_impl.cpp](/toolchain/parse/handle_impl.cpp): + +```cpp +if (context.PositionIs(Lex::TokenKind::As)) { + // as ... + context.AddLeafNode(NodeKind::DefaultSelfImplAs, context.Consume()); + context.PushState(State::Expr); +} else { + // as ... + context.PushState(State::ImplBeforeAs); + context.PushStateForExpr(PrecedenceGroup::ForImplAs()); +} +``` + +and then `HandleImplBeforeAs` creates the parent node in the second case: + +```cpp +auto state = context.PopState(); +if (auto as = context.ConsumeIf(Lex::TokenKind::As)) { + context.AddNode(NodeKind::TypeImplAs, *as, state.subtree_start, + state.has_error); + context.PushState(State::Expr); +} else { + if (!state.has_error) { + CARBON_DIAGNOSTIC(ImplExpectedAs, Error, + "Expected `as` in `impl` declaration."); + context.emitter().Emit(*context.position(), ImplExpectedAs); + } + context.ReturnErrorOnState(); +} +``` + +Note (1) that the `state.subtree_start` value comes from the +`context.PushState(State::ImplBeforeAs);` before parsing the type expression, +and that is how that type expression ends up as the child of the created +`TypeImplAs` node. Unlike +[the previous case 1](#case-1-introducer-to-optional-clause-is-used-as-parent-node), +though, the parent node uses the token after the optional expression, rather +than an introducer token for the optional clause. + +Note (2) how `HandleImplBeforeAs` handles three cases of errors: + +- `as` present but an error in the child type expression -> error on the + output `TypeImplAs` node, but not propagated to the parent. +- Error from no `as` present but the type expression was okay -> create a new + error. +- There was error from the child type expression and no `as` present -> no new + diagnostic, we suppress errors once one is emitted until we can recover. + +If there is no `as` token, we don't output either a `TypeImplAs` or a +`DefaultSelfImplAs` node, as required by the parent node, so in those cases we +mark the parent as having an error. + +#### Case 3: optional sibling + +> TODO: This was changed by +> [#3678](https://github.com/carbon-language/carbon-lang/pull/3678) and needs to +> be updated. + +**Example:** The optional type expression before `as` in `impl as` is output as +an optional sibling subtree between the `ImplIntroducer` node for the `impl` +introducer and the `ImplAs` node for the required `as` keyword. + +`impl bool as Interface;` is transformed to: + +```yaml + {kind: 'ImplIntroducer', text: 'impl'}, + {kind: 'BoolTypeLiteral', text: 'bool'}, + {kind: 'ImplAs', text: 'as'}, + {kind: 'IdentifierNameExpr', text: 'Interface'}, +{kind: 'ImplDecl', text: ';', subtree_size: 5}, +``` + +while `impl as Interface;` is transformed to: + +```yaml + {kind: 'ImplIntroducer', text: 'impl'}, + {kind: 'ImplAs', text: 'as'}, + {kind: 'IdentifierNameExpr', text: 'Interface'}, +{kind: 'ImplDecl', text: ';', subtree_size: 4}, +``` + +This is handled by the `ExpectAsOrTypeExpression` code from +[parse/handle_impl.cpp](/toolchain/parse/handle_impl.cpp): + +```cpp +if (context.PositionIs(Lex::TokenKind::As)) { + // as ... + context.AddLeafNode(NodeKind::ImplAs, context.Consume()); + context.PushState(State::Expr); +} else { + // as ... + context.PushState(State::ImplBeforeAs); + context.PushStateForExpr(PrecedenceGroup::ForImplAs()); +} +``` + +and then `HandleImplBeforeAs` follows +[the "something required in context" pattern](#something-required-in-context) to +deal with the `as` that follows when the type expression is present. + +### Operators + +FIXME + +An independent description of our approach: +["Better operator precedence" on scattered-thoughts.net](https://www.scattered-thoughts.net/writing/better-operator-precedence/) diff --git a/toolchain/docs/parse.svg b/toolchain/docs/parse.svg new file mode 100644 index 0000000000000..6576b352f8f39 --- /dev/null +++ b/toolchain/docs/parse.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/website/prebuild.py b/website/prebuild.py index c0a3829ac145c..a7d7457863e19 100755 --- a/website/prebuild.py +++ b/website/prebuild.py @@ -189,7 +189,9 @@ def next(nav_order: list[int]) -> int: # Reset the order for the implementation children. nav_order[0] = 0 - label_subdir("toolchain", next(nav_order), parent_title="Implementation") + label_subdir( + "toolchain/docs", next(nav_order), parent_title="Implementation" + ) label_subdir("explorer", next(nav_order), parent_title="Implementation") label_subdir("testing", next(nav_order), parent_title="Implementation")