introduce Language DSL v2 (#610)

## About DSL v2 This introduces DSL v2 that describes the grammar in terms of AST types, not EBNF rules: - The new model types in [types.rs](https://github.com/OmarTawfik-forks/slang/blob/dsl-v2/crates/codegen/language/definition/src/model/types.rs) are much more accurate/restrictive in describing the semantics of different productions, which removes the need to do a lot of validation that we had for earlier grammars. - The DSL inside a Rust macro invocation in [definition.rs](https://github.com/OmarTawfik-forks/slang/blob/dsl-v2/crates/solidity/inputs/language/src/definition.rs) are automatically validated on every `cargo check`, are formatted by `rustfmt`, and have definition/references/rename IDE support because of the backing types emitted in [emitter.rs](https://github.com/OmarTawfik-forks/slang/blob/dsl-v2/crates/codegen/language/definition/src/compiler/emitter.rs). ## Behind the Scenes The magic happens because of the new `codegen_language_internal_macros` crate, which reads the model types, and generates implementations for three things: 1. `Spanned` which rewrites all fields in `Spanned<T>` types that preserve the input token spans for validatation. 2. `ParseInputTokens` which uses `syn` crate to parse the tokens into the backing Rust types. 3. `WriteOutputTokens` which serializes the grammar `Spanned` types into a Rust expression that generates the definition using the original types (without `Spanned<T>`), so that they can be used by client crates. ## Next Steps We unfortunately now have three sources of truth for the language, that we are manually keeping in sync for now: 1. The YAML grammar ([here](https://github.com/OmarTawfik-forks/slang/blob/dsl-v2/crates/solidity/inputs/language/definition/manifest.yml)) is used for produce the HTML spec. 4. The DSL v1 grammar ([here](https://github.com/OmarTawfik-forks/slang/blob/dsl-v2/crates/solidity/inputs/language/src/dsl.rs)) is used for produce the parser. 5. The DSL v2 grammar ([here](https://github.com/OmarTawfik-forks/slang/blob/dsl-v2/crates/solidity/inputs/language/src/definition.rs)) introduced in this PR, and not used in anything yet. I will start to delete the YAML grammar, and move `codegen_spec` to use DSL v2, while in parallel starting a discussion about removing DSL v1, as it is a bigger chunk of work that requires coordinating with other ongoing parser/AST work. ## Areas of Improvement Using `serde` to serialize/deserialize was not originally possible, because its data model does not support token spans, which cannot be serialized, recreated, or persisted outside the context of macro invocations. So I moved to use `syn` for parsing for now. It is working well, although it has a few caveats: 1. I needed to implement a parser, along with custom implementations for `Rc`/`IndexMap`/`Box` and other data structures, that `serde` handles by default. 2. The parser is type driven, which means it is strict, and expects fields to be defined exactly in the same order as the backing Rust types. `struct X { a: u8, b: u8 }` has to be declared as `X(a = 1, b = 2)`, not `X(b = 2, a = 1)`. I think I found a solution/workaround to the `serde` data model limitation, that will let me remove all these extra implementations and replace it with a `serde` deserializer, but I will look into this in a later iteration, since it is not blocking us right now. Additionally, validation can still be tightened in a few places, but mostly for keeping the DSL lean/readable, not about correctness, so I will also delay this work for later.
NomicFoundation · Oct 23, 2023 · f17c0f8 · f17c0f8
1 parent 7004bb5
commit f17c0f8
Show file tree

Hide file tree

Showing 84 changed files with 8,269 additions and 93 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -9,6 +9,10 @@ resolver = "2"
 members = [
     "crates/codegen/ebnf",
     "crates/codegen/grammar",
+    "crates/codegen/language/definition",
+    "crates/codegen/language/internal_macros",
+    "crates/codegen/language/macros",
+    "crates/codegen/language/tests",
     "crates/codegen/parser/generator",
     "crates/codegen/parser/runtime",
     "crates/codegen/schema",
@@ -35,6 +39,10 @@ members = [
 #
 codegen_ebnf = { path = "crates/codegen/ebnf" }
 codegen_grammar = { path = "crates/codegen/grammar" }
+codegen_language_definition = { path = "crates/codegen/language/definition" }
+codegen_language_internal_macros = { path = "crates/codegen/language/internal_macros" }
+codegen_language_macros = { path = "crates/codegen/language/macros" }
+codegen_language_tests = { path = "crates/codegen/language/tests" }
 codegen_parser_generator = { path = "crates/codegen/parser/generator" }
 codegen_parser_runtime = { path = "crates/codegen/parser/runtime" }
 codegen_schema = { path = "crates/codegen/schema" }
@@ -86,9 +94,17 @@ serde_yaml = { version = "0.9.19" }
 similar-asserts = { version = "1.4.2" }
 strum = { version = "0.24.0" }
 strum_macros = { version = "0.24.0" }
+syn = { version = "2.0.29", features = [
+    "fold",
+    "full",
+    "extra-traits",
+    "parsing",
+    "printing",
+] }
 tera = { version = "1.19.0" }
 terminal_size = { version = "0.2.6" }
 thiserror = { version = "1.0.40" }
+trybuild = { version = "1.0.85" }
 toml = { version = "0.7.6" }
 typed-arena = { version = "2.0.2" }
 url = { version = "2.3.1" }

diff --git a/crates/codegen/language/definition/Cargo.toml b/crates/codegen/language/definition/Cargo.toml
@@ -0,0 +1,21 @@
+[package]
+name = "codegen_language_definition"
+version.workspace = true
+rust-version.workspace = true
+edition.workspace = true
+publish = false
+
+[dependencies]
+codegen_language_internal_macros = { workspace = true }
+indexmap = { workspace = true }
+Inflector = { workspace = true }
+infra_utils = { workspace = true }
+itertools = { workspace = true }
+proc-macro2 = { workspace = true }
+quote = { workspace = true }
+semver = { workspace = true }
+serde = { workspace = true }
+strum = { workspace = true }
+strum_macros = { workspace = true }
+syn = { workspace = true }
+thiserror = { workspace = true }