Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce Language DSL v2 #610

Merged
merged 8 commits into from
Oct 23, 2023
Merged

Conversation

OmarTawfik
Copy link
Contributor

@OmarTawfik OmarTawfik commented Oct 16, 2023

About DSL v2

This introduces DSL v2 that describes the grammar in terms of AST types, not EBNF rules:

  • The new model types in types.rs are much more accurate/restrictive in describing the semantics of different productions, which removes the need to do a lot of validation that we had for earlier grammars.
  • The DSL inside a Rust macro invocation in definition.rs are automatically validated on every cargo check, are formatted by rustfmt, and have definition/references/rename IDE support because of the backing types emitted in emitter.rs.

Behind the Scenes

The magic happens because of the new codegen_language_internal_macros crate, which reads the model types, and generates implementations for three things:

  1. Spanned which rewrites all fields in Spanned<T> types that preserve the input token spans for validatation.
  2. ParseInputTokens which uses syn crate to parse the tokens into the backing Rust types.
  3. WriteOutputTokens which serializes the grammar Spanned types into a Rust expression that generates the definition using the original types (without Spanned<T>), so that they can be used by client crates.

Next Steps

We unfortunately now have three sources of truth for the language, that we are manually keeping in sync for now:

  1. The YAML grammar (here) is used for produce the HTML spec.
  2. The DSL v1 grammar (here) is used for produce the parser.
  3. The DSL v2 grammar (here) introduced in this PR, and not used in anything yet.

I will start to delete the YAML grammar, and move codegen_spec to use DSL v2, while in parallel starting a discussion about removing DSL v1, as it is a bigger chunk of work that requires coordinating with other ongoing parser/AST work.

Areas of Improvement

Using serde to serialize/deserialize was not originally possible, because its data model does not support token spans, which cannot be serialized, recreated, or persisted outside the context of macro invocations. So I moved to use syn for parsing for now. It is working well, although it has a few caveats:

  1. I needed to implement a parser, along with custom implementations for Rc/IndexMap/Box and other data structures, that serde handles by default.
  2. The parser is type driven, which means it is strict, and expects fields to be defined exactly in the same order as the backing Rust types. struct X { a: u8, b: u8 } has to be declared as X(a = 1, b = 2), not X(b = 2, a = 1).

I think I found a solution/workaround to the serde data model limitation, that will let me remove all these extra implementations and replace it with a serde deserializer, but I will look into this in a later iteration, since it is not blocking us right now.

Additionally, validation can still be tightened in a few places, but mostly for keeping the DSL lean/readable, not about correctness, so I will also delay this work for later.

@changeset-bot
Copy link

changeset-bot bot commented Oct 16, 2023

⚠️ No Changeset found

Latest commit: 17c2a55

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@OmarTawfik OmarTawfik marked this pull request as ready for review October 16, 2023 20:18
@OmarTawfik OmarTawfik requested a review from a team as a code owner October 16, 2023 20:18
@OmarTawfik OmarTawfik enabled auto-merge October 16, 2023 20:25
@OmarTawfik OmarTawfik linked an issue Oct 16, 2023 that may be closed by this pull request
Copy link
Contributor

@AntonyBlakey AntonyBlakey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a few drive-by decisions taken that aren't related to the purpose of the PR

Struct(
name = PragmaDirective,
fields = (
pragma_keyword = Required(Terminal([PragmaKeyword])),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly verbose - can `Required~ be the default, and can single element arrays be replaced by list the element?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to have these fine controls once we move to serde, using things like #[serde(flatten)] and #[serde(default)], but for now, I suggest keeping the hand-written parser simple/straight-forward. It is expensive to add more capabilities for it only to throw it away later.

scanner = Choice([
Atom("0"),
Sequence([
Range(inclusive_start = '1', exclusive_end = '9'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exclusive_ends should all be inclusive_end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed!

Optional(reference = NonTerminal(EventParametersDeclaration)),
anonymous_keyword =
Optional(reference = Terminal([AnonymousKeyword])),
semicolon = Required(Terminal([Semicolon]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this to be a TerminatedBy construction so that the error recovery can pick it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing offline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error_recovery field.

close_paren = Required(Terminal([CloseParen]))
)
),
Separated(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SeparatedBy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using Separated since it describes the item being separated (higher importance to describe/annotate). The term SeparatedBy refers to the separator itself (lower importance). This is also reflected in the backing rust type SeparatedItem, instead of SeparatedByItem.

For example in Parameters, the focus here is on the Parameter being separated, not the Comma. This is the thing we would typically extract/use in further passes.

body = Required(NonTerminal(FunctionBody))
)
),
Struct(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this to be an explicit DelimitedBy construct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing offline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error_recovery field.

@OmarTawfik OmarTawfik disabled auto-merge October 17, 2023 16:50
@OmarTawfik OmarTawfik force-pushed the dsl-v2 branch 2 times, most recently from 0a4c5c6 to f11542a Compare October 18, 2023 23:13
Copy link
Contributor

@Xanewok Xanewok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left some nitpicks.

I'm slightly nervous that we're de facto writing compiler/analysis for the compiler introducing even more machinery for the language definition itself (despite knowing it's a good decision). I think that our immediate priority right now should be to streamline this ASAP as this hurts code navigation (which language definition framework is this file part of, again?) and increases the required mental capacity to move around and work inside the code.

}

fn collect_top_level_items(&mut self) {
for item in self.language.clone().items() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clone is redundant - we should probably enable Clippy soon to catch stuff like this

Copy link
Contributor Author

@OmarTawfik OmarTawfik Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it is not, as analysis is borrowed as a mutable otherwise. However, that should be removed in the next iteration with better validation that doesn't need passing these around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it was redundant in f11542a and now we can remove it with

diff --git a/crates/codegen/language/definition/src/compiler/analysis/definitions.rs b/crates/codegen/language/definition/src/compiler/analysis/definitions.rs
index 430d8587..6c30e6fc 100644
--- a/crates/codegen/language/definition/src/compiler/analysis/definitions.rs
+++ b/crates/codegen/language/definition/src/compiler/analysis/definitions.rs
@@ -18,7 +18,7 @@ pub fn analyze_definitions(analysis: &mut Analysis) {
 }
 
 fn collect_top_level_items(analysis: &mut Analysis) {
-    for item in analysis.language.clone().items() {
+    for item in analysis.language.items() {
         let name = get_item_name(item);
         let defined_in = calculate_defined_in(analysis, item);
 
@@ -111,7 +111,7 @@ fn get_item_name(item: &Item) -> &Spanned<Identifier> {
     }
 }
 
-fn calculate_defined_in(analysis: &mut Analysis, item: &Item) -> VersionSet {
+fn calculate_defined_in(analysis: &Analysis, item: &Item) -> VersionSet {
     return match item {
         Item::Struct { item } => VersionSet::from_range(calculate_enablement(
             analysis,
@@ -157,7 +157,7 @@ fn calculate_defined_in(analysis: &mut Analysis, item: &Item) -> VersionSet {
 }
 
 fn calculate_enablement(
-    analysis: &mut Analysis,
+    analysis: &Analysis,
     enabled_in: &Option<Spanned<Version>>,
     disabled_in: &Option<Spanned<Version>>,
 ) -> VersionRange {

but it's a miniscule thing

}

self.metadata.insert(
(**name).to_owned(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Deref can be convenient but sometimes hurts legibility - this calls potentially costly to_owned and it took me a second to decipher that it derefs to Identifier (given that Identifier also has another Deref impl) and clones it. Maybe it'd be better to have an Spanned::<T>::{value, span} and call name.(as_?)value().to_owned() here?

Using name.value() above would also make it more obvious that the self.metadata doesn't care about spans but inner values

Copy link
Contributor Author

@OmarTawfik OmarTawfik Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was under the impression that Deref here is more idiomatic, but happy to change it. I'm not a big fan of &** everywhere either.

Can do that in the next iteration if you don't mind, since this is only specific to this crate, and shouldn't affect other work in other areas/crates.

}
}

fn check_reference(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, peppered here along with ReferenceFilter::apply took some time for me to decipher - could we add more comments on why it's like this way? Maybe it'd improve legibility if we combine the "get reference by ident for a given namespace/kind or error out"?

@@ -0,0 +1,503 @@
use crate::{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a module documentation or at least for analyze_references, since the check_* names are vague and we also account for versions it seems in addition to just checking if the references are valid. Maybe implementing a folder/visitor would be simpler since it's mostly calling update_enablement and check_reference while visiting the items?

return self.ranges.is_empty();
}

pub fn add(&mut self, range: &VersionRange) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this is mutable but the difference is immutable but creates a new value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not always. It sometimes serves as an accumulator, instead of cloning self when it is not changed.


/// A wrapper type to make sure the DSL token is written as an identifier instead of a string literal.
#[derive(Clone, Debug, Eq, Hash, Ord, PartialEq, PartialOrd)]
pub struct Identifier {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to risk a dumb question - could we use proc_macro2/syn::Ident here directly, since we only seem to use it with the macro and we almost always want it with a span?

Copy link
Contributor Author

@OmarTawfik OmarTawfik Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid question!
We cannot use it instead of Identifier, as we cannot implement other traits for this external type.
We cannot wrap it inside Identifier either, as we would have to special case it from internal_macros in order for it to not be wrapped unnecissarily in Spanned<> again.
Also, it is not serde serializable (Span is not), at least not without workarounds, which we need for the entire model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks! Maybe it'd be worthwhile to add a comment why we're not using proc_macro2::Ident directly?

Copy link
Contributor

@Xanewok Xanewok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like for us to address #610 (comment), otherwise it's good to go and we can tackle other comments in a follow-up.

@OmarTawfik OmarTawfik requested a review from Xanewok October 21, 2023 09:18
@OmarTawfik OmarTawfik disabled auto-merge October 22, 2023 03:57
Copy link
Contributor

@Xanewok Xanewok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks a lot for the work on the next version of the DSL! Let's merge this once the CI is green 🎉

Merged via the queue into NomicFoundation:main with commit f17c0f8 Oct 23, 2023
1 check passed
@OmarTawfik OmarTawfik deleted the dsl-v2 branch October 23, 2023 09:31
OmarTawfik added a commit to OmarTawfik-forks/slang that referenced this pull request Dec 2, 2023
Makes the grammar a lot simpler, since it is no longer needed after enums were simplified in NomicFoundation#610
OmarTawfik added a commit to OmarTawfik-forks/slang that referenced this pull request Dec 3, 2023
Makes the grammar a lot simpler, since it is no longer needed after enums were simplified in NomicFoundation#610
OmarTawfik added a commit to OmarTawfik-forks/slang that referenced this pull request Dec 4, 2023
Makes the grammar a lot simpler, since it is no longer needed after enums were simplified in NomicFoundation#610
OmarTawfik added a commit to OmarTawfik-forks/slang that referenced this pull request Dec 4, 2023
Since it is no longer needed after enums were simplified in NomicFoundation#610
github-merge-queue bot pushed a commit that referenced this pull request Dec 4, 2023
Makes the grammar a lot simpler, since it is no longer needed after
enums were simplified in #610
OmarTawfik added a commit to OmarTawfik-forks/slang that referenced this pull request Dec 5, 2023
Since it is no longer needed after enums were simplified in NomicFoundation#610.
We can just use the `reference` item name directly.
github-merge-queue bot pushed a commit that referenced this pull request Dec 6, 2023
Since it is no longer needed after enums were simplified in #610.
We can just use the referenced item name directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exploration: Rust Types as Source of Truth
3 participants