Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse mode that returns all errors and an as-valid-as-possible Document #371

Open
Kixunil opened this issue Nov 21, 2022 · 12 comments
Open
Labels
A-parse Area: Parsing TOML C-enhancement Category: Raise on the bar on expectations

Comments

@Kixunil
Copy link

Kixunil commented Nov 21, 2022

Background

In configure_me I recently added support for very nice error messages using codespan-reporting. One of the nice features is ability to report multiple errors. However the requirement to not duplicate keys degrades the experience. Note that configure_me input is even more strict: it requires that keys are not duplicate among multiple items. This is detected manually and produces output like this:

error: the option `foo` appears more than once
   ┌─ unknown file:4:26
   │
 3 │         conf_file_param = "foo"
   │                           ----- the option was first defined here
 4 │         conf_dir_param = "foo"
   │                          ^^^^^ the option is repeated here
   ·
13 │         [param.foo]
   │                ^^^ ... and here
   ·
16 │         [switch.foo]
   │                 ^^^ ... and here

As you can see, this is very programmer-friendly. Note that this also outputs other errors such as invalid names of parameters. However if you duplicate a key within a single table you get a message like this:

   ┌─ unknown file:16:9
   │
16 │         [switch.foo]
   │         ^ redefinition of table `switch.foo` for key `switch.foo` at line 16 column 9

Not only does it limit the output to a single error, it's also missing information about which other place defines the same key and span points to a weird place as well.

Proposed solution

The easiest for me would be adding set_allow_duplicate_keys() method similar to the existing methods. Then the existing code would work just fine. The method should probably have documentation that users must reject duplicates themselves to adhere to Toml specification and the method is only provided for custom error handling.

Alternative

I would find it understandable if you think that despite warnings in the doc people would abuse it to implement "robustness principle" which is actually quite bad. If you'd rather not provide an abuse-able tool then at least improving the error type to contain enough information to show all duplicated fields within a single map would be nice.

@ordian
Copy link
Member

ordian commented Nov 21, 2022

If I understand correctly, and the proposed solution is to add support for a non-standard flavor of Toml which is disabled by default. And the alternative solution is to improve the error message on duplicate keys. I'd be in favor of the latter.

@epage
Copy link
Member

epage commented Nov 21, 2022

I'd rather we explore any other option before going down the route of set_allow_duplicate_keys.

We already need to add spans for fields. Once we have that, we could easily look up the first span and report both.

We could then collect errors as we go, accumulating them, rather than stopping immediately. Some questions or concerns I have that would need to be addressed

  • How much does it impact performance of the success case?
  • Which errors do we treat as recoverable / accumulatable? Only non-syntactic ones like duplicate keys?
  • How do we track this information as we go?

@Kixunil
Copy link
Author

Kixunil commented Nov 21, 2022

How much does it impact performance of the success case?

For my projects it certainly won't be enough for me to care. Parsing would have to be maybe 1000x slower to be a problem. If it has to be behind a feature flag so that perf-sensitive people are not harmed, I will happily turn on the feature flag.

Only non-syntactic ones like duplicate keys?

I believe so. I really don't like how rustc tries to recover syntactic errors causing printing of more errors that were caused by the syntactic ones.

How do we track this information as we go?

In configure_me I used a special error type for duplicates and I track the spans of all occurrences in Vec<Span>. I don't see anything wrong with it.

@epage
Copy link
Member

epage commented Nov 21, 2022

For my projects it certainly won't be enough for me to care. Parsing would have to be maybe 1000x slower to be a problem. If it has to be behind a feature flag so that perf-sensitive people are not harmed, I will happily turn on the feature flag.

Cargo is the main performance-sensitive project that I'm aware of and we need to be sensitive to as it has to parse hundreds of toml files per invocation. I have been trying to nerd snipe the performance team into looking at this and seeing about doing a binary cache for immutable manifests (ie from a registry)

@epage
Copy link
Member

epage commented Nov 21, 2022

Also, I'll note that my main toml-related priority right now is moving toml-rs to be on top of toml_edit. Anything related to this comes after for me, including reviewing people making changes to toml-rs as that takes both review time and is throwaway.

@Kixunil
Copy link
Author

Kixunil commented Dec 27, 2022

Just noticed I forgot to say that returning error will prevent me from reporting all duplicates across different parts of the toml. So it's better than the current situation but worse than the ideal situation. And I guess adding logic to toml to handle this would be too much? I can't even imagine what it would look like.

@epage
Copy link
Member

epage commented Dec 27, 2022

Something I've been considering is to expose error recovery to the caller. meaning we'd have something like parser_recovery(data: &str) -> (Document, Vec<Errors>).

@Kixunil
Copy link
Author

Kixunil commented Dec 27, 2022

Sounds interesting, what would the Document type provide?

@epage
Copy link
Member

epage commented Dec 27, 2022

Document would be what toml_edit::Document we were able to load, despite the errors.

@Kixunil
Copy link
Author

Kixunil commented Dec 28, 2022

That sounds quite good, any reason to not return T: Deserialize instead?

@epage
Copy link
Member

epage commented Dec 28, 2022

I'm assuming you are referring to the toml crate. In #340 we'll be swapping out the parser for toml_edit and the lowest level place we'll need to handle error recovery will be in toml_edit. We'll then need to decide how much to automatically expose up the stack. Keep in mind that I'm hoping to see full error recovery, including from syntax errors. This will greatly diminish the value of directly supporting T: Deserialize as syntax errors will likely lead to schema errors. Most likely, you'll want to check the errors and decide if you want to still deserialize or report the errors as-is.

@epage
Copy link
Member

epage commented Dec 28, 2022

And to clarify, with us swapping tomls parser in #340, I'd like to avoid throw-away work before then, like implementing or reviewing someone else implementing error recovery in the existing toml crate.

@epage epage added C-enhancement Category: Raise on the bar on expectations A-parse Area: Parsing TOML labels Jan 18, 2023
@epage epage changed the title Make it possible to allow duplicate keys Parse mode that returns all errors and an as-valid-as-possible Document Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parse Area: Parsing TOML C-enhancement Category: Raise on the bar on expectations
Projects
None yet
Development

No branches or pull requests

3 participants