Canonicalize `dsc schema` command #538

michaeltlombardi · 2024-09-05T22:01:24Z

Summary of the new feature / enhancement

User Stories

As a user, I want the schemas returned from the dsc schema command to be canonical, returning the appropriate and fully realized schema for the requested type, so that regardless of whether I obtained the schema from the command or the schema URI, the schema describes the data in the same way.

As a documentarian, I want to keep the schema definitions more closely aligned to the code without having to do research/update work asynchronously to changes that affect the schema.

As a developer, I want to ensure that my data types are correctly represented in the canonical schema so I don't have to debug behaviors caused by differences between the published schemas and those used internally.

As a schema maintainer, I want to generate canonical schemas as close to the source code as possible to ensure they are testable and correctly integrated.

As an integrating developer, I want to be able to rely on canonical schemas so my own code can use them for processing and validation.

Current context

Currently, we use serde (v1.0.0), schemars (v0.8.12) and jsonschema (v0.18.0) for serializing, deserializing, generating, and validating JSON schemas and the associated data. The code currently derives the schemas for the data types with default settings and without additional schema validation. This generates Draft 07 schemas with minimal validation beyond the basic types.

Publishing

The current published JSON Schemas (as static files served from this repository) are hand-crafted, include documentation, examples, full validation, and VS Code keywords to enable an improved author experience. These schemas are published in three formats:

Format	Description
Canonical	the schema does not bundle references or include VS Code keywords.
Bundled	The schema bundles references into the `$defs` keyword by ID and doesn't include VS Code keywords.
VS Code	The schema bundles references into the `definitions` keyword by ID and includes VS Code keywords.

The Canonical schemas define discrete schemas for the data models in DSC. They may use the $ref keyword to reference other canonical schemas by their $id, which is always the URI to the canonical schema file. These schemas are strictly draft 2020-12 compatible and don't include the VS Code keywords.

The Bundled schemas are a convenience. Each of them represents a top-level schema with references, and includes those referenced schemas in the $defs keyword, reusing their $id as the definition name. Per the JSON Schema spec, this enables consumers to retrieve and process a single document containing that schema without having to fetch the references. This enables us to define smaller discrete canonical schemas while still ensuring users can operate on a single JSON document for validation and inspection.

The VS Code schemas are another convenience. They leverage an extended dialect for JSON Schema that VS Code recognizes in its JSON language server (and that the RedHat YAML language server also recognizes). These keywords enable us to provide extended descriptions in Markdown, per-enum-value descriptions, custom error messages for invalid data, snippets, deprecation notices, and more. The keywords are all annotation keywords and so can be used without errors (assuming compliant client implementations), but it's best practice to strip them from the canonical schemas to both reduce processing time and data size.

Process

When we're making changes that involve updates to the schemas, the informal process is:

In the issue or PR where we learn that an implementation or change will require modifying one or more schema, we apply the Schema-impact label.
The engineer implements code as needed, often in conversation with the schema maintainer to ensure compatibility/behavior.
The implementation code is merged.
The schema maintainer reviews the implementation/changes and drafts an update to the composable schema source files, sending them for review.
Engineers provide feedback about the schema updates and in-schema documentation.
If the schema update breaks backwards compatibility, the schema maintainer defines a new schema version prefix and adds a commit to their PR to define the new version.
The schema maintainer regenerates the canonical, bundled, and VS Code schemas, using the new version if needed.
The documentarian ensures that the markdown documents for the schema reference documentation are updated to reflect the schema changes.
The PR with changes to the schema and documentation are given final review and approval prior to merge.

This process, while functional, is manual, error-prone, and requires continual feedback loops because the published schemas are functionally separate from the implementation code.

Proposed technical implementation details (optional)

Proposal 1: Generate schemas from code attributes

While the current stable version of schemars doesn't support 2020-12 and doesn't have a great way to define custom keywords, the v1-alpha releases do.

We could (almost?) entirely define the schema behaviors with attributes, now that schemars has the #[schemars(extend())] attribute. The only remaining question is around VS Code keywords and whether we can conditionally present those schemas, but that seems doable.

If we can't migrate to those because they're not stable enough at this time, it's probably worth using the #[schemars(schema_with = "some::function")] attribute to reference functions that define the schemas for the data types.

In either case, while this would require a one-time uplift of defining the schemas in the Rust code, it would also:

Allow us to define the schema more usefully in the code
Prepare us for migrating to 2020-12¹ from 2019-09 (if we don't switch to v1, which is 2020-12 by default).
Ensure that the implementation for a version of the CLI is the same as its canonical schema
Incorporate schema behavior into testing
Enable investigating generating documentation from code

Proposal 2: Consume schemas in code

Instead of using schemars to derive schemas from the types, we could use the include_str to grab the composed schema documents, parse them from JSON, and then use them for validation / pass them to the users when calling dsc schema.

This would still ensure that the code actually uses the schemas, but would keep the synchronization problem (or at least, require code changes that modify the schema to first regenerate the schema). It would obviate migrating schemas to different draft dialects.

Other options

The following options have been considered but determined to be too complex or fragile to strongly propose:

Implement custom macros to generate schemas - functionally this would require a lot of exploratory up-front work to define macros that can be used as attributes on our types to help generate the JSON Schema for each type and rework the schema command to recognize and surface the schemas as desired.
Abandon the enhanced schemas and only use the generated schemas as-is - This would strongly degrade the UX for users and DevX for integrating developers, who would have to rely on the simplified schemas instead of semantically accurate ones.

The largest differences between JSON Schema draft 2019-09 and 2020-12 are:
- 2020-12 replaces items/additionalItems with prefixItems/items - this applies mostly to how you represent tuples - when items is an object instead of an array, the behavior is consistent for both dialects.
- 2020-12 replaces $recursiveRef/$recursiveAnchor with $dynamicRef/$dynamicAnchor - neither of which we seem to use in our current schemas, but which has a large impact on defining custom vocabularies, which we may find desirable in the future.
- 2020-12 updates the interactive behavior for the contains and unevaluatedItems keywords, but we don't use contains.
For our purposes, that means that exporting our schemas in the 2019-09 dialect will have no noticeable impact vs 2020-12. ↩

The text was updated successfully, but these errors were encountered:

michaeltlombardi added Issue-Enhancement The issue is a feature or idea Needs Triage labels Sep 5, 2024

SteveL-MSFT added this to the 3.0-RC milestone Sep 18, 2024

SteveL-MSFT added the Need-Review label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonicalize `dsc schema` command #538

Canonicalize `dsc schema` command #538

michaeltlombardi commented Sep 5, 2024 •

edited

Loading

Canonicalize dsc schema command #538

Canonicalize dsc schema command #538

Comments

michaeltlombardi commented Sep 5, 2024 • edited Loading

Summary of the new feature / enhancement

User Stories

Current context

Publishing

Process

Proposed technical implementation details (optional)

Proposal 1: Generate schemas from code attributes

Proposal 2: Consume schemas in code

Other options

Footnotes

Canonicalize `dsc schema` command #538

Canonicalize `dsc schema` command #538

michaeltlombardi commented Sep 5, 2024 •

edited

Loading