Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop fetching spec info from Specref #1668

Merged
merged 1 commit into from
Jan 24, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 18 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ The list is used in a variety of ways, which include:
by tools such as [ReSpec](https://respec.org/docs/) and
[Bikeshed](https://speced.github.io/bikeshed/) to create terminology
and reference links between Web specifications.
* [BCD](https://github.com/mdn/browser-compat-data) and
[web-features](https://github.com/web-platform-dx/web-features) to validate
specification URLs
* [Specref](https://www.specref.org/) to complete the list of specifications
that can be referenced.
* Analyzers of browser technologies to create reports on test coverage,
WebIDL, and specification quality.

Expand Down Expand Up @@ -186,10 +191,11 @@ The `shortname` property is always set.

### `title`

The title of the spec. The title is either retrieved from the
[W3C API](https://w3c.github.io/w3c-api/) for W3C specs,
[Specref](https://www.specref.org/) or from the spec itself. The
[`source`](#source) property details the actual provenance.
The title of the spec. The title is either retrieved from an official source
(the [W3C API](https://w3c.github.io/w3c-api/) for W3C specs, the
[workstreams database](https://github.com/whatwg/sg/blob/main/db.json) for
WHATWG specs, etc.), or from the spec itself. The [`source`](#source) property
details the actual provenance.

The `title` property is always set.

Expand Down Expand Up @@ -485,11 +491,12 @@ available.

The URL of the latest Editor's Draft or of the living standard.

The URL is either retrieved from the [W3C API](https://w3c.github.io/w3c-api/)
for W3C specs, or [Specref](https://www.specref.org/). The document at the
versioned URL is considered to be the latest Editor's Draft if the spec does
neither exist in the W3C API nor in Specref. The [`source`](#source) property
details the actual provenance.
The URL is either retrieved from an official source (the
[W3C API](https://w3c.github.io/w3c-api/) for W3C specs, the
[workstreams database](https://github.com/whatwg/sg/blob/main/db.json) for
WHATWG specs, etc.) when possible. The document at the versioned URL is
considered to be the latest Editor's Draft otherwise. The [`source`](#source)
property details the actual provenance.

The URL should be relatively stable but may still change over time. See
[Spec identifiers](#spec-identifiers) for details.
Expand Down Expand Up @@ -552,8 +559,7 @@ The `pages` property is only set for specs identified as multipage specs.
The URL of the repository that contains the source of the Editor's Draft or of
the living standard.

The URL is either retrieved from the [Specref](https://www.specref.org/) or
computed from `nightly.url`.
The URL is computed from `nightly.url`.

The `repository` property is always set except for IETF specs where such a repo does not always exist.

Expand Down Expand Up @@ -621,7 +627,7 @@ The `excludePaths` property is seldom set.

The provenance for the `title` and `nightly` property values. Can be one of:
- `w3c`: information retrieved from the [W3C API](https://w3c.github.io/w3c-api/)
- `specref`: information retrieved from [Specref](https://www.specref.org/)
- `whatwg`: information retrieved from [WHATWG](https://spec.whatwg.org/)
- `ietf`: information retrieved from the [IETF datatracker](https://datatracker.ietf.org)
- `spec`: information retrieved from the spec itself

Expand Down
2 changes: 1 addition & 1 deletion schema/definitions.json
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@

"source": {
"type": "string",
"enum": ["w3c", "specref", "spec", "ietf", "whatwg"]
"enum": ["w3c", "spec", "ietf", "whatwg"]
},

"nightly": {
Expand Down
120 changes: 7 additions & 113 deletions src/fetch-info.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
* Module that exports a function that takes an array of specifications objects
* that each have at least a "url" and a "short" property. The function returns
* an object indexed by specification "shortname" with additional information
* about the specification fetched from the W3C API, Specref, or from the spec
* itself. Object returned for each specification contains the following
* about the specification fetched from the W3C API, WHATWG, IETF or from the
* spec itself. Object returned for each specification contains the following
* properties:
*
* - "nightly": an object that describes the nightly version. The object will
Expand All @@ -15,8 +15,8 @@
* feature the URL of the TR document for W3C specs when it exists, and is not
* present for specs that don't have release versions (WHATWG specs, CG drafts).
* - "title": the title of the specification. Always set.
* - "source": one of "w3c", "specref", "spec", depending on how the information
* was determined.
* - "source": one of "w3c", "ietf", "whatwg", "spec", depending on how the
* information was determined.
*
* The function throws when something goes wrong, e.g. if the given spec object
* describes a /TR/ specification but the specification has actually not been
Expand All @@ -25,17 +25,14 @@
*
* The function will start by querying the W3C API, using the given "shortname"
* properties. For specifications where this fails, the function will query
* SpecRef, using the given "shortname" as well. If that too fails, the function
* assumes that the given "url" is the URL of the Editor's Draft, and will fetch
* it to determine the title.
* IETF, then WHATWG, using the given "shortname" as well. If that too fails,
* the function assumes that the given "url" is the URL of the Editor's Draft,
* and will fetch it to determine the title.
*
* If the function needs to retrieve the spec itself, note that it will parse
* the HTTP response body as a string, applying regular expressions to extract
* the title. It will not parse it as HTML in particular. This means that the
* function will fail if the title cannot easily be extracted for some reason.
*
* Note: the function operates on a list of specs and not only on one spec to
* bundle requests to Specref.
*/

import puppeteer from "puppeteer";
Expand All @@ -45,17 +42,6 @@ import Octokit from "./octokit.js";
import ThrottledQueue from "./throttled-queue.js";
import fetchJSON from "./fetch-json.js";

// Map spec statuses returned by Specref to those used in specs
// Note we typically won't get /TR statuses from Specref, since all /TR URLs
// are handled through the W3C API. Also, "Proposal for a CSS module" entries
// were probably manually hardcoded in Specref, they are really just Editor's
// Drafts in practice.
const specrefStatusMapping = {
"ED": "Editor's Draft",
"Proposal for a CSS module": "Editor's Draft",
"cg-draft": "Draft Community Group Report"
};

async function useLastInfoForDiscontinuedSpecs(specs) {
const results = {};
for (const spec of specs) {
Expand Down Expand Up @@ -215,97 +201,6 @@ async function fetchInfoFromWHATWG(specs, options) {
return specInfo;
}

async function fetchInfoFromSpecref(specs, options) {
function chunkArray(arr, len) {
let chunks = [];
let i = 0;
let n = arr.length;
while (i < n) {
chunks.push(arr.slice(i, i += len));
}
return chunks;
}

// Browser-specs contributes specs to Specref. By definition, we cannot rely
// on information from Specref about these specs. Unfortunately, the Specref
// API does not return the "source" field, so we need to retrieve the list
// ourselves from Specref's GitHub repository.
const specrefBrowserspecsUrl = "https://raw.githubusercontent.com/tobie/specref/main/refs/browser-specs.json";
const browserSpecs = await fetchJSON(specrefBrowserspecsUrl, options);
specs = specs.filter(spec => !browserSpecs[spec.shortname.toUpperCase()]);

// Browser-specs now acts as source for Specref for the WICG specs and W3C
// Editor's Drafts that have not yet been published to /TR. Let's filter out
// these specs to avoid a catch-22 where the info in browser-specs gets stuck
// to the that in Specref.
const filteredSpecs = specs.filter(spec =>
!spec.url.match(/\/\/(wicg|w3c)\.github\.io\//) &&
!spec.url.match(/\/\/www\.w3\.org\//) &&
!spec.url.match(/\/\/drafts\.csswg\.org\//));

const chunks = chunkArray(filteredSpecs, 50);
const chunksRes = await Promise.all(chunks.map(async chunk => {
let specrefUrl = "https://api.specref.org/bibrefs?refs=" +
chunk.map(spec => spec.shortname).join(',');
return fetchJSON(specrefUrl, options);
}));

const results = {};
chunksRes.forEach(chunkRes => {

// Specref manages aliases, let's follow the chain to the final spec
function resolveAlias(name, counter) {
counter = counter || 0;
if (counter > 100) {
throw "Too many aliases returned by Respec";
}
if (chunkRes[name].aliasOf) {
return resolveAlias(chunkRes[name].aliasOf, counter + 1);
}
else {
return name;
}
}
Object.keys(chunkRes).forEach(name => {
if (specs.find(spec => spec.shortname === name)) {
const info = chunkRes[resolveAlias(name)];
if (info.edDraft?.startsWith('http:')) {
console.warn(`[warning] force HTTPS for nightly of ` +
`"${spec.shortname}", Specref returned "${info.edDraft}"`);
}
if (info.href?.startsWith('http:')) {
console.warn(`[warning] force HTTPS for nightly of ` +
`"${spec.shortname}", Specref returned "${info.href}"`);
}
const nightly =
info.edDraft?.replace(/^http:/, 'https:') ??
info.href?.replace(/^http:/, 'https:') ??
null;
const status =
specrefStatusMapping[info.status] ??
info.status ??
"Editor's Draft";
if (nightly?.startsWith("https://www.iso.org/")) {
// The URL is to a page that describes the spec, not to the spec
// itself (ISO specs are not public).
results[name] = {
title: info.title
}
}
else {
results[name] = {
nightly: { url: nightly, status },
title: info.title
};
}
}
});
});

return results;
}


async function fetchInfoFromIETF(specs, options) {
async function fetchRFCName(docUrl) {
const body = await fetchJSON(docUrl, options);
Expand Down Expand Up @@ -611,7 +506,6 @@ async function fetchInfo(specs, options) {
{ name: 'w3c', fn: fetchInfoFromW3CApi },
{ name: 'ietf', fn: fetchInfoFromIETF },
{ name: 'whatwg', fn: fetchInfoFromWHATWG },
{ name: 'specref', fn: fetchInfoFromSpecref },
{ name: 'spec', fn: fetchInfoFromSpecs }
];
let remainingSpecs = specs;
Expand Down
54 changes: 15 additions & 39 deletions test/fetch-info.js
Original file line number Diff line number Diff line change
Expand Up @@ -35,45 +35,6 @@ describe("fetch-info module", function () {
});
});

describe("fetch from Specref", () => {
it("works on an ISO spec", async () => {
const spec = {
url: "https://www.iso.org/standard/85253.html",
shortname: "iso18181-2"
};
const info = await fetchInfo([spec]);
assert.ok(info[spec.shortname]);
assert.equal(info[spec.shortname].source, "specref");
assert.equal(info[spec.shortname].title, "Information technology — JPEG XL image coding system — Part 2: File format");
assert.equal(info[spec.shortname].nightly, undefined);
});

it("can operate on multiple specs at once", async () => {
const spec = getW3CSpec("presentation-api");
const other = getW3CSpec("hr-time-2");
const info = await fetchInfo([spec, other]);
assert.ok(info[spec.shortname]);
assert.equal(info[spec.shortname].source, "w3c");
assert.equal(info[spec.shortname].nightly.url, "https://w3c.github.io/presentation-api/");
assert.equal(info[spec.shortname].title, "Presentation API");

assert.ok(info[other.shortname]);
assert.equal(info[other.shortname].source, "w3c");
assert.equal(info[other.shortname].nightly.url, "https://w3c.github.io/hr-time/");
assert.equal(info[other.shortname].title, "High Resolution Time Level 2");
});

it("does not retrieve info from a spec that got contributed to Specref", async () => {
const spec = {
url: "https://registry.khronos.org/webgl/extensions/ANGLE_instanced_arrays/",
shortname: "ANGLE_instanced_arrays"
};
const info = await fetchInfo([spec]);
assert.ok(info[spec.shortname]);
assert.equal(info[spec.shortname].source, "spec");
});
});

describe("fetch from IETF datatracker", () => {
it("fetches info about RFCs from datatracker", async () => {
const spec = {
Expand Down Expand Up @@ -337,6 +298,21 @@ describe("fetch-info module", function () {
fetchInfo([spec]),
/^Error: W3C API redirects "webaudio" to "webaudio-.*"/);
});

it("can operate on multiple specs at once", async () => {
const spec = getW3CSpec("presentation-api");
const other = getW3CSpec("hr-time-2");
const info = await fetchInfo([spec, other]);
assert.ok(info[spec.shortname]);
assert.equal(info[spec.shortname].source, "w3c");
assert.equal(info[spec.shortname].nightly.url, "https://w3c.github.io/presentation-api/");
assert.equal(info[spec.shortname].title, "Presentation API");

assert.ok(info[other.shortname]);
assert.equal(info[other.shortname].source, "w3c");
assert.equal(info[other.shortname].nightly.url, "https://w3c.github.io/hr-time/");
assert.equal(info[other.shortname].title, "High Resolution Time Level 2");
});
});

describe("fetch from all sources", () => {
Expand Down
Loading