The objective of this pilot is to develop a reusable proof of concept, to convert existig Word-based NIFO factsheets into structured data following the Resource Description Framework (RDF). This pilot uses existing vocabularies to describe the information within the factsheets, including:
- ISA Core Vocabularies
- Dublin Core Terms
- European Legislation Identifier Ontology
- DBPedia Ontology
- Schema.org
- RDF Data Cube Vocabulary
This pilot requires Node JS v14.16.1 or above, and uses the following packages:
- cheerio v1.0.0-rc.2
- cli-progress v1.8.0
- graph-rdfa-processor v1.3.0
- jsdom v11.9.0
- jsesc v2.5.1
- jsonld-request v0.2.0
- ldtr v0.2.3
- mammoth v1.4.19
- rdf-translator v2.0.0
- rdfa-parser v1.0.1
- request v2.85.0
- sync-request v6.0.0
- xml2js v0.4.19
Important Note!!! The project was tested in Node JS v14.16.1 so it is recommended to download this version of Node JS. In order to install Node JS v14.16.1 follow the link in v14.16.1, then download “Windows 64-bit Installer” and run the downloaded installer.
- The documents that will be used in docx folder must be of type docx
- Images that will be used in documents must be of jpeg type
- It would be helpful if the names of the documents were in a specific format. For example, there was a file named xxxEU_editorxxx and now there is a file named xxxEU_v3.00xxx
- Clone or download this repository.
- From the project's root folder, run
npm install
from the command line to install the dependencies.
Copy the Word documents (in docx format) that you want to convert to html in the /docx
folder and run node docxtohtml.js
from the command line.
All documents in the input folder will be transformed into HTML and stored in the /html
folder. Any images the Word document may contain will be stored in the /html/img
folder.
When using a Windows based machine, it is also possible to simply execute the convert.bat
file from the project's root folder. This will automatically start the Word to HTML and the HTML to RDFa+RDF conversion.
To annotate the HTML files with RDFa run node htmltordf.js
from the command line. The annotated HTML documents will be stored in the /rdfa
folder. Subsequently, these can be copy-pasted into a WYSIWYG text editor, or directly uploaded to a Triple Store.
The documents are then converted in RDF (JSON-LD and Turtle) and stored in the /rdf
folder.
The config.json
file allows users to customise the transformation script and mappings to existing RDF vocabularies.
- Configure document metadata, including the date issued, the applicable licence, the HTML tag used to identify the main sections in the document and the HTML tag used to identify the subsections in the document:
{
"issued" : "2021-09",
"licence" : "https://creativecommons.org/licenses/by/4.0/",
"section_header" : "h1",
"subsection_header" : "h2",
"subsubsection_header" : "h3"
- Configure the prefixes that are used to generate URIs for new entities discovered within the document:
"prefix": {
"nifo" : "http://data.europa.eu/nifo/factsheet/",
"currency": "http://publications.europa.eu/resource/authority/currency/",
"measure" : "http://example.org/nifo/MeasureProperty/",
"datastructure" : "http://example.org/nifo/structure/",
"dataset" : "http://example.org/nifo/dataset/",
"legalframework" : "http://example.org/nifo/legalframework/",
"role" : "http://example.org/nifo/role/Role-",
"post" : "http://example.org/nifo/post/Post-",
"service" : "http://example.org/nifo/publicservice/",
}
- Set the prefix used for the RDF properties and classes, as well as the mapping of different terms to properties and classes in existing vocabularies:
"prefixes" : "dct: http://purl.org/dc/terms/ dbo: http://dbpedia.org/ontology/ dbp: http://dbpedia.org/property/ qb: http://purl.org/linked-data/cube# rdfs: http://www.w3.org/2000/01/rdf-schema# cpsv: http://purl.org/vocab/cpsv# eli: http://data.europa.eu/eli/ontology# foaf: http://xmlns.com/foaf/0.1/ org: https://www.w3.org/ns/org# schema: http://schema.org/",
"prop" : {
"issued" : "dct:issued",
"licence" : "dct:license",
"population" : "dbo:populationTotal",
"gdpnominal" : "dbp:gdpNominal",
"gdppercapita" : "dbp:gdpPppPerCapita",
"area" : "dbo:areaTotal",
"capital" : "dbo:capital",
"language" : "dct:language",
"currency" : "dbo:currency",
"source" : "dct:source",
"leader" : "dbo:leader",
"title" : "dct:title",
"label" : "rdfs:label",
"component" : "qb:component",
"structure" : "qb:structure",
"relation" : "dct:relation",
"name" : "foaf:name",
"holds" : "org:holds",
"telephone" : "schema:telephone",
"fax" : "schema:faxNumber",
"email" : "schema:email",
"url" : "schema:url",
"contact" : "schema:contactPoint",
"description" : "dct:description",
"competent" : "http://data.europa.eu/m8g/hasCompetentAuthority"
},
"class" : {
"measure" : "qb:MeasureProperty",
"datastructure" :"qb:DataStructureDefinition",
"dataset" : "qb:DataSet",
"framework": "cpsv:FormalFramework",
"legalresource": "eli:LegalResource",
"person" : "foaf:Person",
"role" : "org:Role",
"post" : "org:Post",
"contact": "schema:ContactPoint"
},
- Configure the text strings used to identify certain proprties such as currency, head of state and head of government.
"text_identifier" : {
"currency" : "Currency: ",
"headofstate" : "Head of State: ",
"headofgovernment" : "Head of Government: "
},
- Determine the keywords that are used to derive links to legal documents from the text.
"type_framework" : {
"act" : "act",
"reg" : "regulation",
"dir" : "directive",
"law" : "law",
"con" : "constitution"
}
}
More detailed customisation of the annotations can be achieved by modifying the htmltordf.js
code that is applied to the relevant section or subsection in the document.
The different sections are identified based on their title. To apply the annotations, we refer to the methods provided by the Cheerio module.
switch(content){
case "Country Profile":
//Custom code here
break;
case "Digital Public Administration Highlights":
//Custom code here
break;
case "Digital Public Administration Political Communications":
//Custom code here
break;
case "Digital Public Administration Legislation":
//Custom code here
break;
case "Digital Public Administration Governance":
//Custom code here
break;
case "Digital Public Administration Infrastructure":
//Custom code here
break;
case "Cross Border Digital Public Administration Services for Citizens and Business":
//Custom code here
break;
}
Licensed under the EUROPEAN UNION PUBLIC LICENCE v.1.2
Authors: Jens Scheerlinck (PwC EU Services), Emidio Stani (PwC EU Services)