Skip to content

SEMICeu/NIFO_pilot

Repository files navigation

NIFO Pilot

Project description

The objective of this pilot is to develop a reusable proof of concept, to convert existig Word-based NIFO factsheets into structured data following the Resource Description Framework (RDF). This pilot uses existing vocabularies to describe the information within the factsheets, including:

Requirements and dependencies

This pilot requires Node JS v14.16.1 or above, and uses the following packages:

  • cheerio v1.0.0-rc.2
  • cli-progress v1.8.0
  • graph-rdfa-processor v1.3.0
  • jsdom v11.9.0
  • jsesc v2.5.1
  • jsonld-request v0.2.0
  • ldtr v0.2.3
  • mammoth v1.4.19
  • rdf-translator v2.0.0
  • rdfa-parser v1.0.1
  • request v2.85.0
  • sync-request v6.0.0
  • xml2js v0.4.19

Important Note!!! The project was tested in Node JS v14.16.1 so it is recommended to download this version of Node JS. In order to install Node JS v14.16.1 follow the link in v14.16.1, then download “Windows 64-bit Installer” and run the downloaded installer.

Requirements for input files

  • The documents that will be used in docx folder must be of type docx
  • Images that will be used in documents must be of jpeg type
  • It would be helpful if the names of the documents were in a specific format. For example, there was a file named xxxEU_editorxxx and now there is a file named xxxEU_v3.00xxx

Architecture

Architecture

Installation

  1. Clone or download this repository.
  2. From the project's root folder, run npm install from the command line to install the dependencies.

How to use

Converting Word documents to HTML

Copy the Word documents (in docx format) that you want to convert to html in the /docx folder and run node docxtohtml.js from the command line. All documents in the input folder will be transformed into HTML and stored in the /html folder. Any images the Word document may contain will be stored in the /html/img folder.

When using a Windows based machine, it is also possible to simply execute the convert.bat file from the project's root folder. This will automatically start the Word to HTML and the HTML to RDFa+RDF conversion.

Annotating HTML with RDFa and converting to RDF

To annotate the HTML files with RDFa run node htmltordf.js from the command line. The annotated HTML documents will be stored in the /rdfa folder. Subsequently, these can be copy-pasted into a WYSIWYG text editor, or directly uploaded to a Triple Store.

The documents are then converted in RDF (JSON-LD and Turtle) and stored in the /rdf folder.

The config.json file allows users to customise the transformation script and mappings to existing RDF vocabularies.

  • Configure document metadata, including the date issued, the applicable licence, the HTML tag used to identify the main sections in the document and the HTML tag used to identify the subsections in the document:
{
    "issued" : "2021-09",
    "licence" : "https://creativecommons.org/licenses/by/4.0/",
    "section_header" : "h1",
    "subsection_header" : "h2",
    "subsubsection_header" : "h3"
  • Configure the prefixes that are used to generate URIs for new entities discovered within the document:
    "prefix": {
        "nifo" : "http://data.europa.eu/nifo/factsheet/",
        "currency": "http://publications.europa.eu/resource/authority/currency/",
        "measure" : "http://example.org/nifo/MeasureProperty/",
        "datastructure" : "http://example.org/nifo/structure/",
        "dataset" : "http://example.org/nifo/dataset/",
        "legalframework" : "http://example.org/nifo/legalframework/",
        "role" : "http://example.org/nifo/role/Role-",
        "post" : "http://example.org/nifo/post/Post-",
        "service" : "http://example.org/nifo/publicservice/",
    }
  • Set the prefix used for the RDF properties and classes, as well as the mapping of different terms to properties and classes in existing vocabularies:
    "prefixes" : "dct: http://purl.org/dc/terms/ dbo: http://dbpedia.org/ontology/ dbp: http://dbpedia.org/property/ qb: http://purl.org/linked-data/cube# rdfs: http://www.w3.org/2000/01/rdf-schema# cpsv: http://purl.org/vocab/cpsv# eli: http://data.europa.eu/eli/ontology# foaf: http://xmlns.com/foaf/0.1/ org: https://www.w3.org/ns/org# schema: http://schema.org/",
    "prop" : {
        "issued" : "dct:issued",
        "licence" : "dct:license",
        "population" : "dbo:populationTotal",
        "gdpnominal" : "dbp:gdpNominal",
        "gdppercapita" : "dbp:gdpPppPerCapita",
        "area" : "dbo:areaTotal",
        "capital" : "dbo:capital",
        "language" : "dct:language",
        "currency" : "dbo:currency",
        "source" : "dct:source",
        "leader" : "dbo:leader",
        "title" : "dct:title",
        "label" : "rdfs:label",
        "component" : "qb:component",
        "structure" : "qb:structure",
        "relation" : "dct:relation",
        "name" : "foaf:name",
        "holds" : "org:holds",
        "telephone" : "schema:telephone",
        "fax" : "schema:faxNumber",
        "email" : "schema:email",
        "url" : "schema:url",
        "contact" : "schema:contactPoint",
        "description" : "dct:description",
        "competent" : "http://data.europa.eu/m8g/hasCompetentAuthority"
    },
    "class" : {
        "measure" : "qb:MeasureProperty",
        "datastructure" :"qb:DataStructureDefinition",
        "dataset" : "qb:DataSet",
        "framework": "cpsv:FormalFramework",
        "legalresource": "eli:LegalResource",
        "person" : "foaf:Person",
        "role" : "org:Role",
        "post" : "org:Post",
        "contact": "schema:ContactPoint"
    },
  • Configure the text strings used to identify certain proprties such as currency, head of state and head of government.
    "text_identifier" : {
        "currency" : "Currency: ",
        "headofstate" : "Head of State: ",
        "headofgovernment" : "Head of Government: "
    },
  • Determine the keywords that are used to derive links to legal documents from the text.
    "type_framework" : {
        "act" : "act",
        "reg" : "regulation",
        "dir" : "directive",
        "law" : "law",
        "con" : "constitution"
    }
}

More detailed customisation of the annotations can be achieved by modifying the htmltordf.js code that is applied to the relevant section or subsection in the document. The different sections are identified based on their title. To apply the annotations, we refer to the methods provided by the Cheerio module.

switch(content){
    case "Country Profile":
        //Custom code here
        break;
    case "Digital Public Administration Highlights":
        //Custom code here
        break;
    case "Digital Public Administration Political Communications":
        //Custom code here
        break;
    case "Digital Public Administration Legislation":
        //Custom code here
        break;
    case "Digital Public Administration Governance":
        //Custom code here
        break;
    case "Digital Public Administration Infrastructure":
        //Custom code here
        break;
    case "Cross Border Digital Public Administration Services for Citizens and Business":
        //Custom code here                 
        break;
}

Licence

Licensed under the EUROPEAN UNION PUBLIC LICENCE v.1.2

Authors: Jens Scheerlinck (PwC EU Services), Emidio Stani (PwC EU Services)