Port GrandTheatreQuebec Huginn to Ruby #1

saumier · 2023-10-23T20:08:42Z

The GrandTheatreQuebec already has a Planet. This is to remove the crawling still happening on Huginn. The workflow in Huginn has an extra step when crawling each page, that is to scrape the html for the keywords of each event page. The keywords is missing from the JSON-LD and is added to JSON-LD by the workflow and then mapped to the GrandTheatreQuebec event type SKOS.

If needed, I can give you access to Huginn

So I propose working in steps (each step can be loaded into Artsdata for review)

normal crawl using Artsdata Pipeline Action to get JSON-LD of each webpage into Artsdata
custom scrape to extract keywords from event webpages
mapping keywords to GrandTheatreQuebec event type SKOS
add specific SPARQL transforms (review Huginn's list of SPARQLs with Gregory, some may not be needed)

saumier · 2024-09-05T22:20:31Z

@dev-aravind Only a couple of Huginn scenarios left to migrate ;-)

dev-aravind · 2024-11-04T13:17:49Z

@saumier will add the huginn crawling details here.

saumier · 2024-11-04T23:46:21Z

@dev-aravind Here is the agent from Huginn. Instead of a CSS class it uses "xpath": "//article[@class=\"show\"]//a" to get the list of @href for the events.

{
  "expected_update_period_in_days": "100",
  "url": [
    "https://grandtheatre.qc.ca/programmation/"
  ],
  "type": "html",
  "mode": "all",
  "extract": {
    "url": {
      "xpath": "//article[@class=\"show\"]//a",
      "value": "concat(\"https://grandtheatre.qc.ca\",@href)"
    }
  },
  "template": {
    "graph_name": "{{graph_name}}"
  }
}

dev-aravind · 2024-11-07T12:37:35Z

@saumier The reason why the workflow was stalled is because the replace blank nodes SPARQL is taking more time than expected to execute. The initial output of the crawler before the transformation is a 400,000 line JSON-LD file with 16,000+ xhv:role entities. These roles are blank and top-level nodes, hence the SPARQL will try to assign a temporary URI for them. What do you suggest?

dev-aravind · 2024-11-07T13:18:46Z

Task for @dev-aravind - remove the vocab role types and reorder the SPARQL run to make the replace blank nodes to run last.

saumier · 2024-11-08T13:38:27Z

@dev-aravind Here is the list of the SPARQLs created for GTQ. Please look at each one and decide if they can be added to our pipeline for all crawls. Some will already be covered in our Github Artsdata Pipeline Action (like specific/lavitrine/fix-schemaorg-date-datatype), some will need to be added (like specific/lavitrine/fix-isni).

The question to ask: Can this SPARQL apply to all data feeds and improve the data quality for everyone? If yes then it should be added to the Action for everyone.

This is an important step and I will want to review each individual SPARQL in the list to check that we are using the best approach for maintainability. We need to make it as easy as we can for someone else to understand what each SPARQL does.

The paths are relative to this folder https://github.com/culturecreates/sparql-library/tree/master/artsdata/ETL/huginn

Things to consider in SPARQL

several SPARQLs from the Huginn pipeline use graph <graph_name_placeholder> {... } which should be removed from the SPARQLs added to the artsdata pipeline action. Unless the graph is inside a federated part using an external end point (like service <http://db.artsdata.ca/repositories/artsdata>)
the Huginn pipeline runs the SPARQLs in graphdb with inferencing turned on. The Artsdata pipeline does not (yet) using inferencing. Special attention should be given to check that inferencing is not needed. For example: select * {?s a schema:Event} will include all subtypes such as schema:MusicEvent, schema:DanceEvent, etc.
several SPARQLs use federated queries to get artsdata uris. This SPARQL specific/lavitrine/add-keywords-additional-type-mapping is used to add the mapping of keyword to event type (using additionalType) without having to download the mapping file. However, for maintainability, I think this can be replaced by loading the event type mapping file gtq-event-type-mapping.ttl into the graph data uploaded to Artsdata so everything is in a single data feed. This way the workflow will not depend on the mapping file being already loaded into Artsdata.

dev-aravind · 2024-11-13T11:39:57Z

@saumier I added a set of SPARQLs to our pipeline and the PR can be found here. Please approve this if you find everything to be okay.

I was not able to run the add-artsdata-uri-using-wikidata-bridge and add-artsdata-uri-using-isni-bridge because of this error:
SERVICE operator not implemented (NotImplementedError). Have you encountered this previously?

saumier · 2025-01-04T21:09:45Z

@dev-aravind I am a big confused about the status. The repo artsdata-planet-gtq should have a workflow that can be triggered manually to crawl the GTQ website using the Artsdata Pipeline Action, make some custom transforms and publish on the Artsdata Databus as artifact derived-grandtheatre-qc-ca. Once the generic SPARQL transforms are moved to the Artsdata Pipeline Action, the remaining specific SPARQL transforms and the keyword mapping will run in this repo. But I don't see any sparql transforms in this repo. The 2 specific SPARQLs related to event series should be removed from the Artsdata Pipeline Action and placed in this artsdata-planet-gtq repo.

I merged the PR in Artsdata Pipeline Action, but the extra 2 sparqls for event series still need to be removed and added here and you will need to create a new release of the Action to use it for GTQ.

Please load a second artifact of GTQ events from this repo so it will be in the graph http://kg.artsdata.ca/culture-creates/artsdata-planet-gtq/derived-grandtheatre-qc-ca and compare it to the current graph http://kg.artsdata.ca/culture-creates/huginn/derived-grandtheatre-qc-ca.

dev-aravind · 2025-01-06T13:04:13Z

@saumier I updated the PR to include the event series SPARQLs and also did a release for artsdata-pipeline-action. Please review it and let me know if you need any more changes.

saumier · 2025-01-06T23:53:50Z

@dev-aravind I ran the workflow but there were errors. Take a look here. The add-concepts should wait for the fetch-data to complete.

Another suggestion for clarity is to not include the SPARQL transforms under the same task in the workflow called add-concepts.

dev-aravind · 2025-01-07T09:13:01Z

@saumier The data is now up in artsdata and can be found here. However I was not able to find http://kg.artsdata.ca/culture-creates/huginn/derived-grandtheatre-qc-ca in the nebula interface. Please review this and let me know if you need any more changes.

saumier · 2025-01-07T13:18:08Z

@dev-aravind

First please rename the artifact to derived-grandtheatre-qc-ca instead of derived-grandtheatrequebec-ca. The general file-naming convention is to use the domain of the website and replace dots (.) with dash (-). In this case we are also adding keywords and eventSeries so that is why I have the convention of adding derived prefix to the artifact name to indicate that we are doing more than just processing the JSON-LD on the pages.
Please rename the generated URI for Event Series from https://www.grandtheatre.qc.ca/programmation/coucou-passe-partout-spectacle/#EventSeries to be consistent with the other event URIs (remove the 'www'). This is in the SPARQL that creates the event series.

Please compare data from a couple of events (including an event series) to compare the data, and check specifically:

additionalType
creation of Event Series (sparql)

The last step after the above is for me to add 2 additional SPARQLs in the Artsdata platform (after loading from the Databus) :

add-artsdata-uri-using-wikidata-bridge.sparql - this will add an Artsdata ID :sameAs to entities that have a Wikidata ID. For example, place Grand Théâtre de Québec has a Wikidata ID Q3114610 so after loading the JSON-LD into Artsdata this SPARQL will add a :sameAs Artsdata ID
add-artsdata-uri-using-isni-bridge.sparql - this will add an Artsdata ID :sameAs to entities that have an ISNI ID that is also mapped to an Artsdata ID.

dev-aravind · 2025-01-08T12:46:43Z

@saumier The data seems to be consistent from my testing, and the changes you requested are also fixed.

saumier changed the title ~~Crawl GrandTheatreQuebec Huginn to Planet~~ Move GrandTheatreQuebec Huginn to Planet Dec 19, 2023

saumier changed the title ~~Move GrandTheatreQuebec Huginn to Planet~~ Port GrandTheatreQuebec Huginn to Ruby Jan 9, 2024

saumier assigned dev-aravind Jan 10, 2024

saumier transferred this issue from culturecreates/nebula Jan 10, 2024

saumier removed the status in Artsdata Jan 23, 2024

saumier unassigned dev-aravind Jun 3, 2024

saumier moved this to Todo in Artsdata Sep 5, 2024

saumier assigned dev-aravind Sep 5, 2024

dev-aravind assigned saumier and unassigned dev-aravind Nov 4, 2024

dev-aravind moved this from In Progress to In Review in Artsdata Nov 4, 2024

saumier assigned dev-aravind and unassigned saumier Nov 4, 2024

dev-aravind assigned saumier and unassigned dev-aravind Nov 7, 2024

dev-aravind moved this from In Progress to In Review in Artsdata Nov 7, 2024

dev-aravind assigned dev-aravind and unassigned saumier Nov 7, 2024

dev-aravind moved this from In Review to Todo in Artsdata Nov 7, 2024

dev-aravind assigned saumier and unassigned dev-aravind Nov 13, 2024

dev-aravind moved this from In Progress to In Review in Artsdata Nov 13, 2024

dev-aravind added the question Further information is requested label Nov 13, 2024

dev-aravind moved this from Todo to In Progress in Artsdata Dec 30, 2024

dev-aravind assigned saumier and unassigned dev-aravind Jan 2, 2025

dev-aravind moved this from In Progress to Todo in Artsdata Jan 2, 2025

saumier assigned dev-aravind and unassigned saumier Jan 4, 2025

dev-aravind moved this from Todo to In Progress in Artsdata Jan 6, 2025

dev-aravind assigned saumier and unassigned dev-aravind Jan 6, 2025

dev-aravind moved this from In Progress to In Review in Artsdata Jan 6, 2025

saumier assigned dev-aravind and unassigned saumier Jan 6, 2025

saumier moved this from In Review to Todo in Artsdata Jan 6, 2025

dev-aravind moved this from Todo to In Progress in Artsdata Jan 7, 2025

dev-aravind assigned saumier and unassigned dev-aravind Jan 7, 2025

dev-aravind moved this from In Progress to In Review in Artsdata Jan 7, 2025

saumier assigned dev-aravind and unassigned saumier Jan 7, 2025

saumier moved this from In Review to Todo in Artsdata Jan 7, 2025

dev-aravind moved this from Todo to In Progress in Artsdata Jan 8, 2025

dev-aravind assigned saumier and unassigned dev-aravind Jan 8, 2025

dev-aravind moved this from In Progress to In Review in Artsdata Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port GrandTheatreQuebec Huginn to Ruby #1

Port GrandTheatreQuebec Huginn to Ruby #1

saumier commented Oct 23, 2023 •

edited by dev-aravind

Loading

saumier commented Sep 5, 2024

dev-aravind commented Nov 4, 2024

saumier commented Nov 4, 2024 •

edited

Loading

dev-aravind commented Nov 7, 2024 •

edited

Loading

dev-aravind commented Nov 7, 2024

saumier commented Nov 8, 2024 •

edited by dev-aravind

Loading

dev-aravind commented Nov 13, 2024 •

edited

Loading

saumier commented Jan 4, 2025 •

edited

Loading

dev-aravind commented Jan 6, 2025

saumier commented Jan 6, 2025

dev-aravind commented Jan 7, 2025

saumier commented Jan 7, 2025 •

edited by dev-aravind

Loading

dev-aravind commented Jan 8, 2025

Port GrandTheatreQuebec Huginn to Ruby #1

Port GrandTheatreQuebec Huginn to Ruby #1

Comments

saumier commented Oct 23, 2023 • edited by dev-aravind Loading

saumier commented Sep 5, 2024

dev-aravind commented Nov 4, 2024

saumier commented Nov 4, 2024 • edited Loading

dev-aravind commented Nov 7, 2024 • edited Loading

dev-aravind commented Nov 7, 2024

saumier commented Nov 8, 2024 • edited by dev-aravind Loading

Things to consider in SPARQL

dev-aravind commented Nov 13, 2024 • edited Loading

saumier commented Jan 4, 2025 • edited Loading

dev-aravind commented Jan 6, 2025

saumier commented Jan 6, 2025

dev-aravind commented Jan 7, 2025

saumier commented Jan 7, 2025 • edited by dev-aravind Loading

dev-aravind commented Jan 8, 2025

saumier commented Oct 23, 2023 •

edited by dev-aravind

Loading

saumier commented Nov 4, 2024 •

edited

Loading

dev-aravind commented Nov 7, 2024 •

edited

Loading

saumier commented Nov 8, 2024 •

edited by dev-aravind

Loading

dev-aravind commented Nov 13, 2024 •

edited

Loading

saumier commented Jan 4, 2025 •

edited

Loading

saumier commented Jan 7, 2025 •

edited by dev-aravind

Loading