-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port GrandTheatreQuebec Huginn to Ruby #1
Comments
@dev-aravind Only a couple of Huginn scenarios left to migrate ;-) |
@saumier will add the huginn crawling details here. |
@dev-aravind Here is the agent from Huginn. Instead of a CSS class it uses
|
@saumier The reason why the workflow was stalled is because the replace blank nodes SPARQL is taking more time than expected to execute. The initial output of the crawler before the transformation is a 400,000 line JSON-LD file with 16,000+ xhv:role entities. These roles are blank and top-level nodes, hence the SPARQL will try to assign a temporary URI for them. What do you suggest? |
Task for @dev-aravind - remove the vocab role types and reorder the SPARQL run to make the replace blank nodes to run last. |
@dev-aravind Here is the list of the SPARQLs created for GTQ. Please look at each one and decide if they can be added to our pipeline for all crawls. Some will already be covered in our Github Artsdata Pipeline Action (like specific/lavitrine/fix-schemaorg-date-datatype), some will need to be added (like specific/lavitrine/fix-isni). The question to ask: Can this SPARQL apply to all data feeds and improve the data quality for everyone? If yes then it should be added to the Action for everyone. This is an important step and I will want to review each individual SPARQL in the list to check that we are using the best approach for maintainability. We need to make it as easy as we can for someone else to understand what each SPARQL does. The paths are relative to this folder https://github.com/culturecreates/sparql-library/tree/master/artsdata/ETL/huginn
Things to consider in SPARQL
|
@saumier I added a set of SPARQLs to our pipeline and the PR can be found here. Please approve this if you find everything to be okay. I was not able to run the add-artsdata-uri-using-wikidata-bridge and add-artsdata-uri-using-isni-bridge because of this error: |
@dev-aravind I am a big confused about the status. The repo artsdata-planet-gtq should have a workflow that can be triggered manually to crawl the GTQ website using the Artsdata Pipeline Action, make some custom transforms and publish on the Artsdata Databus as artifact I merged the PR in Artsdata Pipeline Action, but the extra 2 sparqls for event series still need to be removed and added here and you will need to create a new release of the Action to use it for GTQ. Please load a second artifact of GTQ events from this repo so it will be in the graph http://kg.artsdata.ca/culture-creates/artsdata-planet-gtq/derived-grandtheatre-qc-ca and compare it to the current graph http://kg.artsdata.ca/culture-creates/huginn/derived-grandtheatre-qc-ca. |
@dev-aravind I ran the workflow but there were errors. Take a look here. The add-concepts should wait for the fetch-data to complete. Another suggestion for clarity is to not include the SPARQL transforms under the same task in the workflow called add-concepts. |
@saumier The data is now up in artsdata and can be found here. However I was not able to find http://kg.artsdata.ca/culture-creates/huginn/derived-grandtheatre-qc-ca in the nebula interface. Please review this and let me know if you need any more changes. |
Please compare data from a couple of events (including an event series) to compare the data, and check specifically:
The last step after the above is for me to add 2 additional SPARQLs in the Artsdata platform (after loading from the Databus) :
|
@saumier The data seems to be consistent from my testing, and the changes you requested are also fixed. |
The GrandTheatreQuebec already has a Planet. This is to remove the crawling still happening on Huginn. The workflow in Huginn has an extra step when crawling each page, that is to scrape the html for the keywords of each event page. The keywords is missing from the JSON-LD and is added to JSON-LD by the workflow and then mapped to the GrandTheatreQuebec event type SKOS.
If needed, I can give you access to Huginn
So I propose working in steps (each step can be loaded into Artsdata for review)
The text was updated successfully, but these errors were encountered: