Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dbpedia History #745

Open
wants to merge 54 commits into
base: history-extraction
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
5342469
Manage REST API answer link shape
Sep 14, 2022
7174c71
Not cleaning HTML before getJsoupDoc and clean it inside for managing…
Sep 14, 2022
d00cf7a
Create a WikipediaNifExtractor extension for REST API answer
Sep 14, 2022
89c6d5c
change connector
Sep 14, 2022
6461bce
add possibility to choose connector
Sep 14, 2022
9cdeb79
deprecate class
Sep 14, 2022
29329c0
Create a MediaWikiConnector Abstract class for gathering common params
Sep 14, 2022
06202d5
Create a new connector for the REST API
Sep 14, 2022
053f0ab
Create a new connector for the REST API
Sep 14, 2022
2050875
script for creating custom dump sample
Sep 14, 2022
7ad20ae
script for generating Minidump from uri list generated by create_cust…
Sep 14, 2022
f55d803
script for creating uri list randomly from id list
Sep 14, 2022
f7686ec
adapt property files to new possible APIS
Sep 14, 2022
782214b
add new param for MWC api
Sep 14, 2022
984b5d4
new Test for abstract benchmark
Sep 14, 2022
021ca01
Add new properties for API connectors
Sep 14, 2022
7001bcd
adapt for extension
Sep 14, 2022
78d91d6
Update core/src/main/scala/org/dbpedia/extraction/util/MediawikiConne…
datalogism Sep 16, 2022
d7929da
Update dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTest…
datalogism Sep 16, 2022
167b342
Update dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTest…
datalogism Sep 16, 2022
6112b94
Update dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTest…
datalogism Sep 16, 2022
e97dd88
Update core/src/main/scala/org/dbpedia/extraction/config/Config.scala
datalogism Sep 16, 2022
969f3b4
Update core/src/main/scala/org/dbpedia/extraction/config/Config.scala
datalogism Sep 16, 2022
e89f813
clear comments of API config and fix plain abstract API urls
Sep 19, 2022
2246f21
snake case to camel case
Sep 19, 2022
da5f135
first dev on historic
Nov 4, 2022
eb68385
add last dev
Nov 21, 2022
cb076dd
ADD FIRST HISTORY PROTOTYPE
Nov 27, 2022
9f870bb
ADD final version of History prototype
Dec 6, 2022
837e402
clean
Dec 6, 2022
cb1526d
clean2
Dec 6, 2022
0f85985
Update history/ReadMe.md
datalogism Dec 7, 2022
bbf64f4
Update history/ReadMe.md
datalogism Dec 7, 2022
37376c1
Update history/src/main/scala/org/dbpedia/extraction/dump/extract/Ser…
datalogism Dec 7, 2022
9ff3818
Update dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTest…
datalogism Dec 8, 2022
8932d97
Update dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTest…
datalogism Dec 8, 2022
367dfe9
Update history/ReadMe.md
datalogism Dec 8, 2022
a8c7736
Update history/ReadMe.md
datalogism Dec 8, 2022
716a780
Update history/ReadMe.md
datalogism Dec 8, 2022
fb87924
Update history/src/main/scala/org/dbpedia/extraction/dump/extract/Ext…
datalogism Dec 8, 2022
a3a8063
Update history/src/main/scala/org/dbpedia/extraction/dump/extract/Ext…
datalogism Dec 8, 2022
376d1cc
Update history/ReadMe.md
datalogism Dec 8, 2022
b247260
Update history/ReadMe.md
datalogism Dec 8, 2022
d9c12f7
Update history/src/main/scala/org/dbpedia/extraction/dump/extract/Con…
datalogism Dec 8, 2022
4221c0f
Update history/ReadMe.md
datalogism Dec 8, 2022
a755007
Update history/ReadMe.md
datalogism Dec 8, 2022
67312b1
Update ReadMe.md
datalogism Dec 8, 2022
b08442d
Update ReadMe.md
datalogism Dec 9, 2022
4286753
Update ReadMe.md
datalogism Jan 5, 2023
a6ebbc5
Update ReadMe.md
datalogism Jan 5, 2023
f87066b
Update ReadMe.md
datalogism Jan 5, 2023
447bc7a
Update history/ReadMe.md
datalogism Jan 6, 2023
e119c8b
Update history/ReadMe.md
datalogism Jan 6, 2023
f97dafa
Update history/ReadMe.md
datalogism Jan 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,26 @@
designed for testing abstracts extractors
## Before all

* Delete tag @DoNotDiscover of ExtractionTestAbstract
* add the tag @DoNotDiscover to other test class
* Delete tag `@DoNotDiscover` of `ExtractionTestAbstract`
* add tag `@DoNotDiscover` to other test class

## Procedure :
## Procedure
1. Clean your target directory with `mvn clean` in the root directory of DIEF
1. Go to bash scripts via `cd /dump/src/test/bash`
1. OPTIONAL: Create a new Wikipedia minidump sample with `bash create_custom_sample.sh -n $numberOfPage -l $lang -d $optionalDate`
1. Process sample of Wikipedia pages `bash Minidump_custom_sample.sh -f $filename/lst`
1. Go to bash scripts via
```shell
cd /dump/src/test/bash
```
1. OPTIONAL: Create a new Wikipedia minidump sample with
```shell
bash create_custom_sample.sh -n $numberOfPage -l $lang -d $optionalDate
```
1. Process sample of Wikipedia pages
```shell
bash Minidump_custom_sample.sh -f $filename/lst
```
1. Update the extraction language parameter for your minidump sample in [`extraction.nif.abstracts.properties`](https://github.com/datalogism/extraction-framework/blob/gsoc-celian/dump/src/test/resources/extraction-configs/extraction.nif.abstracts.properties) and in [`extraction.plain.abstracts.properties`](https://github.com/datalogism/extraction-framework/blob/gsoc-celian/dump/src/test/resources/extraction-configs/extraction.plain.abstracts.properties)
1. Change the name of your log in the [`ExtractionTestAbstract.scala`](https://github.com/datalogism/extraction-framework/blob/gsoc-celian/dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTestAbstract.scala) file
1. Rebuild the app with `mvn install`, or just test it with `mvn test -Dtest="ExtractionTestAbstract2"`
1. Rebuild the app with `mvn install`, or just test it with
```shell
mvn test -Dtest="ExtractionTestAbstract2"
```
28 changes: 14 additions & 14 deletions history/ReadMe.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ DBpedia History enables the history of a Wikipedia chapter to be extracted into

## Previous work

This DBpedia App is a scala/java version of the first work conducted by the French Chapter : https://github.com/dbpedia/Historic/
This DBpedia App is a Scala/Java version of the first work conducted by the French Chapter, <https://github.com/dbpedia/Historic/>.

Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin. Wikipedia editing history in DBpedia: extracting and publishing the encyclopedia editing activity as linked data. IEEE/WIC/ACM International Joint Conference on Web Intelligence (WI' 16), Oct 2016, Omaha, United States. <hal-01359575>
https://hal.inria.fr/hal-01359575
Expand All @@ -15,26 +15,26 @@ https://hal.inria.fr/hal-01359583

## A first working prototype

This prototype is not optimized, during its development of it we were faced with the WikiPage type checking constraints that are checked in almost every module of the DBpedia pipeline.
We hardly copy/paste and renamed all the classes and objects we needed for running the extractors.
This conception could be easily improved by making WikiPage and WikiPageWithRevision objects inherit from the same abstract object.
But as a first step, we wanted to touch the less possible DBpedia core module.
This prototype is not optimized. During its development, we were faced with the WikiPage type-checking constraints that are checked in almost every module of the DBpedia pipeline.
We basically copy/pasted and renamed all the classes and objects we needed for running the extractors.
This conception could be easily improved by making `WikiPage` and `WikiPageWithRevision` objects inherit from the same abstract object.
But as a first step, we didn't want to impact the core module.

Some other improvements that could be conducted:
Some other improvements that could be made:
* Scala version
* Being able to use a historic namespace taking into account the DBpedia chapter language
* Being able to follow if a revision impacts an infobox content
* Enabling use of a historic namespace, taking into account the DBpedia chapter language
* Enabling following when a revision impacts content of an `infobox`

## Main Class

* [WikipediaDumpParserHistory.java](src/main/java/org/dbpedia/extraction/sources/WikipediaDumpParserHistory.java) : for parsing of the history dumps
* [RevisionNode.scala](src/main/scala/org/dbpedia/extraction/wikiparser/RevisionNode.scala) : define revision node object
* [WikiPageWithRevision](src/main/scala/org/dbpedia/extraction/wikiparser/WikiPageWithRevisions.scala) : define wikipage with revision list object
* [WikipediaDumpParserHistory.java](src/main/java/org/dbpedia/extraction/sources/WikipediaDumpParserHistory.java) for parsing the history dumps
* [RevisionNode.scala](src/main/scala/org/dbpedia/extraction/wikiparser/RevisionNode.scala) define revision of node object
* [WikiPageWithRevision](src/main/scala/org/dbpedia/extraction/wikiparser/WikiPageWithRevisions.scala) define `wikipage` with revision list object

## Extractors

* [HistoryPageExtractor.scala](src/main/scala/org/dbpedia/extraction/mappings/HistoryPageExtractor.scala): Extract all the revision of every wikipedia pages
* [HistoryStatsExtractor.scala](src/main/scala/org/dbpedia/extraction/mappings/HistoryStatsExtractor.scala) : Extract statistics about the revision activity for every page of Wikipedia
* [HistoryPageExtractor.scala](src/main/scala/org/dbpedia/extraction/mappings/HistoryPageExtractor.scala)Extract all revisions of every Wikipedia page
* [HistoryStatsExtractor.scala](src/main/scala/org/dbpedia/extraction/mappings/HistoryStatsExtractor.scala) Extract statistics about revision activity for every page of Wikipedia

## How to run it ?

Expand All @@ -48,4 +48,4 @@ Some other improvements that could be conducted:
* configure the [extraction.properties](extraction.properties) file
* and run ```../run run extraction.properties```

* Test it with : mvn test (need to have a containing file frwiki-[YYYYMMDD]-download-complete empty flag file into the base-dir defined into the extraction-properties file )
* Test it with `mvn test` (need to have a containing file, `frwiki-[YYYYMMDD]-download-complete` empty flag file into the `base-dir` defined into the `extraction-properties` file)
datalogism marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ class ConfigLoader2(config: Config2)
/**
* Loads the configuration and creates extraction jobs for all configured languages.
*
* @return Non-strict Traversable over all configured extraction jobs i.e. an extractions job will not be created until it is explicitly requested.
* @return Non-strict Traversable over all configured extraction jobs, i.e., an extraction job will not be created until it is explicitly requested.
*/
def getExtractionJobs: Traversable[ExtractionJob2] =
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import org.dbpedia.extraction.wikiparser.{Namespace, WikiPage, WikiPageWithRevis
* @param source The extraction source
* @param namespaces Only extract pages in these namespaces
* @param destination The extraction destination. Will be closed after the extraction has been finished.
* @param language the language of this extraction.
* @param language The language of this extraction.
*/
class ExtractionJob2(
extractor: WikiPageWithRevisionsExtractor,
Expand Down Expand Up @@ -44,7 +44,7 @@ class ExtractionJob2(
println(graph.toString())
destination.write(graph)
}
//if the internal extraction process of this extractor yielded extraction records (e.g. non critical errors etc.), those will be forwarded to the ExtractionRecorder, else a new record is produced
//if the internal extraction process of this extractor yielded extraction records (e.g., non-critical errors, etc.), those will be forwarded to the ExtractionRecorder; else, a new record is produced
val records = page.getExtractionRecords() match{
case seq :Seq[RecordEntry2[WikiPageWithRevisions]] if seq.nonEmpty => seq
case _ => Seq(new RecordEntry2[WikiPageWithRevisions](page, page.uri, RecordSeverity.Info, page.title.language))
Expand Down