ArrayIndexOutOfBoundsException in make sql-dump-parts #18

alvaromorales · 2015-06-25T03:33:40Z

I'm installing the 2015-06-02 dumps. I got an error in the make sql-dump-parts step. Parts 1-26 completed successfully, but the 27th file did not. I'm opening an issue, as instructed below.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
The errored article is     <title>File:Trinity   <page> ( /scratch/wikipedia-mirror/drafts/errored_articles ). Fixing... (time: Wed Jun 24 23:24:31 EDT 2015)
Will remove article '    <title>File:Trinity   <page>' from file /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml (size: 43533083690)
        Method: (blank is ok)
        search term: <title>    <title>File:Trinity   <page></title>
        file: /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml
        title offset:
Found '' Grep-ing (cat  /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml | grep -b -m 1 -F "<title>    <title>File:Trinity   <page></title>" | cut -d: -f1)
XML parse script failed. This is serous. report this at
        http://github.com/fakedrake/wikipedia-mirror/issues

The text was updated successfully, but these errors were encountered:

fakedrake · 2015-06-26T10:26:48Z

Looks like the xml file is malformed or misread. Could you post the output of

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.xml

and also the contents of /scratch/wikipedia-mirror/drafts/errored_articles

EDIT: also the outputs of

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

and

tail -40 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

Might be useful.

alvaromorales · 2015-06-28T01:33:38Z

Thanks for following up. I've included the output you requested. The output is pretty noisy, let me know if you want me to be more specific with grep.

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015

https://paste.ee/r/uQtk9

the contents of /scratch/wikipedia-mirror/drafts/errored_articles

Ronald J. Rabago
Grażyna (poem)
Wikipedia:WikiProject Spam/LinkReports/firmenpresse.de
Wikipedia:WikiProject Spam/LinkReports/9sky.com
Wikipedia:Reference desk/Archives/Humanities/2011 December 30
    <title>File:Trinity   <page>
    <title>File:Trinity   <page>
    <title>File:Trinity   <page>
Wikipedia:WikiProject Spam/LinkReports/9sky.com
Wikipedia:Reference desk/Archives/Humanities/2011 December 30

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

https://paste.ee/r/R1TZH

tail -40 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

https://paste.ee/r/XbVso

fakedrake · 2015-06-28T12:18:48Z

Just so this is documented and you are not completely in the dark:

Due to a bug(?) in mwdumper when feeding it the xml expecting sql sometimes xerces (the xml parser) throws the exception that you saw, namely

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

I found that removing the offending <page>...</page> entry fixes the problem at the cost of missing a page.

The way I went around implementing this is: the downloaded xml gets parsed into another xml by mwdumper. If this process fails we look backwards into the output xml file for a title tag. Then we remove the relevant page and try again until mwdumper is happy. Then mwdumper parses the "correct" xml into sql. All articles removed this way are logged into drafts/errored_articles.

The problem is almost definitely with my code, I will take a look at it shortly.

dldharma · 2016-09-29T13:54:26Z

We have encountered the same error.
Please do advice if there is a fix in place.
Regards,
D

fakedrake · 2016-09-29T20:15:43Z

@dldharma just so I don't have to download everything from scratch, do you have it on an infolab machine?

dldharma · 2016-09-30T00:26:50Z

Unfortunately, not on infolab machine.
Have it on our cloud server on AWS.

Sent from my iPhone

On 30-Sep-2016, at 1:45 AM, Chris Perivolaropoulos [email protected] wrote:

@dldharma just so I don't have to download everything from scratch, do you have it on an infolab machine?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

michaelsilver · 2016-09-30T02:23:41Z

@dldharma this project is now mostly defunct -- we had a lot of problems setting up a full mirror of Wikipedia. Check out WikipediaBase, a virtual database that uses a combination of local data obtained from the Wikipedia dumps and live fetched from the Wikipedia API.

dldharma · 2016-09-30T04:32:13Z

Thanks a lot for your prompt reply. Appreciate.
WikipediaBase looks great and trying it right now.
START is an amazing initiative - tried it - and works great !

Regards,
Dileep

On Fri, Sep 30, 2016 at 7:53 AM, Michael Silver [email protected]
wrote:

@dldharma https://github.com/dldharma this project is now mostly
defunct -- we had a lot of problems setting up a full mirror of Wikipedia.
Check out WikipediaBase https://github.com/infolab-csail/WikipediaBase,
a virtual database that uses a combination of local data obtained from the Wikipedia
dumps https://dumps.wikimedia.org/enwiki/20160920/ and live fetched
from the Wikipedia API https://www.mediawiki.org/wiki/API:Main_page.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#18 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AImSodIjbvONRQirK7o5VNhIWVR1usLqks5qvHKugaJpZM4FLZKl
.

dldharma · 2016-10-06T10:21:56Z

@michaelsilver the import process successfully completed. It populated Articles, Classes and Article Classes mappings. Thanks once again for sharing WikipediaBase.

On reviewing the data, found that article categories are present only in Article.markup.
To my surprise, the category tables (parent and child category composite relations) and category relation with articles are completely missing.
Did spike the code and will need to enhance the mechanism to support the same.
Any ideas or alternates for category challenge ?

Regards,
Dileep

michaelsilver · 2016-10-07T15:40:18Z

@dldharma why don't you make an issue in WikipediaBase and we can discuss further there. When you create the issue, please provide a printout of the tables you have populated (\d in postgres) and describe what you mean by "article categories". By category, do you mean which type of infobox the article has?

dldharma · 2016-10-08T05:55:54Z

@michaelsilver agree. Created issue 277. Thanks once again for your prompt replies. Appreciate !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArrayIndexOutOfBoundsException in make sql-dump-parts #18

ArrayIndexOutOfBoundsException in make sql-dump-parts #18

alvaromorales commented Jun 25, 2015

fakedrake commented Jun 26, 2015

alvaromorales commented Jun 28, 2015

fakedrake commented Jun 28, 2015

dldharma commented Sep 29, 2016 •

edited

Loading

fakedrake commented Sep 29, 2016

dldharma commented Sep 30, 2016

michaelsilver commented Sep 30, 2016

dldharma commented Sep 30, 2016

dldharma commented Oct 6, 2016

michaelsilver commented Oct 7, 2016

dldharma commented Oct 8, 2016

ArrayIndexOutOfBoundsException in make sql-dump-parts #18

ArrayIndexOutOfBoundsException in make sql-dump-parts #18

Comments

alvaromorales commented Jun 25, 2015

fakedrake commented Jun 26, 2015

alvaromorales commented Jun 28, 2015

fakedrake commented Jun 28, 2015

dldharma commented Sep 29, 2016 • edited Loading

fakedrake commented Sep 29, 2016

dldharma commented Sep 30, 2016

michaelsilver commented Sep 30, 2016

dldharma commented Sep 30, 2016

dldharma commented Oct 6, 2016

michaelsilver commented Oct 7, 2016

dldharma commented Oct 8, 2016

dldharma commented Sep 29, 2016 •

edited

Loading