Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrayIndexOutOfBoundsException in make sql-dump-parts #18

Open
alvaromorales opened this issue Jun 25, 2015 · 11 comments
Open

ArrayIndexOutOfBoundsException in make sql-dump-parts #18

alvaromorales opened this issue Jun 25, 2015 · 11 comments

Comments

@alvaromorales
Copy link
Member

I'm installing the 2015-06-02 dumps. I got an error in the make sql-dump-parts step. Parts 1-26 completed successfully, but the 27th file did not. I'm opening an issue, as instructed below.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
The errored article is     <title>File:Trinity   <page> ( /scratch/wikipedia-mirror/drafts/errored_articles ). Fixing... (time: Wed Jun 24 23:24:31 EDT 2015)
Will remove article '    <title>File:Trinity   <page>' from file /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml (size: 43533083690)
        Method: (blank is ok)
        search term: <title>    <title>File:Trinity   <page></title>
        file: /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml
        title offset:
Found '' Grep-ing (cat  /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml | grep -b -m 1 -F "<title>    <title>File:Trinity   <page></title>" | cut -d: -f1)
XML parse script failed. This is serous. report this at
        http://github.com/fakedrake/wikipedia-mirror/issues
@fakedrake
Copy link
Member

Looks like the xml file is malformed or misread. Could you post the output of

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.xml

and also the contents of /scratch/wikipedia-mirror/drafts/errored_articles

EDIT: also the outputs of

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

and

tail -40 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

Might be useful.

@alvaromorales
Copy link
Member Author

Thanks for following up. I've included the output you requested. The output is pretty noisy, let me know if you want me to be more specific with grep.

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015

https://paste.ee/r/uQtk9

the contents of /scratch/wikipedia-mirror/drafts/errored_articles

Ronald J. Rabago
Grażyna (poem)
Wikipedia:WikiProject Spam/LinkReports/firmenpresse.de
Wikipedia:WikiProject Spam/LinkReports/9sky.com
Wikipedia:Reference desk/Archives/Humanities/2011 December 30
    <title>File:Trinity   <page>
    <title>File:Trinity   <page>
    <title>File:Trinity   <page>
Wikipedia:WikiProject Spam/LinkReports/9sky.com
Wikipedia:Reference desk/Archives/Humanities/2011 December 30

grep -F "File:Trinity" -B 10 -A 10 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

https://paste.ee/r/R1TZH

tail -40 /scratch/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20150602-pages-meta-current27.xml-p029625001p046872015.fix.xml

https://paste.ee/r/XbVso

@fakedrake
Copy link
Member

Just so this is documented and you are not completely in the dark:

Due to a bug(?) in mwdumper when feeding it the xml expecting sql sometimes xerces (the xml parser) throws the exception that you saw, namely

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

I found that removing the offending <page>...</page> entry fixes the problem at the cost of missing a page.

The way I went around implementing this is: the downloaded xml gets parsed into another xml by mwdumper. If this process fails we look backwards into the output xml file for a title tag. Then we remove the relevant page and try again until mwdumper is happy. Then mwdumper parses the "correct" xml into sql. All articles removed this way are logged into drafts/errored_articles.

The problem is almost definitely with my code, I will take a look at it shortly.

@dldharma
Copy link

dldharma commented Sep 29, 2016

We have encountered the same error.
Please do advice if there is a fix in place.
Regards,
D

@fakedrake
Copy link
Member

@dldharma just so I don't have to download everything from scratch, do you have it on an infolab machine?

@dldharma
Copy link

Unfortunately, not on infolab machine.
Have it on our cloud server on AWS.

Sent from my iPhone

On 30-Sep-2016, at 1:45 AM, Chris Perivolaropoulos [email protected] wrote:

@dldharma just so I don't have to download everything from scratch, do you have it on an infolab machine?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@michaelsilver
Copy link
Member

@dldharma this project is now mostly defunct -- we had a lot of problems setting up a full mirror of Wikipedia. Check out WikipediaBase, a virtual database that uses a combination of local data obtained from the Wikipedia dumps and live fetched from the Wikipedia API.

@dldharma
Copy link

Thanks a lot for your prompt reply. Appreciate.
WikipediaBase looks great and trying it right now.
START is an amazing initiative - tried it - and works great !

Regards,
Dileep

On Fri, Sep 30, 2016 at 7:53 AM, Michael Silver [email protected]
wrote:

@dldharma https://github.com/dldharma this project is now mostly
defunct -- we had a lot of problems setting up a full mirror of Wikipedia.
Check out WikipediaBase https://github.com/infolab-csail/WikipediaBase,
a virtual database that uses a combination of local data obtained from the Wikipedia
dumps https://dumps.wikimedia.org/enwiki/20160920/ and live fetched
from the Wikipedia API https://www.mediawiki.org/wiki/API:Main_page.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#18 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AImSodIjbvONRQirK7o5VNhIWVR1usLqks5qvHKugaJpZM4FLZKl
.

@dldharma
Copy link

dldharma commented Oct 6, 2016

@michaelsilver the import process successfully completed. It populated Articles, Classes and Article Classes mappings. Thanks once again for sharing WikipediaBase.

On reviewing the data, found that article categories are present only in Article.markup.
To my surprise, the category tables (parent and child category composite relations) and category relation with articles are completely missing.
Did spike the code and will need to enhance the mechanism to support the same.
Any ideas or alternates for category challenge ?

Regards,
Dileep

@michaelsilver
Copy link
Member

@dldharma why don't you make an issue in WikipediaBase and we can discuss further there. When you create the issue, please provide a printout of the tables you have populated (\d in postgres) and describe what you mean by "article categories". By category, do you mean which type of infobox the article has?

@dldharma
Copy link

dldharma commented Oct 8, 2016

@michaelsilver agree. Created issue 277. Thanks once again for your prompt replies. Appreciate !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants