QuiXDM is an ubiquitous open-source datamodel to process in a Streaming fashion:
- XML (via XQuery and XPath Data Model (XDM))
- JSON
- YAML
- RDF Triple
- Quad
- CSV
- TSV
- HTML
To install it
- Checkout this code. It's Java 1.8+ compliant
- Get access to Saxon 9.7: http://saxon.sourceforge.net/
- Get access to Jackson Core 2.7.4: https://github.com/FasterXML/jackson-core
- and few other dependencies (see pom.xml)
There is SAX,StAX, DOM, Jackson, Jena, CSVParser, HTMLParser out there for processing data
Feature\API | SAX | StAX | DOM | Jackson | QuiXDM |
---|---|---|---|---|---|
in memory/streaming | streaming | streaming | in memory | streaming | streaming |
push/pull | push | pull | -- | pull | pull |
data model | low level XML | low level XML | low level XML | low level JSON | XPath Data Model |
handle sequence | no | no | no | no | yes |
handle json/yaml | no | no | no | yes | yes |
handle rdf | no | no | no | no | yes |
handle csv | no | no | no | no | yes |
handle html | no | no | no | no | yes |
It uses a consistent datamodel to represent all those contents in streaming.
// Here is the grammar of events
sequence := START_SEQUENCE, (document|json_yaml|table|semantic)*, END_SEQUENCE
document := START_DOCUMENT, (PROCESSING-INSTRUCTION|COMMENT)*, element, (PROCESSING-INSTRUCTION|COMMENT)*, END_DOCUMENT
json_yaml := START_JSON, object, END_JSON
table := START_TABLE, header*, array_of_array, END_TABLE
semantic := START_RDF, statement*, END_RDF
element := START_ELEMENT, (NAMESPACE|ATTRIBUTE)*, (TEXT|element|PROCESSING-INSTRUCTION|COMMENT)*, END_ELEMENT
object := START_OBJECT, (KEY_NAME, value)*, END_OBJECT
value := object|array|flat_value
flat_value := VALUE_FALSE|VALUE_TRUE|VALUE_NUMBER|VALUE_NULL|VALUE_STRING
array := START_ARRAY, value*, END_ARRAY
array_of_array := START_ARRAY, flat_array+, END_ARRAY
flat_array := START_ARRAY, flat_value*, END_ARRAY
statement := START_PREDICATE, SUBJECT, OBJECT, GRAPH?, END_PREDICATE
Mostly look at QuiXToken.java
With Object creation (à la javax.xml.stream.XMLEventReader)
Simplest way to use, is to instantiate innovimax.quixproc.datamodel.in.QuiXEventStreamReader.java
Iterable<Source> sources =
"/tmp/file/file_aaa.xml",
"/tmp/file/file_aab.json",
"/tmp/file/file_aac.csv",
"/tmp/file/file_aad.yml",
"/tmp/file/file_aae.n3"
;
QuiXEventStreamReader qesr = new QuiXEventStreamReader(sources);
while(qesr.hasNext()) {
System.out.println(qesr.next());
}
Lightweight iterator without Object creation (à la javax.xml.stream.XMLStreamReader)
TODO
Why QuiXCharStream and QuiXQName?
Well it comes from the fact that Streaming interface in XML should really be streaming. The truth is that there is no such character streaming interface in Java.
- String is definitely not streamable and limited to 2^31 characters
- CharSequence, which could have been, is neither because it has length()
- CharIterator doesn't exist in the JDK (but you can find it here)
- CharSequence.chars() returns IntStream (instead of CharStream because Java 8 people didn't want to add it)
- Java 8 Stream implies that every char is boxed (which means it's highly INEFFICIENT)
Having such context, that's why QuiXCharStream and QuiXQName went live in order to :
- be able to address the TEXT recombination issue (text() node in XDM cannot be contiguous)
- be able to stream even corner cases XML:
- huge string
- huge names
- huge namespace uris
Innovimax is contributing to this work
QuiXDM can be used standalone
This is the data model of QuiXPath and QuiXProc
It is part of two bigger projects :
- QuiXProc: https://github.com/innovimax/quixproc
- QuiXPath: https://github.com/innovimax/quixpath