Skip to content

Latest commit

 

History

History
236 lines (186 loc) · 10.2 KB

README.adoc

File metadata and controls

236 lines (186 loc) · 10.2 KB

yakety-csv

The yakety-csv or Yet-Another-Know-Everything-Tentative-Yap-CSV Java library.
It should work with any field character separator, so ,, ; and tabs; even custom separators.
At this point it only works with single character line break: \n, \r or custom single character. Soon support for Windows \r\n will be added.

It is meant to be configured and only require little coding for usage. Also, the main feature is to parse the file lazily; keeping memory usage low even while parsing GB long files.

There are plenty implementations out there, please look for them, this is an exercise.

Features

  • Fully functional, Java Stream based API

  • Configuration over code:
    “library user defines rules not process”

    • Columns are defined for reuse (Recommendation to use an enum)

  • At some later point, full support to RFC4180 both strict and non-strict

  • Support for different locales

  • Support for quoted values

  • On parsing:

    • Step 1 list of ordered string values

    • Step 2 list of string values by key

    • Step 3 nullable values

    • Step 4 basic type coercion from string

    • Step 5 Boxed variation on basic types

    • Step 6 Support for date, time, datetime (timezone or not)

    • Step 7 assembled value object

Bucket list

  • Caching of parsed results (denormalized files tend to repeat a lot, memoized field parsing may save time in some cases).

  • Support for custom line breaks that are more than one character long

  • in case there is a coercion error, return a marker that point to file, line, field where problem happened.

  • replace apache validation library with own parser and include a formatter parameter that is not recreated on every call

Release Plans

  • v0.0.X - empty project that compiles

  • v0.1.X - stream String value of fields from a csv using strict RFC4180

  • v0.2.X - Enable column configuration

  • v0.3.X?- Proper handle of quotes

  • v1.0.X - stream Map.Entry of String values by column (RFC4180)

  • v1.1.X - nullable columns (map stream cannot take null values)

  • v1.2.X - proper limited parsing of int and double

  • v1.3.X - proper parsing of long and float

  • v1.4.X - proper parsing to Boxed numbers

  • v1.5.X - proper parsing of Date, Time and DateTime

  • v1.6.X - proper parsing of BigInteger and BigDecimal

  • v1.7.X - support to domain values (enum) parsing

  • v2.0.0 - Populate a java bean from the textual field by column map, includes:

    • safe parsing of standard nullable types: Integer, Long, Float, Double, BigInteger, BigDecimal, LocalDate, LocalDateTime, LocalTime.

    • parsing of domain-based values backed by enums;

    • API for custom field parsing;

    • API that joins field parser and map’s key lookup;

    • any field that fails to be parsed is null;

    • all API avoids exceptions, so the stream is never broken;

    • API for a transformer from Map<K, String> to java bean.

    • API for indexed/id’ed beans

  • v3.0.0 - Adding feature to identify parsing problems, includes:

    • Removal of all deprecated API;

    • implementation of Either;

    • a new parser and bean assembly that uses Either as output;

    • a new interface to represent a problem parsing a field, which can be applied along with ColumnDefinition;

    • introduction of result expectation vs actual result;

    • marker pointing to location where parsing failed.

    • generate a digest based unique id for the source file, use it as identifier

  • v4.0.0 - first API stable version, includes:

    • optimizations

    • extra logging

    • integration test comments and documentation

    • package publication

Tasks

  1. setup project:

    • ✓ gradle

    • ✓ spock tests

    • ✓ spock integration tests

    • ✓ git ignores

  2. functionalities:

    • ✓ simple csv to stream of fields

    • ✓ configurable parser

    • ✓ file format configuration

    • ✓ column definition interface

    • ✓ configurable csv columns to stream of String fieldByColumnName maps

    • ✓ indexed row value as field in map

    • ✓ use dynamic programming to check if line break is within quotes, ignore it if it is. should consume large files without blowing up the stack.

    • ✓ parser localization

    • ✓ column definition map to expected type (string for now)

    • ✓ from the map result apply identity type coercion to bean

    • ❏ add coercion checks with bad results as separate dataset from raw values

    • ❏ add null constraints

    • ❏ configurable csv columns with type coercion (all types)

    • ❏ configurable csv columns with type coercion to list of objects

How to build

Environment setup requirements

Java 16+ is needed, get it with SDKMan Gradle configuration recommended, ~/.gradle/gradle.properties:

org.gradle.parallel=true
org.gradle.jvmargs=-Xmx2048M
org.gradle.caching=true
org.gradle.daemon.idletimeout=1800000
org.gradle.java.home=/home/user/.sdkman/candidates/java/16.0.1-open # (1)
  1. your own path for the JDK 16+

TL;DR

./gradlew

How to use

The concept usage is that you are either: - exploring data from a file you do not know the format or - parsing well known CSV format multiple times from different files.

Simple CSV to Java Stream<List<String>>

import org.shimomoto.yakety.csv.config.FileFormatConfiguration;
import org.shimomoto.yakety.csv.CsvParserFactory;

final FileFormatConfiguration config =
        FileFormatConfiguration.builder().build()
final CsvParser textParser =
        CsvParserFactory.toText(config)

final Stream<List<Stream>> textResults =
    textParser.parse(new File("that_data.csv"))

Simple CSV to Stream<Map<String,String>>

With added field for the line index, starting at 1 (headers were zero). The field name must not clash with a column name.

It is purely positional (does not check if first field matches first header column name), if you mess up the fields order, you mess up the mapping.

import org.shimomoto.yakety.csv.config.FileFormatConfiguration;
import org.shimomoto.yakety.csv.CsvParserFactory;

final FileFormatConfiguration config =
        FileFormatConfiguration.builder()
            .indexColumn("#")
            .columns(List.of("colA","colB","colC"))
            .build()
final CsvParser indexedMapParser =
        CsvParserFactory.toRowIndexedTextMap(config)

final Stream<Map<String,String>> textResults =
    indexedMapParser.parse(new File("that_data.csv"))

Simple CSV to Stream<?> where ? is a POJO

It builds upon the fields by column map with a dynamic index, those results are used to build a Java Bean.

A transformer from Stream<Map<? extends ColumnDefinition,String>> to whatever aggregate is to be used is needed.

import org.shimomoto.yakety.csv.config.FileFormatConfiguration;
import org.shimomoto.yakety.csv.CsvParserFactory;

final FileFormatConfiguration config =
        FileFormatConfiguration.builder()
            .indexColumn(MyVirtualColumns.INDX) // (1)
            .columns(MyColumns.values()) // (2)
            .build()

final BeanAssembly<MyColumns, MyAggregate> transformer =
    new MyTransformer(Locale.EN)  // (3)

final CsvParser beanParser =
        CsvParserFactory.toBeans(config)

final Stream<MyAggregate> aggregates =
        beanParser.parse(new File("that_data.csv"))
  1. Must implement org.shimomoto.yakety.csv.api.ColumnDefinition and have a common interface with the columns bellow.

  2. Must implement org.shimomoto.yakety.csv.api.ColumnDefinition and have a common interface with the index above.

  3. Must implement BeanAssembly interface; it is responsible for coercing the values from String and assign to POJO field, all done without throwing exceptions.

TL;DR

Snippets are not real life examples?! Ok, read the contents of MarvelIT.groovy, it is creating multiple parsers using different approaches and it does process some data for close inspection.

If you just want to read from the test results:

./gradlew integrationTest

then open build/reports/spock-reports/integrationTest/index.html, these are the integration tests results

Learning notes

  1. Scanner discards empty elements at beginning or end, which works ok when splitting lines, also being lazy is a must; String.split(/pattern/, -1) works correctly (empty fields show up) but takes a String instead of Pattern; the Pattern.split(/string/, -1) works when the number of fields is unknown; when the number of fields is known just pass the number instead of a negative.

  2. Regular expressions with matches and groups take more processing power, the lookahead doesn’t and works as would the index based string walk.
    The regular expression break by line blows up the stack; the solution I can think of is to consume lines, then check if there is an open quote, consume another line until all open quotes are closed, then it would be better to just already consume fields while at that.

  3. Java Pattern class cannot be used on hash or equals 🤷.

  4. Apache validation library recreates Formatter on every call, so it does what is needed (no exception while parsing) but may degrade performance on large volumes.

Questions

  1. Should the ColumnDefinition be enforced at API level? That would force split for String columns.
    Perhaps it should be enforced when types are to be used…​