Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Did 2019 hackathon identify any new potential checks? #25

Open
mobb opened this issue Jul 31, 2019 · 5 comments
Open

Did 2019 hackathon identify any new potential checks? #25

mobb opened this issue Jul 31, 2019 · 5 comments

Comments

@mobb
Copy link
Contributor

mobb commented Jul 31, 2019

ask them to comment on this issue.

@atn38
Copy link

atn38 commented Aug 2, 2019

(This might overlap with planned or existing checks, coming in hot with little context here). Primarily we found that congruence between attributes as listed in metadata and in data is crucial for writing tools to leverage EML. Otherwise code needs to go an extra mile to match up the two sources of information and may not be reliable. @clnsmth concurs here.

These need to be at least WARNs if not ERRORs:

  • Same number of attributes listed as columns in data
  • Set of attributeNames in EML match set of column names in data table
  • Order of attributeNames in EML (first to last in attributeList) match column names left to right

@mobb
Copy link
Contributor Author

mobb commented Aug 2, 2019

Thanks so much for these ideas! its great to see more people attempt systematic, programmatic access to data. EML was designed for that. These suggestions have come up before, with lots of interesting discussion. And the fact that the hackathon group is working hard on viz makes them very current.

Fair warning - the below is a bit of a monolog.

For a refresher on the ECC (and a history lesson for the new DMs), see
http://dx.doi.org/10.1016/j.ecoinf.2016.08.001

A few comments below on the current state of checks related to these issues:

  • Same number of attributes as columns in data
  • there are already several checks which attempt to ensure this. in any of your reports, see the checks called tooFewFields and tooManyFields.

There is an edge case where where that situation isn't caught. so if you have examples of datasets where the number of cols does not match the number of attributes, please add ids to this thread. We actually used this check as an example of something seemingly simple that is not so simple to implement. In the paper above, the discussion is right beneath Table 1.

  • Set of attributeNames in EML match set of column names in data table
  • Order matters; ie, the attributeName is not a key to a column somewhere in the table. So we would probably never simply look at the two sets of strings.
  • Order of attributeNames in EML (first to last in attributeList) match column names left to right

Because the goal is that the attribute description can actually be used to read the data for analysis. checks have considered some of the other aspects of "matching" between metadata and data. In pie-in-the-sky discussions we've come up with a need to ensure

  • order, uniqueness, typing, precision, range, quantity, unit, and even semantic meaning.

Some of the easier ones have been address in checks; see attributeNamesUnique, and dataLoadStatus (which uses postgres to check typing)

And we have considered a check like this one, to ensure that:

attributeName (in order) matches column header (in order)

There are some complications, among them:

  1. there are no std formats for text tables, which makes an 'acceptable table' difficult to define
  2. headers can be any number of lines. if they are more than one, does one hold the names of the attributes? if so, how do we identify it?

So the logic gets a bit tricky, the first attempt at that check was simply to display both the attributeNames and header for a user to compare them manually. See this check: headerRowAttributeNames

But now that PASTA is recording that in reports (for approx the last year or 2), we could start to analyze report content and see if reasonable logic might be developed to do more.

Warn vs error:
the ECC committee will not create an unnecessarily high bar for acceptance. So that means only "unusable data" is rejected (gets an error). And for now, programmatic access is not the norm. Humans can usually figure out what to do (eg, by reading a table into R, apply manual examination, interpretation, plotting). So until programmatic access becomes the norm these sorts of checks will generate only a warn.

But again - thanks for pushing this community forward! we'll do our best to keep up. The ECC committee is a great group, and welcomes new members.

@atn38
Copy link

atn38 commented Aug 2, 2019 via email

@gastil
Copy link
Collaborator

gastil commented Aug 2, 2019

An,

the tooFewFields and tooManyFields checks use the test of INSERT statements to postgres as part of their logic. (Notice the databaseTableCreated check shows the CREATE TABLE generated from the attributes' metadata.) In knb-lter-ble.1.5 the dataTables which pass the db load test do have the tooManyFields and tooFewFields checks. The other ones fail the db load test so do not have the column number check.

For date and time related columns that do not fit an ISO-8601 formatString, you have two choices: accept warns and miss out on column number checks, or give up using dateTIme as the measurementType. A hard choice. You can see why the dateTime check was the most complicated the ECC wg tackled.

Similar to a "profile", by restricting formatStrings to ISO-8601 we made coding more practical in scope. Given infinite resources, if a different standard than 8601 were also available, and specified in the metadata, then a wider range of formatStrings could be handled.

@mobb
Copy link
Contributor Author

mobb commented Aug 2, 2019

Clearly Gastil has a better memory than me.
On your other issue:

More on the semantic/expanding EML side, but hackathon also brought up the
need to identify which data columns contain contextualizing information
(spatial, temporal, possibly taxonomic) and which ones contain
measurements. Knowing that would help immensely. For example, how does a
program identify if a table contains spatial coordinates columns, if yes
which columns and which is x, y, z?

That is a good example of where semantic annotation can help.
I'll need to hunt down some examples for you, of what that would look like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants