-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Did 2019 hackathon identify any new potential checks? #25
Comments
(This might overlap with planned or existing checks, coming in hot with little context here). Primarily we found that congruence between attributes as listed in metadata and in data is crucial for writing tools to leverage EML. Otherwise code needs to go an extra mile to match up the two sources of information and may not be reliable. @clnsmth concurs here. These need to be at least WARNs if not ERRORs:
|
Thanks so much for these ideas! its great to see more people attempt systematic, programmatic access to data. EML was designed for that. These suggestions have come up before, with lots of interesting discussion. And the fact that the hackathon group is working hard on viz makes them very current. Fair warning - the below is a bit of a monolog. For a refresher on the ECC (and a history lesson for the new DMs), see A few comments below on the current state of checks related to these issues:
There is an edge case where where that situation isn't caught. so if you have examples of datasets where the number of cols does not match the number of attributes, please add ids to this thread. We actually used this check as an example of something seemingly simple that is not so simple to implement. In the paper above, the discussion is right beneath Table 1.
Because the goal is that the attribute description can actually be used to read the data for analysis. checks have considered some of the other aspects of "matching" between metadata and data. In pie-in-the-sky discussions we've come up with a need to ensure
Some of the easier ones have been address in checks; see And we have considered a check like this one, to ensure that:
There are some complications, among them:
So the logic gets a bit tricky, the first attempt at that check was simply to display both the attributeNames and header for a user to compare them manually. See this check: But now that PASTA is recording that in reports (for approx the last year or 2), we could start to analyze report content and see if reasonable logic might be developed to do more. Warn vs error: But again - thanks for pushing this community forward! we'll do our best to keep up. The ECC committee is a great group, and welcomes new members. |
Margaret,
Thanks for the history lesson and the context! I see that it's not always
straightforward to implement these checks, as much as they make sense.
By the bye, I looked at the one BLE dataset's ECC report here
<https://portal.lternet.edu/nis/reportviewer?packageid=knb-lter-ble.1.5>
and found that only the second entity out of four total (Elson 2015 spatial
survey) has "tooFewFields" and "tooManyFields" listed as executed checks.
Do checks ever run silently and not show up in reports?
More on the semantic/expanding EML side, but hackathon also brought up the
need to identify which data columns contain contextualizing information
(spatial, temporal, possibly taxonomic) and which ones contain
measurements. Knowing that would help immensely. For example, how does a
program identify if a table contains spatial coordinates columns, if yes
which columns and which is x, y, z? Date/times are a bit easier to approach
but also not completely straightforward.
…On Fri, Aug 2, 2019 at 12:09 PM mobb ***@***.***> wrote:
Thanks so much for these ideas! its great to see more people attempt
systematic, programmatic access to data. EML was designed for that. These
suggestions have come up before, with lots of interesting discussion. And
the fact that the hackathon group is working hard on viz makes them very
current.
Fair warning - the below is a bit of a monolog.
For a refresher on the ECC (and a history lesson for the new DMs), see
http://dx.doi.org/10.1016/j.ecoinf.2016.08.001
A few comments below on the current state of checks related to these
issues:
- Same number of attributes as columns in data
- there are already several checks which attempt to ensure this. in
any of your reports, see the checks called tooFewFields and
tooManyFields.
There is an edge case where where that situation isn't caught. so if you
have examples of datasets where the number of cols does not match the
number of attributes, please add ids to this thread. We actually used this
check as an example of something seemingly simple that is not so simple to
implement. In the paper above, the discussion is right beneath Table 1.
- Set of attributeNames in EML match set of column names in data table
- Order matters; ie, the attributeName is not a key to a column
somewhere in the table. So we would probably never simply look at the two
sets of strings.
- Order of attributeNames in EML (first to last in attributeList)
match column names left to right
Because the goal is that the attribute description can actually be used to
read the data for analysis. checks have considered some of the other
aspects of "matching" between metadata and data. In pie-in-the-sky
discussions we've come up with a need to ensure
- order, uniqueness, typing, precision, range, quantity, unit, and
even semantic meaning.
Some of the easier ones have been address in checks; see
attributeNamesUnique, and dataLoadStatus (which uses postgres to check
typing)
And we have considered a check like this one, to ensure that:
attributeName (in order) matches column header (in order)
There are some complications, among them:
1. there are no std formats for text tables, which makes an
'acceptable table' difficult to define
2. headers can be any number of lines. if they are more than one, does
one hold the names of the attributes? if so, how do we identify it?
So the logic gets a bit tricky, the first attempt at that check was simply
to display both the attributeNames and header for a user to compare them
manually. See this check: headerRowAttributeNames
But now that PASTA is recording that in reports (for approx the last year
or 2), we could start to analyze report content and see if reasonable logic
might be developed to do more.
*Warn vs error:*
the ECC committee will not create an unnecessarily high bar for
acceptance. So that means only "unusable data" is rejected (gets an error).
And for now, programmatic access is not the norm. Humans can usually figure
out what to do (eg, by reading a table into R, apply manual examination,
interpretation, plotting). So until programmatic access becomes the norm
these sorts of checks will generate only a warn.
But again - thanks for pushing this community forward! we'll do our best
to keep up. The ECC committee is a great group, and welcomes new members.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#25?email_source=notifications&email_token=AKAZD5QQKNFXDQX4JBSH4KDQCRS4PA5CNFSM4IIJ2CJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3OKQSY#issuecomment-517777483>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKAZD5VD6HWRE6VLNLZ525DQCRS4PANCNFSM4IIJ2CJA>
.
|
An, the tooFewFields and tooManyFields checks use the test of INSERT statements to postgres as part of their logic. (Notice the For date and time related columns that do not fit an ISO-8601 formatString, you have two choices: accept warns and miss out on column number checks, or give up using dateTIme as the measurementType. A hard choice. You can see why the dateTime check was the most complicated the ECC wg tackled. Similar to a "profile", by restricting formatStrings to ISO-8601 we made coding more practical in scope. Given infinite resources, if a different standard than 8601 were also available, and specified in the metadata, then a wider range of formatStrings could be handled. |
Clearly Gastil has a better memory than me.
That is a good example of where semantic annotation can help. |
ask them to comment on this issue.
The text was updated successfully, but these errors were encountered: