v0.2.0 (2024-11-13)
Build
- build: integrate
ontogpt
for SPIRES-based annotation
Add the ontogpt
Python package to enable annotation capabilities using
the SPIRES method (Structured Prompt Interrogation and Recursive
Extraction of Semantics). (235924e
)
- build: upgrade
soso
dependency to latest version
Upgrade the soso
dependency to its latest version in the environment
files. This update incorporates significant changes affecting the
Schema.org representation of EML metadata and generally enhances the
connections between metadata nodes and vocabulary terms. (58b282a
)
- build: integrate
soso
package for EML conversion
Add the soso
package as a dependency to facilitate the conversion of
EML metadata to Science-On-Schema.Org JSON-LD format. This
JSON-LD representation serves as the foundation for the knowledge
graph. (d368f92
)
Documentation
- docs: provide QUDT annotation demonstration
Add a demonstration of the QUDT unit annotation process for EML records,
providing a practical guide for users. (f52a448
)
- docs: remove outdated parameter description
Remove the outdated description for the annotator
parameter from the
documentation for the add_env_medium_annotations_to_workbook
function. (c745d2d
)
- docs: clarify annotation refresh process
Update the documentation explaining how to force a re-annotation of
specific elements for a particular predicate. (41176b3
)
- docs: clarify annotation reuse and skipping logic
Update code comments to clarify the distinction between reusing existing
annotations from the workbook and skipping re-annotation for elements
that have already been annotated.
Related to commit: 7287f31 (5cfaa20
)
- docs: document main interface for clarity
Add detailed documentation for the main interface to enhance
understanding and usage. (4febaf6
)
- docs: remove obsolete function reference
Removed reference to deprecated function elements_to_df to improve clarity. (54589dc
)
- docs: correct GitHub username references
Update GitHub username references in documentation URLs to ensure
accuracy. (0922168
)
Feature
- feat: enhance
annotate_eml
flexibility with object/path input/output
Expand the annotate_eml
function to accept and return either an object
or a file path as input, increasing its versatility and ease of use in
various scenarios.
Update call signatures in functions that use annotate_eml
. (e5a96a5
)
- feat: implement duplicate annotation removal in workbook annotators
Integrate the delete_duplicate_annotations
function into workbook
annotators to proactively remove duplicate annotations, ensuring data
quality and consistency. (7f388e3
)
- feat: remove empty workbook rows for clarity
Add a utility function to optionally remove empty rows from the
workbook, enhancing readability. This is possible due to recent changes
that enable direct annotation from EML content, eliminating the need for
pre-populated rows for workbook annotators to reference.
Note, to preserve potential human annotation opportunities, the empty
row removal function is not applied within existing workbook annotators
or the annotate_workbook
wrapper. (89ae337
)
- feat: annotate
methods
with OntoGPT
Add a function to annotate the methods
of measurement using the
OntoGPT package to be more precise and accurate than currently possible
using the BioPortal annotator. (d72647e
)
- feat: enhance get_description to include methods
Extend the get_description
function to retrieve methods information,
providing a unified approach for accessing this data. (adad111
)
- feat: annotate
research topic
with OntoGPT
Add a function to annotate the research topic
using the OntoGPT
package to be more precise and accurate than currently possible using
the BioPortal annotator. (42dbf77
)
- feat: annotate
environmental medium
with OntoGPT
Add a function to annotate the environmental medium
using the OntoGPT
package to be more precise and accurate than currently possible using
the BioPortal annotator. (48b3246
)
- feat: annotate
local environmental context
with OntoGPT
Add a function to annotate the local scale environmental context
using
the OntoGPT package to be more precise and accurate than currently
possible using the BioPortal annotator. (2335413
)
- feat: annotate
broad environmental context
with OntoGPT
Add a function to annotate the broad scale environmental context
using
the OntoGPT package to be more precise and accurate than currently
possible using the BioPortal annotator. (1c825bb
)
- feat: annotate
processes
with OntoGPT
Add a function to annotate environmental (and other) processes
using
the OntoGPT package to be more precise and accurate than currently
possible using the BioPortal annotator. (aaefbff
)
- feat: annotate
measurements
with OntoGPT
Update the add_measurement_type_annotations_to_workbook
function to
enable annotate of measurement types
using the OntoGPT package to be
more precise and accurate than currently possible using the BioPortal
annotator. (b91c65c
)
- feat: implement SPIRES-based annotation with
ontogpt
Add a get_ontogpt_annotation
function to leverage the ontogpt
package
for SPIRES-based annotation. This new approach complements existing
lexical-based methods, offering a more sophisticated semantic
understanding of text.
Add ontogpt
templates to support this functionality. (62cc039
)
- feat: introduce
add_measurement_type_annotations_to_workbook
function
Create a dedicated add_measurement_type_annotations_to_workbook
function to streamline the process of adding measurement type
annotations to annotation workbook files. This encapsulates existing
annotation logic, promoting code modularity and maintainability. (ad04013
)
- feat: introduce
add_dataset_annotation_to_workbook
function
Create a dedicated add_dataset_annotation_to_workbook
function to
streamline the process of adding dataset annotations to annotation
workbook files. This encapsulates existing annotation logic, promoting
code modularity and maintainability. (a0c1ccb
)
- feat: implement
write_eml
for standardized output
Introduce the write_eml
function to standardize the output format and
avoid unintended data loss. (eda07a5
)
- feat: implement write_workbook for standardized output
Introduce the write_workbook
function to standardize the output format
and avoid unintended data loss. (3eeb606
)
- feat: add
delete_annotations
for workbook cleanup
Add a delete_annotations
function to remove annotations from a
workbook based on various criteria. This flexible approach allows for
targeted removal of annotations, enhancing workbook maintenance and
organization. (410d666
)
- feat: remove duplicate annotations from workbook
Introduce a utility function to identify and remove duplicate
annotations within EML elements. Duplicates are considered rows in which
element_xpath
, object
and object_id
values match. We prioritize
the most recent annotations based on the date
field to allow
improvements to other fields set by the annotator. (5b1e913
)
- feat: add utility functions for workbook field population
Create utility functions to populate workbook fields, promoting code
reuse and consistency across various annotators. (9dac35e
)
- feat: introduce workbook row initialization
Create a function to initialize an empty workbook row to be subsequently
filled with content. This provides a foundation for independent
annotation operations without relying on an existing workbook,
addressing scenarios where a new workbook row needs to be created from
scratch.
Currently, annotators create rows of annotation data by copying from an
existing workbook then modifying the rows contents. (4cd6858
)
- feat: implement QUDT annotation for workbooks
Introduce a function to add QUDT annotations to existing workbooks.
This enables users to apply QUDT annotations during the default
annotation process or to existing annotated workbooks. (b06ae95
)
- feat: integrate EML unit conversion to QUDT annotations
Develop an annotator function to convert EML standard and custom units
into QUDT representations, leveraging the LTER/EDI-developed EML-to-QUDT
web service. This enhances metadata interoperability and semantic
richness. (1afe98c
)
- feat: convert string literals to URI refs
During graph creation, convert string literals to URI references of SOSO
documents for links between URI strings and vocabulary term IDs. This
fallback measure creates linked data when the URI value cannot be
elevated to an @id
for a node.
Apply this approach to specific locations:
- keyword/DefinedTerm/url
- variableMeasured/PropertyValue/propertyID
- variableMeasured/PropertyValue/measurementTechnique
- variableMeasured/PropertyValue/unitCode
- license
Avoid conversion for non-URL/URI text strings by checking for a URL
pattern first.
Add an EML and SOSO file pair, that contain these graph properties, for
testing purposes. (0be6551
)
- feat: batch process for creating shadow metadata
Introduce the create_shadow_eml_files
wrapper function to convert EML
files into shadow EML in bulk, aligning with existing workflow functions
in the main module. (30d28b6
)
- feat: introduce
create_shadow_eml
wrapper
Create a create_shadow_eml
wrapper function to streamline the process
of applying shadow metadata enrichment functions to individual EML
documents. (31d1d01
)
- feat: ensure EML userId is a URL for linked data compatibility
Modify the EML userId element to be a URL, whenever possible,
facilitating linked data compatibility. If the current value isn't a
URL, it's converted using the directory attribute as a base URL.
This function addresses a practice, previously recommended by EDI, of
setting the base URL as the directory attribute and the remaining
identifier as the element value. This practice has been recently
deprecated. (8c2f8dc
)
- feat: introduce module for shadow metadata generation
Create a new silhouette
module to encapsulate functions involved in
generating shadow metadata. This module focuses on refining "raw" or
"level-0" metadata to enable specific applications that are not feasible
with the original metadata.
Shadow metadata is a nascent concept we implement here for prototyping
purposes only. (49e81cf
)
- feat: add
@id
to Dataset type in SOSO files
Enhance the main.create_soso_files
function to include the @id
property for the Dataset
type. Use the data package landing page URL
instead of the DOI to provide a more transparent and accessible
identifier.
This update also leverages improvements in the soso
package v0.2.0 to
enhance downstream linkages between knowledge graph URIs. (9c19ec2
)
- feat: implement heuristic URI validation utility
Add a utility function to heuristically determine if a given string is a
URI, enhancing validation and quality control capabilities. This
function aims to differentiate URIs from unstructured text descriptions
without guaranteeing absolute certainty. (872ef5e
)
- feat: introduce
create_kgraph
function
Create a new function, main.create_kgraph
, to encapsulate the
process of building the knowledge graph using graph.load_graph
.
This function provides a centralized entry point for future graph
enhancements. (691d926
)
- feat: integrate metadata and vocabulary loading into
load_graph
Create the load_graph
function to handle both metadata and vocabulary
loading in a unified manner. This allows for flexible usage scenarios
where only metadata or vocabulary files are provided.
This approach addresses the ConjunctiveGraph
issue outlined in rdflib
version 7.0.0 documentation. (6589ac0
)
- feat: load vocabularies into a graph
Introduce a function to load target vocabularies into a graph. (cca7cce
)
- feat: establish main module for codebase
Create the foundational main module for implementing the codebase. (9ee3a00
)
- feat: integrate empty tag removal in workbook.create
Incorporate the delete_empty_tags
functionality into workbook.create
to prevent errors caused by processing empty XML tags. (6847f91
)
- feat: prevent processing errors from empty XML tags
Create helper function to remove empty XML tags to avoid unexpected
behavior when processing elements like keywords in
workbook.get_description
. (574970e
)
- feat: implement EML annotation from worksheet
Introduce a function to annotate EML files based on their associated
worksheets. (281c098
)
- feat: implement automatic workbook annotation
Add functionality to automatically annotate workbooks using an
annotator.
Note: Annotation quality may be limited. Consider these as
recommendations rather than definitive annotations. (c384477
)
- feat: integrate BioPortal annotator for term recommendation
Implement functionality to access the BioPortal annotator API. This
enables:
- Text input analysis
- Relevant class retrieval
- Recommendations for data authors and curators (
9822406
)
- feat: introduce configuration file for spinneret
Add support for a configuration file to store API keys and other
parameters, enhancing flexibility and security. (1231973
)
- feat: add element descriptions to workbook
Add the corresponding element description directly within the workbook
to streamline the annotation process and reduce potential errors. This
eliminates the need for manual navigation to the data package landing
page to verify element details. (9d6da08
)
Fix
- fix: preserve case sensitivity in
local_model
arguments
Modified the handling of local_model
arguments to maintain case
sensitivity, preventing potential errors when calling models that are
case-sensitive. (2fd7fa3
)
- fix: ensure correct file extension for OntoGPT output
Correct the OntoGPT output file extension to .json
to align with the
expected file format and prevent potential downstream processing issues. (677dd0a
)
- fix: gracefully handle missing OntoGPT output files
Implement error handling to gracefully handle cases where the OntoGPT
process fails to produce an output file. This prevents downstream
processing errors. (3636672
)
- fix: handle optional methods element gracefully
Modify the add_methods_annotations_to_workbook
function to gracefully
handle the optional methods element in EML files, preventing errors when
processing EML documents that lack this element. (591ed67
)
- fix: handle ungrounded IDs gracefully in CURIE expansion
Modify the CURIE expansion process to gracefully handle ungrounded IDs,
preventing errors and allowing for downstream processing. (b493340
)
- fix: remove empty XML tags during EML loading
Remove empty XML tags during the EML loading process to prevent
potential errors and inconsistencies in subsequent processing steps.
This was in the code base prior to commit
6847f91, but not included in the
refactor commit 44468ea. (4c42c93
)
- fix: enhance author attribution in
add_qudt_annotations_to_workbook
Modify the add_qudt_annotations_to_workbook
function to use a more
descriptive author value, providing better provenance. The full module
path of the Python function is now used to identify the source of the
annotation. (3047953
)
- fix: ensure correct annotation placement in EML
Adjust annotation placement to precede optional elements, ensuring
schema-compliant EML generation. (346281c
)
- fix: make
get_description
more resilient to missing elements
Modify workbook.get_description
to gracefully handle missing optional
elements (abstract, keywordSet), preventing unnecessary failures. (87a66d0
)
- fix: correct BioPortal annotator parameter format
Adjust BioPortal annotator parameters to use lowercase boolean
strings for compatibility with the service's URL encoding. (b95961f
)
Performance
- perf: skip annotation if element is annotated
Implement a mechanism to skip annotating elements that already have
annotations for a specific predicate, improving performance and reducing
redundant processing. (a508e9b
)
- perf: extend annotation with caching to all workbook annotators
Extend the attribute element annotation caching strategy to all
annotators, improving performance and reducing redundant processing,
especially for large-scale annotation tasks. (7a9fa7c
)
- perf: optimize attribute annotation with caching
Implement cache retrieval for attribute annotations to significantly
reduce processing time and minimize variance in results, especially
for annotators like OntoGPT that can exhibit non-deterministic
behavior. (7287f31
)
Refactor
- refactor: enhance duplicate annotation detection criteria
Expand the criteria for identifying and removing duplicate annotations
in the workbook to include predicate
and predicate_id
in addition to
element_xpath
, object
, and object_id
. This more comprehensive
approach ensures the removal of truly redundant annotations. (31067f8
)
- refactor: remove extraneous commented code
Remove unnecessary commented code block that was intended to be removed
in commit 7287f31. (83ed3d5
)
- refactor: prevent ungrounded annotations from being added to EML
Implement a filter to prevent ungrounded annotations from being added to
EML documents. This helps maintain data integrity and avoids potential
confusion or errors. (e9af9cb
)
- refactor: modularize workbook annotation for improved accuracy
Refactor the workbook annotation process to use modular annotators,
enhancing accuracy and precision. The BioPortal annotator is retained as
an option for specific use cases. (05222c3
)
- refactor: remove tentative annotation skipping logic
Remove the tentative logic for skipping annotations based on existing
annotations. This simplifies the annotation process and allows for
multiple annotations per element, which is useful for experimentation
and handling potential annotation conflicts. (33e2731
)
- refactor: annotate workbook by predicate
Restructure the annotate_workbook
function to organize annotation
processes by predicate instead of element. This enables more modular
addition of different annotation types to the same element, improving
code flexibility and maintainability. (cbae977
)
- refactor: pass EML to
annotate_workbook
subroutines for context
Add the EML path to the annotate_workbook
function to enable context
for annotation subroutines like add_dataset_annotations_to_workbook
.
This provides access to the original EML file, enabling the use of
descriptive text beyond the workbook's description field.
This is crucial for annotating elements like units, which require
information from the EML's raw unit name and which is not listed in the
workbook's description field.
This also enables annotators to source descriptive text directly from
the authoritative source (i.e. the EML). It's possible the workbook
descriptions could fall out of synch with the corresponding EML document
and this refactor addresses this.
Note, the description field provides context to human annotators, which
would otherwise have to browse data package landing pages, which is
arguably much less efficient. So we keep this information in the
workbook for now. (fc3dd16
)
- refactor: integrate
write_eml
Integrate the write_eml
function into the codebase to ensure
consistent and accurate EML output. (390da39
)
- refactor: integrate
write_workbook
Integrate the write_workbook
function into the codebase to ensure
consistent and accurate workbook output. (3e90f64
)
- refactor: integrate EML and workbook loading utilities
Integrate the load_workbook
and load_eml
utility functions to
streamline data loading and improve code maintainability. Additionally,
addressed errors introduced by the standardization process, requiring
adjustments to both the utility functions and their usage throughout the
codebase. (44468ea
)
- refactor: centralize EML and workbook loading for code clarity
Introduce utility functions to streamline the loading of EML and
workbook data from various input formats (e.g., file paths, DataFrames).
This centralizes common data loading operations, improving code
readability and maintainability. (f0f2e5e
)
- refactor: streamline workbook row creation in annotators
Leverage workbook row initializer utility functions to simplify the
process of creating new annotation rows within workbook annotators. This
approach eliminates the need for error-prone copy and paste row creation
and promotes code consistency. (962863c
)
- refactor: integrate
delete_annotations
to workbook annotation
Integrate the delete_annotations
function into workbook annotators to
streamline the code and improve maintainability. (e74495a
)
- refactor: standardize workbook annotator return types
Modify workbook annotators to consistently return DataFrames, regardless
of input arguments. This simplifies code logic and improves function
clarity. (400cc53
)
- refactor: enable flexible I/O for workbook annotators
Enable workbook annotators to accept both file paths and DataFrames as
input and output. This flexibility reduces the need for redundant file
operations, streamlining the annotation process. (71f2a3e
)
- refactor: utilize existing id attribute values in EML
Leverage the existing id
attribute values from EML elements to
populate the workbook's element_id
field whenever possible. This
eliminates the need for arbitrary UUID assignment, reducing confusion
and simplifying the process. (92f10b4
)
- refactor: integrate workbook row creation
Incorporate the newly created workbook row generation function into the
workbook.create function to streamline the process and reduce code
duplication. (d7a7f7c
)
- refactor: create soso files from shadow EML files
Process shadow EML files by passing them as input to
main.create_soso_files
. Rename the soso file directory to raw
to
imply different processing levels for future use. (e4c34c9
)
- refactor: rename
is_uri
function for accuracy
Rename the is_uri
function to is_url
for accuracy, as the function
specifically checks for URLs, not URIs. (ded2748
)
- refactor: rename
silhouette
module toshadow
for clarity
Rename the silhouette
module to shadow
to align with terminology
commonly used within EDI, making the language more accessible and
familiar to users. (badea82
)
- refactor: rename
load_graph
tocreate_graph
for clarity
Rename the function to create_graph
to better reflect its primary
purpose of constructing the knowledge graph from various data
sources, rather than solely loading an existing graph. (f9b5f43
)
- refactor: deprecate existing load functions in favor of
load_graph
Deprecated the load_metadata
and load_vocabularies
functions in
favor of the load_graph
function. This unified approach preserves
blank node references, addressing limitations of the ConjunctiveGraph
object as outlined in rdflib version 7.0.0 documentation. (52ffe8e
)
- refactor: rename
combine_jsonld_files
toload_metadata
Rename the function to load_metadata
to more accurately reflect its
purpose of loading multiple metadata files into a graph. (9eae7f5
)
- refactor: output one workbook per EML file
Optimize performance and maintainability by generating individual
workbooks for each EML file instead of a single large workbook. This
addresses potential issues with:
- Performance degradation due to increased data volume.
- Complexity and difficulty in managing a large, complex table.
- Data corruption risks associated with a single, interconnected table. (
b6e7c92
)
Test
- test: refine workbook annotator test clarity
Remove unnecessary commentary from workbook annotator tests to improve
clarity and focus on the core testing logic. (cbe98ae
)
- test: integrate
has_annotations
utility
Integrate the has_annotations
utility function into relevant test
cases to ensure accurate annotation verification. (987c5b0
)
- test: implement
has_annotations
utility for testing
Create a has_annotations
utility function to verify the presence or
absence of annotations in a workbook, improving test efficiency. (2831176
)
- test: address Pandas warnings in workbook annotation
Resolve Pandas warnings in the test_annotate_workbook
module to
prevent unexpected behavior, specifically:
- Correct chained assignment issues by modifying a copy of the row
then reassigning. - Specify data types for columns in the annotation workbook to avoid
type-related warnings. (8fc3263
)
- test: mock API calls to annotator with manual option
Mock API calls in the test_annotator
module to enable offline testing.
However, provide a manual option for testing against the real API to
ensure ongoing functionality. (9eb93c7
)
- test: include missing annotated workbook for testing
Add the annotated workbook version for edi.3.9 to the test suite,
addressing an oversight from commit
c384477. (b4d4374
)