Skip to content

v0.2.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 13 Nov 01:56

v0.2.0 (2024-11-13)

Build

  • build: integrate ontogpt for SPIRES-based annotation

Add the ontogpt Python package to enable annotation capabilities using
the SPIRES method (Structured Prompt Interrogation and Recursive
Extraction of Semantics). (235924e)

  • build: upgrade soso dependency to latest version

Upgrade the soso dependency to its latest version in the environment
files. This update incorporates significant changes affecting the
Schema.org representation of EML metadata and generally enhances the
connections between metadata nodes and vocabulary terms. (58b282a)

  • build: integrate soso package for EML conversion

Add the soso package as a dependency to facilitate the conversion of
EML metadata to Science-On-Schema.Org JSON-LD format. This
JSON-LD representation serves as the foundation for the knowledge
graph. (d368f92)

Documentation

  • docs: provide QUDT annotation demonstration

Add a demonstration of the QUDT unit annotation process for EML records,
providing a practical guide for users. (f52a448)

  • docs: remove outdated parameter description

Remove the outdated description for the annotator parameter from the
documentation for the add_env_medium_annotations_to_workbook function. (c745d2d)

  • docs: clarify annotation refresh process

Update the documentation explaining how to force a re-annotation of
specific elements for a particular predicate. (41176b3)

  • docs: clarify annotation reuse and skipping logic

Update code comments to clarify the distinction between reusing existing
annotations from the workbook and skipping re-annotation for elements
that have already been annotated.

Related to commit: 7287f31 (5cfaa20)

  • docs: document main interface for clarity

Add detailed documentation for the main interface to enhance
understanding and usage. (4febaf6)

  • docs: remove obsolete function reference

Removed reference to deprecated function elements_to_df to improve clarity. (54589dc)

  • docs: correct GitHub username references

Update GitHub username references in documentation URLs to ensure
accuracy. (0922168)

Feature

  • feat: enhance annotate_eml flexibility with object/path input/output

Expand the annotate_eml function to accept and return either an object
or a file path as input, increasing its versatility and ease of use in
various scenarios.

Update call signatures in functions that use annotate_eml. (e5a96a5)

  • feat: implement duplicate annotation removal in workbook annotators

Integrate the delete_duplicate_annotations function into workbook
annotators to proactively remove duplicate annotations, ensuring data
quality and consistency. (7f388e3)

  • feat: remove empty workbook rows for clarity

Add a utility function to optionally remove empty rows from the
workbook, enhancing readability. This is possible due to recent changes
that enable direct annotation from EML content, eliminating the need for
pre-populated rows for workbook annotators to reference.

Note, to preserve potential human annotation opportunities, the empty
row removal function is not applied within existing workbook annotators
or the annotate_workbook wrapper. (89ae337)

  • feat: annotate methods with OntoGPT

Add a function to annotate the methods of measurement using the
OntoGPT package to be more precise and accurate than currently possible
using the BioPortal annotator. (d72647e)

  • feat: enhance get_description to include methods

Extend the get_description function to retrieve methods information,
providing a unified approach for accessing this data. (adad111)

  • feat: annotate research topic with OntoGPT

Add a function to annotate the research topic using the OntoGPT
package to be more precise and accurate than currently possible using
the BioPortal annotator. (42dbf77)

  • feat: annotate environmental medium with OntoGPT

Add a function to annotate the environmental medium using the OntoGPT
package to be more precise and accurate than currently possible using
the BioPortal annotator. (48b3246)

  • feat: annotate local environmental context with OntoGPT

Add a function to annotate the local scale environmental context using
the OntoGPT package to be more precise and accurate than currently
possible using the BioPortal annotator. (2335413)

  • feat: annotate broad environmental context with OntoGPT

Add a function to annotate the broad scale environmental context using
the OntoGPT package to be more precise and accurate than currently
possible using the BioPortal annotator. (1c825bb)

  • feat: annotate processes with OntoGPT

Add a function to annotate environmental (and other) processes using
the OntoGPT package to be more precise and accurate than currently
possible using the BioPortal annotator. (aaefbff)

  • feat: annotate measurements with OntoGPT

Update the add_measurement_type_annotations_to_workbook function to
enable annotate of measurement types using the OntoGPT package to be
more precise and accurate than currently possible using the BioPortal
annotator. (b91c65c)

  • feat: implement SPIRES-based annotation with ontogpt

Add a get_ontogpt_annotation function to leverage the ontogpt package
for SPIRES-based annotation. This new approach complements existing
lexical-based methods, offering a more sophisticated semantic
understanding of text.

Add ontogpt templates to support this functionality. (62cc039)

  • feat: introduce add_measurement_type_annotations_to_workbook function

Create a dedicated add_measurement_type_annotations_to_workbook
function to streamline the process of adding measurement type
annotations to annotation workbook files. This encapsulates existing
annotation logic, promoting code modularity and maintainability. (ad04013)

  • feat: introduce add_dataset_annotation_to_workbook function

Create a dedicated add_dataset_annotation_to_workbook function to
streamline the process of adding dataset annotations to annotation
workbook files. This encapsulates existing annotation logic, promoting
code modularity and maintainability. (a0c1ccb)

  • feat: implement write_eml for standardized output

Introduce the write_eml function to standardize the output format and
avoid unintended data loss. (eda07a5)

  • feat: implement write_workbook for standardized output

Introduce the write_workbook function to standardize the output format
and avoid unintended data loss. (3eeb606)

  • feat: add delete_annotations for workbook cleanup

Add a delete_annotations function to remove annotations from a
workbook based on various criteria. This flexible approach allows for
targeted removal of annotations, enhancing workbook maintenance and
organization. (410d666)

  • feat: remove duplicate annotations from workbook

Introduce a utility function to identify and remove duplicate
annotations within EML elements. Duplicates are considered rows in which
element_xpath, object and object_id values match. We prioritize
the most recent annotations based on the date field to allow
improvements to other fields set by the annotator. (5b1e913)

  • feat: add utility functions for workbook field population

Create utility functions to populate workbook fields, promoting code
reuse and consistency across various annotators. (9dac35e)

  • feat: introduce workbook row initialization

Create a function to initialize an empty workbook row to be subsequently
filled with content. This provides a foundation for independent
annotation operations without relying on an existing workbook,
addressing scenarios where a new workbook row needs to be created from
scratch.

Currently, annotators create rows of annotation data by copying from an
existing workbook then modifying the rows contents. (4cd6858)

  • feat: implement QUDT annotation for workbooks

Introduce a function to add QUDT annotations to existing workbooks.
This enables users to apply QUDT annotations during the default
annotation process or to existing annotated workbooks. (b06ae95)

  • feat: integrate EML unit conversion to QUDT annotations

Develop an annotator function to convert EML standard and custom units
into QUDT representations, leveraging the LTER/EDI-developed EML-to-QUDT
web service. This enhances metadata interoperability and semantic
richness. (1afe98c)

  • feat: convert string literals to URI refs

During graph creation, convert string literals to URI references of SOSO
documents for links between URI strings and vocabulary term IDs. This
fallback measure creates linked data when the URI value cannot be
elevated to an @id for a node.

Apply this approach to specific locations:

  • keyword/DefinedTerm/url
  • variableMeasured/PropertyValue/propertyID
  • variableMeasured/PropertyValue/measurementTechnique
  • variableMeasured/PropertyValue/unitCode
  • license

Avoid conversion for non-URL/URI text strings by checking for a URL
pattern first.

Add an EML and SOSO file pair, that contain these graph properties, for
testing purposes. (0be6551)

  • feat: batch process for creating shadow metadata

Introduce the create_shadow_eml_files wrapper function to convert EML
files into shadow EML in bulk, aligning with existing workflow functions
in the main module. (30d28b6)

  • feat: introduce create_shadow_eml wrapper

Create a create_shadow_eml wrapper function to streamline the process
of applying shadow metadata enrichment functions to individual EML
documents. (31d1d01)

  • feat: ensure EML userId is a URL for linked data compatibility

Modify the EML userId element to be a URL, whenever possible,
facilitating linked data compatibility. If the current value isn't a
URL, it's converted using the directory attribute as a base URL.

This function addresses a practice, previously recommended by EDI, of
setting the base URL as the directory attribute and the remaining
identifier as the element value. This practice has been recently
deprecated. (8c2f8dc)

  • feat: introduce module for shadow metadata generation

Create a new silhouette module to encapsulate functions involved in
generating shadow metadata. This module focuses on refining "raw" or
"level-0" metadata to enable specific applications that are not feasible
with the original metadata.

Shadow metadata is a nascent concept we implement here for prototyping
purposes only. (49e81cf)

  • feat: add @id to Dataset type in SOSO files

Enhance the main.create_soso_files function to include the @id
property for the Dataset type. Use the data package landing page URL
instead of the DOI to provide a more transparent and accessible
identifier.

This update also leverages improvements in the soso package v0.2.0 to
enhance downstream linkages between knowledge graph URIs. (9c19ec2)

  • feat: implement heuristic URI validation utility

Add a utility function to heuristically determine if a given string is a
URI, enhancing validation and quality control capabilities. This
function aims to differentiate URIs from unstructured text descriptions
without guaranteeing absolute certainty. (872ef5e)

  • feat: introduce create_kgraph function

Create a new function, main.create_kgraph, to encapsulate the
process of building the knowledge graph using graph.load_graph.
This function provides a centralized entry point for future graph
enhancements. (691d926)

  • feat: integrate metadata and vocabulary loading into load_graph

Create the load_graph function to handle both metadata and vocabulary
loading in a unified manner. This allows for flexible usage scenarios
where only metadata or vocabulary files are provided.

This approach addresses the ConjunctiveGraph issue outlined in rdflib
version 7.0.0 documentation. (6589ac0)

  • feat: load vocabularies into a graph

Introduce a function to load target vocabularies into a graph. (cca7cce)

  • feat: establish main module for codebase

Create the foundational main module for implementing the codebase. (9ee3a00)

  • feat: integrate empty tag removal in workbook.create

Incorporate the delete_empty_tags functionality into workbook.create
to prevent errors caused by processing empty XML tags. (6847f91)

  • feat: prevent processing errors from empty XML tags

Create helper function to remove empty XML tags to avoid unexpected
behavior when processing elements like keywords in
workbook.get_description. (574970e)

  • feat: implement EML annotation from worksheet

Introduce a function to annotate EML files based on their associated
worksheets. (281c098)

  • feat: implement automatic workbook annotation

Add functionality to automatically annotate workbooks using an
annotator.

Note: Annotation quality may be limited. Consider these as
recommendations rather than definitive annotations. (c384477)

  • feat: integrate BioPortal annotator for term recommendation

Implement functionality to access the BioPortal annotator API. This
enables:

  • Text input analysis
  • Relevant class retrieval
  • Recommendations for data authors and curators (9822406)
  • feat: introduce configuration file for spinneret

Add support for a configuration file to store API keys and other
parameters, enhancing flexibility and security. (1231973)

  • feat: add element descriptions to workbook

Add the corresponding element description directly within the workbook
to streamline the annotation process and reduce potential errors. This
eliminates the need for manual navigation to the data package landing
page to verify element details. (9d6da08)

Fix

  • fix: preserve case sensitivity in local_model arguments

Modified the handling of local_model arguments to maintain case
sensitivity, preventing potential errors when calling models that are
case-sensitive. (2fd7fa3)

  • fix: ensure correct file extension for OntoGPT output

Correct the OntoGPT output file extension to .json to align with the
expected file format and prevent potential downstream processing issues. (677dd0a)

  • fix: gracefully handle missing OntoGPT output files

Implement error handling to gracefully handle cases where the OntoGPT
process fails to produce an output file. This prevents downstream
processing errors. (3636672)

  • fix: handle optional methods element gracefully

Modify the add_methods_annotations_to_workbook function to gracefully
handle the optional methods element in EML files, preventing errors when
processing EML documents that lack this element. (591ed67)

  • fix: handle ungrounded IDs gracefully in CURIE expansion

Modify the CURIE expansion process to gracefully handle ungrounded IDs,
preventing errors and allowing for downstream processing. (b493340)

  • fix: remove empty XML tags during EML loading

Remove empty XML tags during the EML loading process to prevent
potential errors and inconsistencies in subsequent processing steps.

This was in the code base prior to commit
6847f91, but not included in the
refactor commit 44468ea. (4c42c93)

  • fix: enhance author attribution in add_qudt_annotations_to_workbook

Modify the add_qudt_annotations_to_workbook function to use a more
descriptive author value, providing better provenance. The full module
path of the Python function is now used to identify the source of the
annotation. (3047953)

  • fix: ensure correct annotation placement in EML

Adjust annotation placement to precede optional elements, ensuring
schema-compliant EML generation. (346281c)

  • fix: make get_description more resilient to missing elements

Modify workbook.get_description to gracefully handle missing optional
elements (abstract, keywordSet), preventing unnecessary failures. (87a66d0)

  • fix: correct BioPortal annotator parameter format

Adjust BioPortal annotator parameters to use lowercase boolean
strings for compatibility with the service's URL encoding. (b95961f)

Performance

  • perf: skip annotation if element is annotated

Implement a mechanism to skip annotating elements that already have
annotations for a specific predicate, improving performance and reducing
redundant processing. (a508e9b)

  • perf: extend annotation with caching to all workbook annotators

Extend the attribute element annotation caching strategy to all
annotators, improving performance and reducing redundant processing,
especially for large-scale annotation tasks. (7a9fa7c)

  • perf: optimize attribute annotation with caching

Implement cache retrieval for attribute annotations to significantly
reduce processing time and minimize variance in results, especially
for annotators like OntoGPT that can exhibit non-deterministic
behavior. (7287f31)

Refactor

  • refactor: enhance duplicate annotation detection criteria

Expand the criteria for identifying and removing duplicate annotations
in the workbook to include predicate and predicate_id in addition to
element_xpath, object, and object_id. This more comprehensive
approach ensures the removal of truly redundant annotations. (31067f8)

  • refactor: remove extraneous commented code

Remove unnecessary commented code block that was intended to be removed
in commit 7287f31. (83ed3d5)

  • refactor: prevent ungrounded annotations from being added to EML

Implement a filter to prevent ungrounded annotations from being added to
EML documents. This helps maintain data integrity and avoids potential
confusion or errors. (e9af9cb)

  • refactor: modularize workbook annotation for improved accuracy

Refactor the workbook annotation process to use modular annotators,
enhancing accuracy and precision. The BioPortal annotator is retained as
an option for specific use cases. (05222c3)

  • refactor: remove tentative annotation skipping logic

Remove the tentative logic for skipping annotations based on existing
annotations. This simplifies the annotation process and allows for
multiple annotations per element, which is useful for experimentation
and handling potential annotation conflicts. (33e2731)

  • refactor: annotate workbook by predicate

Restructure the annotate_workbook function to organize annotation
processes by predicate instead of element. This enables more modular
addition of different annotation types to the same element, improving
code flexibility and maintainability. (cbae977)

  • refactor: pass EML to annotate_workbook subroutines for context

Add the EML path to the annotate_workbook function to enable context
for annotation subroutines like add_dataset_annotations_to_workbook.
This provides access to the original EML file, enabling the use of
descriptive text beyond the workbook's description field.

This is crucial for annotating elements like units, which require
information from the EML's raw unit name and which is not listed in the
workbook's description field.

This also enables annotators to source descriptive text directly from
the authoritative source (i.e. the EML). It's possible the workbook
descriptions could fall out of synch with the corresponding EML document
and this refactor addresses this.

Note, the description field provides context to human annotators, which
would otherwise have to browse data package landing pages, which is
arguably much less efficient. So we keep this information in the
workbook for now. (fc3dd16)

  • refactor: integrate write_eml

Integrate the write_eml function into the codebase to ensure
consistent and accurate EML output. (390da39)

  • refactor: integrate write_workbook

Integrate the write_workbook function into the codebase to ensure
consistent and accurate workbook output. (3e90f64)

  • refactor: integrate EML and workbook loading utilities

Integrate the load_workbook and load_eml utility functions to
streamline data loading and improve code maintainability. Additionally,
addressed errors introduced by the standardization process, requiring
adjustments to both the utility functions and their usage throughout the
codebase. (44468ea)

  • refactor: centralize EML and workbook loading for code clarity

Introduce utility functions to streamline the loading of EML and
workbook data from various input formats (e.g., file paths, DataFrames).
This centralizes common data loading operations, improving code
readability and maintainability. (f0f2e5e)

  • refactor: streamline workbook row creation in annotators

Leverage workbook row initializer utility functions to simplify the
process of creating new annotation rows within workbook annotators. This
approach eliminates the need for error-prone copy and paste row creation
and promotes code consistency. (962863c)

  • refactor: integrate delete_annotations to workbook annotation

Integrate the delete_annotations function into workbook annotators to
streamline the code and improve maintainability. (e74495a)

  • refactor: standardize workbook annotator return types

Modify workbook annotators to consistently return DataFrames, regardless
of input arguments. This simplifies code logic and improves function
clarity. (400cc53)

  • refactor: enable flexible I/O for workbook annotators

Enable workbook annotators to accept both file paths and DataFrames as
input and output. This flexibility reduces the need for redundant file
operations, streamlining the annotation process. (71f2a3e)

  • refactor: utilize existing id attribute values in EML

Leverage the existing id attribute values from EML elements to
populate the workbook's element_id field whenever possible. This
eliminates the need for arbitrary UUID assignment, reducing confusion
and simplifying the process. (92f10b4)

  • refactor: integrate workbook row creation

Incorporate the newly created workbook row generation function into the
workbook.create function to streamline the process and reduce code
duplication. (d7a7f7c)

  • refactor: create soso files from shadow EML files

Process shadow EML files by passing them as input to
main.create_soso_files. Rename the soso file directory to raw to
imply different processing levels for future use. (e4c34c9)

  • refactor: rename is_uri function for accuracy

Rename the is_uri function to is_url for accuracy, as the function
specifically checks for URLs, not URIs. (ded2748)

  • refactor: rename silhouette module to shadow for clarity

Rename the silhouette module to shadow to align with terminology
commonly used within EDI, making the language more accessible and
familiar to users. (badea82)

  • refactor: rename load_graph to create_graph for clarity

Rename the function to create_graph to better reflect its primary
purpose of constructing the knowledge graph from various data
sources, rather than solely loading an existing graph. (f9b5f43)

  • refactor: deprecate existing load functions in favor of load_graph

Deprecated the load_metadata and load_vocabularies functions in
favor of the load_graph function. This unified approach preserves
blank node references, addressing limitations of the ConjunctiveGraph
object as outlined in rdflib version 7.0.0 documentation. (52ffe8e)

  • refactor: rename combine_jsonld_files to load_metadata

Rename the function to load_metadata to more accurately reflect its
purpose of loading multiple metadata files into a graph. (9eae7f5)

  • refactor: output one workbook per EML file

Optimize performance and maintainability by generating individual
workbooks for each EML file instead of a single large workbook. This
addresses potential issues with:

  • Performance degradation due to increased data volume.
  • Complexity and difficulty in managing a large, complex table.
  • Data corruption risks associated with a single, interconnected table. (b6e7c92)

Test

  • test: refine workbook annotator test clarity

Remove unnecessary commentary from workbook annotator tests to improve
clarity and focus on the core testing logic. (cbe98ae)

  • test: integrate has_annotations utility

Integrate the has_annotations utility function into relevant test
cases to ensure accurate annotation verification. (987c5b0)

  • test: implement has_annotations utility for testing

Create a has_annotations utility function to verify the presence or
absence of annotations in a workbook, improving test efficiency. (2831176)

  • test: address Pandas warnings in workbook annotation

Resolve Pandas warnings in the test_annotate_workbook module to
prevent unexpected behavior, specifically:

  • Correct chained assignment issues by modifying a copy of the row
    then reassigning.
  • Specify data types for columns in the annotation workbook to avoid
    type-related warnings. (8fc3263)
  • test: mock API calls to annotator with manual option

Mock API calls in the test_annotator module to enable offline testing.
However, provide a manual option for testing against the real API to
ensure ongoing functionality. (9eb93c7)

  • test: include missing annotated workbook for testing

Add the annotated workbook version for edi.3.9 to the test suite,
addressing an oversight from commit
c384477. (b4d4374)