Skip to content

Commit

Permalink
README update, wrote comparison section on Rust vs Python API #1
Browse files Browse the repository at this point in the history
  • Loading branch information
proycon committed Feb 6, 2023
1 parent b21ff5c commit 410ba73
Show file tree
Hide file tree
Showing 2 changed files with 110 additions and 2 deletions.
111 changes: 109 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@ Loading a STAM JSON file containing an annotation store:
store = stam.AnnotationStore(file="example.stam.json")
```

The annotation store is your workspace, it holds all resources, annotation sets (i.e. keys and annotation data) and of course the actual annotations. It is a memory-based store and you can as much as you like into it (as long as it fits in memory:).
The annotation store is your workspace, it holds all resources, annotation sets
(i.e. keys and annotation data) and of course the actual annotations. It is a
memory-based store and you can as much as you like into it (as long as it fits
in memory:).

Retrieving anything by ID:

Expand All @@ -46,9 +49,113 @@ for annotation in store.annotations():
except stam.StamError:
text = "n/a"
for data in annotation:
print("\t".join(( annotation.id, data.key().id(), str(data.value()), text)));
print("\t".join(( annotation.id, data.key().id, str(data.value()), text)));
```


Adding a resource:

```python
resource = store.add_resource(filename="my-text.txt")
```

Create a store and annotations from scratch:

```python
from stam import AnnotationStore, Selector, AnnotationDataBuilder

store = AnnotationStore(id="test")
resource = store.add_resource(id="testres", text="Hello world")
store.annotate(id="A1",
target=Selector.text(resource, Offset.simple(6,11)),
data=[AnnotationDataBuilder(id="D1", key="pos", value="noun", annotationset="testdataset")])
```

In the above example, the `AnnotationDataSet` , `DataKey` and `AnnotationData`
are created on-the-fly. You can also create them explicitly, as shown in the
next snippet, results in the exact same store:


```python
store = AnnotationStore(id="test")
resource = store.add_resource(id="testres", text="Hello world")
annotationset = store.add_annotationset(id="testdataset")
annotationset.add_key("pos")
data = annotationset.add_data("pos","noun","D1")
self.store.annotate(id="A1",
target=Selector.text(resource, Offset.simple(6,11)),
data=[AnnotationDataBuilder.link(data)])
```

Here we use `AnnotationDataBuilder.link()` to link to the existing annotation.
Providing the full `AnnotationDataBuilder` as in the example before would have
also worked fine with the same end result, but would be less performant. The
implementation will ensure to reuse any already existing `AnnotationData` if
possible, as not duplicating data is one of the core characteristics of the
STAM model.

You can serialize the entire annotation store (including all sets and annotations) to a STAM JSON file:

```python
store.to_file("example.stam.json")
```

## Differences between the rust library and python library and performance considerations

Although this Python binding builds on the Rust library, the API it exposes
differs in certain aspects to make it more pythonic and easier to work with.
This results in a higher-level API that hides some of the lower-level details
that are present in the Rust library. This approach does come at the cost of causing
some additional runtime overhead.

In this Python binding, most classes of the model (`Annotation`,
`AnnotationData`, `DataKey`, etc..) are references to the annotation store
(self-containing also a reference to the store itself). None of them can be
instantiated directly, but always via an `add_*()` or `annotate()` method which
will add them and return the reference.

These instances play a bigger role in the Python API than their equivalents in
the Rust API (which distinguishes owned data, borrowed data aka references and
so-called handles). In the Rust API, methods for search are mostly implemented on the main
`AnnotationStore` or `AnnotationDataSet`, reflecting the underlying ownership model more strictly.
In the Python API, they are implemented on the types themselves. Here's a comparison of some common methods:

+----------------------------------+-----------------------------------------------------+
| Python API | Rust API |
+----------------------------------+-----------------------------------------------------+
| `Annotation.annotations()` | `AnnotationStore::annotations_by_annotation()` |
| `Annotation.resources()` | `AnnotationStore::resources_by_annotation()` |
| `Annotation.textselections()` | `AnnotationStore::textselections_by_annotation()` |
| `Annotation.text()` | `AnnotationStore::text_by_annotation()` |
| `TextResource.annotations()` | `AnnotationStore::annotations_by_resource()` |
| `TextSelection.annotations()` | `AnnotationStore::annotations_by_textselection()` |
| `DataKey.data()` | `AnnotationDataSet::data_by_key()` |
| `DataKey.annotationset()` | n/a |
| `AnnotationData.annotationset()` | n/a |
+--------------------------------+-------------------------------------------------------+

The Rust methods will return iterators, references or handles whenever they
can, moreover it will do so safely. The Python API is often forced to make a
local copy. For iterators it sometimes decides to let the entire underlying Rust
iterator run its course and then return the result as a whole as a tuple, rather than
return a Python generator. Here you gain some speed at the cost of some memory.

Probably needless to say, but using Rust directly will always be more
performant than using this Python binding. However, using this Python binding
should still be way more performant than if the whole thing were implemented in
native Python. The trick is in letting the binding work for you as much as
possible, use higher-level methods whenever they are available rather than
implementing your logic in Python.













1 change: 1 addition & 0 deletions test.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,7 @@ def test_find_data_missing(self):
self.assertEqual(annotationdata, None)






Expand Down

0 comments on commit 410ba73

Please sign in to comment.