-
Notifications
You must be signed in to change notification settings - Fork 0
Throw some documentation down for design-by-docs #1
Conversation
docs/api.rst
Outdated
>>> tree_flatten({'a': {'b': 1, 'c': 2}}) | ||
{'a.b': 1, 'a.c': 2} | ||
>>> tree_flatten({'a': [{'b': 1}, {'c': 2}]}) | ||
{'a.0.b': 1, 'a.1.c': 2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want flatten to be reversible (via an unflatten), then we need a different syntax here and/or a different mode because otherwise things like 0
and 1
are ambiguous and unflattenable.
docs/api.rst
Outdated
|
||
This is the same error you'd get if you tried to access an index that doesn't | ||
exist in a list. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should tree_set
support an argument to create indexes? Maybe a create_indexes=True
fills in indexes with None
?
>>> tree_set({}, 'a.1.b', value=5, create_indexes=True)
{'a': [None, {'b': 5}]}
This might affect creating new lists if they're missing, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we assume numerical indices are lists? What do we do with dictionaries have numerical keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tree_set
and numbers should work fine with both lists and dicts except when we're creating things that don't already exist. In that case, then treevert wouldn't know what to do because the syntax is ambiguous.
We could use []
syntax:
tree_set({}, 'a[1].b' ...)
That's not ambiguous if we declare that keys can't have [
and ]
in them.
I think configman appends the index to the item like destination.storage0
, destination.storage1
, and so on. I don't know if it lets you use that to set things, though.
Another thing we could do is assume anything that doesn't already exist is a dict. So then:
>>> tree_set({}, 'a.1.b', value='foo')
{'a': {'1': {'b': 'foo}}}
I'm not sure that's better and less surprising, but it's interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another idea that has two parts.
- we make
tree_set
only builds dicts for things that are missing and in this way, builds "sparse lists" - we add an argument to
tree_set
for "don't build things that don't exist"
So then we have use cases:
Developer wants to add something deep in the tree that doesn't involve lists. Works as advertised.
>>> tree_set({}, 'a.b.c', value='foo')
{'a': {'b': {'c': 'foo'}}}
Done.
Developer wants to add something deep in the tree that has a list in the middle. If the list can be sparse (this really depends on how the tree is used), then it treats everything as a dict key and creates dicts as it goes along:
>>> tree_set({}, 'a.1.b', value='foo')
{'a': {'1': {'b': 'foo'}}}
Done.
Developer wants lists to be lists. At this point, the developer has to build the list manually. Python has a setdefault
--maybe we could do something like that which creates something if it doesn't exist already.
tree_setdefault(tree, default_tree)
>>> ret = tree_setdefault({}, {'a': [None, None, None]})
>>> ret
{'a': [None, None, None]}
>>> tree_set(ret, 'a.1.b', value='foo')
{'a': [None, {'b': 'foo'}, None]}
That tree_setdefault
would also let us take a tree of data and add default values to it which I think would also be super handy. There are some behavior things there we'd have to figure out like what happens if an edge in the tree leads to a different kind of node than the equivalent edge of the default tree (for example, dict vs. list).
|
||
.. py:func:: tree_validate(tree, schema) | ||
|
||
FIXME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a schema system for dicts for pyvideo a while back.
https://github.com/pyvideo/old-pyvideo-data/blob/master/src/clive/schemalib.py
With an example schema here:
https://github.com/pyvideo/old-pyvideo-data/blob/master/src/clive/pyvideo_schema.py
Pretty sure there are other schema systems out there, too.
Not sure we'd need this in the first version of treevert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a schema system in use in the data pipeline? Do they already have a tool for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I've seen, the bulk of the telemetry data pipeline is not in Python. So I'm not sure we could use anything they've made without switching languages.
I'll ask around, though.
|
||
.. py:func:: tree_traverse(tree, fun) | ||
|
||
FIXME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea to have a traversal system, but we wouldn't need this in the first version of treevert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would wait on this. Maybe on the flattening API, also, unless we had a clear use case for it up front.
For the immediate need, we need Seems like the other functions would benefit greatly from differentiating. You mentioned jq.... I don't think we're recreating jq with this library since I think there are things I want to do that go beyond what jq does. However, we could definitely use the syntax for paths and they solve the ambiguity issues between keys and indices. I'll add a section about path specification based on jq and we can see what that looks like. Timing-wise, I want to move forward with this so I can use this to clean up signature rules to make them more readable. |
* adjust syntax for paths to distinguish between keys and indices * rework tree_set to try to make it straight-forward but flexible for the myriad of scenarios when edges are missing
@lonnen ^^^ I "extracted" paths and made it jq-like. There are some other jq things we could take, but I want to focus on explicitly what we need for making processor rules safer and easier to read for the moment. I also added some mission text showing the problem we're trying to solve here. Plus I redid |
Paths can be composed using string operations since they're just strings. | ||
|
||
FIXME(willkg): Add diagram showing a tree with edges specified by a path. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paths are based on, but not as expressive as, jq filters: https://stedolan.github.io/jq/tutorial/
Paths | ||
===== | ||
|
||
A path is a string specifying a period-delimited list of edges. Edges can be: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're rediscovering ClearSilver. I don't say that's good or bad—I've had some good times with ClearSilver—but you can look to it for inspirations and pitfalls.
Index | ||
----- | ||
|
||
Indexes indicate a 0-based list index. They are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a dict key is "0"? Not a problem because you're eventually using []
? Seems like you still might have trouble because you're saying some_dict[0]
rather than some_dict["0"]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're wrapping the path elements in []
all the time. That solves it.
FIXME | ||
|
||
|
||
.. py:func:: tree_validate(tree, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For your consideration, my favorite Python schema lib is https://pypi.python.org/pypi/schema/, by the docopt guy. I've done some work on it. I use it in DXR. I've made some design proposals that I think would result in it being an elegant solution rather than being a bit slapdash in places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema looks interesting. I'll keep that in mind.
get the utility we want without having to box/unbox data. | ||
|
||
If we're just working with dicts and lists and standard Python things, then | ||
``json.dumps`` and other things just work without us having to do anything about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you familiar with custom json.dumps
and loads
decoders and encoders? You can arrange it so you get fancier, custom types out rather than vanilla dicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do that for datetimes, but that feature json loads/dumps doesn't solve the big problem. json.dumps
here is just an example.
The big problem is that in order to change the data structure of raw_crash
and processed_crash
, we have to change all the code that touches those things. If the data structure is a subclass of dict, that helps, but it's not a panacea. Switching data structures is huge change that'd take a while to do and it's sort of all-or-nothing.
One of my hopes for treevert is that it gives us an interim step that we can convert to incrementally in the short term without having to commit to a project-wide rewrite.
Overall, I'd like to see some concrete spots in the code that you want to make easier. That would help me understand what the core problem is you're trying to solve. Right now I'm waffling between…
In general, I find deeply nested dicts and lists an antipattern when they have different semantics attached to different levels. Sure, if you're just representing a bunch of textual keys for subbing into a template, you're fine. But when they start to have different data types and default values and such, the responsibility for keeping that stuff straight diffuses all over the program and becomes impossible to understand (as opposed to reading a centralized schema) or update. Which situation do you have? |
We have both use cases (data access and schema validation), but in different places. For data access, it's pretty much what I said in the Goals section. Here are a couple of examples: Generally, the |
Okay. So it's mostly a syntax problem. And from your explanation and a casual search through the code, it looks like the data format is largely defined outside Socorro and nothing [much] but the transformation plugins cares about the structure. So yes, your "multi-get" routine looks like a straightforward factoring out of what you were doing in the code anyway. I'm all for it. I might not bust it out into its own lib until it's grown up for awhile within Socorro (just to save paperwork), but that's just due to my personal values. Code on! |
I think framing it as a syntax problem is really interesting. I think I agree with that, though I think it's a little bit more in that I want to build an API that gives us safety and readability wrapped in convenience for dealing with tree structures of crash data. One interesting thing about making it a lib and not baking it into Socorro proper is that we can use it on Socorro-related projects, too. It's expensive to do things inside of Socorro and a lot easier to do things outside of Socorro currently. I definitely hear you on the paperwork concerns--those are definitely problematic. If this were for a different project, I'd hedge differently. The future will clarify all things and we can adjust accordingly--none of this is permanent. @erikrose, @lonnen: Thank you so much for your thoughts! I'll throw together a minimal implementation for a 0.1 and we can see where that goes. |
No description provided.