Skip to content
This repository has been archived by the owner on Jun 12, 2020. It is now read-only.

Throw some documentation down for design-by-docs #1

Merged
merged 6 commits into from
Aug 15, 2017
Merged

Conversation

willkg
Copy link
Owner

@willkg willkg commented Jul 26, 2017

No description provided.

docs/api.rst Outdated
>>> tree_flatten({'a': {'b': 1, 'c': 2}})
{'a.b': 1, 'a.c': 2}
>>> tree_flatten({'a': [{'b': 1}, {'c': 2}]})
{'a.0.b': 1, 'a.1.c': 2}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want flatten to be reversible (via an unflatten), then we need a different syntax here and/or a different mode because otherwise things like 0 and 1 are ambiguous and unflattenable.

docs/api.rst Outdated

This is the same error you'd get if you tried to access an index that doesn't
exist in a list.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should tree_set support an argument to create indexes? Maybe a create_indexes=True fills in indexes with None?

>>> tree_set({}, 'a.1.b', value=5, create_indexes=True)
{'a': [None, {'b': 5}]}

This might affect creating new lists if they're missing, too.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we assume numerical indices are lists? What do we do with dictionaries have numerical keys?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tree_set and numbers should work fine with both lists and dicts except when we're creating things that don't already exist. In that case, then treevert wouldn't know what to do because the syntax is ambiguous.

We could use [] syntax:

tree_set({}, 'a[1].b' ...)

That's not ambiguous if we declare that keys can't have [ and ] in them.

I think configman appends the index to the item like destination.storage0, destination.storage1, and so on. I don't know if it lets you use that to set things, though.

Another thing we could do is assume anything that doesn't already exist is a dict. So then:

>>> tree_set({}, 'a.1.b', value='foo')
{'a': {'1': {'b': 'foo}}}

I'm not sure that's better and less surprising, but it's interesting.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another idea that has two parts.

  1. we make tree_set only builds dicts for things that are missing and in this way, builds "sparse lists"
  2. we add an argument to tree_set for "don't build things that don't exist"

So then we have use cases:

Developer wants to add something deep in the tree that doesn't involve lists. Works as advertised.

>>> tree_set({}, 'a.b.c', value='foo')
{'a': {'b': {'c': 'foo'}}}

Done.

Developer wants to add something deep in the tree that has a list in the middle. If the list can be sparse (this really depends on how the tree is used), then it treats everything as a dict key and creates dicts as it goes along:

>>> tree_set({}, 'a.1.b', value='foo')
{'a': {'1': {'b': 'foo'}}}

Done.

Developer wants lists to be lists. At this point, the developer has to build the list manually. Python has a setdefault--maybe we could do something like that which creates something if it doesn't exist already.

tree_setdefault(tree, default_tree)

>>> ret = tree_setdefault({}, {'a': [None, None, None]})
>>> ret
{'a': [None, None, None]}
>>> tree_set(ret, 'a.1.b', value='foo')
{'a': [None, {'b': 'foo'}, None]}

That tree_setdefault would also let us take a tree of data and add default values to it which I think would also be super handy. There are some behavior things there we'd have to figure out like what happens if an edge in the tree leads to a different kind of node than the equivalent edge of the default tree (for example, dict vs. list).


.. py:func:: tree_validate(tree, schema)

FIXME
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a schema system for dicts for pyvideo a while back.

https://github.com/pyvideo/old-pyvideo-data/blob/master/src/clive/schemalib.py

With an example schema here:

https://github.com/pyvideo/old-pyvideo-data/blob/master/src/clive/pyvideo_schema.py

Pretty sure there are other schema systems out there, too.

Not sure we'd need this in the first version of treevert.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a schema system in use in the data pipeline? Do they already have a tool for it?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I've seen, the bulk of the telemetry data pipeline is not in Python. So I'm not sure we could use anything they've made without switching languages.

I'll ask around, though.


.. py:func:: tree_traverse(tree, fun)

FIXME
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea to have a traversal system, but we wouldn't need this in the first version of treevert.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would wait on this. Maybe on the flattening API, also, unless we had a clear use case for it up front.

@willkg
Copy link
Owner Author

willkg commented Jul 28, 2017

For the immediate need, we need tree_get. I think what we have here is fine, except for the syntax of specifying the path. Namely, do we differentiate between keys and indices?

Seems like the other functions would benefit greatly from differentiating.

You mentioned jq.... I don't think we're recreating jq with this library since I think there are things I want to do that go beyond what jq does. However, we could definitely use the syntax for paths and they solve the ambiguity issues between keys and indices.

I'll add a section about path specification based on jq and we can see what that looks like.

Timing-wise, I want to move forward with this so I can use this to clean up signature rules to make them more readable.

* adjust syntax for paths to distinguish between keys and indices
* rework tree_set to try to make it straight-forward but flexible for the myriad
  of scenarios when edges are missing
@willkg
Copy link
Owner Author

willkg commented Jul 28, 2017

@lonnen ^^^ I "extracted" paths and made it jq-like. There are some other jq things we could take, but I want to focus on explicitly what we need for making processor rules safer and easier to read for the moment.

I also added some mission text showing the problem we're trying to solve here.

Plus I redid tree_set. That feels sane, is readable, has sane defaults, and easy to reason about. Does that look better?

Paths can be composed using string operations since they're just strings.

FIXME(willkg): Add diagram showing a tree with edges specified by a path.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paths are based on, but not as expressive as, jq filters: https://stedolan.github.io/jq/tutorial/

Paths
=====

A path is a string specifying a period-delimited list of edges. Edges can be:
Copy link

@erikrose erikrose Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're rediscovering ClearSilver. I don't say that's good or bad—I've had some good times with ClearSilver—but you can look to it for inspirations and pitfalls.

Index
-----

Indexes indicate a 0-based list index. They are:
Copy link

@erikrose erikrose Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a dict key is "0"? Not a problem because you're eventually using []? Seems like you still might have trouble because you're saying some_dict[0] rather than some_dict["0"].

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're wrapping the path elements in [] all the time. That solves it.

FIXME


.. py:func:: tree_validate(tree, schema)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your consideration, my favorite Python schema lib is https://pypi.python.org/pypi/schema/, by the docopt guy. I've done some work on it. I use it in DXR. I've made some design proposals that I think would result in it being an elegant solution rather than being a bit slapdash in places.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema looks interesting. I'll keep that in mind.

get the utility we want without having to box/unbox data.

If we're just working with dicts and lists and standard Python things, then
``json.dumps`` and other things just work without us having to do anything about

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you familiar with custom json.dumps and loads decoders and encoders? You can arrange it so you get fancier, custom types out rather than vanilla dicts.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do that for datetimes, but that feature json loads/dumps doesn't solve the big problem. json.dumps here is just an example.

The big problem is that in order to change the data structure of raw_crash and processed_crash, we have to change all the code that touches those things. If the data structure is a subclass of dict, that helps, but it's not a panacea. Switching data structures is huge change that'd take a while to do and it's sort of all-or-nothing.

One of my hopes for treevert is that it gives us an interim step that we can convert to incrementally in the short term without having to commit to a project-wide rewrite.

@erikrose
Copy link

Overall, I'd like to see some concrete spots in the code that you want to make easier. That would help me understand what the core problem is you're trying to solve. Right now I'm waffling between…

  1. You're trying to make syntax shorter so you can read and write it more easily: tree.setdefault('a', {}).setdefault('b', {})['c'] = 5.
  2. You're trying to solve a data validity problem.

In general, I find deeply nested dicts and lists an antipattern when they have different semantics attached to different levels. Sure, if you're just representing a bunch of textual keys for subbing into a template, you're fine. But when they start to have different data types and default values and such, the responsibility for keeping that stuff straight diffuses all over the program and becomes impossible to understand (as opposed to reading a centralized schema) or update. Which situation do you have?

@willkg
Copy link
Owner Author

willkg commented Aug 11, 2017

We have both use cases (data access and schema validation), but in different places.

For data access, it's pretty much what I said in the Goals section. Here are a couple of examples:

https://github.com/mozilla-services/socorro/blob/master/socorro/processor/general_transform_rules.py#L28

https://github.com/mozilla-services/socorro/blob/master/socorro/processor/breakpad_transform_rules.py#L257

Generally, the raw_crash and processed_crash contain arbitrary things in them. We can do some validation of the things we know about that don't change shape over time before running data through the processor, but that's not going to happen any time soon.

@erikrose
Copy link

Okay. So it's mostly a syntax problem. And from your explanation and a casual search through the code, it looks like the data format is largely defined outside Socorro and nothing [much] but the transformation plugins cares about the structure. So yes, your "multi-get" routine looks like a straightforward factoring out of what you were doing in the code anyway. I'm all for it. I might not bust it out into its own lib until it's grown up for awhile within Socorro (just to save paperwork), but that's just due to my personal values. Code on!

@willkg
Copy link
Owner Author

willkg commented Aug 15, 2017

I think framing it as a syntax problem is really interesting. I think I agree with that, though I think it's a little bit more in that I want to build an API that gives us safety and readability wrapped in convenience for dealing with tree structures of crash data.

One interesting thing about making it a lib and not baking it into Socorro proper is that we can use it on Socorro-related projects, too. It's expensive to do things inside of Socorro and a lot easier to do things outside of Socorro currently. I definitely hear you on the paperwork concerns--those are definitely problematic. If this were for a different project, I'd hedge differently. The future will clarify all things and we can adjust accordingly--none of this is permanent.

@erikrose, @lonnen: Thank you so much for your thoughts! I'll throw together a minimal implementation for a 0.1 and we can see where that goes.

@willkg willkg merged commit 099ca2f into master Aug 15, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants