Throw some documentation down for design-by-docs #1

willkg · 2017-07-26T01:38:01Z

No description provided.

willkg · 2017-07-26T01:48:03Z

docs/api.rst

+   >>> tree_flatten({'a': {'b': 1, 'c': 2}})
+   {'a.b': 1, 'a.c': 2}
+   >>> tree_flatten({'a': [{'b': 1}, {'c': 2}]})
+   {'a.0.b': 1, 'a.1.c': 2}


If we want flatten to be reversible (via an unflatten), then we need a different syntax here and/or a different mode because otherwise things like 0 and 1 are ambiguous and unflattenable.

willkg · 2017-07-26T01:49:11Z

docs/api.rst

+
+   This is the same error you'd get if you tried to access an index that doesn't
+   exist in a list.
+


Should tree_set support an argument to create indexes? Maybe a create_indexes=True fills in indexes with None?

>>> tree_set({}, 'a.1.b', value=5, create_indexes=True) {'a': [None, {'b': 5}]}

This might affect creating new lists if they're missing, too.

Do we assume numerical indices are lists? What do we do with dictionaries have numerical keys?

tree_set and numbers should work fine with both lists and dicts except when we're creating things that don't already exist. In that case, then treevert wouldn't know what to do because the syntax is ambiguous.

We could use [] syntax:

tree_set({}, 'a[1].b' ...)

That's not ambiguous if we declare that keys can't have [ and ] in them.

I think configman appends the index to the item like destination.storage0, destination.storage1, and so on. I don't know if it lets you use that to set things, though.

Another thing we could do is assume anything that doesn't already exist is a dict. So then:

>>> tree_set({}, 'a.1.b', value='foo') {'a': {'1': {'b': 'foo}}}

I'm not sure that's better and less surprising, but it's interesting.

I had another idea that has two parts.

we make tree_set only builds dicts for things that are missing and in this way, builds "sparse lists"

we add an argument to tree_set for "don't build things that don't exist"

So then we have use cases:

Developer wants to add something deep in the tree that doesn't involve lists. Works as advertised.

>>> tree_set({}, 'a.b.c', value='foo') {'a': {'b': {'c': 'foo'}}}

Done.

Developer wants to add something deep in the tree that has a list in the middle. If the list can be sparse (this really depends on how the tree is used), then it treats everything as a dict key and creates dicts as it goes along:

>>> tree_set({}, 'a.1.b', value='foo') {'a': {'1': {'b': 'foo'}}}

Done.

Developer wants lists to be lists. At this point, the developer has to build the list manually. Python has a setdefault--maybe we could do something like that which creates something if it doesn't exist already.

tree_setdefault(tree, default_tree)

>>> ret = tree_setdefault({}, {'a': [None, None, None]}) >>> ret {'a': [None, None, None]} >>> tree_set(ret, 'a.1.b', value='foo') {'a': [None, {'b': 'foo'}, None]}

That tree_setdefault would also let us take a tree of data and add default values to it which I think would also be super handy. There are some behavior things there we'd have to figure out like what happens if an edge in the tree leads to a different kind of node than the equivalent edge of the default tree (for example, dict vs. list).

willkg · 2017-07-26T01:53:34Z

docs/api.rst

+
+.. py:func:: tree_validate(tree, schema)
+
+   FIXME


I wrote a schema system for dicts for pyvideo a while back.

https://github.com/pyvideo/old-pyvideo-data/blob/master/src/clive/schemalib.py

With an example schema here:

https://github.com/pyvideo/old-pyvideo-data/blob/master/src/clive/pyvideo_schema.py

Pretty sure there are other schema systems out there, too.

Not sure we'd need this in the first version of treevert.

Is there a schema system in use in the data pipeline? Do they already have a tool for it?

As far as I've seen, the bulk of the telemetry data pipeline is not in Python. So I'm not sure we could use anything they've made without switching languages.

I'll ask around, though.

willkg · 2017-07-26T01:56:36Z

docs/api.rst

+
+.. py:func:: tree_traverse(tree, fun)
+
+   FIXME


Seems like a good idea to have a traversal system, but we wouldn't need this in the first version of treevert.

I would wait on this. Maybe on the flattening API, also, unless we had a clear use case for it up front.

willkg · 2017-07-28T13:51:21Z

For the immediate need, we need tree_get. I think what we have here is fine, except for the syntax of specifying the path. Namely, do we differentiate between keys and indices?

Seems like the other functions would benefit greatly from differentiating.

You mentioned jq.... I don't think we're recreating jq with this library since I think there are things I want to do that go beyond what jq does. However, we could definitely use the syntax for paths and they solve the ambiguity issues between keys and indices.

I'll add a section about path specification based on jq and we can see what that looks like.

Timing-wise, I want to move forward with this so I can use this to clean up signature rules to make them more readable.

* adjust syntax for paths to distinguish between keys and indices * rework tree_set to try to make it straight-forward but flexible for the myriad of scenarios when edges are missing

willkg · 2017-07-28T17:07:59Z

@lonnen ^^^ I "extracted" paths and made it jq-like. There are some other jq things we could take, but I want to focus on explicitly what we need for making processor rules safer and easier to read for the moment.

I also added some mission text showing the problem we're trying to solve here.

Plus I redid tree_set. That feels sane, is readable, has sane defaults, and easy to reason about. Does that look better?

willkg · 2017-08-11T12:31:02Z

docs/api.rst

+Paths can be composed using string operations since they're just strings.
+
+FIXME(willkg): Add diagram showing a tree with edges specified by a path.
+


Paths are based on, but not as expressive as, jq filters: https://stedolan.github.io/jq/tutorial/

erikrose · 2017-08-11T15:26:47Z

docs/api.rst

+Paths
+=====
+
+A path is a string specifying a period-delimited list of edges. Edges can be:


You're rediscovering ClearSilver. I don't say that's good or bad—I've had some good times with ClearSilver—but you can look to it for inspirations and pitfalls.

erikrose · 2017-08-11T15:27:41Z

docs/api.rst

+Index
+-----
+
+Indexes indicate a 0-based list index. They are:


What if a dict key is "0"? Not a problem because you're eventually using []? Seems like you still might have trouble because you're saying some_dict[0] rather than some_dict["0"].

Ah, you're wrapping the path elements in [] all the time. That solves it.

erikrose · 2017-08-11T15:42:51Z

docs/api.rst

+   FIXME
+
+
+.. py:func:: tree_validate(tree, schema)


For your consideration, my favorite Python schema lib is https://pypi.python.org/pypi/schema/, by the docopt guy. I've done some work on it. I use it in DXR. I've made some design proposals that I think would result in it being an elegant solution rather than being a bit slapdash in places.

Schema looks interesting. I'll keep that in mind.

erikrose · 2017-08-11T15:46:52Z

docs/api.rst

+get the utility we want without having to box/unbox data.
+
+If we're just working with dicts and lists and standard Python things, then
+``json.dumps`` and other things just work without us having to do anything about


Are you familiar with custom json.dumps and loads decoders and encoders? You can arrange it so you get fancier, custom types out rather than vanilla dicts.

We do that for datetimes, but that feature json loads/dumps doesn't solve the big problem. json.dumps here is just an example.

The big problem is that in order to change the data structure of raw_crash and processed_crash, we have to change all the code that touches those things. If the data structure is a subclass of dict, that helps, but it's not a panacea. Switching data structures is huge change that'd take a while to do and it's sort of all-or-nothing.

One of my hopes for treevert is that it gives us an interim step that we can convert to incrementally in the short term without having to commit to a project-wide rewrite.

erikrose · 2017-08-11T16:00:29Z

Overall, I'd like to see some concrete spots in the code that you want to make easier. That would help me understand what the core problem is you're trying to solve. Right now I'm waffling between…

You're trying to make syntax shorter so you can read and write it more easily: tree.setdefault('a', {}).setdefault('b', {})['c'] = 5.
You're trying to solve a data validity problem.

In general, I find deeply nested dicts and lists an antipattern when they have different semantics attached to different levels. Sure, if you're just representing a bunch of textual keys for subbing into a template, you're fine. But when they start to have different data types and default values and such, the responsibility for keeping that stuff straight diffuses all over the program and becomes impossible to understand (as opposed to reading a centralized schema) or update. Which situation do you have?

willkg · 2017-08-11T16:10:32Z

We have both use cases (data access and schema validation), but in different places.

For data access, it's pretty much what I said in the Goals section. Here are a couple of examples:

https://github.com/mozilla-services/socorro/blob/master/socorro/processor/general_transform_rules.py#L28

https://github.com/mozilla-services/socorro/blob/master/socorro/processor/breakpad_transform_rules.py#L257

Generally, the raw_crash and processed_crash contain arbitrary things in them. We can do some validation of the things we know about that don't change shape over time before running data through the processor, but that's not going to happen any time soon.

erikrose · 2017-08-11T18:20:26Z

Okay. So it's mostly a syntax problem. And from your explanation and a casual search through the code, it looks like the data format is largely defined outside Socorro and nothing [much] but the transformation plugins cares about the structure. So yes, your "multi-get" routine looks like a straightforward factoring out of what you were doing in the code anyway. I'm all for it. I might not bust it out into its own lib until it's grown up for awhile within Socorro (just to save paperwork), but that's just due to my personal values. Code on!

willkg · 2017-08-15T01:42:37Z

I think framing it as a syntax problem is really interesting. I think I agree with that, though I think it's a little bit more in that I want to build an API that gives us safety and readability wrapped in convenience for dealing with tree structures of crash data.

One interesting thing about making it a lib and not baking it into Socorro proper is that we can use it on Socorro-related projects, too. It's expensive to do things inside of Socorro and a lot easier to do things outside of Socorro currently. I definitely hear you on the paperwork concerns--those are definitely problematic. If this were for a different project, I'd hedge differently. The future will clarify all things and we can adjust accordingly--none of this is permanent.

@erikrose, @lonnen: Thank you so much for your thoughts! I'll throw together a minimal implementation for a 0.1 and we can see where that goes.

willkg added 2 commits July 25, 2017 21:37

Throw some documentation down for design-by-docs

93f4989

Add some of Lonnen's thoughts

c38280a

willkg commented Jul 26, 2017

View reviewed changes

willkg added 2 commits July 28, 2017 12:54

Lots of changes

73e2f34

* adjust syntax for paths to distinguish between keys and indices * rework tree_set to try to make it straight-forward but flexible for the myriad of scenarios when edges are missing

Add a mission

87e2fa3

willkg added 2 commits July 28, 2017 13:34

Add research and inspirations

1722143

Tweak some wording for clarity

e0fcfca

willkg commented Aug 11, 2017

View reviewed changes

erikrose reviewed Aug 11, 2017

View reviewed changes

willkg merged commit 099ca2f into master Aug 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throw some documentation down for design-by-docs #1

Throw some documentation down for design-by-docs #1

willkg commented Jul 26, 2017

willkg Jul 26, 2017

willkg Jul 26, 2017

lonnen Jul 27, 2017

willkg Jul 27, 2017

willkg Jul 28, 2017

willkg Jul 26, 2017

lonnen Jul 27, 2017

willkg Jul 27, 2017

willkg Jul 26, 2017

lonnen Jul 27, 2017

willkg commented Jul 28, 2017

willkg commented Jul 28, 2017

willkg Aug 11, 2017

erikrose Aug 11, 2017 •

edited

Loading

erikrose Aug 11, 2017 •

edited

Loading

erikrose Aug 11, 2017

erikrose Aug 11, 2017

willkg Aug 11, 2017

erikrose Aug 11, 2017

willkg Aug 11, 2017

erikrose commented Aug 11, 2017

willkg commented Aug 11, 2017

erikrose commented Aug 11, 2017

willkg commented Aug 15, 2017


		This is the same error you'd get if you tried to access an index that doesn't
		exist in a list.

		Paths can be composed using string operations since they're just strings.

		FIXME(willkg): Add diagram showing a tree with edges specified by a path.

Throw some documentation down for design-by-docs #1

Throw some documentation down for design-by-docs #1

Conversation

willkg commented Jul 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

willkg commented Jul 28, 2017

willkg commented Jul 28, 2017

Choose a reason for hiding this comment

erikrose Aug 11, 2017 • edited Loading

Choose a reason for hiding this comment

erikrose Aug 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikrose commented Aug 11, 2017

willkg commented Aug 11, 2017

erikrose commented Aug 11, 2017

willkg commented Aug 15, 2017

erikrose Aug 11, 2017 •

edited

Loading

erikrose Aug 11, 2017 •

edited

Loading