-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC py/stream: Add .readbin() and .writebin() methods. #3382
Conversation
These methods allow read/write binary packed data directly from/to streams. That's quite a popular paradigm, and many programming languages/libraries use that (Java, Node.js, Android, etc.) Python doesn't support that natively, but such support would allow to implement parser for complex binary structures with low memory overhead, which would be beneficial for MicroPython.
This implements ideas from #2543 . My plan was to use these to implement DNS parser, then perhaps to assess how these methods would help projects like https://github.com/dmazzella/uble . But I didn't find time for that so far, so the code was just sitting in my git stash, so I'm posting it as RFC for now. |
Example of DNS packet writer/parser implemented using these functions. Shows that it's pretty natural paradigm.
|
The example does look natural, but (to play Devi'ls Advocate) there are other existing ways to do it: 1) readinto a preallocated buffer and using struct to unpack the values; 2) to avoid heap allocation, readinto a buf and then use uctypes to access the values. And these methods only need one call to the underlying stream to read the data, not one call per item. |
The whole idea of introducing methods working directly on streams is to minimize/avoid extra memory allocation.
Many binary messages contain variable-size fields and context-dependent data. Dealing with those using both ustruct and uctypes modules are cumbersome - it requires to stop accessing previous structure chunk, extracting variable length data, recalc/resync offset for next chunk, repeat. Many binary messages aren't really static structures, but are streams of different-typed elements. That's why stream paradigm works so naturally for them. DNS code above exemplifies points above, e.g. write_fqdn, read_fqdn. Pay special attention to read_fqdn - DNS messages employ a kind of simple LZ compression, and it was simple and clear to implement that with stream paradigm too. (As a disclaimer, DNS wire format is pretty easy. Having been written using readbin/writebin (easy), it's now easy to see how to "optimize" it (for a particular usecase of just resolving IP addresses) to use just read/write/few int.from_bytes/to_bytes (even struct isn't needed). But in general situation, that's not the case, and in many cases, such "optimization" is actually "cumbersomization" and additional alloc overhead).
Right, the stream paradigm is not panacea, it won't replace struct/uctypes. Then the main requirement for it is to avoid memory allocation, and then it effectively trades that for bytecode size. This patch was also written to minimize machine code size. Going forward, we could extend .writebin() to take few arguments at once, but that would make it asymmetric to .readbin(). |
Well, minimal machine code size is no code at all, and to instead use existing functionality (like struct). And then why not try to improve the struct methods so they don't need to allocate on the heap? This would then be immediately more useful for other things, not just streams. If readbin/writebin are added they should arguably be available for all streams, meaning adding a whole bunch of entries to existing method tables. On the other hand, improving struct only needs to be done in one place. |
Yes, so this may trigger looking into actually implementing a base class support for streams. Though realistically, the most useful these functions are for BytesIO.
Which would be the way of ... ? The underlying problem, as was discussed above, is that there're different paradigms of working with memory structures vs working with streams of values. struct module does the former, so it will be always cumbersome to use with streams. Short of making it accept a stream as an argument, which will require noticeable code changes still, and still will look cumbersome, even if technical problem of useless memory allocation is solved.
Unfortunately, that's not always possible. That's why my suggestion to do that whenever possible (e.g. not add listdir if it's just list(ilistdir), etc., etc.), and save those bytes to implement something which can't be well done any other way. All this is still to approach "choose M features form N, M < N). (While the current situation is that such free selection isn't even possible for all M features, e.g. there's no way to write binary data to streams which is alloc-free and non-cumbersome). |
Yes, that's one thing that can be done. And that independently would reduce code size (even without readbin/writebin).
Like machine.time_pulse_us, make readbin/writebin functions that operate on streams, possibly putting them in the struct module. Or make existing struct take streams instead of a buffer argument (as you suggest). And/or add struct.unpack_single() that doesn't return a new tuple and can generalise to streams and buffers. And/or struct.unpack_from_into() that takes a list to return the multiple values in. Or provide a mechanism within the runtime to allow functions to return a tuple without allocating it on the heap, as long as it is immediately unpacked. Etc.
Take a look at https://github.com/micropython/micropython/blob/master/drivers/sdcard/sdcard.py#L135-L143 . This is doing an alloc-free (the buffer is preallocated) write of binary data ( |
That already adds a bit of cumbersomeness, as method notation is more natural, plus it helps a lot that these methods integrate with existing stream methods, so it all comes natural (it's really a Python problem that it doesn't provide that functionality).
I like that one, can you please work on that? ;-)
Yeah, so it's like saying "there're multiple ways to implement it". Indeed, there're. But can you choose "the best alternative" which would be competitive with stream.writebin()? Because otherwise it's clear that the only reason the alternatives are listed is to workaround adding these methods to stream interface. But as mentioned, that's the whole "selling point" of it, why it can become a default choice to do protocol parsing/generating. And none of the choices you list would help with the original issue you bring, which is adding more code to implement it.
Yeah, that's what I call "cumbersome". With .writebin(), that's exactly 3 lines of code, like the number of values to write.
It can cover these cases, that's the whole point.
How comes? Streams are a powerful idea, and the baseline of I/O operations in uPy, so it should prevail, and other I/O protocols should extend it, nor override it. We also discussed how to make an object be able to support multiple protocols. |
Did you consider discussing the issue upstream (python-ideas)? Without having it upstream, and to aid compatibility with CPython, instead of adding methods like readbin/writebin it's better to add functions (like struct.readbin) which can be then much more easily emulated in a (C)Python script.
The idea is to pick something which works not only with streams but also with other entities that need to pack/unpack binary data in an efficient way, and which can't be streams. Otherwise we need to invent 2 things: one for streams and one for everything else (which would also work with streams).
Please explain how a generic readbin() can cover the case of reading a byte from SPI, when you also need to specify the value to write at the same time?
SPI is not really a stream, it's close but not quite because it allows (really requires) writing and reading at the same time. And for efficiency you usually buffer the whole SPI transaction and then read/write it only in one call. So multiple readbin/writebin doesn't fit in here. And with I2C it's even less like a stream because of the address of the device. How will readbin/writebin help I2C do more efficient transfers? Something to really address first is #2552. |
You perfectly know that I considered, and perfectly know that I did - in general. For example, I posted to upstream lists more and on more issues than you. I found this activity to be, mildly speaking, not fruitful, if not say discouraging. In particular the reason I was doing that was to avoid the situation like with previous small Python implementations (which are more than one), where they didn't seem to leave any traces in CPython, I now understand that even if they tried, they felt discouraged soon too. Besides, there's no need to bow to upstream on any question, but rather we should be what we are - experimental Python implementation, we should innovate more freely, and share results with them, not wishful thinking. So: I can post that to python-ideas, but there won't be productive discussion. Any arguments about memory allocation will fall on dead ears, and the only discussion will be around "where are keyword-only arguments??" (Do you know, that according to CPython 3.6, they made breaking changes in many places - where previously normal arguments were required, now it's kwonly.)
I didn't see any desire to aid compatibility from CPython side so far. And better for whom? For MicroPython aims (and arguably, users), it would be better to have it easy to use and efficient. And we perfectly know that Python is a flexible language, so argument that it's easier to add a module-level function instead of providing a wrapped class which adds a method is pretty straw one. Again, I'm so far the only one who cared about uPy->CPy compatibility and laid out framework for that (e.g. initial uasyncio compatibility module) - those grew unmaintained, because just as everything else, it was supposed to be community effort, and nobody cared about that so far. Summing up, I find it pretty frustrating that you're talking about "innovation" in relation to a forked repo(s) which violate as basic things as code style, add random things here and there without thinking about any other port (far less all other ports), etc. and here we have valuation like this - after this matter literally was years in RFC and "thinking-over", with pretty clear (for me at least) conclusion that we need improvements on that part, and if doing it, doing in the most useful, not a "big brother watches you" manner. |
More bravery on CPython side: https://github.com/python/typing/issues/ 495 . So, people just evolve and do what makes sense to do. Of course, all that shouldn't be done randomly and without planning, and most of the ideas in uPy which finally get implemented are years old. |
Thanks, Paul. That looks very useful, when puzzling with coded binary data containing records of varying length. |
@robert-hh : Thanks for the review. |
Very useful! +1 for it |
@dmazzella : As I mentioned, I wanted to use your https://github.com/dmazzella/uble as a "case study" whether it would be useful for existing 3rd-party code (dmazzella/uble#4 came from that review). I didn't post my "findings", because I'm not sure they're conclusive. If you find the above paradigm useful, thanks. |
@pfalcon : I thank you for interest in uble, if you can give me some advice on how to reduce the use of RAM I would be really happy, at the moment the dict of events and commands loads too much memory |
For reference, based on the criticism and suggestion in this PR, I went thru the trouble of implementing an alternative of adding stream argument support to
In this regard, following was implemented:
A detailed report on the result is available in pfalcon/pycopy#10 . In short, pp. 1 & 2 above went well, but pack_into() already showed cumbersomeness, because for streams, offset would be always 0, and this stray 0 is only confusing. But trying to compare .writebin() and ustruct.* shows that the latter is cumbersome in either case, e.g.:
vs
(pack_s() is pack_into() without offset). And overall, .writebin()/.readbin() are just one (pair) of proposed stream method additions, with #2180 giving more examples. Perhaps .writebin()/.readbin() could be stuffed into Another way to look to look at .writebin() is as a generalization of .writechar() http://docs.micropython.org/en/latest/pyboard/library/pyb.UART.html?highlight=writechar#pyb.UART.writechar, which somehow was added to pyb module, and while not officially described for the machine module, people somehow implement it in adhoc manner: #4014 . Regarding support for .writebin()/.readbin() for SPI, it's exactly the same as for .write()/.read(). How they should behave? Apparently, they should be send/receive a filler value. How to specify that value? There're 2 patterns on doing that: a) allow to specify adhoc, "ancillary" data to generic methods like .read/.write; b) make such extended param be part of stream's state, e.g. .set_filler() method in the SPI case. The ancillary data way is "more lightweight", but it effectively breaks "common interface" (stream interface) paradigm. The "extra state" way requires, well, more state to store (at least a bit more memory to maintain), but preserves the interface paradigm. Btw, .writev() method of #2180 is again generalization of what's now proposed as #4020. |
Add esp32s2 internal temp sensor support
Closing because 1) it's important to not stray too far from CPython compatibility; 2) MicroPython works pretty well without these additions; 3) there are more important things to work on. |
These methods allow read/write binary packed data directly from/to
streams. That's quite a popular paradigm, and many programming
languages/libraries use that (Java, Node.js, Android, etc.) Python
doesn't support that natively, but such support would allow to implement
parser for complex binary structures with low memory overhead, which
would be beneficial for MicroPython.