-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for encoding TypedArrays as primitive objects for serialization #2911
Conversation
…tation objects) A TypedArray representation object has two properties: `dtype` and `data`. `dtype` is a string indicating the type of the typed array (`'int8'`, `'float32'`, `'uint16'`, etc.) `data` is a primitive JavaScript object that stores the typed array data. It can be one of: - Standard JavaScript Array - ArrayBuffer - DataView - A base64 encoded string The representation objects may stand in for TypedArrays in `data_type` properties and in properties with `arrayOk: true`. The representation object is stored in `data`/`layout`, while the converted TypedArray is stored in `_fullData`/`_fullLayout`
Really appreciate this thorough write ups @jonmmease ! Technologically, I'll leave it up to the rest, but I do really like the sound of the use cases that you've enumerated 👍
Dash directly uses this JSON serializer ( When you say 10 times faster, in which ways is it faster? I'm assuming there are 4: |
@chriddyp The 10X was my off-the-cuff test of encoding a 1 million element numpy array of random float64 values into a Python string. Something like Comparisons of (2) (3) and (4) would definitely be interesting as well! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for PR! Looking forward to JSON-serializable typed arrays!
{ | ||
"data": [{ | ||
"type": "scatter", | ||
"x": {"dtype": "float64", "data": [3, 2, 1]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about using "data"
, "data"
has a pretty important meaning already for plotly.js. I'd vote for values
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe just "v"
as calling a base64 string a set of values sounds wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed data
to value
(singular) in 5030d3a. Does that work for you? v
felt too short 🙂
src/lib/coerce.js
Outdated
@@ -521,3 +539,48 @@ function validate(value, opts) { | |||
return out !== failed; | |||
} | |||
exports.validate = validate; | |||
|
|||
var dtypeStringToTypedarrayType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that all of them, except for Uint8ClampedArray
:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays#Typed_array_views
Might as well add it here for completeness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 5030d3a
src/lib/coerce.js
Outdated
* | ||
* @returns {TypedArray} | ||
*/ | ||
function primitiveTypedArrayReprToTypedArray(v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious. Have you benchmarked this routine here for bas64 string corresponding to 1e4, 1e5, and 1e6 pts?
src/lib/coerce.js
Outdated
else if(dflt !== undefined) propOut.set(dflt); | ||
if(isArrayOrTypedArray(v)) { | ||
propOut.set(v); | ||
} else if(isPrimitiveTypedArrayRepr(v)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the {dtype: '', data: ''}
-> typed array conversion should happen during the calc
step. More precisely somewhere here:
plotly.js/src/plots/polar/set_convert.js
Lines 103 to 136 in 24a0f91
ax.makeCalcdata = function(trace, coord) { | |
var arrayIn = trace[coord]; | |
var len = trace._length; | |
var arrayOut, i; | |
var _d2c = function(v) { return ax.d2c(v, trace.thetaunit); }; | |
if(arrayIn) { | |
if(Lib.isTypedArray(arrayIn) && axType === 'linear') { | |
if(len === arrayIn.length) { | |
return arrayIn; | |
} else if(arrayIn.subarray) { | |
return arrayIn.subarray(0, len); | |
} | |
} | |
arrayOut = new Array(len); | |
for(i = 0; i < len; i++) { | |
arrayOut[i] = _d2c(arrayIn[i]); | |
} | |
} else { | |
var coord0 = coord + '0'; | |
var dcoord = 'd' + coord; | |
var v0 = (coord0 in trace) ? _d2c(trace[coord0]) : 0; | |
var dv = (trace[dcoord]) ? _d2c(trace[dcoord]) : (ax.period || 2 * Math.PI) / len; | |
arrayOut = new Array(len); | |
for(i = 0; i < len; i++) { | |
arrayOut[i] = v0 + i * dv; | |
} | |
} | |
return arrayOut; | |
}; |
Depending on how slow this conversion can be, moving it to the calc step will help ensure faster interactions (note that the calc
is skipped on e.g. zoom and pan).
This might be a fairly big job though, you'll have to replace all the isArrayOrTypedArray
calls upstream of the calc step with something like isArrayOrTypedArrayOrIsPrimitiveTypedArrayRepr
(or something less verbose 😄 ).
So I guess, we should first benchmark primitiveTypedArrayReprToTypedArray
what kind of potential perf gain we could get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@etpinard Maybe I'm misunderstanding something. As it is now, I thought these conversions would be happening in the supplyDefaults
logic for the various traces. Does that logic run on events like zoom/pan?
Either way, I'll get some performance numbers for primitiveTypedArrayReprToTypedArray
. I was hoping I wouldn't need to retrace all of the steps you went through to add the original TypedArray
support!
Thanks for bringing this issue to the fore @jonmmease - your format proposal looks great, and I'd be very happy to have this format locked in as an official part of plotly.js. My main concern is where it fits in the pipeline. Leaving the representation in the figure and making conversion part of
Plotly.newPlot(gd,[{y:new Int8Array([1,2,3,4,5])}]);
Plotly.restyle(gd, 'y[2]', 6); I guess the first two could potentially be fixed by @etpinard's suggestion of moving the conversion to Alternatively, would it be reasonable to have official serialization/deserialization routes, that can be used both on a complete figure and on arguments to |
I'll look it over some more tonight, but one quick thought. For plotly.py all I really need is a way to check equality between my data model (representation array), and whatever Plotly.js stores in If we go this route, could the |
Yeah, we do modify the input in a few places, but we're trying to break that habit. This is why I'm angling for an explicit deserialize step when new data comes in (and an explicit serialize step when saving). And as the serial format is really all about interfacing with the world outside of javascript, it seems like it should be kept separate from the regular pipeline, to be invoked by whatever application it is that's doing that out-of-js interfacing.
OK great, lets see how far we can get that way! |
Quick update. It turns out that I was able to solve the FigureWidget serialization problem by customizing the ipywidgets serialization logic. So my use-case (1) is no longer dependent on any changes here. And there's no need to make the methods discussed above public. Given that, I do think it makes sense to work towards some form of dedicated serialization pathway. Maybe something like... // inVal is something from the outside with typed array representation objects
var inVal = {...}
// Plotly.import converts these to TypedArrays
Plot.newPlot(gd, Plotly.import(inVal))
// Do stuff
// outVal has TypedArrays encoded as base64 representation objects
var outVal = Plotly.export(gd, {typedArrayRepr: 'base64'}) what do you think? |
- Use `Lib.isPlainObject` - Renamed `data` -> `value` - Added `Uint8ClampedArray` - Committed updated package-lock.json No changes yet to the logical structure of where conversion happens
Revert changes to coerce.js
Latest push reverts all coerce.js changes and moves the typed array conversion logic to a new |
src/lib/is_array.js
Outdated
@@ -5,6 +5,7 @@ | |||
* This source code is licensed under the MIT license found in the | |||
* LICENSE file in the root directory of this source tree. | |||
*/ | |||
var Lib = require('../lib'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to require './is_plain_object.js
to avoid a circular dependency pattern
lib/index -> lib/is_plain_object -> lib/index -> lib/is_array -> lib/index -> lib/is_plain_object
@jonmmease your // inVal is something from the outside with typed array representation objects
var inVal = {...}
// Plotly.import converts these to TypedArrays
Plot.newPlot(gd, Plotly.import(inVal))
// Do stuff
// outVal has TypedArrays encoded as base64 representation objects
var outVal = Plotly.export(gd, {typedArrayRepr: 'base64'}) sounds solid. 🥇 Personally, I find |
src/plot_api/plot_api.js
Outdated
* | ||
* @returns {TypedArray} | ||
*/ | ||
var dtypeStringToTypedarrayType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to append this list:
Lines 12 to 21 in a8c6217
"globals": { | |
"Promise": true, | |
"Float32Array": true, | |
"Float64Array": true, | |
"Uint8Array": true, | |
"Int16Array": true, | |
"Int32Array": true, | |
"ArrayBuffer": true, | |
"DataView": true, | |
"SVGElement": false |
to make npm run lint
pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we'll need to add fallbacks so that browsers w/o typed array support don't break.
We have a "test" for that:
npm run test-jasmine -- --bundleTest=ie9_test.js
Relevant question from the plotly.py forums: https://community.plot.ly/t/offline-plot-to-div-encode-numpy-data-as-binary-blob/12965
This reminds me that the encoding should be able to handle the 2-dimensional use case as well. I think I'll add an optional |
Thanks for the feedback @etpinard , I'm going to circle back around to this in a week or two, after the plotly.py 3.2 release. In terms of naming. I was thinking of keeping "TypedArray" out of the name so that we could eventually add other encodings if useful. How about
I was also wondering if it would make sense to allow @chriddyp would compression as the figure-level be useful to Dash or do you already compress at a higher level? Also, what do you use from decompression on the JavaScript side? |
👌
Interesting, but perhaps this is out of the scope of plotly.js? How big our common front-end unzip libraries? If we want to have all decode/encode/compress/decompress logic in one place, maybe we should explore placing these |
Good point regarding scope. The typed array stuff is pretty tied to Plotly.js, but the compression can happen where-ever. I was picturing using the decompression from orca eventually, but it probably makes more sense to just add this as an option to orca down the road (if it proves helpful), rather than Plotly.js. |
In this case the decoded value is undefined, but an error won't be thrown.
This can't be a mock anymore because it is not valid as input to Plotly.plot without first passing through Plotly.decode
This function inputs a Plotly object and outputs a copy where all TypedArray instances have been replace with JSON serializable representation objects. This function is the inverse of Plotly.decode
Ok, I believe I have finished the implementation and testing of the new Handling multi-dimensional arrays will be a bit more complicated, and a bit less useful since Plotly.js doesn't (yet?) use a homogenous multi-dimensional array type internally, so I'd like to put this off for a future PR. Might it be feasible to get this into 1.41? My hope was to use this in plotly.py 3.3, where I'm going to introduce new |
@jonmmease after a long talk with @alexcjohnson, we came to a few conclusions:
So, I'm proposing two solutions:
|
Thanks for taking the time to think through this in detail @etpinard and @alexcjohnson. The approach of eventually creating a separate plotly.py does't really need this new npm package for itself (it already as a version of the So I'm happy to plan on pursuing the npm package approach in not too distant future, but I don't think it needs to be this week. It's not blocking anything, just the next step in efficiency for large datasets. Maybe after 1.42? Thanks! |
Thanks very much for your understanding and your hard work @jonmmease
Thanks for info! |
['int32', new Int32Array([-2147483648, -123, 345, 32767, 2147483647])], | ||
['uint32', new Uint32Array([0, 345, 32767, 4294967295])], | ||
['float32', new Float32Array([1.2E-38, -2345.25, 2.7182818, 3.1415926, 2, 3.4E38])], | ||
['float64', new Float64Array([5.0E-324, 2.718281828459045, 3.141592653589793, 1.8E308])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be of interest to add BigInt
s to the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that is not yet supporting longs. How is ploty dealing with BigInt currently? is there any support planned?
return 'float32'; | ||
} else if(typeof Float64Array !== 'undefined' && v instanceof Float64Array) { | ||
return 'float64'; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be of interest to support BigInt
s as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big integers could be useful for storing time in milliseconds.
@@ -16,6 +16,8 @@ exports.restyle = main.restyle; | |||
exports.relayout = main.relayout; | |||
exports.redraw = main.redraw; | |||
exports.update = main.update; | |||
exports.decode = main.decode; | |||
exports.encode = main.encode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may start with underscore (i.e. kind of private) methods here.
For example:
exports._decode = main.decode;
exports._encode = main.encode;
If we want to expose these functionality, then using a more specific name could be considered.
For example:
exports.decodeArray = main.decode;
exports.encodeArray = main.encode;
} else if(Lib.isPlainObject(v)) { | ||
var result = {}; | ||
for(var k in v) { | ||
if(v.hasOwnProperty(k)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we may list the keys using Object.getOwnPropertyNames and loop through them.
What's the status of this problem in 2020? |
Besides there is |
plotly.js/src/traces/surface/convert.js Lines 14 to 15 in a54c502
|
} else if(dtype === 'float64' && typeof Float64Array !== 'undefined') { | ||
return Float64Array; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may consider rewrite this:
function getArrayType(t) {
return (typeof t !== 'undefined') ? t : undefined;
}
var validInt8Array = getArrayType(Int8Array);
var validUint8Array = getArrayType(Uint8Array);
...
/**
* Get TypedArray type for a given dtype string
* @param {String} dtype: Data type string
* @returns {TypedArray}
*/
function getTypedArrayTypeForDtypeString(dtype) {
switch(dtyle) {
case 'int8':
return validInt8Array;
case 'uint8':
return validUint8Array;
...
}
}
return result; | ||
} else { | ||
return v; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop last else
statement and simply return v;
Right now, one could call: Plotly.newPlot(gd, Plotly.decode({
'data': [{
'type': 'scatter',
'x': {'dtype': 'float64', 'value': 'AAAAAAAACEAAAAAAAAAAQAAAAAAAAPA/'},
'y': {'dtype': 'float32', 'value': 'AABAQAAAAEAAAIA/'},
'marker': {
'color': {
'dtype': 'uint16',
'value': 'AwACAAEA',
},
}
}]
})); But wondering if we could/should expand Plotly.newPlot(gd,
'data': [{
'type': 'scatter',
'x': {'dtype': 'float64', 'value': 'AAAAAAAACEAAAAAAAAAAQAAAAAAAAPA/'},
'y': {'dtype': 'float32', 'value': 'AABAQAAAAEAAAIA/'},
'marker': {
'color': {
'dtype': 'uint16',
'value': 'AwACAAEA',
},
}
}]
}); |
That was my original design. I don't remember all of the details, but as I recall it got a little messy to work through what should get stored in I don't have a strong preference if there's a clean way to handle it all internally. |
We actually have this comment here: Line 34 in a54c502
And few more added in bc32981cdb |
How may one use this feature when exporting an interactive plot to HTML from Python? The html file I got still store data in ascii format. I couldn't find proper documentation for using this. I tried to manually replace the data in the HTML file with base64 string following the format in the given example, which doesn't seem to work... |
Overview
This PR implements a proposed approach encoding TypedArrays as primitive representation objects. See some related discussion in #1784.
Background
Plotly.js gained native support for typed arrays in #2388. This provides significantly improved performance when working with large arrays. plotly.py version 3 takes advantage of typed array support by converting numpy arrays into binary buffers on the Python side, and then converting these buffers directly into TypedArrays on the JavaScript side (See #2388 (comment) for more info).
One downside of working with TypedArrays is that there isn't a standard way (at least that I've been able to find) to serialize them to JSON. This PR aims to provide a few options of remedies to this problem.
Use cases
There are at least 5 use cases directly relevant to plotly.py where a serialized representation of TypedArrays will be very useful.
The reason I'm working on this today is because the lack of serialization support for TypedArrays is the reason that(This PR is no longer needed for this use case)FigureWidget
instances containing numpy arrays cannot be rendered statically using nbconvert, nbviewer, and by extension Plotly Cloud. With these changes I could update the JavaScript model forFigureWidget
to not include TypedArrays, but instead primitive representation objects that can be serialized.With these changes I could update the plotly.py JSON serializer to encode numpy arrays as base64 strings, which can be up to 10 times faster than the current method of first converting them to lists.
This JSON representation can be written to disk and then opened in the JupyterLab Chart Editor more efficiently.
This JSON representation should make the plotly.py orca integration more responsive for figures with large numpy arrays.
This JSON representation should make Dash more responsive when working with figures with large numpy arrays.
Typed Array Representation
This PR introduces the concept of an encoded
TypedArray
. An encoded TypedArray is a vanilla JavaScript object that containsdtype
andvalue
properties.dtype
property is a string indicating the data type of theTypedArray
('int8'
,'float32'
,'uint16'
, etc.).value
property is a primitive JavaScript object that stores the typed array data. It can be one of the following:i. Standard JavaScript
Array
ii. A base64 encoded
string
iii. An
ArrayBuffer
objectiv. A
DataView
objectEncodings (i) and (ii) can be directly serialized to a string representation. Encodings (iii) and (iv) are useful when working with frameworks that already have support for serializing these more primitive binary representations.
Decoding and Encoding
A new top-level
Plotly.decode
function is introduced. This function inputs a JavaScript value, and returns a copy where all encodedTypedArray
instances have been decoded into properTypedArrays
.A new top-level
Plotly.encode
function is introduced. This function inputs a JavaScript value and returns a copy where allTypedArray
instances are encoded as base64-encoded typed array representations.Future
To support multi-dimensional arrays, an encoded typed array representation object could optionally include a
shape
parameter, indicating the size of each dimension. Plotly.js does not currently support a homogenous multi-dimensional array type, so initially these would be decoded into nested primitive arrays.Would it be possible to encode datetime arrays more efficiently with a base64 buffer?