You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've implemented feature related to decoupling from #37. Main commit can be viewed here. Below is commit message for convenience:
Change SharedTensor::read() signature from fn read(&self, device: &DeviceType) -> Result<&MemoryType, ...>
into fn read<D: IDevice(&self, device: &D) -> Result<&D::M, ...>
New signature provides type-level guarantee that if a Cuda device is passed
into read(), then it'll return Cuda memory (and not Native or OpenCL).
Previously required additional unwraps (.as_native().unwrap()) are no
longer required, code is more clear and concise.
Internally SharedTensor uses Any type to store objects of different types
uniformely. Synchronization between memories is also done through type-erased
interface. This makes it possible to define a new Framework in an external
crate, or extract Cuda and OpenCL frameworks into their own crates. Though
error types would require some additional work.
Use of "dynamic typing" has drawbacks -- mainly slightly larger runtime
overhead. Before this patch benchmarks showed that SharedTensor::read() takes
19-22ns, now it takes 23-26ns. For comparison, minimal synchronized CUDA
operation will take about 10-40us. Small NN layers on CPU are much faster,
e.g. 10-input softmax layer takes about 500ns. Still, in typical NNs overhead
looks negligible, and I think it's fair tradeoff for code clarity and better
decoupling.
Here are actual benches, before:
test bench_shared_tensor_access_time_first ... bench: 19 ns/iter (+/- 2)
test bench_shared_tensor_access_time_second ... bench: 21 ns/iter (+/- 0)
after:
test bench_shared_tensor_access_time_first ... bench: 23 ns/iter (+/- 0)
test bench_shared_tensor_access_time_second ... bench: 26 ns/iter (+/- 3)
What's your opinion on it?
The text was updated successfully, but these errors were encountered:
I've implemented feature related to decoupling from #37. Main commit can be viewed here. Below is commit message for convenience:
Change
SharedTensor::read()
signature fromfn read(&self, device: &DeviceType) -> Result<&MemoryType, ...>
into
fn read<D: IDevice(&self, device: &D) -> Result<&D::M, ...>
New signature provides type-level guarantee that if a Cuda device is passed
into read(), then it'll return Cuda memory (and not Native or OpenCL).
Previously required additional unwraps (.as_native().unwrap()) are no
longer required, code is more clear and concise.
Internally
SharedTensor
usesAny
type to store objects of different typesuniformely. Synchronization between memories is also done through type-erased
interface. This makes it possible to define a new Framework in an external
crate, or extract Cuda and OpenCL frameworks into their own crates. Though
error types would require some additional work.
Use of "dynamic typing" has drawbacks -- mainly slightly larger runtime
overhead. Before this patch benchmarks showed that
SharedTensor::read()
takes19-22ns, now it takes 23-26ns. For comparison, minimal synchronized CUDA
operation will take about 10-40us. Small NN layers on CPU are much faster,
e.g. 10-input softmax layer takes about 500ns. Still, in typical NNs overhead
looks negligible, and I think it's fair tradeoff for code clarity and better
decoupling.
Here are actual benches, before:
after:
What's your opinion on it?
The text was updated successfully, but these errors were encountered: