Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend decoupling and typed memory interface #63

Open
alexandermorozov opened this issue May 2, 2016 · 0 comments
Open

Backend decoupling and typed memory interface #63

alexandermorozov opened this issue May 2, 2016 · 0 comments

Comments

@alexandermorozov
Copy link
Contributor

I've implemented feature related to decoupling from #37. Main commit can be viewed here. Below is commit message for convenience:

Change SharedTensor::read() signature from
fn read(&self, device: &DeviceType) -> Result<&MemoryType, ...>
into
fn read<D: IDevice(&self, device: &D) -> Result<&D::M, ...>
New signature provides type-level guarantee that if a Cuda device is passed
into read(), then it'll return Cuda memory (and not Native or OpenCL).
Previously required additional unwraps (.as_native().unwrap()) are no
longer required, code is more clear and concise.

Internally SharedTensor uses Any type to store objects of different types
uniformely. Synchronization between memories is also done through type-erased
interface. This makes it possible to define a new Framework in an external
crate, or extract Cuda and OpenCL frameworks into their own crates. Though
error types would require some additional work.

Use of "dynamic typing" has drawbacks -- mainly slightly larger runtime
overhead. Before this patch benchmarks showed that SharedTensor::read() takes
19-22ns, now it takes 23-26ns. For comparison, minimal synchronized CUDA
operation will take about 10-40us. Small NN layers on CPU are much faster,
e.g. 10-input softmax layer takes about 500ns. Still, in typical NNs overhead
looks negligible, and I think it's fair tradeoff for code clarity and better
decoupling.

Here are actual benches, before:

test bench_shared_tensor_access_time_first                            ... bench:          19 ns/iter (+/- 2)
test bench_shared_tensor_access_time_second                           ... bench:          21 ns/iter (+/- 0)

after:

test bench_shared_tensor_access_time_first                        ... bench:          23 ns/iter (+/- 0)
test bench_shared_tensor_access_time_second                       ... bench:          26 ns/iter (+/- 3)

What's your opinion on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant