Sylvan Clebsch
Microsoft Research
June 19th, 2017
Presenting at ICOOOLPS 2017, Barcelona
Multicore CPUs
Coprocessor packages (Phi, Single-Chip Cloud)
FPGAs
GPUs / GPGPUs
- Software asks hardware to emulate an outdated execution model
- Sequential execution
- Coherent shared memory
- This makes hardware slower and less innovative
- And yet hardware gets faster
- And it gets "more correct"
- What could hardware do if we weren't holding it back?
Non-sequentiality comes in many forms:
- Instruction pipelining
- Vector processing
- Parallel processors
Compilers are getting better at vectorisation and pipelining
- But it's generally a black box: the programmer gets little to no feedback
- Compilation strategies differ by target: can languages offer mechanisms to specify the intended result? Pony doesn't!
- Real-world instructions per cycle tend to be much lower than ideal
Have sequential languages resulted in an inexpressive instruction set?
- Should languages be able to express pipelined computation syntactically?
- What about branch prediction? Is it feasible to express non-dependent computation shared between branches?
- What about HT/SMT? Are there better ways to hide memory latency? Can language-level memory dependency information improve scheduling?
There's no such thing: it's a distributed system
- The hardware gods jump through rings of fire to let us pretend we still have simple shared memory architecture
- But under the hood, it's message passing
- Large amounts of transistor budget goes into maintaining the illusion
Why can't we solve this in software?
- Efficient actor-model runtimes can treat parallelism like a distributed system. Pony tries to do this!
- (Or CSP, of course)
- That's good, but not good enough to hand back all that transistor budget
For example, can we eliminate the need for cache coherent hardware?
There's no such thing: it's a distributed system
- I should have reused the other slide entirely
- The costs are more obvious with co-processors
- But even here, the hardware gods are trying to rescue us with things like heterogeneous memory management
We tend to write multi-language programs for GPU and FPGA: C++ with OpenCL, for example
- They don't type check together or optimise together
- Here come the hardware gods: ARM
big.LITTLE
gives us heterogeneous hardware that can cope with our homogenous languages and runtimes
Distributed systems are also often written as a collection of programs that don't type check together or optimise together
- Datacenter computing can't emulate sequential instructions and coherent shared memory
- Are distributed operating systems hard because we lack distributed languages to express them in?
Non-uniform memory
Cache effects
Cache coherency
Prefetching
Memory is message passing
- It's another distributed system
- It's non-programmable
- It tries to guess what we want
- We try to guess what it wants
While the memory system tries to guess what our program wants, our program tries to guess what the memory system wants by trying to optimise for expected behaviour
- With no explicit interface, new CPUs can invalidate program assumptions
- For example: x86-64 adjacent cache line prefetch
Processors expose little (ARM) to no (Intel) cache interface (modulo non-temporal writes)
- Domain knowledge in the program (whether static or dynamic) cannot be used to optimise cache behaviour
- NUMA exposes more control: by moving the code to the data rather than the data to the code, we (sometimes) make big gains
- Fine grained non-processor cache control is critical to databases and many web applications
Why is processor cache a black box?
The coherent shared memory illusion is brittle
- The C11 memory model is a huge improvement, expressing more precise behaviour
- But that behaviour remains emulated: the memory model is an API to a message passing system
- For example, a relaxed atomic exchange: what kind of messages does it generate?
Fine control over program memory layout can give huge performance benefits
- Leverage hardware prefetching and indirect branch prediction
- Express arrays of structs as structs of arrays
- Use allocation locality to express temporal locality: adjacent addresses are likely to be read in sequence
If we could exploit domain knowledge in the program with processor instructions to guide prefetching, could we have more flexible memory layouts with the same performance?
Is this like a database query plan?
Cache effects can dominate algorithm performance
- This is true for everything from L1D to network to disk
- For example, linear vs. quadratic hash table probing: careful benchmarking is needed to find the best performance for a specific use case
Could domain knowledge improve cache behaviour?
RDMA over Ethernet
Infiniband
iWARP
100 GbE
Optical interconnect
Distributed hardware doesn't provide a shared memory abstraction
- MPI Remote Memory Access tries to provide one
- Some PGAS languages do the same
- Two-sided RDMA for message passing can outperform a memory abstraction: FaSST
We have a chance to make languages suit the hardware before the hardware gods try to suit existing languages
First class language support for distribution
- Can we stop RDMA from becoming a hardware shared memory abstraction?
- Can we convince the hardware gods we know what we're doing?
We're gonna have to prove it
Formal verification for hardware is more advanced than for software
- Alastair Reid: End-to-End Verification of ARM Processors with ISA-Formal
- Intel investment following the FDIV bug
- Still much to do, particularly regarding memory models
What can we remove from programming languages to make proofs easier?
We use hardware to isolate software faults
- Per-process virtual address spaces
- Hardware assisted OS virtualisation
Why? And at what cost?
OSes themselves are faulty software in need of isolation
- TPM
- SGX
- CHERI instruction set
Example Hidden Costs: Context Switching
We rely on hardware enforced virtual address spaces to compensate for lack of software memory safety
If a cache maps over virtual addresses, it must be entirely invalidated on a virtual address space change
What non-interference properties are necessary to avoid this?
- Memory safety, including no synthetic pointers
- Data-race freedom
- Static typing? Not needed: can be done dynamically
What properties other than non-interference are needed?
- Capabilities-security: strongly contained faults across processes
- Checkable capabilities policies: the equivalent of formal verification for capabilities
FORTH gives fine control over a stack machine
- No assumed shared memory model
- But threaded code doesn't seem to work well with modern processors
- No low level memory safety
How can we leverage modern and future hardware?
How can we convince the hardware gods we can produce programming models that don't need to be lied to?
Are distributed systems a way in?
Please don't tell anyone hash table lookup is O(1)
Thanks for listening!