Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper Discussion 15a: Chaos: a System for Criticality-Aware, Multi-core Coordination #111

Open
Others opened this issue Apr 12, 2020 · 12 comments
Labels
paper discussion s20 The discussion of a paper for the spring 2020 class.

Comments

@Others
Copy link
Contributor

Others commented Apr 12, 2020

paper link

Have fun :)

@pcodes
Copy link
Contributor

pcodes commented Apr 13, 2020

Reviewer Name: Pat Cody
Review Type: Critical

Problem Being Solved

Keeping mixed-criticality systems secure when running on a single multicore device is challenging due to the complexities of inter-core coordination. A frequent problem is interference, when lower assurance code generates too many interrupts and impacts the performance of the high-assurance code. However, running each assurance level as a VM has high overhead costs and makes it impractical.

Main Contributions

CHAOS provides a separate runtime environment with the bare necessities (simple bear necessities) to devirtualize high-assurance/high-criticality code, such that it is not tied to the execution of the lower-assurance code. Instead, the two runtimes communicate via proxies. If the code running in ChaosRT (the high-assurance code) needs data available from the low-assurance code, it can make a request using the proxies. Chaos was implemented within NASA's Core Flight System (cFS), and it relies on a handful of key features provided by the Composite OS.

Questions

  • It seems like this system requires dedicating a core to the high-assurance code. What are the tradeoffs of this?
  • Could this scale to multiple assurance levels (does it even make sense to)?

Critiques

  • Dense: I'd love to know the JpS (Jargon per Sentence) count on this, there was a lot to understand
  • Given how deeply technical this topic is, maybe a higher-level example or walkthrough would have helped explain some of the benefits/use-cases.

@samfrey99
Copy link
Contributor

Reviewer: Sam Frey
Review Type: Skim

Problem
To reduce size and power consumption, many embedded systems requiring multiple execution streams have transitioned away from multiple single-core processors in favor of a single multi-core processor. Having a multi-core processor adds complexity for systems that run software with differing levels of criticality. Interrupts from low criticality software can interfere with high criticality software.

Contributions
Chaos lowers interference by removing high criticality software from the primary subsystem and allowing it to run in a minimal ChaosRT environment without interruption. Chaos improved processing latency by a factor of 2.7 compared to a standard Linux real-time environment while also improving isolation of critical code.

@jacobcannizzaro
Copy link
Contributor

Reviewer: Jacob Cannizzaro
Type: Comprehend/Skim

Problem:

With the trend of switching from single core processors to single multicore processors, there can be a lot of interference from less critical tasks taking up processing time as well as introduces more contention for shared resources. This adds a lot of overhead when trying to run all of this code, no matter the assurance level, in one place.

Main Contributions:

This paper uses Chaos to devirtualize some tasks. By taking tasks that are highly critical, and running them in a minimal ChaosRT environment, VM's don't have to have as much overhead, and communication can be dealt with with proxies. Communication now flows from lower priority tasks to higher functionality VM using the help of proxies that are abl to bound interference and latency. This reduces the worst case latency of a system by 3.5 that of the Linux equivalent.

@anguyen0204
Copy link
Contributor

anguyen0204 commented Apr 13, 2020

Reviewer: Andrew Nguyen
Review Type: Skim

Problem
There is a need to minimize size and power of embedded systems. Many systems move to a single stream processor in order to achieve this making it difficult to coordinate high & low criticality tasks. The paper introduces Chaos that devirtualizes to obtain high criticality tasks.

Contributions
Chaos reduces interference through its devirtualization process like previously mentioned. Tasks are isolated and moved to the ChaosRt environment that enables high-assurance & high-criticality tasks. These are also done in rate-limiting servers. The paper then looks into the design, the scenarios of interference, synchronic communications, and the implementations of it.

Questions

  1. How exactly does TCAPS work?
  2. Why was synchronous communications used in thread migration for Composite and not asynchronous?

@hjaensch7
Copy link
Contributor

Reviewer: Henry Jaensch
Review Type: Skim

Problem Being Solved

Embedded systems are moving toward using one chip for all of the tasks on the system. This paper proposes a way to maintain criticality and high resource efficiency when mixing many components and functions on the same processor. The priority here is to maintain efficient feature rich user applications while also providing isolation guarantees for high criticality processes.

Main Contributions

The paper introduces CHAOS a system for de-virtualizing high-criticality systems so that deadlines can be met without interference from other applications with lower-criticality. This is achieved by providing CHAOS RT which is a bare bones runtime that allows predictable execution of tasks. In order to support communication between mixed criticality tasks proxies are used to maintain feature richness.

Question

  1. What's the difference between bounded asynchronous communication and synchronous communication?

  2. Why was Linux one of the choices for comparison here? Mixed assurance and criticality yes, but Linux doesn't make any real-time guarantees.

@themarcusyoung
Copy link

Reviewer: Marcus Young
Review Type: Skim

Problem
Modern embedded systems are increasingly using single multi-core processors that are asked to process extremely complex tasks with different criticality levels. Since these systems are using a single multi-core processor instead of multiple single-core processors, there is a need to extract high criticality tasks to run them in a minimal runtime environment in order to improve human or equipment safety.

Contributions
Chaos removes high criticality software from the primary subsystem and puts it in a ChaosRT minimal runtime environment. Chaos improved processing latency for a sensor/actuation loop in satellite software experiencing inter-core interference by a factor of 2.7 while reducing worst-case by a factor of 3.5 over a real-time Linux variant.

@rachellkm
Copy link
Contributor

rachellkm commented Apr 13, 2020

Reviewer: Rachell Kim
Review Type: Skim

Problem Being Solved:

Embedded systems using multi-core processors to support mixed-criticality and multi-assurance levels often face difficulty in enforcing strict isolation between subsystems. Because shared abstractions between cores may trigger interference, it is important to protect high-criticality tasks from faults caused by subsystems of low-assurance and low-criticality. Moreover, systems must also maintain high-confidence in correctness while supporting feature-rich software, and this condition is considered to be difficult to maintain with current technology.

Main Contributions:

The authors of this paper propose a system called Chaos, which aims to remove interference caused by inter-core coordination in multi-core systems via devirtualization of high-criticality tasks. High-criticality tasks are moved into an execution environment called ChaosRT, thereby allowing predictable execution with minimal interference from shared subsystems. This paper also outlines example situations in which shared memory and inter-core coordination may impact the execution of high-criticality code.

Questions:

  1. Could the proxies and ChaosRT environment introduce more points of failure in place of removing interference?

@rebeccc
Copy link
Contributor

rebeccc commented Apr 13, 2020

Reviewer Name: Becky Shanley
Review Type: Critical

Problem Being Solved

Embedded systems struggle to manage the balance between minimizing SWaP (Size, Weight, and Power) and providing time guarantees and resources to the highest priority processes. In embedded systems, this problem is much more severe because high priority tasks usually include detrimental impacts on the physical world and the safety of humans. It's a difficult problem to solve because of the requirement that the high priority tasks work with the lower priority tasks to provide many real-time functionalities that cannot afford to be completely separated.

Main Contributions

CHAOS is a devirtualization system that is used to guarantee that the high priority tasks have access to the resources and low-latency requirements they need. It achieves this by extracting high priority tasks into CHAOSrt, a real-time environment that is separated from the interference of potentially low assurance level tasks. In all, this paper contributes to the problem domain by:

  1. evaluating the impact low assurance tasks can have on the high priority tasks in the same runtime
  2. introducing the devirtualization runtime CHAOS
  3. the technique that utilizes InterProcess Interrupts rate-limiting to bound interference from this communication method and bounds latency
  4. An evaluation of CHAOS to reliability-focused systems

Questions

  1. Why is there an evaluation of CHAOS on Linux at all? Since most embedded devices aren't running Linux, how meaningful is the comparison of CHAOS on a real-time OS versus CHAOS on Linux?
  2. Since IPI and Shared Memory both have their pros and cons, why was it decided to go with IPI? Rate-limiting reads like a complex process and I'm curious to know if it was ever explored to try to solve the problems of shared memory in the same way that it was explored to solve the problems with IPI.

Critiques

  1. As mentioned before, this paper is dense. It was dense to the point of being extremely difficult to get anything meaningful out of in the first skim. It got to a point where I was checking references and googling words at least once in a paragraph.
  2. The use of the world "devirtualization"-- it's explained in a footnote, and after a deep read I get why it's called this, but it also feels unnecessarily confusing. As a person who tends to google everything I don't understand, in my first skim I was very confused because I missed the footnote and was trying to apply compiler devirtualization to this paper, which, is not helpful.

@bushidocodes
Copy link
Contributor

Reviewer: Sean McBride

Review Type: Simple Skim

Problem Being Solved:

How can one consolidate mixed criticality workloads onto shared multi-core systems? Also, how can one leverage the more differentiated QOS attributes of a modern RTOS to provide better assurance guarantees to industry-standard software systems that run mixed criticality subsystems on a shared POSIX backend?

Main Contributions

  1. Modifies composite, an existing RTOS, to provide Chaos RT, a minimal sandbox that can run high assurance code and transparently communicate with a low assurance sandbox via software proxies.
  2. Uses devirtualization to extract high criticality sections from legacy systems running on nonspecialized operating systems that combine mixed-criticality workloads into Chaos RT.
  3. Applies a rate limiting technique to preemptive messages passed via inter process interrupts from rump software in a low-criticality sandboxes to the sandbox running high-criticality devirtualized work to limit interference.
  4. Uses TCaps (delegration of time) to coordinate different user-level schedulers in the different sandboxes
  5. Demonstrates that application of this technique to the cFS software stack used in NASA missions (which depends on a POSIX backend) using a NetBSD rumpkernel.

@ericwendt
Copy link
Contributor

Reviewer Name: Eric Wendt
Review Type: Critical

Problem Being Solved:
Some of the fundamental problems that need addressing are size, weight, and power for IoT devices. Finding a good balance for these requirements is exceedingly difficult, combining techniques from both hardware and software. This paper dives into software solutions focusing on cutting down interference between highly-critical tasks and lower tasks.

Main Contributions
Fortunately, many of the main contributions are laid out in a distinct sub-header in the paper.

  1. Outlining how low-criticality tasks can interfere with high-criticality tasks. Useful graphs are including to demonstrate impact.
  2. Detail of a technique called devirtualization which will cut down on interference between high and low tasks without compromising on dependency between the two.
  3. IPI rate limiting technique which allows CHAOS to bound IPI latency.
  4. Evaluation of CHAOS on multiple operating systems.

Questions:

  • This paper mentions that Composite uses wait-free operations based on raw atomic instructions. What does this mean? How is this different from Linux?
  • What is thread migration?

Critiques:

  • There are a lot of terms in this paper I am not familiar with. I assume someone more knowledgeable would have an easier time with this, but from an outsider it's a little bit of a headache.
  • Performance metrics and charts at the end were done very well.
  • A coverage of security would lend itself nicely here.

Sorry for the late post, lost power for a few hours.

@RyanFisk2
Copy link
Contributor

Reviewer: Ryan Fisk

Review Type: Critical

Problem Being Solved

Embedded systems are increasingly required to run many different processes at varying degrees of criticality. High-criticality systems need to run at a high priority to protect human or equipment safety, whereas lower-criticality systems are nice to have, but not as important. Due to resource constraints on IoT devices, these processes have to be scheduled by the same processor, and the underlying hardware overhead for deciding which processes run can cause interference with high priority tasks.

Contributions

The paper introduces ChaosRT, a minimal runtime environment that removes high-criticality tasks from the management system of the VM it normally would run on and thus minimizes or eliminates interference from lower priority tasks. Tasks that are removed from the VM can still communicate with the higher-priority tasks using proxies that handle communication between the devirtualized, higher criticality tasks and the rest of the system.

Questions

  1. If a devirtualized system required sensor readings or some other data that was gathered in a lower priority task, wouldn't it still have to wait for that task to complete and for the proxy to get the information?

  2. Why does devirtualization work so well for this? I'm confused as to how this reduces interference from other tasks.

  3. What happens if there is more IPI interference than allowed messages for a certain task?

Critiques

  1. The example of the NASA cFS system they used to explain the problems with virtualization was really helpful, I would've liked to see an example using ChaosRT when they talked about the implementation.

  2. What security concerns are there with the proxy? Can it be spoofed to send bad data to a safety critical system?

@huachuan
Copy link
Contributor

Reviewer: Huachuan Wang
Review Type: Skim

Problem being solved

Embedded systems are increasingly required to provide both complicated feature-sets, and high-confidence in the correctness of mission-critical computations. Functionalities traditionally performed are consolidated onto less expensive and more capable multi-core, commodity processors are very complicated. Supporting feature-rich, general computation and high-confidence physical control is difficult.

Contributions

This paper presented Chaos which could effectively use the increased throughput of multi-core machines and ensuring the necessary isolation between tasks of different criticalities and assurance-levels. Chaos also devirtualizes high criticality tasks to remove the overhead.

  1. Devirtualization to extract high-criticality subsystems from lower-assurance legacy systems while maintaining functional dependencies and predictable inter-core message passing mechanisms.
  2. IPI rate-limiting technique enables Chaos to bound the IPI interference and latency of notifications for inter-core coordination.

@gparmer gparmer added the paper discussion s20 The discussion of a paper for the spring 2020 class. label Apr 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
paper discussion s20 The discussion of a paper for the spring 2020 class.
Projects
None yet
Development

No branches or pull requests