Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
yunwei37 committed Sep 16, 2024
1 parent 1b1591b commit 160d96f
Show file tree
Hide file tree
Showing 7 changed files with 17,316 additions and 123 deletions.
85 changes: 83 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,85 @@
# blog generator
# Code-Survey: Uncovering Insights in Complex Systems with LLM

Make a company and the blog with markdown files.
- **Do we really kno how complex systems like the Linux works?**
- **How can we understand the design choice and evolution of a Super Complex system, like the Linux kernel?**

**Code-Survey** is here to change that:

- **No human can do that before, but AI can.**
- **No chatbots, document search, or code generation: everyone is doing so.**

Code-Survey helps you explore and analyze. the world's largest and most intricate codebases, like the Linux kernel. By carefully **design a survey** and **transforming** `unstructured data` like commits, mailing lists into organized, ``structed and easy-to-analyze data`, Code-Survey makes it simpler to uncover valuable insights in modern complex software systems.

With the power of AI and Large Language Models (LLMs), you can ask questions, run queries, and gain a deeper understanding of how systems evolve over time. AI Agents can also help you analysis that. Whether you're a developer, researcher, or enthusiast, Code-Survey bridges the gap between design, implementation, and maintenance and security, making complex systems more accessible.

**Let's do Code-Survey!**

## Linux-bpf Dataset

The **Linux-bpf dataset** focuses on the eBPF subsystem and is continuously updated via CI. The dataset includes:

- **680+ expert-selected commits**: Features, commit details, types (Map, Helper, Kfunc, Prog, etc.).
- **12,000+ BPF-related commits**: LLM Agent surveys and summaries.
- **150,000+ BPF subsystem-related emails**: Automatically analyzed by LLM Agents.

A simplest approach to see how these data works is just **Upload the CSV to ChatGPT**(**Or other platforms) and Ask questions to let it Analysis for you!

To see more details, check the analysis in [report_ebpf.md](docs/report_ebpf.md).

## Workflow / Methodology



Our approach follows a well-defined workflow:

1. **Human Experts or LLM Agents design surveys**: Tailored questions for each commit or email.
2. **LLM Agents complete the surveys**: Answering yes/no, tagging relevant data, and summarizing key information. **This is the key steps to turn unstructured data into structured data.**
3. **Human Experts or LLM Agents evaluate results**: Ensuring accuracy and uncover new insights easily from that.

### Best Practices for Designing Surveys:

- Survey acts as both prompt and action plan for LLM Agents.
- Focus on yes/no, choice-based, or summary questions.
- Domain knowledge questions should remain with experts.

## Why LLM?

LLMs have been proven effective in survey, summarization, and analysis tasks in fields like market research and chemistry. With LLMs, we can analyze unstructured data, which traditional methods struggle to handle efficiently.

### Why Not Other Methods?

- **Domain Knowledge**: Required for Linux kernel analysis.
- **Unstructured Data**: Commit messages and emails are difficult to process with traditional tools.
- **Expert Cost**: Manually analyzing this data is time-consuming and expensive.

## Example Questions LLM Agents Can Answer:

- How do new feature introductions affect kernel stability and performance?
- What identifiable phases exist in a feature’s lifecycle?
- How has a specific eBPF feature, like `bpf_link`, evolved over successive commits?
- What patterns emerge in commit frequency related to specific features?
- What lessons from eBPF development can improve other eBPF runtimes?

## Configuration Example

```yml
# Configuration for LLM Agent in Code-survey
task: survey_analysis
memory_access: linux_bpf_database
survey_questions:
- type: yes_no
content: "Was this commit related to bpf_link?"
- type: tag
options: ["uprobe", "kprobe", "xdp", "bpf_link"]
content: "What type of BPF feature is this?"
- type: summary
length: 1-2 sentences
content: "Summarize the main purpose of this commit."
```
## References
1. [How to Communicate When Submitting Patches: An Empirical Study of the Linux Kernel](https://dl.acm.org/doi/abs/10.1145/3359210)
2. [Differentiating Communication Styles of Leaders on the Linux Kernel Mailing List](https://dl.acm.org/doi/abs/10.1145/2957792)
This README outlines how Code-survey uses LLM agents to transform unstructured Linux kernel data into actionable insights, particularly in the eBPF subsystem, providing a faster and deeper understanding of feature evolution, design, and collaboration.
17,020 changes: 17,020 additions & 0 deletions data/bpf_commits.csv

Large diffs are not rendered by default.

105 changes: 90 additions & 15 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Linux kernel LLM Agent(LKLA)


Imagine if you can ask every kernel developer to do a completed survey/questionare about a commit/a mail, what can you find with the results?

Expand All @@ -18,9 +18,9 @@ We are trying to turn impossible into possible with AI, changing weeks of work i

QUESTIONS and insights:

see the full lists in


## Datasets with eBPF: Linux BPF subsystem
### Datasets with eBPF: Linux BPF subsystem

- 680+ eBPF expert seleted Important feature commits information, with feature name, type(Map, helper, kfunc, prog, attach, etc...)
- bpf related feature commits information, with LLM Agent survey and summary
Expand Down Expand Up @@ -59,28 +59,103 @@ What question can llm help answer but other cannot?

- bug...

## How to define a LLM Agent for analysis

- Task:
- Tool:
- The input survey define: 3 types of questions
- Answer: yes or no
- The tag: choose between something: uprobe/kprobe/xdp...
- The summary: should be complete in one or 2 sentence.
- Memory: which database can it access?
- Planer(Predefined)

## How to design a survey?

An example survey for eBPF commits:

```yml

title: "Commit Classification Survey"
description: "A survey about the type, use cases and summary of commit in Linux eBPF."
questions:
- id: summary
type: fill_in
question: "Please provide a summary of the commit in one short sentence not longer than 30 words. Only output one sentence."
required: true

- id: keywords
type: fill_in
question: "Please extract no more than 3 keywords from the commit. Only output 3 keywords without any special characters."
required: true

- id: commit_classification
type: single_choice
question: "What may be the main type of the commit?"
choices:
- value: It's a bug fix.
- value: It's a new feature.
- value: It's a performance optimization.
- value: It's a cleanup or refactoring in the code.
- value: It's a documentation change or typo fix.
- value: It's a test case or test infrastructure change.
- value: It's a build system or CI/CD change.
- value: It's a security fix.
- value: It's other type of commit.

- id: commit_complexity
type: single_choice
question: "What is the estimated complexity of implementing this commit?"
choices:
- value: Simple can be used without much configuration. For example a simple helper function.
- value: Moderate requires some setup or understanding of the system. For example a new map type or a new link type.
- value: Complex needs expert knowledge or significant changes to existing systems. Like adding a completely new subsystem support or a completely new program type don't exist before.
- value: It's a merge commit not related to any of the above.

- id: Major related implementation component
type: single_choice
question: "What major implementation component that the commit is modified on?"
choices:
- value: The eBPF verifier
- value: The eBPF JIT compiler for different architectures
- value: The helper and kfuncs
- value: The syscall interface
- value: The eBPF maps
- value: The libbpf library
- value: The bpftool utility
- value: The test cases and makefiles
- value: The attach events e.g. perf events tracepoints network HID LSM etc. This should be the modify on the event implements, e.g. change how uprobe works.
- value: Other implementation component related to eBPF but not listed above.
- value: It's a merge commit including changes to multiple implementation components.
- value: It's not related to any of the above it's not related to bpf subsystem in Linux kernel may be wrong data.

- id: Major related logic component
type: single_choice
question: "What major logic component that the commit is related to? logic component is different to implementation component."
choices:
- value: A eBPF Instruction Logic for adding fixing or updating the way eBPF instructions are interpreted validated or executed by the eBPF virtual machine in the kernel. e.g. add fix optimized eBPF instructions.
- value: A eBPF Program Logic related to different eBPF program types (e.g. XDP tc kprobes) and how the kernel manages attaches and runs these programs. e.g. add eBPF program type fix things related to eBPF program.
- value: Runtime features Logic related to eBPF helper or kfunc functions which provide access to kernel resources from eBPF programs (e.g. reading from maps manipulating packet data). e.g. helpers kfuncs etc.
- value: eBPF events Logic for handling events that trigger eBPF programs such as network packet reception system calls or tracing events. e.g. add new event fix things related to XDP HID perf events tracepoint etc.
- value: Control Plane interface Logic for userspace control. This involves adding fixing or modifying syscalls that are part of the control plane interface for eBPF allowing userspace programs to interact with eBPF features in the kernel.
- value: Maps Logic for how eBPF maps (data structures shared between kernel and user space) are created accessed and managed.
- value: BPF Type Format (BTF) Logic. e.g. add fix things related to BTF, changes how eBPF programs are CO-RE, using BTF for verifier etc.
- value: It's a merge commit including changes to multiple logic components.

- id: usecases_or_events
type: multiple_choice
question: "What eBPF usecases/events may the commit relate to and designed for?"
choices:
- value: xdp related type programs
- value: socket related type programs
- value: tc related type programs
- value: netfilter related type programs
- value: tracepoints related type programs
- value: kprobe/ftrace like type kernel dynamic probe programs
- value: uprobe/usdt like type user space dynamic probe programs
- value: profile related type programs
- value: LSM type relatde programs
- value: cgroup type related programs
- value: HID driver related type programs
- value: scheduler relatde type programs
- value: It improves the overall eBPF infrastructure (e.g. verifier runtime etc.)
- value: It's an experimental feature that doesn't fit into existing categories.
- value: It's not related to any of the above.
- value: other type of usecases/bpf programs not listed above.
```
## reference
1. okernel mail
1. kernel mail
1. [How to Communicate when Submitting Patches: An Empirical Study of the Linux Kernel](https://dl.acm.org/doi/abs/10.1145/3359210?casa_token=5CrG9X-8QNgAAAAA:mm-N0p2baZSzxgfNbBcSi5HYBF67jdM7VZlJfTbhI2ht2cv1oCHRSL_FRPmM7DHr6ISpV91szCTOEg)
2. [Differentiating Communication Styles of Leaders on the Linux Kernel Mailing List](https://dl.acm.org/doi/abs/10.1145/2957792.2957801?casa_token=VMchS_jhea0AAAAA:EubJDL_ftM5jmV3_yzwWzDLvLq8hAsexZnss1x3j754OZr4VNENST_tSl0ijQEBnVg5AaFWpZGf3kQ)
19 changes: 1 addition & 18 deletions survey/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,5 @@
# How to define a LLM Agent

- Task:
- Tool:
- The input survey define: 3 types of questions
- Answer:
- if yes

- The tag: choose between usecases: security/network/cgroup/observability...
- if security
- The summary: should be complete in one or 2 sentence.
- Key words:
- number 1-10
- Memory: which database can it access?
- Planer(Predefined)

## Config
# How to define a survey

```yml


```
Loading

0 comments on commit 160d96f

Please sign in to comment.