update

eunomia-bpf · Sep 16, 2024 · 160d96f · 160d96f
1 parent 1b1591b
commit 160d96f
Show file tree

Hide file tree

Showing 7 changed files with 17,316 additions and 123 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,85 @@
-# blog generator
+# Code-Survey: Uncovering Insights in Complex Systems with LLM
 
-Make a company and the blog with markdown files.
+- **Do we really kno how complex systems like the Linux works?** 
+- **How can we understand the design choice and evolution of a Super Complex system, like the Linux kernel?**
 
+**Code-Survey** is here to change that:
+
+- **No human can do that before, but AI can.**
+- **No chatbots, document search, or code generation: everyone is doing so.**
+
+Code-Survey helps you explore and analyze. the world's largest and most intricate codebases, like the Linux kernel. By carefully **design a survey** and **transforming** `unstructured data` like commits, mailing lists into organized, ``structed and easy-to-analyze data`, Code-Survey makes it simpler to uncover valuable insights in modern complex software systems.
+
+With the power of AI and Large Language Models (LLMs), you can ask questions, run queries, and gain a deeper understanding of how systems evolve over time. AI Agents can also help you analysis that. Whether you're a developer, researcher, or enthusiast, Code-Survey bridges the gap between design, implementation, and maintenance and security, making complex systems more accessible.
+
+**Let's do Code-Survey!**
+
+## Linux-bpf Dataset
+
+The **Linux-bpf dataset** focuses on the eBPF subsystem and is continuously updated via CI. The dataset includes:
+
+- **680+ expert-selected commits**: Features, commit details, types (Map, Helper, Kfunc, Prog, etc.).
+- **12,000+ BPF-related commits**: LLM Agent surveys and summaries.
+- **150,000+ BPF subsystem-related emails**: Automatically analyzed by LLM Agents.
+
+A simplest approach to see how these data works is just **Upload the CSV to ChatGPT**(**Or other platforms) and Ask questions to let it Analysis for you!
+
+To see more details, check the analysis in [report_ebpf.md](docs/report_ebpf.md).
+
+## Workflow / Methodology
+
+
+
+Our approach follows a well-defined workflow:
+
+1. **Human Experts or LLM Agents design surveys**: Tailored questions for each commit or email.
+2. **LLM Agents complete the surveys**: Answering yes/no, tagging relevant data, and summarizing key information. **This is the key steps to turn unstructured data into structured data.**
+3. **Human Experts or LLM Agents evaluate results**: Ensuring accuracy and uncover new insights easily from that.
+
+### Best Practices for Designing Surveys:
+
+- Survey acts as both prompt and action plan for LLM Agents.
+- Focus on yes/no, choice-based, or summary questions.
+- Domain knowledge questions should remain with experts.
+
+## Why LLM?
+
+LLMs have been proven effective in survey, summarization, and analysis tasks in fields like market research and chemistry. With LLMs, we can analyze unstructured data, which traditional methods struggle to handle efficiently. 
+
+### Why Not Other Methods?
+
+- **Domain Knowledge**: Required for Linux kernel analysis.
+- **Unstructured Data**: Commit messages and emails are difficult to process with traditional tools.
+- **Expert Cost**: Manually analyzing this data is time-consuming and expensive.
+
+## Example Questions LLM Agents Can Answer:
+
+- How do new feature introductions affect kernel stability and performance?
+- What identifiable phases exist in a feature’s lifecycle?
+- How has a specific eBPF feature, like `bpf_link`, evolved over successive commits?
+- What patterns emerge in commit frequency related to specific features?
+- What lessons from eBPF development can improve other eBPF runtimes?
+
+## Configuration Example
+
+```yml
+# Configuration for LLM Agent in Code-survey
+task: survey_analysis
+memory_access: linux_bpf_database
+survey_questions:
+  - type: yes_no
+    content: "Was this commit related to bpf_link?"
+  - type: tag
+    options: ["uprobe", "kprobe", "xdp", "bpf_link"]
+    content: "What type of BPF feature is this?"
+  - type: summary
+    length: 1-2 sentences
+    content: "Summarize the main purpose of this commit."
+```
+
+## References
+
+1. [How to Communicate When Submitting Patches: An Empirical Study of the Linux Kernel](https://dl.acm.org/doi/abs/10.1145/3359210)
+2. [Differentiating Communication Styles of Leaders on the Linux Kernel Mailing List](https://dl.acm.org/doi/abs/10.1145/2957792)
+
+This README outlines how Code-survey uses LLM agents to transform unstructured Linux kernel data into actionable insights, particularly in the eBPF subsystem, providing a faster and deeper understanding of feature evolution, design, and collaboration.
diff --git a/data/bpf_commits.csv b/data/bpf_commits.csv
diff --git a/docs/README.md b/docs/README.md
@@ -1,4 +1,4 @@
-# Linux kernel LLM Agent(LKLA)
+
 
 Imagine if you can ask every kernel developer to do a completed survey/questionare about a commit/a mail, what can you find with the results?
 
@@ -18,9 +18,9 @@ We are trying to turn impossible into possible with AI, changing weeks of work i
 
 QUESTIONS and insights:
 
+see the full lists in 
 
-
-## Datasets with eBPF: Linux BPF subsystem
+### Datasets with eBPF: Linux BPF subsystem
 
 - 680+ eBPF expert seleted Important feature commits information, with feature name, type(Map, helper, kfunc, prog, attach, etc...)
 - bpf related feature commits information, with LLM Agent survey and summary
@@ -59,28 +59,103 @@ What question can llm help answer but other cannot?
 
 - bug...
 
-## How to define a LLM Agent for analysis
-
-- Task:
-- Tool:
-- The input survey define: 3 types of questions
-    - Answer: yes or no
-    - The tag: choose between something: uprobe/kprobe/xdp...
-    - The summary: should be complete in one or 2 sentence.
-- Memory: which database can it access?
-- Planer(Predefined)
 
 ## How to design a survey?
 
 An example survey for eBPF commits:
 
 ```yml
-
+title: "Commit Classification Survey"
+description: "A survey about the type, use cases and summary of commit in Linux eBPF."
+questions:
+- id: summary
+  type: fill_in
+  question: "Please provide a summary of the commit in one short sentence not longer than 30 words. Only output one sentence."
+  required: true
+
+- id: keywords
+  type: fill_in
+  question: "Please extract no more than 3 keywords from the commit. Only output 3 keywords without any special characters."
+  required: true
+
+- id: commit_classification
+  type: single_choice
+  question: "What may be the main type of the commit?"
+  choices:
+    - value: It's a bug fix.
+    - value: It's a new feature.
+    - value: It's a performance optimization.
+    - value: It's a cleanup or refactoring in the code.
+    - value: It's a documentation change or typo fix.
+    - value: It's a test case or test infrastructure change.
+    - value: It's a build system or CI/CD change.
+    - value: It's a security fix.
+    - value: It's other type of commit.
+
+- id: commit_complexity
+  type: single_choice
+  question: "What is the estimated complexity of implementing this commit?"
+  choices:
+    - value: Simple can be used without much configuration. For example a simple helper function.
+    - value: Moderate requires some setup or understanding of the system. For example a new map type or a new link type.
+    - value: Complex needs expert knowledge or significant changes to existing systems. Like adding a completely new subsystem support or a completely new program type don't exist before.
+    - value: It's a merge commit not related to any of the above.
+
+- id: Major related implementation component
+  type: single_choice
+  question: "What major implementation component that the commit is modified on?"
+  choices:
+    - value: The eBPF verifier
+    - value: The eBPF JIT compiler for different architectures
+    - value: The helper and kfuncs
+    - value: The syscall interface
+    - value: The eBPF maps
+    - value: The libbpf library
+    - value: The bpftool utility
+    - value: The test cases and makefiles
+    - value: The attach events e.g. perf events tracepoints network HID LSM etc. This should be the modify on the event implements, e.g. change how uprobe works.
+    - value: Other implementation component related to eBPF but not listed above.
+    - value: It's a merge commit including changes to multiple implementation components.
+    - value: It's not related to any of the above it's not related to bpf subsystem in Linux kernel may be wrong data.
+
+- id: Major related logic component
+  type: single_choice
+  question: "What major logic component that the commit is related to? logic component is different to implementation component."
+  choices:
+    - value: A eBPF Instruction Logic for adding fixing or updating the way eBPF instructions are interpreted validated or executed by the eBPF virtual machine in the kernel. e.g. add fix optimized eBPF instructions.
+    - value: A eBPF Program Logic related to different eBPF program types (e.g. XDP tc kprobes) and how the kernel manages attaches and runs these programs. e.g. add eBPF program type fix things related to eBPF program.
+    - value: Runtime features Logic related to eBPF helper or kfunc functions which provide access to kernel resources from eBPF programs (e.g. reading from maps manipulating packet data). e.g. helpers kfuncs etc.
+    - value: eBPF events Logic for handling events that trigger eBPF programs such as network packet reception system calls or tracing events. e.g. add new event fix things related to XDP HID perf events tracepoint etc.
+    - value: Control Plane interface Logic for userspace control. This involves adding fixing or modifying syscalls that are part of the control plane interface for eBPF allowing userspace programs to interact with eBPF features in the kernel.
+    - value: Maps Logic for how eBPF maps (data structures shared between kernel and user space) are created accessed and managed.
+    - value: BPF Type Format (BTF) Logic. e.g. add fix things related to BTF, changes how eBPF programs are CO-RE, using BTF for verifier etc.
+    - value: It's a merge commit including changes to multiple logic components.
+
+- id: usecases_or_events
+  type: multiple_choice
+  question: "What eBPF usecases/events may the commit relate to and designed for?"
+  choices:
+    - value: xdp related type programs
+    - value: socket related type programs
+    - value: tc related type programs
+    - value: netfilter related type programs
+    - value: tracepoints related type programs
+    - value: kprobe/ftrace like type kernel dynamic probe programs
+    - value: uprobe/usdt like type user space dynamic probe programs
+    - value: profile related type programs
+    - value: LSM type relatde programs
+    - value: cgroup type related programs
+    - value: HID driver related type programs
+    - value: scheduler relatde type programs
+    - value: It improves the overall eBPF infrastructure (e.g. verifier runtime etc.)
+    - value: It's an experimental feature that doesn't fit into existing categories.
+    - value: It's not related to any of the above.
+    - value: other type of usecases/bpf programs not listed above.
 ```
 
 ## reference
 
-1. okernel mail
+1. kernel mail
    1. [How to Communicate when Submitting Patches: An Empirical Study of the Linux Kernel](https://dl.acm.org/doi/abs/10.1145/3359210?casa_token=5CrG9X-8QNgAAAAA:mm-N0p2baZSzxgfNbBcSi5HYBF67jdM7VZlJfTbhI2ht2cv1oCHRSL_FRPmM7DHr6ISpV91szCTOEg)
    2. [Differentiating Communication Styles of Leaders on the Linux Kernel Mailing List](https://dl.acm.org/doi/abs/10.1145/2957792.2957801?casa_token=VMchS_jhea0AAAAA:EubJDL_ftM5jmV3_yzwWzDLvLq8hAsexZnss1x3j754OZr4VNENST_tSl0ijQEBnVg5AaFWpZGf3kQ)
 
diff --git a/survey/README.md b/survey/README.md
@@ -1,22 +1,5 @@
-# How to define a LLM Agent
-
-- Task:
-- Tool:
-- The input survey define: 3 types of questions
-    - Answer: 
-        - if yes
-
-    - The tag: choose between usecases: security/network/cgroup/observability...
-        - if security
-    - The summary: should be complete in one or 2 sentence.
-    - Key words: 
-    - number 1-10
-- Memory: which database can it access?
-- Planer(Predefined)
-
-## Config
+# How to define a survey
 
 ```yml
 
-
 ```