_____ ____ ____ _
| ___| _ \ / ___| / \
| |_ | |_) | | _ / _ \
| _| | __/| |_| |/ ___ \
|_| |_| \____/_/ \_\
This repository is intended for folks who are new and want to learn something about FPGA. This repository is a collection of useful resources and links rather than a thorough FPGA tutorial. Traditional HDL (Hard and Difficult Language) is not the main focus, instead, we focus on using high-level languages (e.g., C++) to cook FPGA.
Originally, this repository was started by a newbie to record his learning of FPGA, and late made public in the hope that it could help researchers to start their journey along with FPGA, with less pain and whiskey.
Resources collected here, or the way contents are organized, are not in their perfect shape. This repository is still raw and need major improvements. Any form of contribution is welcomed and appreciated.
Main contents:
README.md
- Basics about Digital Design
- Basics about FPGA
- Relevant Courses and Books
- Papers about FPGA internal
Xilinx
xilinx.md
xilinx_constraints.md
xilinx_cheatsheet.md
xilinx_lessons_vivado.md
xilinx_lessons_hls.md
submodules/
: Github repositories about FPGAhls/
: Sample Xilinx HLS C++ code- AXI Stream
- Network protocol processing
xilinx_arty_a7
: Sample Xilinx projects for Arty A7 100 board- Tri-mode MAC reference design
- Simple LED
- Clocked LED
FAQ.md
- Some implementation questions about FPGA
FPGA Intro
- URL: RapidWright FPGA Architecture Basics
- URL: RapidWright Xilinx Architecture Terminology
- Book: Parallel Programming for FPGAs
- Basic about FPGA and HLS
- URL: All about FPGAs, EE Times
- Slides: Intro FPGA CSE467 UW
- URL: I/O Pads
- BGA Wiki: In a BGA the pins are replaced by pads on the bottom of the package. If you check PGA package, you will know the difference between pin and pad, and immediately get why it is called pad. And you will also know what's the pad in the IO Block diagram.
Digital Basics
- PDF: The Digital World
- Wiki: Differential signaling and Wiki: Single-ended signaling
- Book: Digital design and computer architecture
- Content-Addressable Memory Introduction
Verilog
High-Level Synthesis (HLS)
- A Survey and Evaluation of FPGA High-Level Synthesis Tools
- Xilinx Introduction to FPGA Design with Vivado High-Level Synthesis
- Xilinx Vivado Design Suite User Guide High-Level Synthesis
- Xilinx SDAccel Development Environment Help for 2018.2 XDF
- The Zynq book
- Parallel Programming for FPGAs
- GMU ECE 699 Software/Hardware Co-design S15
- CMU ECE 18-643
- Cornell ECE5775 from Prof. Zhiru Zhang
Courses
- Online: Real Digital
- CMU ECE 18-643
- I like its slides, very informative. Slides about PR, Verilog, HLS are good.
- Also read its references, all quite good papers.
- Cornell ECE5775 from Prof. Zhiru Zhang
- GMU ECE 699 Software/Hardware Co-design S16
- GMU ECE 699 Software/Hardware Co-design S15
- DAMN, this is a good and practical course.
- MIT 6.111 Introductory Digital Systems Laboratory
- MIT 6.375 Complex Digital Systems
- UCB EECS 151/251A
Books
- Parallel Programming for FPGAs
- The Zynq book
- 15.5.3 Pipelining
- 15.5.4 Dataflow
- FPGAs for Software Programmers
- Data Processing on FPGAs, Synthesis Lectures on Data Management
How to apply Operating System concept to FPGA? How to virtualize on-board memory and on-chip logic? And, how is FPGA ultimately different from CPU in items of resource sharing? Papers in this section could give you some hint.
General
- Sharing, Protection, and Compatibility for Reconfigurable Fabric with AMORPHOS, OSDI'18
- The LEAP Operating System for FPGAs
Memory Hierarchy
- (Papers deal with BRAM, registers, on-board DRAM, and system DRAM)
- LEAP Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic, FPGA'11
- Main design hierarchy: Use BRAM as L1 cache, use on-board DRAM as L2 cache, and host memory as the backing store. Everthing is abstracted away through their interface (similar to load/store). Programming is pretty much the same as if you are writing for CPU.
- According to sec 2.2.2, its scratchpad controller, is using simple segment-based mapping scheme. Like AmorphOS's one.
- LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories, FCCM'14
- Follow up work on LEAP Scratchpads, extends the work to have cache coherence between multiple FPGAs.
- Coherent Scatchpads with MOSI protocol.
- CoRAM: An In-Fabric Memory Architecture for FPGA-Based Computing
- CoRAM provides an interface for managing the on- and off-chip memory resource of an FPGA.
- Cache, TLB, NoC, it has almost everything. The thesis is very comprehensive and informative.
- Sharing, Protection, and Compatibility for Reconfigurable Fabric with AMORPHOS, OSDI'18
- Hull: provides memory protection for on-board DRAM using segment-based address translation.
- Virtualized Execution Runtime for FPGA Accelerators in the Cloud, IEEE Access'17
Dynamic Memory Allocation
- A High-Performance Memory Allocator for Object-Oriented Systems, IEEE'96
- SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems, FPL'15
malloc()
andfree()
for FPGA on-board DRAM.
- Hi-DMM: High-Performance Dynamic Memory Management in High-Level Synthesis, IEEE'18
Integrate with Virtual Memory
- (Papers deal with OS Virtual Memory System. Note that, all these papers, they introduce some form of MMU into FPGA to let FPGA work with host virtual memory systems. This added MMU is similar to CPU's MMU in the sense that they both do address translation. But, do note that the virtual memory system still runs in Linux, these include page fault handling, swapping, TLB shootdown stuff. What could really stands out, is to implement virtual memory system in FPGA. :-/ )
- Virtual Memory Window for Application-Specific Reconfigurable Coprocessors, DAC'04
- Early work that adds a new MMU to FPGA to let FPGA logic access
on-chip DRAM
. Note, it's not the system main memory. Thus the translation pgtable is different. - Has some insights on prefetching and MMU CAM design.
- Early work that adds a new MMU to FPGA to let FPGA logic access
- Seamless Hardware Software Integration in Reconfigurable Computing Systems, 2005
- Follow up summary on previous DAC'04 Virtual Memory Window.
- A Reconfigurable Hardware Interface for a Modern Computing System, FCCM'07
- This work adds a new MMU which includes a 16-entry TLB to FPGA. FPGA and CPU shares the same user virtual address space, use the same physical memory. FPGA and CPU share memory at cacheline granularity, FPGA is just another core in this sense. Upon a TLB miss at FPGA MMU, the FPGA sends interrupt to CPU, to let software to handle the TLB miss. Using software-managed TLB miss is not efficient. But they made cache coherence between FPGA and CPU easy.
- Low-Latency High-Bandwidth HW/SW Communication in a Virtual Memory Environment, FPL'08
- This work actually add a new MMU to FPGA, which works just like CPU MMU. It's similar to IOMMU, in some sense.
- But I think they missed one important aspect: cache coherence between CPU and FPGA. There is not too much information about this in the paper, it seems they do not have cache at FPGA. Anyhow, this is why recently CCIX and OpenCAPI are proposed.
- Memory Virtualization for Multithreaded Reconfigurable Hardware, FPL'11
- Part of the ReconOS project
- They implemented a simple MMU inside FPGA that includes a TLB. On protection violation or page invalid access cases, their MMU just hand over to CPU pgfault routines. How is this different from the FPL'08 one? Actually, IMO, they are the same.
- S4 Virtualized Execution Runtime for FPGA Accelerators in the Cloud, IEEE Access'17
- This paper also implemented a hardware MMU, but the virtual memory system still run on Linux.
- Also listed in
Cloud Infrastructure
part.
- Lightweight Virtual Memory Support for Many-Core Accelerators in Heterogeneous Embedded SoCs, 2015
- Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs, IEEE'17
- Part of the PULP project.
- Essentially a software-managed IOMMU. The control path is running as a Linux kernel module. The datapath is a lightweight AXI transation translation.
Integrate OS/CPU/FPGA
- A Virtual Hardware Operating System for the Xilinx XC6200, FPL'96
- Operating systems for reconfigurable embedded platforms: online scheduling of real-time tasks, IEEE'04
- hthreads: a hardware/software co-designed multithreaded RTOS kernel, 2005
- Reconfigurable computing: architectures and design methods, IEE'05
- BORPH: An Operating System for FPGA-Based Reconfigurable Computers. PhD Thesis.
- FUSE: Front-end user framework for O/S abstraction of hardware accelerators, FCCM'11
- ReconOS – an Operating System Approach for Reconfigurable Computing, IEEE Micro'14
- Invoke kernel from FPGA. They built a shell in FPGA and delegation threads in CPU to achieve this.
- They implemented their own MMU (using pre-established pgtables) to let FPGA logic to access system memory. Ref.
- Read the "Operating Systems for Reconfigurable Computing" sidebar, nice summary.
What are the typical applications that can be offloaded into FPGA? What has already been done before? This section lists many interesting applications and systems deployed on FPGA.
Integrate with Frameworks
- Map-reduce as a Programming Model for Custom Computing Machines, FCCM'08
- This paper proposes a model to translate MapReduce code written in C to code that could run on FPGA and GPU. Many details are omitted, and they don't really have the compiler.
- Single-host framework, everything is in FPGA and GPU.
- Axel: A Heterogeneous Cluster with FPGAs and GPUs, FPGA'10
- A distributed MapReduce Framework, targets clusters with CPU, GPU, and FPGA. Mainly the idea of scheduling FPGA/GPU jobs.
- Distributed Framework.
- FPMR: MapReduce Framework on FPGA, FPGA'10
- A MapReduce framework on a single host's FPGA. You need to write Verilog/HLS for processing logic to hook with their framework. The framework mainly includes a data transfer controller, a simple schedule that enable certain blocks at certain time.
- Single-host framework, everything is in FPGA.
- Melia: A MapReduce Framework on OpenCL-Based FPGAs, IEEE'16
- Another framework, written in OpenCL, and users can use OpenCL to program as well. Similar to previous work, it's more about the framework design, not specific algorithms on FPGA.
- Single-host framework, everything is in FPGA. But they have a discussion on running on multiple FPGAs.
- Four MapReduce FPGA papers here, I believe there are more. The marriage between MapReduce and FPGA is not something hard to understand. FPGA can be viewed as another core with different capabilities. The thing is, given FPGA's reprogram-time and limited on-board memory, how to design a good scheduling algorithm and data moving/caching mechanisms. Those papers give some hints on this.
- UCLA: When Apache Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration, HotCloud'16
- UCLA: Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale, SoCC'16
- A system that hooks FPGA with Spark.
- There is a line of work that hook FPGA with big data processing framework (Spark), so the implementation of FPGA and the scale-out software can be separated. The Spark can schedule FPGA jobs to different machines, and take care of scale-out, failure handling etc. But, I personally think this line of work is really just an extension to ReconOS/FUSE/BORPH line of work. The main reason is: both these two lines of work try to integrate jobs run on CPU and jobs run on FPGA, so CPU and FPGA have an easier way to talk, or put in another way, CPU and FPGA have a better division of labor. Whether it's single-machine (like ReconOS, Melia), or distributed (like Blaze, Axel), they are essentially the same.
- UCLA: Heterogeneous Datacenters: Options and Opportunities, DAC'16
- Follow up work of Blaze. Nice comparison of big and wimpy cores.
Cloud Infrastructure
- Huawei: FPGA as a Service in the Cloud
- UCLA: Customizable Computing: From Single Chip to Datacenters, IEEE'18
- UCLA: Accelerator-Rich Architectures: Opportunities and Progresses, DAC'14
- Reminds me of OmniX. Disaggregation at a different scale.
- This paper actually targets single-machine case. But it can reflect a distributed setting.
- Enabling FPGAs in the Cloud, CF'14
- Paper raised four important aspects to enable FPGA in cloud: Abstraction, Sharing, Compatibility, and Security. FPGA itself requires a shell (paper calls it service logic) and being partitioned into multiple slots. Things discussed in the paper are straightforward, but worth reading. They did not solve the FPGA sharing issue, which, is solved by AmorphOS.
- FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack, FCCM'14
- Use OpenStack to manage FPGA resource. The FPGA is partitioned into multiple regions, each region can use PR. The FPGA shell includes: 1) basic MAC, and packet dispatcher, 2) memory controller, and segment-based partition scheme, 3) a soft processor used for runtime PR control. One very important aspect of this project is: they envision input to FPGA comes from Ethernet, which is very true nowadays. And this also makes their project quite similar to Catapult. It's a very solid paper, though the evaluation is a little bit weak. What could be added: migration, different-sized region.
- The above CF and FCCM papers are similar in the sense that they are both building SW framework and HW shell to provide a unified cloud management system. They differ in their shell design: CF one take inputs from DMA engine, which is local system DRAM, FCCM one take inputs from Ethernet. The things after DMA or MAC, are essentially similar.
- It seems all of them are using simple segment-based memory partition for user FPGA logic. What's the pros and cons of using paging here?
- S1 DyRACT: A partial reconfiguration enabled accelerator and test platform, FPL'14
- S2 Virtualized FPGA Accelerators for Efficient Cloud Computing, CloudCom'15
- S3 Designing a Virtual Runtime for FPGA Accelerators in the Cloud, FPL'16
- S4 Virtualized Execution Runtime for FPGA Accelerators in the Cloud, IEEE Access'17
- The above four papers came from the same group of folks. S1 developed a framework to use PCIe to do PR, okay. S2 is a follow-up on S1, read S2's chapter IV hardware architecture, many implementation details like internal FPGA switch, AXI stream interface. But no memory virtualization discussion. S3 is a two page short paper. S4 is the realization of S3. I was particularly interested if S4 has implemented their own virtual memory management. The answer is NO. S4 leveraged on-chip Linux, they just build a customized MMU (in the form of using BRAM to store page tables. This approach is similar to the papers listed in
Integrate with Virtual Memory
). Many things discussed in S4 have been proposed multiple times in previous cloud FPGA papers since 2014.
- The above four papers came from the same group of folks. S1 developed a framework to use PCIe to do PR, okay. S2 is a follow-up on S1, read S2's chapter IV hardware architecture, many implementation details like internal FPGA switch, AXI stream interface. But no memory virtualization discussion. S3 is a two page short paper. S4 is the realization of S3. I was particularly interested if S4 has implemented their own virtual memory management. The answer is NO. S4 leveraged on-chip Linux, they just build a customized MMU (in the form of using BRAM to store page tables. This approach is similar to the papers listed in
- MS: A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA'14
- MS: A Cloud-Scale Acceleration Architecture, Micro'16
- Catapult is unique in its shell, which includes the Lightweight Transport Layer (LTL), and Elastic Router(ER). The cloud management part, which the paper just briefly mentioned, actually should include everything the above CF'14 and FCCM'14 have. The LTL has congestion control, packet loss detection/resend, ACK/NACK. The ER is a crossbar switch used by FPGA internal modules, which is essential to connect shell and roles.
- These two Catapult papers are simply a must read.
- MS: A Configurable Cloud-Scale DNN Processor for Real-Time AI, Micro'18
- MS: Azure Accelerated Networking: SmartNICs in the Public Cloud, NSDI'18
- MS: Direct Universal Access : Making Data Center Resources Available to FPGA, NSDI'19
- Catapult is just sweet, isn't it?
- ASIC Clouds: Specializing the Datacenter, ISCA'16
Programmable Network
- MS: ClickNP: Highly Flexible and High Performance Network Processing with Reconfigurable Hardware, SIGCOMM'16
- MS: Multi-Path Transport for RDMA in Datacenters, NSDI'18
- MS: Azure Accelerated Networking: SmartNICs in the Public Cloud, NSDI'18
Database
Machine Learning
- Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks, FPGA'15
- From High-Level Deep Neural Models to FPGAs, ISCA'16
- Deep Learning on FPGAs: Past, Present, and Future, arXiv'16
- Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC, FPT'16
- FINN: A Framework for Fast, Scalable Binarized Neural Network Inference, FPGA'17
- In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA'17
- Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs, FPGA'17
- A Configurable Cloud-Scale DNN Processor for Real-Time AI, ISCA'18
- A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks, MICRO'18
- DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs, ICCAD'18
- FA3C : FPGA-Accelerated Deep Reinforcement Learning, ASPLOS’19
Graph
- A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing, ISCA'15
- Energy Efficient Architecture for Graph Analytics Accelerators, ISCA'16
- Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search, FPGA'17
- FPGA-Accelerated Transactional Execution of Graph Workloads, FPGA'17
- An FPGA Framework for Edge-Centric Graph Processing, CF'18
KVS
- Achieving 10Gbps line-rate key-value stores with FPGAs, HotCloud'13
- Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached, ISCA'13
- An FPGA Memcached Appliance, FPGA'13
- Scaling out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory, HotStorage'15
- KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC, SOSP'17
- This link is also useful for better understading Morning Paper
- Ultra-Low-Latency and Flexible In-Memory Key-Value Store System Design on CPU-FPGA, FPT'18
Genome
Consensus
- Consensus in a Box: Inexpensive Coordination in Hardware, NSDI'16
Video Processing
- TODO
Blockchain
- TODO
Micro-services
- TODO
Languages
- From JVM to FPGA: Bridging Abstraction Hierarchy via Optimized Deep Pipelining, HotCloud'18
General
- FPGA and CPLD architectures: a tutorial, 1996
- Reconfigurable computing: a survey of systems and software, 2002
- Reconfigurable computing: architectures and design methods
- FPGA Architecture: Survey and Challenges, 2007
- Read the first two paragraphs of each section and then come back to read all of that if needed.
- RAMP: Research Accelerator For Multiple Processors, 2007
- Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology, IEEE'15
Partial Reconfiguration
- DyRACT: A partial reconfiguration enabled accelerator and test platform, FPL'14
- FPGA Dynamic and Partial Reconfiguration: A Survey of Architectures, Methods, and Applications, CSUR'18
Logical Optimization and Technology Mapping
- FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based FPGA Designs, 1994
- Combinational Logic Synthesis for LUT Based Field Programmable Gate Arrays, 1996
- DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs, 2004
Place and Route
- VPR: A New Packing, Placement and Routing Tool for FPGA Research, 1997
- VTR 7.0: Next Generation Architecture and CAD System for FPGAs, 2014
RTL2FPGA