Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting RR graphs with poor router performance #2803

Open
petergrossmann21 opened this issue Nov 7, 2024 · 4 comments
Open

Troubleshooting RR graphs with poor router performance #2803

petergrossmann21 opened this issue Nov 7, 2024 · 4 comments

Comments

@petergrossmann21
Copy link
Contributor

In a revision of an existing architecture to clean up RR graph bugs, the router wirelength and runtime have both blown up for test circuits with size on the order of thousands of LUTs. The particulars are posted to current behavior:

Both RR graphs are generated from scratch (not derived/revised from VPR's RR graph generator). The QOR of the original graph is benchmarked against the RR graph VPR generates from the arch XML and found to be roughly equal (VPR's graph wirelength: 71017, runtime also ~15 seconds). The VPR arch XML generates a relatively different graph vs. the custom graph, discussed in current behavior.

The salient differences between custom graphs are:

Original graph:

  1. one of two types of IOBs contains an extra IPIN node at ptc=0, side=TOP that does not correspond to hardware. There are 8 instances of this IOB type and 72 IOBs of another type.
  2. The routing channel permutation algorithm is implemented with a bug such that the ptc_num value of CHANX/CHANY nodes with edges to each IPIN do not necessarily appear in pairs of consecutive ptc_num values. For example, the lowest ptc values assigned for CLB IPIN ptc=0 are CHANX ptc=1 and CHANX ptc=8.

Revised graph:

  1. The IOB IPINs and with side=TOP, and the edges from CHANX/CHANY leading to them, are removed.
  2. The routing channel permutation bug is fixed, and so pairs of consecutive ptc_nums (INC_DIR and DEC_DIR) are assigned, i.e. CHANX ptc=0 and ptc=1 have edges to the same IPIN.

Expected Behaviour

Because the graphs are so similar, the expected behavior is to see similar wirelength and router results across both RR graph XML versions.

Current Behaviour

The architecture is a customized eFPGA with some unusual grid properties:
*The perimeter X/Y locations (x=0, y=0, y=30, x=36) are all EMPTY
*There are two IO block types that differ only in the number of I/Os located in them:
a. two copies of an IOB with 8 I/Os per IOB are located on each of the four sides of the array. These are the IOBs with the spurious TOP side IPIN in the original graph.
b. 24 copies of an IOB with 50 IOs per IOB are located on three of the four sides of the array.
c. leftover boundary tile locations (x=1, y=1, y=29, x=35) are populated with CLBs.
d. fc_in=0.15 and fc_out=0.15 for all pb_types.

example test circuit: synthetic finite state machine, packs into 631 CLBs with 8 4-LUTs each.
original graph wirelength: 70173
original graph runtime: ~15 seconds
current graph wirelength: routing fails.
current graph runtime: > 30 minutes

I have also shown that if I modify the original RR graph by manually deleting the edges from the CHANX/CHANY nodes to the IOB IPINs with side=TOP (but not deleting the nodes), the wirelength and runtime are effectively the same as the original RR graph.

Regarding a comparison to a VPR-generated RR graph derived from the original architecture XML, the following observations can be made:

*VPR inserts an additional row and column of routing channels at X=0, Y=0 that are not included in the custom RR graph
*The custom RR graph attempts to compensate for the missing row/column by looping channels that would go to X=0, Y=0 back to X=1, Y=1, respectively.
*VPR's routing channel index permutation algorithm for creating edges from CHANX/CHANY to IPINs differs from the custom graph generator's algorithm (original and bug fixed).

Possible Solution

If the mere existence of a second side for the IPIN node helps VPR in some way (for example, by fooling the router lookahead into making better choices), it would be nice to understand why this is the case, to help motivate modifying my graph generator to add these nodes systematically instead of as part of a bug.

Additionally, if there is risk in allowing the architecture XML to specify the architecture in a way that implies a different RR graph than what is loaded with --read_vpr_graph, it would be good to know what problems this kind of deviation can cause.

Finally, some observations from other experiments I have run attempting to improve the health of the architecture post-bug fix:

  • Increasing the flexibility of the connection boxes partially mitigates but does not fully resolve the issue.
  • Trials on smaller versions of the architecture with smaller test circuits produce similar (scaled) results.

Steps to Reproduce

Reproduction of the problem is possible on request. The original graph is available in Zero ASIC's logik release, and current graphs for the same architecture can be furnished for collaboration purposes if deemed appropriate.

Context

The goal of this work is to generate an eFPGA core to insert into an SoC. A custom RR graph generator is needed to properly support genfasm and to ensure CAD correspondence to the eFPGA hardware. For these reasons, we rely on generating a custom RR graph rather than reverse-engineering VPR's auto-generated RR graph. Given that, whatever can be done to keep the router stable vs. tiny perturbations of the routing graph is beneficial.

Your Environment

  • VTR revision used: commit hash de31f09
  • Operating System and version: Ubuntu 20.04
  • Compiler version: GCC 10.0
@petergrossmann21
Copy link
Contributor Author

@rs-dhow @AlexandreSinger @vaughnbetz I believe I have found the smoking gun issue in this setup. It appears that concurrently with fixing the bugs in the RR graph generation I introduced another error in which all switch_id values in the RR graph were being set to zero. It would appear that when the router is fed an RR graph comprised entirely of switch_id = 0 (vpr_delayless_switch) edges that something in the router goes haywire. Chalk this up to the hazards of fixing too many things at once while trying to turn out an eFPGA quickly for a short IC design cycle.

I am reporting after running only one round of experiments, but it would appear that much of the background I provided above is not relevant to the underlying issue. Let me know what's helpful from a VPR development standpoint re: next steps; in the meantime my bug fix will be subjected to a much larger battery of tests.

@vaughnbetz
Copy link
Contributor

Thanks @petergrossmann21 . I'm surprised setting the delay of all switches to 0 made the router go haywire, unless the delay of the nodes is also 0 (that could start getting strange, as the router might not see a benefit to branching off shorter paths for critical connections). But fixing that bug and re-running certainly makes sense.

There are a couple of router report options that can be helpful in debugging issues like this. See the option at https://docs.verilogtorouting.org/en/latest/vpr/command_line_usage/#cmdoption-vpr-generate_rr_node_overuse_report and the one right above it; turning on the graphics and selecting the congestion display after a failed routing can also help.

@petergrossmann21
Copy link
Contributor Author

@vaughnbetz I will need to confirm, but there's a very good chance the node delay is also zero. Additional testing is showing that correcting the switch_id settings is producing sensible results across several strawman architectures I use to verify my eFPGA generator. For context, my generator as a few different settings for delay modeling, most of which are still in beta as we have been codesigning them with ongoing IC design work. The baseline is a "zero" delay model in which no delay tags are provided in the , but a 1ns delay is furnished for the default switches in the section of the arch XML. Because all RC values are also set to zero, the 1ns delays in these switch types should be the only delays used in this architecture model.

Thanks for pointing me towards the router report options.

@vaughnbetz
Copy link
Contributor

If the delay of everything is 0, the normalizations we do for resource cost could also start going crazy; we normalize the resource costs to the average delay so they have the same units. If the average delay is 0, you might normalize everything to infinity. If we silently let that happen we should probably add an assert for that case during resource cost normalization at the start of the router.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants