-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bandwidth drops for linux #29
Comments
I don't understand the title of your which says bandwidth drops. With what and compared to what ? By how much ? What I see is that you asked for 9999999999 words = 10 billion. Not only this overflows 32 bits (see the code of the app and try Just to be clear, in #18 I didn't modify the way kernel paging mechanisms are used. I just applied the macros/functions renaming that the kernel developers did along kernel versions. If you noticed a BW drop for all kernel versions, this makes me think of recent Spectre/Meltdown patches that suddenly got applied to ALL kernel versions. As this isolates kernel pages from other process pages (if you're using Intel CPU), the only impact this can have on perfs is a slowdown. Assuming this is the cause of course. In that case, I know there were discussions to add a kernel parameter so this this costly kernel page table isolation can be disabled, but I don't know the current status of this (if this parameter does exist, it's probably only for latest kernel anyway). You can always recompile the kernel by disabling this and give it a try. Other than that, please check your system logs when an FPGA is discovered by the kernel. It prints the negociated speed and other interesting information. Your 1400/1500 MB/s seems like PCIe Gen1 2.5GT/s 8x to me, with BW significantly lower than theoretical 2GB/s. You should check with much larger buffers to mask sync latency and to get closer to what your machine can achieve (like 100MB at least). |
The bandwidth drop is compared with what I got previously for Gen 2 result which is around 3.5Gbps. I have tried re-downloading the bitstream, rebooting my computer. Only one out of many times of retrying solved the problem. I am not sure if this relates to linux driver or kernel version. Besides, as of why we are only receiving less data than we sent out through chnl_tester module, please see https://groups.google.com/forum/#!topic/riffa_users/z1hQ7F0vFGs or the following:
|
What do you observe when you run lspci |
"lspci | grep Altera" gave me
How does this affect the number of received words ? |
How about lspci -vvv
I need to see more than that first line
… On Feb 28, 2018, at 9:45 PM, promach ***@***.***> wrote:
"lspci | grep Altera" gave me
03:00.0 Unassigned class [ff00]: Altera Corporation Device 0004 (rev 01)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Here you go:
|
Still can't see what I'm looking for. Try |
Here you go :
|
Okay - it didn’t show up in the first 10 lines
How about `lspci -vvv | grep -A 100 Altera`
I’m looking for a line like:
LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
And while you’re at it - do you have the debug driver installed? If so - then also run the command
dmesg | grep riffa
… On Feb 28, 2018, at 10:07 PM, promach ***@***.***> wrote:
Here you go :
***@***.***:~$ sudo lspci -vvv | grep -A 10 altera
[sudo] password for phung:
***@***.***:~$ sudo lspci -vvv | grep -A 10 altera
***@***.***:~$ sudo lspci -vvv | grep -A 10 Altera
03:00.0 Unassigned class [ff00]: Altera Corporation Device 0004 (rev 01)
Subsystem: Altera Corporation Device 0004
Physical Slot: 4
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 52
Region 0: Memory at ef300000 (32-bit, non-prefetchable) [size=1K]
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00618 Data: 0000
Capabilities: [78] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
***@***.***:~$
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Please check my updated comment I have just ran "./testutil 2 0 0 55536" . This is what I got for "dmesg | grep riffa". The log still shows number of received words(32768) is less than number of words sent (55536). |
The relevant driver code to discard additional data is highlighted in https://github.com/KastnerRG/riffa/blob/master/driver/linux/riffa_driver.c#L640-L650 Does anyone have any comments on how to trigger this c driver code segment because I have tried and I have been receiving all the words I sent after modifying for-loop within the test_util c code at https://gist.github.com/promach/5fb1ddfab95d4e72033a3053735f9df1#file-testutil-c-L155 ? |
I am using Altera DE4 and it seems like my intermittent PCIe hardware detection issue is related to some timing analysis errors ? However, could anyone help to guide me how to perform the timing closure recommendation by Altera for this particular path which involves the PCIe hardcore IP ? |
When I ran the dmesg command to check the FPGA status during boot up, I saw the following information: [ 2.119690] riffa: loading out-of-tree module taints kernel. Will the "module verification failed" cause any other issue? I don't know if this is normal or not. I tested using the testutil, and it seems working fine. |
I have tried with linux kernel 4.4, 4.13 and 4.15.5 , testutil check is only giving me around 1.5Gbps for very large sample size.
@marzoul Does it have to do with the paging mechanism in the newer linux kernel as you had modified for #18 ? I suspect that it is something else because linux kernel 4.4 is also affected as per my experience. Could you advise ?
Is there a way to debug this bandwidth reduction issue using the linux driver compiled ("make debug") with debug support enabled ?
Why am I only getting 65536 words when I sent 99999 words ?
kernel log for "./testutil 2 0 0 99999"
The text was updated successfully, but these errors were encountered: