-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
提交 #6
base: main
Are you sure you want to change the base?
提交 #6
Conversation
|
||
PROF_SCOPED_MARKER("WorkLoop"); | ||
|
||
#pragma omp parallel for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉按照子矩阵的方式划分并行任务会更好?
好像数据规模有点小,没法给cache上压力。。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
按子矩阵具体是指?输入的4x4为单位?还是由多个4x4组成的block?
数据规模确实可能有点小,我profile出来没多少cache miss感觉还很意外。。。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
按子矩阵具体是指?输入的4x4为单位?还是由多个4x4组成的block? 数据规模确实可能有点小,我profile出来没多少cache miss感觉还很意外。。。
比如输入是1024*1024的图像,可以切分成一个线程取64 * 64的子矩阵计算啥的,这样对于列上的数据复用比较好。
不过好像因为1024*1024的矩阵太小了,甚至能全装cache里(?),不会反复flush导致cache miss。。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
按子矩阵具体是指?输入的4x4为单位?还是由多个4x4组成的block? 数据规模确实可能有点小,我profile出来没多少cache miss感觉还很意外。。。
比如输入是1024*1024的图像,可以切分成一个线程取64 * 64的子矩阵计算啥的,这样对于列上的数据复用比较好。
不过好像因为1024*1024的矩阵太小了,甚至能全装cache里(?),不会反复flush导致cache miss。。
是的,当时在研究怎么load输入的时候有考虑过尝试这么做。我这里的写法是按行load,按理来说是很容易在列方向上出cache miss;但一方面因为profiler告诉我没多少miss,另一方面因为时间不够,就没有往这方面写,这个做法其实是make sense的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我看了下HUST.PNG的大小是4.19M,我CPU的L2$是6M,所以确实绰绰有余。。。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我看了下HUST.PNG的大小是4.19M,我CPU的L2$是6M,所以确实绰绰有余。。。
Profiler跑的机器是4M的L3$,一次性只访问input最近4行的话也是绰绰有余。
使用HUST.PNG的情况下,
通过环境变量
OMP_NUM_THREADS
控制线程数。