Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3546] Push down image pre-processing to blas/mkl #1843

Open
wants to merge 90 commits into
base: main
Choose a base branch
from

Conversation

anishsapkota
Copy link

This is where i intend to add the native bindings for intel MKL image processing.

Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi anishsapkota

Welcome to Opensource,

I suggest you start with doing the following:

  1. Add the .h file with a function definition.
  2. Add the native function in NativeHelper.java
  3. Add a test verifying behavior of your new function, you do this by adding a new folder in src/test/java/org/apache/sysds/test/functions and call it native, inside this make a java file named something appropriate.

Furthermore, I suggest initially to make something that just hooks up and call the cpp without actually doing much to verify the binding and then after this make the MKL work.

src/main/cpp/libAnish.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice progress,

I understand based on your comment that commenting out other code this was the only way for you to make it work.

But we should strive to find a way to make all of it work (your point as well)
perhaps the limitation is that you do not have BLAS installed as well, and maybe there is a requirement that you need to install both to compile the library?

Alternatively we can add a new cpp file called systemds_img.cpp and header file that you hook into. But then you need to make a new NativeHelper class instance that then hooks up to the new file. In general i would be interested if there was a nice way of doing such a thing.

Thanks

@Baunsgaard Baunsgaard changed the title [SYSTEMDS-3546] good start [SYSTEMDS-3546] Push down image pre-processing to blas/mkl Jun 16, 2023
@Baunsgaard
Copy link
Contributor

Hi @anishsapkota,

How is it going? is there uncommitted progress or do you need any further comments?
Also, feel free if any of the comments are resolved, to click resolved yourself.

@anishsapkota
Copy link
Author

Hi @Baunsgaard ,
I have been loading the shared libraries manually on my tests as the Native Helper currently supports LINUX and Windows. I have been working using MAC_OS and the libraries are therefore not loaded automatically. Could you please suggest me something on this?

Apart from this, I have created a new file systemds_img.cpp and add the methods that I have implemented there, added nativ function tests to JavaTest.yaml and created a new Java Class ImgNativeHelper that extends NativeHelper.

@anishsapkota
Copy link
Author

anishsapkota commented Sep 11, 2023

Hi @Baunsgaard ,
I am waiting for the tests to run, so that we can merge.

regarding the performance results, should I be comparing the performance between OpenBlas and Intel MKL ? or with the existing DML implementation?

@Baunsgaard
Copy link
Contributor

Hi @Baunsgaard , I am waiting for the tests to run, so that we can merge.

regarding the performance results, should I be comparing the performance between OpenBlas and Intel MKL ? or with the existing DML implementation?

You need to compare the results between all possible ones you have.
But most importantly between DML Java version and your MKL.

@Baunsgaard
Copy link
Contributor

Regarding the tests, I need to allow them to run since this is your first commit it does not allow them to automatically run.
I will keep an eye on it and run the tests when you commit, if you feel like you are waiting to long just write a comment.

@anishsapkota
Copy link
Author

java.lang.UnsatisfiedLinkError: /github/workspace/src/main/cpp/lib/libsystemds_openblas-Linux-x86_64.so: libopenblas.so.0: cannot open shared object file: No such file or directory

I still can't figure it out why it can't find the lib. could you please review the static block from my last commit ?

@Baunsgaard
Copy link
Contributor

Baunsgaard commented Sep 15, 2023

Hi @Baunsgaard

System.IO.IOException: No space left on device : '/home/runner/runners/2.309.0/_diag/Worker_20230915-060835-utc.log'
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/runners/2.309.0/_diag/Worker_20230915-060835-utc.log'

https://github.com/apache/systemds/actions/runs/6191936104

well, i guess we need to cleanup some of the resources.
I think we only have 8GB disk space, but i might remember it wrong.

what we need to do is to remove cached files from install of our docker image to reduce its size via apt-get clean

docker/testsysds.Dockerfile

Then include MKL and OpenBLAS in the image rather than installing them in the build script of cpp and make sure we do not keep around anything not needed in our install.

But this is out of scope of this PR, i would prefer if we make it pass as it is and do the Docker refinements some other time.

@anishsapkota
Copy link
Author

anishsapkota commented Sep 15, 2023

@Baunsgaard @j143
Hi,
after some intensive trying using gitpod I could conclude following things:

  1. libmkl_rt.so cannot open the file: no such file or directory this error was because ldconfig could not find the exported LD_LIBRARY_PATH. I could solve this problem by adding a .conf file with /opt/intel/oneapi/mkl/latest/lib/intel64 in /etc/ld.so.conf.d directory and reloading ldconfig.

  2. Next Thanks to @j143 I stumbled across his video about Intel MKL installation on ubuntu. If I would do it just like he mentioned using a offline package installer which has CLI based UI and requires user interaction ( I could not find a way to do this using shell script), everything would work fine after doing what I mentioned in 1.
    If we were to use apt package installer instead of offline one, somehow the include folder containing all the necessary header file like "mkl.h" is missing in mkl root folder. I tried using "locate mkl.h" no result.
    I could not find the solution to this.
    Next option would be to install whole intel-basekit which contains MKL as well. It is unfortunately too large ~13 GB I think and could have led to the not enough space error.

  3. Regarding "mkl_dnn.h" .I have mentioned here [SYSTEMDS-3546] Push down image pre-processing to blas/mkl #1843 (comment), if we download it unzip it, we have 2 folders in it include and lib, then copy "mkl_dnn.h" and "mkl_dnn_types.h"(second name is something similar) for its include folder to /opt/intel/oneapi/mkl/latest/include (if present) and the libs to /opt/intel/oneapi/mkl/latest/lib/intel64. Using this trick I could compile the cpp files with "mkl_dnn.h"header file as well. But not recommended it was just an experiment.

  4. Nevertheless, I could run the benchmarks using both mkl and openblas, also DML implementations using -stats flag in gitpod. Although I could not install mkl properly in GitHub test docker container. I will process the results and post it.

@Baunsgaard should I comment out "mkl test case" so that we can merge? or how should I proceed now?

@Baunsgaard
Copy link
Contributor

@Baunsgaard @j143 Hi, after some intensive trying using gitpod I could conclude following things:

1. **libmkl_rt.so cannot open the file: no such file or directory** this error was because ldconfig could not find the exported LD_LIBRARY_PATH. I could solve this problem by adding a **.conf file** with **/opt/intel/oneapi/mkl/latest/lib/intel64** in **/etc/ld.so.conf.d** directory and reloading **ldconfig.**

2. Next Thanks to @j143 I stumbled across his video about Intel MKL installation on ubuntu. If I would do it just like he mentioned using a **offline package installer** which has CLI based UI and requires user interaction ( _I could not find a way to do this using shell script_), everything would work fine after doing what I mentioned in 1.
   If we were to use **apt package installer** instead of offline one, somehow the include folder containing all the necessary header file like "mkl.h" is missing in mkl root folder. I tried using "locate mkl.h" no result.
   I could not find the solution to this.
   Next option would be to install whole intel-basekit which contains MKL as well. It is unfortunately too large ~13 GB I think and could have led to the not enough space error.

3. Regarding "mkl_dnn.h" .I have mentioned here [[SYSTEMDS-3546] Push down image pre-processing to blas/mkl #1843 (comment)](https://github.com/apache/systemds/pull/1843#issuecomment-1718879255), if we download it unzip it, we have 2 folders in it include and lib, then copy "mkl_dnn.h" and "mkl_dnn_types.h"(second name is something similar) for its include folder to /opt/intel/oneapi/mkl/latest/include (if present) and the libs to /opt/intel/oneapi/mkl/latest/lib/intel64. Using this trick I could compile the cpp files with "mkl_dnn.h"header file as well. **But not recommended it was just an experiment.**

4. Nevertheless, I could run the benchmarks using both mkl and openblas, also DML implementations using -stats flag in gitpod. Although I could not install mkl properly in GitHub test docker container. I will process the results and post it.

@Baunsgaard should I comment out "mkl test case" so that we can merge? or how should I proceed now?

lets go with the Blas test then and add some TODO that reference this PR about the MKL tests, then I will try to make it work at a later time, since i need it for other things.

Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, nice implementation.

My main comment is the comparison is not fair. There are some things we can do to improve it if you want (read the detailed comments for more).

Before we merge it we need to address all the comments, but you do not need to do this before your presentation Monday, and it is up to you if you even want to.

Thanks!

path: target/jacoco.exec
retention-days: 1
- name: Checkout Repository
uses: actions/checkout@v4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to revert here, since the change is only syntax.


apt-get update
apt-get install libopenblas-dev -y
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment on MKL install.

cout << endl;
}
cout << "\n"<< endl;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest removing this debugging printImage.

cblas_dcopy(end_x - start_x, &img_in[(x_in) + (y_in) * in_w],
1, &img_out[y * out_w + start_x + static_cast<int>(offset_x)], 1);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation.

return img_out;
}

void img_translate(double* img_in, double offset_x, double offset_y,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast the offsets in the first line rather than every call to offset_x and y, and maybe consider changing the API to use int offsets.


import java.io.File;

public class ImgNativeHelper extends NativeHelper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add doc on the Helper class itself.
i would suggest a hint at where to find the CPP files.

System.load(System.getProperty("user.dir")+"/src/main/cpp/lib/libsystemds_" + blasType + "-Darwin-x86_64.dylib");
}
} catch (Exception e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would like this to throw the exception, not just print stack trace.

you can wrap the exception in a throw new RuntimeException(e);

assertArrayEquals(expectedOutput, img_out, 1e-9); // Compare arrays with a small epsilon
}

public void runTests(String blasType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand why you designed it this way, but please make the test integrate with JUnit,

To do this remove the runTests method and make all your static calls annotated with @ Test.
Then make the imgNativeHelper a field in this class that the constructor instantiate.
Then finally the class needs to be changed into a parameterized class, that calls the constructor selecting either BLAS or MKL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anishsapkota , there are a plenty of tests in the neighboring folders you can take inspiration from!


import java.util.Random;

import static org.apache.sysds.utils.ImgNativeHelper.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no wildcard imports allowed.

}

public void runBlasTests(boolean sparse,int n, int seed) {
double spSparse = 0.1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@anishsapkota
Copy link
Author

My main comment is the comparison is not fair. There are some things we can do to improve it if you want (read the detailed comments for more).

Hi @Baunsgaard ,
thank your for reviewing the code and the suggestions. Rest assured, I will address all of them after my presentation.

could you be more specific, how can I make the comparison fairer? I will be incorporating your previous suggestions of calculating std and analysing dml execution time using -stats flag. Do you have any further suggestions?
It would be interesting for the presentation to address this issue before that.

@Baunsgaard
Copy link
Contributor

My main comment is the comparison is not fair. There are some things we can do to improve it if you want (read the detailed comments for more).

Hi @Baunsgaard , thank your for reviewing the code and the suggestions. Rest assured, I will address all of them after my presentation.

could you be more specific, how can I make the comparison fairer? I will be incorporating your previous suggestions of calculating std and analysing dml execution time using -stats flag. Do you have any further suggestions? It would be interesting for the presentation to address this issue before that.

The fairest comparison would be to integrate the native instruction as a DML supported command. It does not take long to make. But requires you to know where to do it.
Once integrated we can compare with the overhead of the system on both executions and thereby get a fair comparison.

For now the stats comparison is close enough.

@anishsapkota
Copy link
Author

The fairest comparison would be to integrate the native instruction as a DML supported command. It does not take long to make. But requires you to know where to do it.

could you please let me know which files I should be looking at, so that I can implement this as well ?

@anishsapkota
Copy link
Author

image

@anishsapkota
Copy link
Author

image

@anishsapkota
Copy link
Author

image

@anishsapkota
Copy link
Author

image

@anishsapkota
Copy link
Author

anishsapkota commented Sep 17, 2023

DML Implementations Benchmark sample with -stats flag

image_rotate

SystemDS Statistics:
Total elapsed time: 1,842 sec.
Total compilation time: 0,042 sec.
Total execution time: 1,800 sec.
Number of compiled Spark inst: 1.
Number of executed Spark inst: 0.
Cache hits (Mem/Li/WB/FS/HDFS): 1835090/0/0/0/1.
Cache writes (Li/WB/FS/HDFS): 3/14/0/1.
Cache times (ACQr/m, RLS, EXP): 0,946/0,352/1,060/0,058 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0,167 sec.
Spark ctx create time (lazy): 0,000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0,000/0,000/0,000 secs.
Spark async. count (pf,bc,op): 0/0/0.
ParFor loops optimized: 1.
ParFor optimize time: 0,001 sec.
ParFor initialize time: 0,004 sec.
ParFor result merge time: 0,028 sec.
ParFor total update in-place: 0/262144/262168
Total JIT compile time: 36.02 sec.
Total JVM GC count: 175.
Total JVM GC time: 0.374 sec.
Heavy hitter instructions:
Instruction Time(s) Count
1 rightIndex 3,288 786443
2 m_img_rotate 1,711 1
3 m_img_transform 1,711 1
4 rmvar 1,434 2359350
5 leftIndex 1,213 262168
6 createvar 0,997 1048631
7 castdts 0,846 524294
8 && 0,544 786432
9 mvvar 0,392 524300
10 < 0,355 524288

image_cutout

SystemDS Statistics:
Total elapsed time: 0,108 sec.
Total compilation time: 0,042 sec.
Total execution time: 0,065 sec.
Number of compiled Spark inst: 1.
Number of executed Spark inst: 0.
Cache hits (Mem/Li/WB/FS/HDFS): 2/0/0/0/1.
Cache writes (Li/WB/FS/HDFS): 0/1/0/1.
Cache times (ACQr/m, RLS, EXP): 0,032/0,000/0,000/0,033 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0,000 sec.
Spark ctx create time (lazy): 0,000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0,000/0,000/0,000 secs.
Spark async. count (pf,bc,op): 0/0/0.
Total JIT compile time: 13.577 sec.
Total JVM GC count: 176.
Total JVM GC time: 0.368 sec.
Heavy hitter instructions:
Instruction Time(s) Count
1 write 0,033 1
2 sp_rblk 0,032 1
3 rmvar 0,001 2
4 leftIndex 0,000 1
5 rand 0,000 1
6 createvar 0,000 4
7 + 0,000 2
8 < 0,000 1
9 mvvar 0,000 8
10 - 0,000 2

image_crop

New w/h=409.0/409.0
SystemDS Statistics:
Total elapsed time: 0,134 sec.
Total compilation time: 0,029 sec.
Total execution time: 0,105 sec.
Number of compiled Spark inst: 1.
Number of executed Spark inst: 0.
Cache hits (Mem/Li/WB/FS/HDFS): 9/0/0/0/1.
Cache writes (Li/WB/FS/HDFS): 1/8/0/1.
Cache times (ACQr/m, RLS, EXP): 0,037/0,000/0,000/0,063 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0,000 sec.
Spark ctx create time (lazy): 0,000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0,000/0,000/0,000 secs.
Spark async. count (pf,bc,op): 0/0/0.
Total JIT compile time: 36.052 sec.
Total JVM GC count: 176.
Total JVM GC time: 0.367 sec.
Heavy hitter instructions:
Instruction Time(s) Count
1 write 0,063 1
2 sp_rblk 0,037 1
3 m_img_crop 0,005 1
4 - 0,002 4
5 rmempty 0,001 1
6 rshape 0,000 3
7 rmvar 0,000 8
8 + 0,000 8
9 leftIndex 0,000 1
10 rand 0,000 2

image_translate

SystemDS Statistics:
Total elapsed time: 0,166 sec.
Total compilation time: 0,049 sec.
Total execution time: 0,117 sec.
Number of compiled Spark inst: 1.
Number of executed Spark inst: 0.
Cache hits (Mem/Li/WB/FS/HDFS): 3/0/0/0/1.
Cache writes (Li/WB/FS/HDFS): 0/3/0/1.
Cache times (ACQr/m, RLS, EXP): 0,039/0,000/0,000/0,077 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0,000 sec.
Spark ctx create time (lazy): 0,000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0,000/0,000/0,000 secs.
Spark async. count (pf,bc,op): 0/0/0.
Total JIT compile time: 13.584 sec.
Total JVM GC count: 176.
Total JVM GC time: 0.367 sec.
Heavy hitter instructions:
Instruction Time(s) Count
1 write 0,077 1
2 sp_rblk 0,039 1
3 m_img_translate 0,001 1
4 leftIndex 0,000 1
5 rmvar 0,000 12
6 rand 0,000 1
7 rightIndex 0,000 1
8 createvar 0,000 5
9 < 0,000 8
10 mvvar 0,000 10

@Baunsgaard
Copy link
Contributor

Baunsgaard commented Sep 18, 2023

In the results you have above the time of the Heavy hitter instructions: is what i would use to compare. The rest of the time reported is all overhead of Java , I/O and other compilation related elements.
But i would modify the script you use. A example of a measuring script:

res = read($1)
print(sum(res))
for(i in 1:$2) {
    res2 = OPERATION(res)
}
print(sum(res2))

This reads the file in the first argument and then repeats the operation the number of times specified in the second argument. Because of the way we execute in SystemDS the sum before the loop make sure res is actually read from disk.

The downside to this approach is that it is impossible/hard for you to get the individual calls execution time.

@j143
Copy link
Contributor

j143 commented Sep 24, 2023

Hi @anishsapkota -- let's move this forward, request the reviewer to commit this much by addressing the comments (which seem minor).

could you please let me know which files I should be looking at, so that I can implement this as well ?

after that you can add further functionality. This would remove the burden on you & merge conflicts.

Thanks, Janardhan

@j143
Copy link
Contributor

j143 commented Nov 6, 2023

Hi @anishsapkota -- thanks for your contribution. do you need some help here?

@j143
Copy link
Contributor

j143 commented Dec 4, 2023

Hi @Baunsgaard , this PR does seems fine. only minor comments are there. Is it okay to take this in?

@j143 j143 added this to the systemds-3.2.0 milestone Dec 4, 2023
@Baunsgaard
Copy link
Contributor

Hi @Baunsgaard , this PR does seems fine. only minor comments are there. Is it okay to take this in?

Hi @j143 , it can be taken in but we need to be careful, and go through the code, since some of the changes (for instance GitHub actions) contain changes that should not be made.

@corepointer
Copy link
Contributor

Hi!

Sorry that I'm a bit late to the discussion.

Our current Intel MKL support ends at version 2019.5 (you noticed the problems with dnn ops not compiling). I would be surprised if all magically worked by just using OneMKL and OneDNN.

We should probably put the native ops testing into a separate test suite and use a Docker container that supports it.

Furthermore, moving to OneAPI would change external behavior and require a major version bump.

Regards, Mark

@j143 j143 modified the milestones: systemds-3.2.0, next-release Feb 8, 2024
@j143
Copy link
Contributor

j143 commented Feb 8, 2024

marking this PR for next release because there are some concerns pointed by @corepointer

Our current Intel MKL support ends at version 2019.5 (you noticed the problems with dnn ops not compiling). I would be surprised if all magically worked by just using OneMKL and OneDNN.

We should probably put the native ops testing into a separate test suite and use a Docker container that supports it.

Furthermore, moving to OneAPI would change external behavior and require a major version bump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Unplanned
Development

Successfully merging this pull request may close these issues.

4 participants