Skip to content

Commit

Permalink
[SYCL][Docs][Joint matrix] Add overloads and restrictions for the off…
Browse files Browse the repository at this point in the history
…set load store (#15499)

- Add missing restriction on the stride of the checked variants of
load/store
- Add new overloads of `joint_matrix_load` and `joint_matrix_store`
where the offsets are separated from the base pointer and added as
separate arguments. I kept the same name as the expectation is to remove
the regular variants once the new ones are used instead.
- Add restrictions on both the regular and the offset
`joint_matrix_load/store` on PVC since in the current implementation, no
runtime checks are added as they are expensive. The fall back to 1d
load/store is done using a flag instead.
  • Loading branch information
dkhaldi authored Oct 22, 2024
1 parent f26139e commit 475ca2d
Show file tree
Hide file tree
Showing 2 changed files with 181 additions and 44 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,37 @@ void joint_matrix_store(Group g,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> dest,
size_t stride);

// Overloads for offset store
template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> base_dest, size_t row_index,
size_t col_index, size_t stride);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> base_dest, size_t row_index,
size_t col_index, size_t stride);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, typename PropertyListT>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT>
base_dest, size_t row_index, size_t col_index, size_t stride);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, typename PropertyListT>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT>
base_dest, size_t row_index, size_t col_index, size_t stride);

} // namespace sycl::ext::intel::experimental::matrix
```

Expand Down Expand Up @@ -244,19 +275,19 @@ supporting the out of bounds checked APIs that are defined in this section.
In this section, we refer to the memory buffer where a `joint_matrix`
is loaded from or stored to as the global matrix. This global matrix
is also interpreted as a two-dimensional memory region as follows, where
`GlobalRows` is number of rows in the global matrix, `GlobalCols` is number of
columns in the global matrix, `Stride` is number of columns that include
`global_rows` is number of rows in the global matrix, `global_cols` is number of
columns in the global matrix, `stride` is number of columns that include
the out of bounds data (depicted as x here).

```
GlobalCols
global_cols
<----------->
dddddddddddddxxx ^
dddddddddddddxxx | GlobalRows
dddddddddddddxxx | global_rows
dddddddddddddxxx v
xxxxxxxxxxxxxxxx
<-------------->
Stride
stride
```

In the diagram above, the global matrix has 13 columns and 3
Expand Down Expand Up @@ -293,15 +324,15 @@ checking, namely `joint_matrix_fill`, `joint_matrix_load`, and
the global memory matrix, which is different from the APIs that do not
do bounds checking. Those non-bounds-checking APIs take a pointer to
the base of the joint matrix.
* The coordinates `RowIndex` and `ColIndex` into the global matrix to
* The coordinates `row_index` and `col_index` into the global matrix to
calculate the pointer offset to load/store are given as separate
arguments.
* These variants take extra arguments to determine the global bounds
`GlobalRows` and `GlobalCols` of the global matrix.
`global_rows` and `global_cols` of the global matrix.

To illustrate the out-of-bounds checking, consider the global matrix
shown above which has 13 columns and 3 rows (`GlobalRows=3` and
`GlobalCols=13`), where the joint matrix size is 8 columns by 2 rows defined as
shown above which has 13 columns and 3 rows (`global_rows=3` and
`global_cols=13`), where the joint matrix size is 8 columns by 2 rows defined as
```
joint_matrix<sub_group, bfloat16, use::b, 2, 8, layout::row_major> sub_b;
```
Expand All @@ -311,14 +342,14 @@ both dimensions. This is shown below, where capital letters correspond
to the elements that are accessed by this joint matrix load:

```
GlobalCols
global_cols
<----------->
dddddddddddddxxx ^
dddddddddddddxxx | GlobalRows
dddddddddddddxxx | global_rows
ddddddddDDDDDXXX v
xxxxxxxxXXXXXXXX
<-------------->
Stride
stride
```

If the joint matrix is loaded via `joint_matrix_load_checked` using
Expand All @@ -335,18 +366,18 @@ namespace sycl::ext::intel::experimental::matrix {
template <typename Group, typename T, size_t Rows, size_t Cols,
use Use, layout Layout, typename Tv>
void joint_matrix_fill_checked(Group g, joint_matrix<Group, T, Use, Rows,
Cols, Layout> &m, Tv v, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);
Cols, Layout> &m, Tv v, size_t global_rows, size_t global_cols,
size_t row_index, size_t col_index);

// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_load_checked(Group g,
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
multi_ptr<T2, Space, IsDecorated> base_src, size_t Stride,
layout Layout, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);
multi_ptr<T2, Space, IsDecorated> base_src, size_t stride,
layout Layout, size_t global_rows, size_t global_cols,
size_t row_index, size_t col_index);

// Only available when Layout != layout::dynamic
// and when std::is_same_v<T1, std::remove_const_t<T2>>
Expand All @@ -356,17 +387,17 @@ template <typename Group, typename T1, typename T2,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_load_checked(Group g,
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
multi_ptr<T2, Space, IsDecorated> base_src, size_t Stride,
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
multi_ptr<T2, Space, IsDecorated> base_src, size_t stride,
size_t global_rows, size_t global_cols, size_t row_index, size_t col_index);

// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols, typename PropertyListT>
void joint_matrix_load_checked(Group g,
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
ext::oneapi::experimental::annotated_ptr<T2, PropertyListT> base_src,
size_t Stride, layout Layout, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);
size_t stride, layout Layout, size_t global_rows, size_t global_cols,
size_t row_index, size_t col_index);

// Only available when Layout != layout::dynamic
// and when std::is_same_v<T1, std::remove_const_t<T2>>
Expand All @@ -375,55 +406,55 @@ template <typename Group, typename T1, typename T2, size_t Rows,
void joint_matrix_load_checked(Group g,
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T2, PropertyListT> base_src,
size_t Stride, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);
size_t stride, size_t global_rows, size_t global_cols,
size_t row_index, size_t col_index);

template <typename Group, typename T, size_t Rows, size_t Cols,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride, layout Layout,
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
multi_ptr<T, Space, IsDecorated> base_dest, size_t stride, layout Layout,
size_t global_rows, size_t global_cols, size_t row_index, size_t col_index);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride,
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
multi_ptr<T, Space, IsDecorated> base_dest, size_t stride,
size_t global_rows, size_t global_cols, size_t row_index, size_t col_index);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride,
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
multi_ptr<T, Space, IsDecorated> base_dest, size_t stride,
size_t global_rows, size_t global_cols, size_t row_index, size_t col_index);

template <typename Group, typename T, size_t Rows, size_t Cols,
typename PropertyListT>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
size_t Stride, layout Layout, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);
size_t stride, layout Layout, size_t global_rows, size_t global_cols,
size_t row_index, size_t col_index);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, typename PropertyListT>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
size_t Stride, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);
size_t stride, size_t global_rows, size_t global_cols,
size_t row_index, size_t col_index);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, typename PropertyListT>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
size_t Stride, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);
size_t stride, size_t global_rows, size_t global_cols,
size_t row_index, size_t col_index);

} // namespace sycl::ext::intel::experimental::matrix
```
Expand All @@ -445,12 +476,12 @@ the following queries to get these requirements:
|Tells the required alignment (in bytes) of the base pointer for
`joint_matrix_load_checked` and `joint_matrix_store_checked`.
|`ext::intel::experimental::info::device::matrix_checked_rowindex_multiple_of<T>`|
`size_t`|Returns a value, of which `RowIndex` must be multiple of;
`size_t`|Returns a value, of which `row_index` must be multiple of;
where `T` is the element type of the matrix. When using the matrices
with the machine learning types, `T` should be the element type
(e.g. `precision::tf32`) not the storage type.
|`ext::intel::experimental::info::device::matrix_checked_globalcols_multiple_of<T>`|
`size_t` | Returns a value, of which `GlobalCols` must be multiple of;
`size_t` | Returns a value, of which `global_cols` must be multiple of;
where `T` is the element type of the matrix. When using the matrices
with the machine learning types, `T` should be the element type
(e.g. `precision::tf32`) not the storage type.
Expand All @@ -462,14 +493,19 @@ The checked APIs are currently available in devices with the architecture
`architecture::intel_gpu_pvc`. The following restrictions apply to
these checked APIs:

- The `stride` parameter has the following restrictions:

* The value `stride * sizeof(T1)` must be a multiple of 8, and
* The value of `stride * sizeof(T1)` must not exceed `2^24^`.

- The base pointer must be 4 bytes aligned.

- For 8 bits data type, `RowIndex` must be a multiple of 4. For 16 bits
data type, `RowIndex` must be a multiple of 2. So `RowIndex` must be a
- For 8 bits data type, `row_index` must be a multiple of 4. For 16 bits
data type, `row_index` must be a multiple of 2. So `row_index` must be a
multiple of 4 divided by size of the element type (`4/sizeof(T)`).

- For 8 bits data type, `GlobalCols` must be a multiple of 4. For 16 bits
data type, `GlobalCols` must be a multiple of 2. So `GlobalCols` must be a
- For 8 bits data type, `global_cols` must be a multiple of 4. For 16 bits
data type, `global_cols` must be a multiple of 2. So `global_cols` must be a
multiple of 4 divided by size of the element type (`4/sizeof(T)`).

=== New Device Information Descriptor
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,86 @@ of `sycl::multi_ptr`. The property list associated with the
`annotated_ptr` argument represents the compile-time constant
properties for cache control included in the SYCL extenion link:../../proposed/sycl_ext_intel_cache_controls.asciidoc[sycl_ext_intel_cache_controls]


==== Offset Load
```c++
namespace sycl::ext::oneapi::experimental::matrix {

// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
multi_ptr<T2, Space, IsDecorated> base_src, size_t row_index,
size_t col_index, size_t stride, layout Layout);

// Only available when Layout != layout::dynamic
// and when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols,
use Use, layout Layout,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
multi_ptr<T2, Space, IsDecorated> base_src, size_t row_index,
size_t col_index size_t stride);

// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols,
typename PropertyListT>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
annotated_ptr<T2, PropertyListT> base_src, size_t row_index, size_t
col_index, size_t stride, layout Layout);

// Only available when Layout != layout::dynamic
// and when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols, use Use, layout Layout,
typename PropertyListT>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
annotated_ptr<T2, PropertyListT> base_src, size_t row_index, size_t
col_index, size_t stride);

} // namespace sycl::ext::oneapi::experimental::matrix
```

These overloads of `joint_matrix_load` takes the pointer `base_src` to
designate the base pointer of the global memory matrix. The
coordinates `row_index` and `col_index` into the global matrix to
calculate the pointer offset to load/store are given as separate
arguments.

==== Offset Store
```c++
namespace sycl::ext::oneapi::experimental::matrix {

// T1 must be the same as T2
template <typename Group, typename T1, typename T2, size_t Rows, size_t Cols,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
multi_ptr<T2, Space, IsDecorated> base_dest, size_t row_index,
size_t col_index, size_t stride, layout Layout);

template <typename Group, typename T1, typename T2, size_t Rows, size_t Cols,
typename PropertyListT>
void joint_matrix_store(Group g,
const joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
annotated_ptr<T2, PropertyListT> base_dest, size_t row_index, size_t
col_index, size_t stride, layout Layout);

} // namespace sycl::ext::oneapi::experimental::matrix
```
These overloads of `joint_matrix_store` takes the pointer `base_dest` to
designate the base pointer of the global memory matrix. The
coordinates `row_index` and `col_index` into the global matrix to
calculate the pointer offset to load/store are given as separate
arguments.

==== Multiply and Add

```c++
Expand Down Expand Up @@ -562,7 +642,7 @@ float *buf = malloc_shared<float>(M*K, q);
auto pBuf = address_space_cast<sycl::access::address_space::global_space,
sycl::access::decorated::no>(buf);

joint_matrix_load(sg, tA, pBuf + Offset, Stride);
joint_matrix_load(sg, tA, pBuf + Offset, stride);
```

==== store
Expand All @@ -576,7 +656,7 @@ float *buf = malloc_shared<float>(M*K, q);
auto pBuf = address_space_cast<sycl::access::address_space::global_space,
sycl::access::decorated::no>(buf);

joint_matrix_store(sg, tA, pBuf + Offset, Stride, layout::row_major);
joint_matrix_store(sg, tA, pBuf + Offset, stride, layout::row_major);
```

==== fill
Expand Down Expand Up @@ -979,7 +1059,7 @@ for (int i = 0; sizeof(combinations); i++) {
}
```

=== Appendix: Supported Combinations Per Hardware
=== Appendix: Supported Combinations and Restrictions Per Hardware
The table below provides a list of the combinations that
`joint_matrix` implementations support on each of Intel AMX and Intel
XMX hardware. Note that these can be returned using
Expand Down Expand Up @@ -1116,6 +1196,27 @@ architecture::intel_gpu_dg2_g11, architecture::intel_gpu_dg2_g12`,
`architecture::intel_gpu_lnl_m`
|======================

===== Restrictions on `architecture::intel_gpu_pvc`

- The `stride` parameter to `joint_matrix_load` and
`joint_matrix_store` has the following restrictions:

* The value `stride * sizeof(T1)` must be a multiple of 8, and
* The value of `stride * sizeof(T1)` must not exceed `2^24^`.

- The base pointer argument to `joint_matrix_load` and
`joint_matrix_store` must be 4 bytes aligned.

- In the case of the offset overloads of `joint_matrix_load` and
`joint_matrix_store`, for 8 bits data type, `row_index` must be a
multiple of 4. For 16 bits data type, `row_index` must be a multiple
of 2. So `row_index` must be a multiple of 4 divided by size of the
element type (`4/sizeof(T)`).

- If these restrictions are not satisfied, users can switch to slower
implementations of `joint_matrix_load` and `joint_matrix_store` by
setting the driver flag `IGC_JointMatrixLoadStoreOpt=1`.

==== Nvidia Tensor Cores Supported Combinations
The complete set of matrix data types and shapes that are supported by
the `ext_oneapi_cuda` backend are represented in the following
Expand Down

0 comments on commit 475ca2d

Please sign in to comment.