Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uplift dram and l1 allocators to use dram/l1 specific alignment #17122

Open
wants to merge 74 commits into
base: main
Choose a base branch
from

Conversation

llongTT
Copy link
Contributor

@llongTT llongTT commented Jan 26, 2025

Ticket

#13609

Problem description

Using the max of DRAM and L1 alignment for both DRAM and L1 buffers was causing pcc mismatches in i2s and s2i.

What's changed

Use L1/DRAM specific alignment for respective allocations. This will require some ops to be uplifted to handle re-alignment

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • (For models and ops writers) Full new models tests passes
  • New/Existing tests provide coverage for changes

abhullar-tt and others added 30 commits December 4, 2024 00:40
@@ -1061,7 +1061,7 @@ conv_op_l1_usage conv2d::calculate_L1_usage(
} else if (output_dtype == DataType::FLOAT32) {
per_core_out_width_aligned *= 4;
}
output_size = round_up(per_core_out_width_aligned, 32) * pconfig.per_core_out_matrix_height;
output_size = round_up(per_core_out_width_aligned, 16) * pconfig.per_core_out_matrix_height;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have defines/constexpr for these magic numbers (16 in this case, 32 before).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is a function just used for L1 we can use HAL api: get_alignment(HalMemType::L1)

@@ -108,7 +108,7 @@ def test_unet_trace(

@skip_for_grayskull("UNet not currently supported on GS")
@pytest.mark.parametrize(
"device_params", [{"l1_small_size": 68864, "trace_region_size": 442368, "num_command_queues": 2}], indirect=True
"device_params", [{"l1_small_size": 68864, "trace_region_size": 917504, "num_command_queues": 2}], indirect=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so much larger? Seems odd we now need to double the trace region size.

@@ -343,7 +343,7 @@ def test_unet_trace_2cq_multi_device(

@skip_for_grayskull("UNet not currently supported on GS")
@pytest.mark.parametrize(
"device_params", [{"l1_small_size": 68864, "trace_region_size": 424960, "num_command_queues": 2}], indirect=True
"device_params", [{"l1_small_size": 68864, "trace_region_size": 1376256, "num_command_queues": 2}], indirect=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - why such a large increase?

Copy link
Contributor

@tt-aho tt-aho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm if hardcoding is removed

@@ -1164,7 +1164,7 @@ conv_op_l1_usage conv2d::calculate_L1_usage(
} else if (output_dtype == DataType::FLOAT32) {
per_core_out_width_aligned *= 4;
}
output_size = round_up(per_core_out_width_aligned, 32) * pconfig.per_core_out_matrix_height;
output_size = round_up(per_core_out_width_aligned, 16) * pconfig.per_core_out_matrix_height;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also not be hardcoded to 16

Comment on lines +174 to +175
uint32_t l1_alignment = tt::tt_metal::hal.get_alignment(tt::tt_metal::HalMemType::L1);
uint32_t per_core_N_bytes_padded = tt::round_up(per_core_N * datum_size_bytes, l1_alignment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to query a's buffer alignment here instead of querying the hal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants