Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LST followups: better work divisions, concrete kernel dimension, some cleanup and fixes #47084

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ariostas
Copy link
Contributor

This PR addresses some of the LST followups that we have listed in #46746.

Here is the list of fixes/changes:

  • Better work division: we switched to using cms::alpakatools::makeworkdiv (instead of our custom createWorkDiv) and we now use cms::alpakatools::uniform_elements for kernel loops.
  • We switched to explicitly specifying kernel dimensions instead of using templated types.
  • Started removal of kVerticalModuleSlope (previously named lst_INF). We're doing this in two steps instead of one since the data files also need to be updated. We ensure a smooth transition by first supporting both options and later removing the legacy one.
  • We fixed some issues with our includes and with an overflow that was sometimes happening.

c.c. @slava77 @VourMa

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 10, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @ariostas for master.

It involves the following packages:

  • RecoTracker/LSTCore (reconstruction)

@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @dgulhan, @felicepantaleo, @gpetruc, @missirol, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@ariostas
Copy link
Contributor Author

Tagging @fwyzard since most (if not all) of the comments addressed were his

@slava77
Copy link
Contributor

slava77 commented Jan 10, 2025

test parameters:

  • enable_tests = gpu
  • workflows_gpu = 29634.704,29834.704
  • workflows = 29634.703,29834.703
  • relvals_opt = -w upgrade,standard
  • relvals_opt_gpu = -w upgrade,standard

@slava77
Copy link
Contributor

slava77 commented Jan 10, 2025

@cmsbuild please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals-GPU
Size: This PR adds an extra 104KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1eb2fd/43723/summary.html
COMMIT: 1a27b2a
CMSSW: CMSSW_15_0_X_2025-01-10-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47084/43723/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS

RelVals-GPU

  • 29834.70429834.704_TTbar_14TeV+Run4D110PU_lstOnGPUIters01TrackingOnly/step3_TTbar_14TeV+Run4D110PU_lstOnGPUIters01TrackingOnly.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 52
  • DQMHistoTests: Total histograms compared: 3996179
  • DQMHistoTests: Total failures: 64
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3996095
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 51 files compared)
  • Checked 226 log files, 195 edm output root files, 52 DQM output files
  • TriggerResults: no differences found

@slava77
Copy link
Contributor

slava77 commented Jan 10, 2025

29834.70429834.704_TTbar_14TeV+Run4D110PU_lstOnGPUIters01TrackingOnly/step3_TTbar_14TeV+Run4D110PU_lstOnGPUIters01TrackingOnly.log

there are a bunch of errors like

alpaka/event/EventUniformCudaHipRt.hpp(66) 
'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 
'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!`

the same workflow step3 in the baseline ran OK. So, the crash seems related to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants