Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added script for PyTorch issues debugging #137

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kodxana
Copy link

@kodxana kodxana commented Mar 11, 2024

Added new command: gpu-test
Curently it's including one of my script to test out PyTorch (PyTorch,go file)
Saves debug informations to /workspace/gpu_diagnostics.json

What script does:
Gather all informations about host including:

  • pod id
  • machine id
  • CUDA version
  • PyTorch Version
  • GPU model

Runs test on all attached GPU's to makre sure PyTorch fully utilizes CUDA
Logs error messages into .json file for tech support to review.

Copy link
Contributor

@ef0xa ef0xa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks very good. Have a lot of small comments on general performance and 'go/house style' that are just meant to be pointers. Feel free to ignore the stuff labeled OPTIONAL or MINOR: just keep them in mind for next time.

}
}
`
data := map[string]interface{}{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR: General go HTTP/REST performance tips:

  • use a struct here rather than map[string]any to avoid extraneous allocations.
  • don't make a map to iterate through the headers: just do req.Header.Set("Content-Type", "application/json")
  • don't make a new http.Client for each request: they're safe to share between invocations. if you don't care which client to use, use http.DefaultClient
  • either or both of the client or request context should have a timeout.

"github.com/spf13/cobra"
)

func getPodMachineID(podID, apiKey string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function should return an error.

}
jsonData, _ := json.Marshal(data)

req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check this error and return it.

headers := map[string]string{
"Content-Type": "application/json",
}
query := `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPTIONAL: this can be a const.


var result map[string]interface{}
json.NewDecoder(resp.Body).Decode(&result)
if pod, ok := result["data"].(map[string]interface{})["pod"].(map[string]interface{}); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, don't use map[string]any: just make a struct type that's the shape you want. in this case, it would be

type Result struct { 
    Data struct  { 
        Pod struct { 
            MachineID string  `json:"machineId"` 
         } `json:"pod"`
    } `json:"data"`
 }

driverVersion := driverVersionRegex.FindStringSubmatch(output)
gpuName := gpuNameRegex.FindStringSubmatch(output)

info := map[string]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(MINOR, OPTIONAL): see above re: structs vs maps


client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use return fmt.Errorf with %w instead of just printing the error.

return parseNvidiaSMIOutput(string(output))
}

func getSystemInfo() map[string]interface{} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(OPTIONAL, MINOR): this function is tiny and only called once. inline it.

return results
}

func saveInfoToFile(info map[string]interface{}, filename string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(OPTIONAL, MINOR): this function is tiny and only called once. inline it.

return info
}

func getNvidiaSMIInfo() map[string]string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(OPTIONAL, MINOR): this function is tiny and only called once. inline it.

@ef0xa ef0xa self-assigned this Mar 12, 2024
@DireLines
Copy link
Collaborator

What will happen if people try to run this on the client? Is there anything telling them that it is a diagnostic to be run specifically through web terminal on the pod? I think somebody will try to run it locally and be confused by the output if not

@kodxana
Copy link
Author

kodxana commented Mar 12, 2024

What will happen if people try to run this on the client? Is there anything telling them that it is a diagnostic to be run specifically through web terminal on the pod? I think somebody will try to run it locally and be confused by the output if not

Most of the output in json file would be empty though script will try to run PyTorch test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants