-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added script for PyTorch issues debugging #137
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly looks very good. Have a lot of small comments on general performance and 'go/house style' that are just meant to be pointers. Feel free to ignore the stuff labeled OPTIONAL or MINOR: just keep them in mind for next time.
cmd/diagnostic/PyTorch.go
Outdated
} | ||
} | ||
` | ||
data := map[string]interface{}{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MINOR: General go HTTP/REST performance tips:
- use a struct here rather than
map[string]any
to avoid extraneous allocations. - don't make a map to iterate through the headers: just do
req.Header.Set("Content-Type", "application/json")
- don't make a new
http.Client
for each request: they're safe to share between invocations. if you don't care which client to use, usehttp.DefaultClient
- either or both of the client or request context should have a timeout.
"github.com/spf13/cobra" | ||
) | ||
|
||
func getPodMachineID(podID, apiKey string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function should return an error.
cmd/diagnostic/PyTorch.go
Outdated
} | ||
jsonData, _ := json.Marshal(data) | ||
|
||
req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check this error and return it.
headers := map[string]string{ | ||
"Content-Type": "application/json", | ||
} | ||
query := ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OPTIONAL: this can be a const
.
cmd/diagnostic/PyTorch.go
Outdated
|
||
var result map[string]interface{} | ||
json.NewDecoder(resp.Body).Decode(&result) | ||
if pod, ok := result["data"].(map[string]interface{})["pod"].(map[string]interface{}); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, don't use map[string]any
: just make a struct type that's the shape you want. in this case, it would be
type Result struct {
Data struct {
Pod struct {
MachineID string `json:"machineId"`
} `json:"pod"`
} `json:"data"`
}
driverVersion := driverVersionRegex.FindStringSubmatch(output) | ||
gpuName := gpuNameRegex.FindStringSubmatch(output) | ||
|
||
info := map[string]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(MINOR, OPTIONAL): see above re: structs vs maps
|
||
client := &http.Client{} | ||
resp, err := client.Do(req) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use return fmt.Errorf
with %w
instead of just printing the error.
return parseNvidiaSMIOutput(string(output)) | ||
} | ||
|
||
func getSystemInfo() map[string]interface{} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(OPTIONAL, MINOR): this function is tiny and only called once. inline it.
return results | ||
} | ||
|
||
func saveInfoToFile(info map[string]interface{}, filename string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(OPTIONAL, MINOR): this function is tiny and only called once. inline it.
return info | ||
} | ||
|
||
func getNvidiaSMIInfo() map[string]string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(OPTIONAL, MINOR): this function is tiny and only called once. inline it.
What will happen if people try to run this on the client? Is there anything telling them that it is a diagnostic to be run specifically through web terminal on the pod? I think somebody will try to run it locally and be confused by the output if not |
Most of the output in json file would be empty though script will try to run PyTorch test. |
Added new command: gpu-test
Curently it's including one of my script to test out PyTorch (PyTorch,go file)
Saves debug informations to /workspace/gpu_diagnostics.json
What script does:
Gather all informations about host including:
Runs test on all attached GPU's to makre sure PyTorch fully utilizes CUDA
Logs error messages into .json file for tech support to review.