Memory issue #34

vvuonghn · 2023-08-14T21:52:16Z

Thank you for your release source code. It helps me a lot.

During training process, I met a problem related to memory.

The process consume a lot of memory, over 150GB RAM. I think the problem in the validate function. Because you append all the input/output data to the inputs_all, gts_all, predictions_all

def validate(net, val_set, val_loader, criterion, optimizer, epoch, new_ep):
    net.eval()
    val_loss = AverageMeter()
    inputs_all, gts_all, predictions_all = [], [], []

    with torch.no_grad():
        for vi, (inputs, gts) in enumerate(val_loader):
            inputs, gts = inputs.cuda(), gts.cuda()
            N = inputs.size(0) * inputs.size(2) * inputs.size(3)
            outputs = net(inputs)

            val_loss.update(criterion(outputs, gts).item(), N)
            # val_loss.update(criterion(gts, outputs).item(), N)
            if random.random() > train_args.save_rate:
                inputs_all.append(None)
            else:
                inputs_all.append(inputs.data.squeeze(0).cpu())

            gts_all.append(gts.data.squeeze(0).cpu().numpy())
            predictions = outputs.data.max(1)[1].squeeze(1).squeeze(0).cpu().numpy()
            predictions_all.append(predictions)

    update_ckpt(net, optimizer, epoch, new_ep, val_loss,
                inputs_all, gts_all, predictions_all)

    net.train()
    return val_loss, inputs_all, gts_all, predictions_all

The text was updated successfully, but these errors were encountered:

samleoqh · 2023-08-14T22:15:42Z

ah, there is a bug, should move the three lines within else：

gts_all.append(gts.data.squeeze(0).cpu().numpy())
            predictions = outputs.data.max(1)[1].squeeze(1).squeeze(0).cpu().numpy()
            predictions_all.append(predictions)

and the value of save_rate is used to control what percent of val images will be appended for later visualization. I set it to 0.1 as default, it can be further lower like to be 0.001 if the number of val images is very large.

vvuonghn · 2023-08-15T20:56:04Z

Hi

i think if you remove the line as above, maybe the source code can not run evaluate, because they need predictions_all, gts_all, train_args.nb_classes

acc, acc_cls, mean_iu, fwavacc, f1 = evaluate(predictions_all, gts_all, train_args.nb_classes)

I think the best way to fix that is evaluate for every sample not for all val set

samleoqh · 2023-08-16T08:18:45Z

yes, you are right. I'd refactor the code a bit when I get some free time. I remember the reason I appended all predictions together to compute metrics, it's because each test image only contains one or two classes among 9 classes, and appending them together can get a whole/stable confusion metrics across all val images. It's also fine to evaluate one by one then average them, like the val loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issue #34

Memory issue #34

vvuonghn commented Aug 14, 2023

samleoqh commented Aug 14, 2023

vvuonghn commented Aug 15, 2023

samleoqh commented Aug 16, 2023

Memory issue #34

Memory issue #34

Comments

vvuonghn commented Aug 14, 2023

samleoqh commented Aug 14, 2023

vvuonghn commented Aug 15, 2023

samleoqh commented Aug 16, 2023