Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue #34

Open
vvuonghn opened this issue Aug 14, 2023 · 3 comments
Open

Memory issue #34

vvuonghn opened this issue Aug 14, 2023 · 3 comments

Comments

@vvuonghn
Copy link

Hi @samleoqh

Thank you for your release source code. It helps me a lot.

During training process, I met a problem related to memory.
image

The process consume a lot of memory, over 150GB RAM. I think the problem in the validate function. Because you append all the input/output data to the inputs_all, gts_all, predictions_all

def validate(net, val_set, val_loader, criterion, optimizer, epoch, new_ep):
    net.eval()
    val_loss = AverageMeter()
    inputs_all, gts_all, predictions_all = [], [], []

    with torch.no_grad():
        for vi, (inputs, gts) in enumerate(val_loader):
            inputs, gts = inputs.cuda(), gts.cuda()
            N = inputs.size(0) * inputs.size(2) * inputs.size(3)
            outputs = net(inputs)

            val_loss.update(criterion(outputs, gts).item(), N)
            # val_loss.update(criterion(gts, outputs).item(), N)
            if random.random() > train_args.save_rate:
                inputs_all.append(None)
            else:
                inputs_all.append(inputs.data.squeeze(0).cpu())

            gts_all.append(gts.data.squeeze(0).cpu().numpy())
            predictions = outputs.data.max(1)[1].squeeze(1).squeeze(0).cpu().numpy()
            predictions_all.append(predictions)

    update_ckpt(net, optimizer, epoch, new_ep, val_loss,
                inputs_all, gts_all, predictions_all)

    net.train()
    return val_loss, inputs_all, gts_all, predictions_all
@samleoqh
Copy link
Owner

ah, there is a bug, should move the three lines within else:

gts_all.append(gts.data.squeeze(0).cpu().numpy())
            predictions = outputs.data.max(1)[1].squeeze(1).squeeze(0).cpu().numpy()
            predictions_all.append(predictions)

and the value of save_rate is used to control what percent of val images will be appended for later visualization. I set it to 0.1 as default, it can be further lower like to be 0.001 if the number of val images is very large.

@vvuonghn
Copy link
Author

Hi

i think if you remove the line as above, maybe the source code can not run evaluate, because they need predictions_all, gts_all, train_args.nb_classes

acc, acc_cls, mean_iu, fwavacc, f1 = evaluate(predictions_all, gts_all, train_args.nb_classes)

I think the best way to fix that is evaluate for every sample not for all val set

@samleoqh
Copy link
Owner

yes, you are right. I'd refactor the code a bit when I get some free time. I remember the reason I appended all predictions together to compute metrics, it's because each test image only contains one or two classes among 9 classes, and appending them together can get a whole/stable confusion metrics across all val images. It's also fine to evaluate one by one then average them, like the val loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants