Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating on the validation dataset #14

Open
yinoue0426 opened this issue Jun 17, 2022 · 1 comment
Open

Evaluating on the validation dataset #14

yinoue0426 opened this issue Jun 17, 2022 · 1 comment

Comments

@yinoue0426
Copy link

I am currently trying to run the evaluation code for valid-unseen dataset, with the available pretrained models, and I have some questions.

First, this is the code I am using to run the evaluation:

python main.py -n1 --max_episode_length 1000 --num_local_steps 25 --num_processes 1 --eval_split valid_unseen --from_idx 0 --to_idx 820 --max_fails 10 --debug_local --learned_depth --use_sem_seg --set_dn tmp --use_sem_policy -v 0 --which_gpu 0 --x_display 0

The code fails to run, however, complaining that rewards.json is missing. So I pulled it from the [ALFRED repo](https://github.com/askforalfred/alfred/blob/master/models/config/rewards.json), and added --reward_config flag to arguments.py.

The code runs with the modification, and I get SR=18.03%, which matches the val-unseen score for "without template assumption" (Table 2 of the FILM paper). I am using the best_model_multi.pt as the semantic search policy.
What I wasn't quite sure was what language processing modules is being used. I think the predicted templates are read from the models/instructions_processed_LP/instruction2_params*.p files, and I was wondering whether they are generated with/without the template assumption.

Thanks,

@yinoue0426
Copy link
Author

I looked at my results in more details and have one update to the original post.

The code runs with the modification, and I get SR=18.03%

The score I report here is actually a result of splitting the evaluation among 3 PCs (--from_idx and --to_idx were split among 3 PCs), so I doubt the random number generation is the same as the one used in the paper. So please ignore about 18.03% matching the paper result.

I am still interested in knowing what language processing modules is being used- i.e. whether models/instructions_processed_LP/instruction2_params*.p files are generated with/without the template assumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant