Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pu.pileupsWithControl might skip some pile-up regions or raise vague domain_score not in dict2 AssertionError #146

Open
nvaulin opened this issue May 25, 2024 · 1 comment

Comments

@nvaulin
Copy link

nvaulin commented May 25, 2024

Hi, I bumped into some unclear behavior of coolpup.py. Finally, I managed to overcome the problem, but maybe this issue might be helpful for other users and might help to improve this awesome tool.

TLDR:

AssertionError: domain_score not in dict2 from pu.pileupsWithControl means there are regions you want to pile-up may be intersected or close to bad bins.

Story

I tried to get rescaled pileup of TADs with calculation of TADs scores (strengths). I've done everything according to the Distribution of TAD strength scores part of the tutorial. For your example HFF_MicroC data everything worked fine, but for my data the behavior was strange:

  • pu.pileupsWithControl raises AssertionError: domain_score not in dict2 form the accumulate_values function

I found out that there were too many NaNs in the expected df. (Yes, I should pay more attention to my data, but back to the coolpup.py). I recalculated expected without respect to the centromeres and the AssertionError disappeared.

But then, after pileup ran without any errors, I found that length of pup.loc[0].domain_score is less the len(tads). I tried to record all the TADs processed by the add_domain_score and found that some of the TADs are missed:

image

So, these unprocessed TADs were skipped silently, without any warning, When I tried to run pu.pileupsWithControl specifically on the unprocessed TADs, it gave me AssertionError: domain_score not in dict2 again.

These unprocessed TADs were first or last TADs on their chromosomes and, apparently, are intersected with Hi-C map black stripes ("bad bins"). For their snippets data was containing only NaNs. After I performed more strict TADs filtering - pileup ended up fine for my data.

So the conclusions for the users:

  • Always check the shapes of the input and output data
  • Pay attention to the handling of bad bins prior to the post-hoc analysis (e.x. filtering out TADs intersected or close to bad bins)

Conclusions for the coolpuppy team

  • If there are bad bins influencing excepted values or intersected with TADs df -> then coolpuppy behave a bit vague:
    • It raises AssertionError: domain_score not in dict2 error, which is very far away from the real problem
    • In some cases it even just skips bad TADs without any warning.
  • May be it worth adding some additional checks if the pipeup of some particular TAD contains only NaN. I didnt find why in some cases such TADs are just dropped (they don't even pass into the postprocess_func).

Again, in my case it just solved by more strict filtering, however probably there may be cases when such TADs skippings may spoil the results.

I don't know weather this issue might be useful, especially I am not able to provide my data right now. You can close it if you want. If you will be discovering this case, then I can assist you.

System

  • OS: Linux Ubuntu 20.04.6 LTS
  • coolpuppy 1.1.0
  • Python 3.10.8

Sincerely,
Nikita

@Phlya
Copy link
Member

Phlya commented May 30, 2024

Thank you for reporting it Nikita! This is an important issue, at the very least regions should not be silently skipped for no clear reason... It would be more useful if you could provide some reproducible example even with some random public data... But we'll try to look into it when we find the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants