Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AraTha map read in with zero rate #903

Closed
apragsdale opened this issue Apr 9, 2021 · 4 comments
Closed

AraTha map read in with zero rate #903

apragsdale opened this issue Apr 9, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@apragsdale
Copy link
Member

apragsdale commented Apr 9, 2021

Short description of the problem:
Creating a contig with AraTha Salome map has zero rate, because the cM column of the map is all zeros even though the rates are positive.

In [118]: cat ~/.cache/stdpopsim/genetic_maps/AraTha/SalomeAveraged_TAIR7/arab_chr4_map_loess.txt
Chromosome Position(bp) Rate(cM/Mb) Map(cM)
chr4 208650 10.379124 0
chr4 434712 7.015134 0
chr4 945976 5.078784 0
chr4 1512987 8.739854 0
chr4 1782389 2.500659 0
...

How to reproduce the problem:

In [115]: species = stdpopsim.get_species("AraTha")

In [116]: contig = species.get_contig("4", genetic_map="SalomeAveraged_TAIR7")

In [117]: contig.recombination_map.rate
Out[117]: 
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.])

Suggested fixes:
The Map(cM) column needs to be populated. But more generally, we need to make sure all maps are properly formatted, and test for that.

Now in 1.0, msprime.RateMap.read_hapmap() by default uses the Map column, it looks like. The rate column can be used by specifying rate_col=2. One possible test could be to load a map based on rate_col and then load it again based on map_col and ensure that they are equivalent.

E.g.

In [127]: species = stdpopsim.get_species("HomSap")

In [128]: m1 = msprime.RateMap.read_hapmap("~/.cache/stdpopsim/genetic_maps/HomSap/HapMapII_GRCh37/genetic_map_GRCh37_chr22.txt", rate_col=2)

In [129]: m2 = msprime.RateMap.read_hapmap("~/.cache/stdpopsim/genetic_maps/HomSap/HapMapII_GRCh37/genetic_map_GRCh37_chr22.txt", map_col=3)

In [131]: m1.rate
Out[131]: 
array([0.000000e+00, 8.096992e-08, 8.131520e-08, ..., 6.439380e-09,
       6.487070e-09, 6.552390e-09])

In [132]: m2.rate
Out[132]: 
array([0.00000000e+00, 8.09677419e-08, 8.13220676e-08, ...,
       6.43979058e-09, 6.48666233e-09, 6.55155642e-09])

In [133]: m1.total_mass
Out[133]: 0.7410956239614878

In [134]: m2.total_mass
Out[134]: 0.74109562

In [141]: np.allclose(m1.rate, m2.rate)
Out[141]: True

@apragsdale apragsdale added the bug Something isn't working label Apr 10, 2021
@andrewkern
Copy link
Member

did something move in the rate map format here? some of the maps don't even have that fourth column of data, e.g., HomSap/DeCodeSexAveraged_GRCh36

@jeromekelleher
Copy link
Member

Good spot, thanks @apragsdale. I think our effort would be best spent in converting the maps into zarr, and doing the checking up front, at import time. See #852 for initial progress. HapMap is just a mess (why, oh why oh why would you allow two ways of saying the same thing?), and it's super, super slow.

@grahamgower
Copy link
Member

Now in 1.0, msprime.RateMap.read_hapmap() by default uses the Map column, it looks like. The rate column can be used by specifying rate_col=2.

I noticed some problem that may be the same as this issue, and made the change you suggest here in #592. Can you try again with the tip of the main branch? Here's what I see:

>>> import stdpopsim
>>> species = stdpopsim.get_species("AraTha")
>>> contig = species.get_contig("4", genetic_map="SalomeAveraged_TAIR7")
>>> contig.recombination_map.rate
array([          nan, 1.0379124e-07, 7.0151340e-08, 5.0787840e-08,
       8.7398540e-08, 2.5006590e-08, 1.7177200e-09, 2.6797600e-09,
       5.6738710e-08, 3.3156900e-08, 0.0000000e+00, 0.0000000e+00,
       1.1800820e-08, 2.2438780e-08, 7.8200960e-08, 8.0121670e-08,
       7.6937590e-08, 6.4977360e-08, 6.6370430e-08, 6.8539320e-08,
       5.8319910e-08, 5.8240740e-08, 4.1347960e-08, 3.2404150e-08,
       3.5142890e-08, 4.2664890e-08, 4.7095890e-08, 4.8082140e-08,
       4.5151300e-08, 4.1365400e-08, 3.4242260e-08, 4.4641040e-08,
       4.5720070e-08, 3.4355990e-08, 2.5936730e-08, 2.5052750e-08,
       1.9271120e-08, 2.5627460e-08, 3.0640580e-08, 3.2697590e-08,
       3.2484830e-08, 1.3125780e-08, 1.7708340e-08, 2.5028560e-08,
       3.1251320e-08, 5.3774620e-08, 4.1579840e-08, 3.7319190e-08,
       8.3851750e-08, 8.4583290e-08, 3.3992390e-08, 3.7049640e-08,
       2.7696990e-08, 0.0000000e+00])

@apragsdale
Copy link
Member Author

Sorry @graham, yeah I hadn't pulled in new changes for the last day or two. I get the same as what you see now. Converting maps into zarr and doing an audit of all maps then sounds good to me. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants