Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IUPAC sites were not removed #6

Open
smallfishcui opened this issue Feb 12, 2024 · 5 comments
Open

IUPAC sites were not removed #6

smallfishcui opened this issue Feb 12, 2024 · 5 comments

Comments

@smallfishcui
Copy link

Hi,

I was using the ascbias.py script to remove the invariant sites. It seems to not work for me. I first converted my vcf file to phylip format using the script https://raw.githubusercontent.com/edgardomortiz/vcf2phylip/master/vcf2phylip.py, and then using this script to remove invariant sites for raxml phylogeny inference. But the phylip file after processing still contains IUPAC sites, and raxml was conplaining there are invariant sites in the input file that I shouldn't user the model GTR+ASC_LEWIS.

Do you know what could be the reason?

thanks,
Cui

@btmartin721
Copy link
Owner

Hi Cui,

I will look into this next week. I apologize for the delay, but I am swamped this week. Just wanted to give you a heads-up that it might be a bit.

If I happen to forget to follow up on this, feel free to send me a reminder message.

-Bradley

@smallfishcui
Copy link
Author

Thanks a lot Bradley! That's okay, please take your time!
I have never used python, so I tried to ask help from chatgpt. Eventually I changed the code like this
line 49, changed to bases = ["A","G","C","T"]
line 68, changed to if any(value for value in column_unique if value not in bases):
The resulted file seems to work for me. But I have no idea whether this works correctly. Please do get back to me once you get a moment!

thanks,
Cui

@pvdam3
Copy link

pvdam3 commented Mar 26, 2024

I just wanted to add here that I also experienced the same thing; the invariable sites (E.g. all samples having A nucleotides and one R, meaning A/G), are not removed, and raxml-ng is complaining about a large number of remaining invariable sites in the MSA.
The suggestion by @smallfishcui removes all columns with any IUPAC code completely, which is not what I would like to do.

@pvdam3
Copy link

pvdam3 commented Mar 26, 2024

I added a section to the filter_invariants function that looks for the above situation and also removes columns with multiple IUPAC code bases. Maybe it can be of use to you @btmartin721.

def filter_invariants(dframe):
    initial_dframe_shape=dframe.shape
    bases = ["A","G","C","T","R","Y","S","W","K","M","B","D","H","V"]
    IUPAC_bases = ["R","Y","S","W","K","M","B","D","H","V","N"]
    IUPAC_dict = {'B': 'CGT', 'D': 'AGT', 'H': 'ACT', 'K': 'GT', 'M': 'AC', 'N': 'ACGT', 'S': 'CG', 'R': 'AG', 'W': 'AT', 'V': 'ACG', 'Y': 'CT', 'X': 'ACGT'}


    # collections::Counter library
    stamatakis_cnt = Counter()
    fels_cnt = 0

    invariant_lst = list()
    # Loop through each dataframe column
    for i in dframe.columns:

        # Gets unique values at each column and saves to list
        column_unique = [x.upper() for x in dframe[i].unique().tolist()]

        # Intersects column_unique with bases list
        intersect = [value for value in bases if value in column_unique]

        # If column contains only ambigous or IUPAC characters
        # Save the column index for dropping later
        if not any(value for value in bases if value in column_unique):
            invariant_lst.append(i)

        # If site is invariant (only A, C, G, or T); ignores N's and "-"
        if len(intersect) == 1:
            # Uses collections::Counter to get Stamatakis counts
            stamatakis_cnt[intersect[0]] += 1

            # Counts number of invariant sites for Felsenstein count
            fels_cnt += 1

            # Saves column indexes to list
            invariant_lst.append(i)

        # Assesses whether the column is considered invariate because a pattern like AAAAAARAAAAA (R being A or G)
        # Save the column index for dropping later
        IUPAC_bases_this_column = [x for x in intersect if x in IUPAC_bases]
        column_unique_without_IUPAC = [x for x in intersect if x not in IUPAC_bases_this_column]

        if len(IUPAC_bases_this_column) == 1 and \
            len(column_unique_without_IUPAC) == 1 and \
            column_unique_without_IUPAC[0] in list(IUPAC_dict[IUPAC_bases_this_column[0]]):  
            ## if 1 IUPAC base AND 1 non-IUPAC base AND the non-IUPAC base is part of the IUPAC base set stored in IUPAC_dict
            print(column_unique_without_IUPAC[0])
            invariant_lst.append(i)
        if len(IUPAC_bases_this_column) > 1: 
            ## if there are more than one IUPAC bases in this column
            invariant_lst.append(i)

    # Drops invariant sites from dataframe
    invariant_set=set(invariant_lst)
    dframe.drop(invariant_set, axis=1, inplace=True)
    print("## dimensions of the initial alignment (samples, nr of positions):",initial_dframe_shape)
    print("##",len(invariant_set), "invariable sites removed")
    print("## dimensions of the remaining alignment (samples, nr of positions):", dframe.shape)
    return stamatakis_cnt, fels_cnt, dframe

@btmartin721
Copy link
Owner

Awesome thanks! Sorry for the trouble. This was one of the first python scripts I ever wrote and I haven't touched it in a long time. @pvdam3 do you think you could do a pull request and I'll merge it with my original code so that this issue could be fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants