Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The selective regions to find candidate genes #23

Open
QianghuiZhu opened this issue Jul 30, 2022 · 2 comments
Open

The selective regions to find candidate genes #23

QianghuiZhu opened this issue Jul 30, 2022 · 2 comments

Comments

@QianghuiZhu
Copy link

Dear all,
Hello! I know that this question may have been asked and answered. But I am in trouble, and it confuses me deeply.
I am a student that use SweeD for my research recently. It is really a good software that is very convenient to use.
However, I find that I cannot interpret the results easily.


Here are some of my commond lines:

1st, get chr and grid file

I used a grid value that I thought may divide each chr into ~20kb bins.

  • $ head -n 100 input.vcf | grep -f list-chr.txt | sed 's/##contig=<ID=//g; s/,length=/\t/g; s/>//g' | gawk '{printf "%s\t%d\n", $1,$2/20000+1}' > list-chr-gird.txt
  • $ head -n 3 list-chr-gird.txt
    chr1 1011
    chr2 1016
    chr3 1024
    chr1A 3569

2nd, RUN SweeD

$ while read i; do 
  chr=$(echo  ${i} | gawk '{print $1}')
  grid=$(echo  ${i} | gawk '{print $2}')
  echo "chr: ${chr}; grid: ${grid}"
  SweeD -name input.${chr}.20kb -input input-${chr}.vcf -grid ${grid} -minsnps 200 -maf 0.05 -missing 0.1
done < list-chr-gird.txt

The above commond lines all run well! And the results files is also OK. But since I extract top 1% selective regions (Based on StartPos and EndPos, reffering to: #10), I found that many regions are overlapping.
Here are part of my results (I add the chr name at the beginning):

  • $ cat results-top.txt
    Chr Position Likelihood Alpha StartPos EndPos
    chr1A 948409.7871 164.3214 1.314105e-06 8460 10074939
    chr1A 1088402.3086 46.14001 1.777709e-06 8460 7837473
    chr1A 59245295.5195 141.4263 2.429103e-07 9844380 71364650
    chr1A 59265294.4512 164.2197 2.409952e-07 9474848 71364650

when I extract "chr StartPos EndPos" INFO, and used "bedtools merge -i", I got a region "chr1A 8460 71364650", ~71Mb, it's amazing!


Like this, since my ref genom is ~1Gb, I got ~400MB regions as selected regions, it may be impossible!
But in present issues: [https://github.com//issues/10] (#10),
selected regions is [Start, End], or I misunderstand it or grid values?
I also tried to divide each chr into ~50kb (by "gawk '{printf "%s\t%d\n", $1,$2/50000+1}'") or even ~1kb (by "gawk '{printf "%s\t%d\n", $1,$2/1000+1}'"), and got ~400Mb and ~500Mb regions as selected regions, separately.

So, my questions are:

  1. Is: grid value = (chr length) / (window size)?
  2. How to choose suitable grid value (based on LD or others?)?
  3. Why merged selected regions are so long, and how to choose suitable selected regions?

Best Wishes!

@alachins
Copy link
Owner

alachins commented Aug 4, 2022 via email

@QianghuiZhu
Copy link
Author

Thanks for your response.

Fortunately, I asked. Or I'll always misunderstand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants