The selective regions to find candidate genes #23

QianghuiZhu · 2022-07-30T07:27:58Z

Dear all,
Hello! I know that this question may have been asked and answered. But I am in trouble, and it confuses me deeply.
I am a student that use SweeD for my research recently. It is really a good software that is very convenient to use.
However, I find that I cannot interpret the results easily.

Here are some of my commond lines:

1st, get chr and grid file

I used a grid value that I thought may divide each chr into ~20kb bins.

$ head -n 100 input.vcf | grep -f list-chr.txt | sed 's/##contig=<ID=//g; s/,length=/\t/g; s/>//g' | gawk '{printf "%s\t%d\n", $1,$2/20000+1}' > list-chr-gird.txt
$ head -n 3 list-chr-gird.txt
chr1 1011
chr2 1016
chr3 1024
chr1A 3569

2nd, RUN SweeD

$ while read i; do 
  chr=$(echo  ${i} | gawk '{print $1}')
  grid=$(echo  ${i} | gawk '{print $2}')
  echo "chr: ${chr}; grid: ${grid}"
  SweeD -name input.${chr}.20kb -input input-${chr}.vcf -grid ${grid} -minsnps 200 -maf 0.05 -missing 0.1
done < list-chr-gird.txt

The above commond lines all run well! And the results files is also OK. But since I extract top 1% selective regions (Based on StartPos and EndPos, reffering to: #10), I found that many regions are overlapping.
Here are part of my results (I add the chr name at the beginning):

$ cat results-top.txt
Chr Position Likelihood Alpha StartPos EndPos
chr1A 948409.7871 164.3214 1.314105e-06 8460 10074939
chr1A 1088402.3086 46.14001 1.777709e-06 8460 7837473
chr1A 59245295.5195 141.4263 2.429103e-07 9844380 71364650
chr1A 59265294.4512 164.2197 2.409952e-07 9474848 71364650

when I extract "chr StartPos EndPos" INFO, and used "bedtools merge -i", I got a region "chr1A 8460 71364650", ~71Mb, it's amazing!

Like this, since my ref genom is ~1Gb, I got ~400MB regions as selected regions, it may be impossible!
But in present issues: [https://github.com//issues/10] (#10),
selected regions is [Start, End], or I misunderstand it or grid values?
I also tried to divide each chr into ~50kb (by "gawk '{printf "%s\t%d\n", $1,$2/50000+1}'") or even ~1kb (by "gawk '{printf "%s\t%d\n", $1,$2/1000+1}'"), and got ~400Mb and ~500Mb regions as selected regions, separately.

So, my questions are:

Is: grid value = (chr length) / (window size)?
How to choose suitable grid value (based on LD or others?)?
Why merged selected regions are so long, and how to choose suitable selected regions?

Best Wishes!

The text was updated successfully, but these errors were encountered:

alachins · 2022-08-04T11:53:46Z

Hello Hui, The grid is a parameter that allows us to exploit the tradeoff between exec. time and how thoroughly SweeD will scan the data. The grid does not really divide the chromosome into bins. It defines a number of fixed positions to be evaluated. The algorithm then scans windows around each position to calc. the score but these window sizes are not constant. If you use a very high grid value, you will notice many consecutive results with the same score. These can be merged since they are all based on the same SNPs. It is not recommended to merge regions with different SweeD scores. Best regards, Nikos A.

…

On Sat, Jul 30, 2022 at 9:28 AM Hui ***@***.***> wrote: Dear all, Hello! I know that this question may have been asked and answered. But I am in trouble, and it confuses me deeply. I am a student that use SweeD for my research recently. It is really a good software that is very convenient to use. However, I find that I cannot interpret the results easily. ------------------------------ Here are some of my commond lines: 1st, get chr and grid file I used a grid value that I thought may divide each chr into ~20kb bins. - $ head -n 100 input.vcf | grep -f list-chr.txt | sed 's/##contig=<ID=//g; s/,length=/\t/g; s/>//g' | gawk '{printf "%s\t%d\n", $1,$2/20000+1}' > list-chr-gird.txt - $ head -n 3 list-chr-gird.txt chr1 1011 chr2 1016 chr3 1024 chr1A 3569 2nd, RUN SweeD $ while read i; do chr=$(echo ${i} | gawk '{print $1}') grid=$(echo ${i} | gawk '{print $2}') echo "chr: ${chr}; grid: ${grid}" SweeD -name input.${chr}.20kb -input input-${chr}.vcf -grid ${grid} -minsnps 200 -maf 0.05 -missing 0.1 done < list-chr-gird.txt ------------------------------ The above commond lines all run well! And the results files is also OK. But since I extract top 1% selective regions (Based on StartPos and EndPos, reffering to: #10 <#10>), I found that many regions are overlapping. Here are part of my results (I add the chr name at the beginning): - $ cat results-top.txt Chr Position Likelihood Alpha StartPos EndPos chr1A 948409.7871 164.3214 1.314105e-06 8460 10074939 chr1A 1088402.3086 46.14001 1.777709e-06 8460 7837473 chr1A 59245295.5195 141.4263 2.429103e-07 9844380 71364650 chr1A 59265294.4512 164.2197 2.409952e-07 9474848 71364650 when I extract "chr StartPos EndPos" INFO, and used "*bedtools merge -i*", I got a region "chr1A 8460 71364650", ~71Mb, it's amazing! ------------------------------ Like this, since my ref genom is ~1Gb, I got ~400MB regions as selected regions, it may be impossible! But in present issues: [https://github.com//issues/10 <#10>] (#10 <#10>), *selected regions is [Start, End], or I misunderstand it or grid values?* I also tried to divide each chr into ~50kb (by "*gawk '{printf "%s\t%d\n", $1,$2/50000+1}'*") or even ~1kb (by "*gawk '{printf "%s\t%d\n", $1,$2/1000+1}'*"), and got ~400Mb and ~500Mb regions as selected regions, separately. So, my questions are: 1. Is: grid value = (chr length) / (window size)? 2. How to choose suitable grid value (based on LD or others?)? 3. Why merged selected regions are so long, and how to choose suitable selected regions? Best Wishes! — Reply to this email directly, view it on GitHub <#23>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALKWCX3NESCJHOTFIDI2STVWTKQTANCNFSM55CZHPEQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Nikolaos Alachiotis

QianghuiZhu · 2022-08-05T03:44:34Z

Thanks for your response.

Fortunately, I asked. Or I'll always misunderstand

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The selective regions to find candidate genes #23

The selective regions to find candidate genes #23

QianghuiZhu commented Jul 30, 2022

alachins commented Aug 4, 2022 via email

QianghuiZhu commented Aug 5, 2022

The selective regions to find candidate genes #23

The selective regions to find candidate genes #23

Comments

QianghuiZhu commented Jul 30, 2022

1st, get chr and grid file

2nd, RUN SweeD

alachins commented Aug 4, 2022 via email

QianghuiZhu commented Aug 5, 2022