-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The selective regions to find candidate genes #23
Comments
Hello Hui,
The grid is a parameter that allows us to exploit the tradeoff between
exec. time and how thoroughly SweeD will scan the data. The grid does not
really divide the chromosome into bins. It defines a number of fixed
positions to be evaluated. The algorithm then scans windows around each
position to calc. the score but these window sizes are not constant. If you
use a very high grid value, you will notice many consecutive results with
the same score. These can be merged since they are all based on the same
SNPs. It is not recommended to merge regions with different SweeD scores.
Best regards,
Nikos A.
…On Sat, Jul 30, 2022 at 9:28 AM Hui ***@***.***> wrote:
Dear all,
Hello! I know that this question may have been asked and answered. But I
am in trouble, and it confuses me deeply.
I am a student that use SweeD for my research recently. It is really a
good software that is very convenient to use.
However, I find that I cannot interpret the results easily.
------------------------------
Here are some of my commond lines:
1st, get chr and grid file
I used a grid value that I thought may divide each chr into ~20kb bins.
- $ head -n 100 input.vcf | grep -f list-chr.txt | sed
's/##contig=<ID=//g; s/,length=/\t/g; s/>//g' | gawk '{printf "%s\t%d\n",
$1,$2/20000+1}' > list-chr-gird.txt
- $ head -n 3 list-chr-gird.txt
chr1 1011
chr2 1016
chr3 1024
chr1A 3569
2nd, RUN SweeD
$ while read i; do
chr=$(echo ${i} | gawk '{print $1}')
grid=$(echo ${i} | gawk '{print $2}')
echo "chr: ${chr}; grid: ${grid}"
SweeD -name input.${chr}.20kb -input input-${chr}.vcf -grid ${grid} -minsnps 200 -maf 0.05 -missing 0.1
done < list-chr-gird.txt
------------------------------
The above commond lines all run well! And the results files is also OK.
But since I extract top 1% selective regions (Based on StartPos and EndPos,
reffering to: #10 <#10>), I found
that many regions are overlapping.
Here are part of my results (I add the chr name at the beginning):
- $ cat results-top.txt
Chr Position Likelihood Alpha StartPos EndPos
chr1A 948409.7871 164.3214 1.314105e-06 8460 10074939
chr1A 1088402.3086 46.14001 1.777709e-06 8460 7837473
chr1A 59245295.5195 141.4263 2.429103e-07 9844380 71364650
chr1A 59265294.4512 164.2197 2.409952e-07 9474848 71364650
when I extract "chr StartPos EndPos" INFO, and used "*bedtools merge -i*",
I got a region "chr1A 8460 71364650", ~71Mb, it's amazing!
------------------------------
Like this, since my ref genom is ~1Gb, I got ~400MB regions as selected
regions, it may be impossible!
But in present issues: [https://github.com//issues/10
<#10>] (#10
<#10>),
*selected regions is [Start, End], or I misunderstand it or grid values?*
I also tried to divide each chr into ~50kb (by "*gawk '{printf
"%s\t%d\n", $1,$2/50000+1}'*") or even ~1kb (by "*gawk '{printf
"%s\t%d\n", $1,$2/1000+1}'*"), and got ~400Mb and ~500Mb regions as
selected regions, separately.
So, my questions are:
1. Is: grid value = (chr length) / (window size)?
2. How to choose suitable grid value (based on LD or others?)?
3. Why merged selected regions are so long, and how to choose suitable
selected regions?
Best Wishes!
—
Reply to this email directly, view it on GitHub
<#23>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALKWCX3NESCJHOTFIDI2STVWTKQTANCNFSM55CZHPEQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Nikolaos Alachiotis
|
Thanks for your response. Fortunately, I asked. Or I'll always misunderstand |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Dear all,
Hello! I know that this question may have been asked and answered. But I am in trouble, and it confuses me deeply.
I am a student that use SweeD for my research recently. It is really a good software that is very convenient to use.
However, I find that I cannot interpret the results easily.
Here are some of my commond lines:
1st, get chr and grid file
I used a grid value that I thought may divide each chr into ~20kb bins.
chr1 1011
chr2 1016
chr3 1024
chr1A 3569
2nd, RUN SweeD
The above commond lines all run well! And the results files is also OK. But since I extract top 1% selective regions (Based on StartPos and EndPos, reffering to: #10), I found that many regions are overlapping.
Here are part of my results (I add the chr name at the beginning):
Chr Position Likelihood Alpha StartPos EndPos
chr1A 948409.7871 164.3214 1.314105e-06 8460 10074939
chr1A 1088402.3086 46.14001 1.777709e-06 8460 7837473
chr1A 59245295.5195 141.4263 2.429103e-07 9844380 71364650
chr1A 59265294.4512 164.2197 2.409952e-07 9474848 71364650
when I extract "chr StartPos EndPos" INFO, and used "bedtools merge -i", I got a region "chr1A 8460 71364650", ~71Mb, it's amazing!
Like this, since my ref genom is ~1Gb, I got ~400MB regions as selected regions, it may be impossible!
But in present issues: [https://github.com//issues/10] (#10),
selected regions is [Start, End], or I misunderstand it or grid values?
I also tried to divide each chr into ~50kb (by "gawk '{printf "%s\t%d\n", $1,$2/50000+1}'") or even ~1kb (by "gawk '{printf "%s\t%d\n", $1,$2/1000+1}'"), and got ~400Mb and ~500Mb regions as selected regions, separately.
So, my questions are:
Best Wishes!
The text was updated successfully, but these errors were encountered: