-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtour.yaml
1984 lines (1968 loc) · 85.1 KB
/
tour.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
id: mothur-miseq-sop
name: Galaxy Tour
description: >-
In this tour we will perform the Standard Operating Procedure (SOP) for MiSeq
data
title_default: mothur-miseq-sop
steps:
- title: 16S Microbial Analysis with Mothur
content: >-
In this tour we will perform the Standard Operating Procedure (SOP) for
MiSeq data.
backdrop: true
- title: 16S Microbial Analysis with Mothur
content: >-
In this tutorial we use 16S rRNA data, but similar pipelines can be used
for WGS data.<br><br> The 16S rRNA gene has several properties that make
it ideally suited for our purposes: <ol>
<li>Present in all living organisms</li>
<li>Single copy (no recombination)</li>
<li>Highly conserved + highly variable regions</li>
<li>Huge reference databases</li>
</ol> The highly conserved regions make it easy to target the gene across
different organisms, while the highly variable regions allow us to
distinguish between different species.
backdrop: true
- title: Understanding our input data
content: >-
In this tutorial we are interested in understanding the effect of normal
variation in the gut microbiome on host health. To that end, fresh feces
from mice were collected on a daily basis for 365 days post weaning.
During the first 150 days post weaning (dpw), nothing was done to our mice
except allow them to eat, get fat, and be merry. We were curious whether
the rapid change in weight observed during the first 10 dpw affected the
stability microbiome compared to the microbiome observed between days 140
and 150. We will address this question in this tutorial using a
combination of OTU, phylotype, and phylogenetic methods. <br><br>To make
this tutorial easier to execute, we are providing only part of the data -
you are given the flow files for one animal at 10 time points (5 early and
5 late). In order to assess the error rate of our analysis pipeline and
experimental setup, we additionally resequenced a mock community composed
of genomic DNA from 21 bacterial strains.
backdrop: true
- title: Dataset details
content: >-
Because of the large size of the original dataset (3.9 GB) you are given
20 of the 362 pairs of fastq files. For example, you will see two files:
F3D0_S188_L001_R1_001.fastq, and F3D0_S188_L001_R2_001.fastq These two
files correspond to Female 3 on Day 0 (F3D0) (i.e. the day of weaning).
The first file (and all those with R1 in the name) correspond to the
forward reads, while the second (and all those with R2 in the name)
correspond to the reverse reads. <br>These sequences are 250 bp and
overlap in the V4 region of the 16S rRNA gene; this region is about 253 bp
long. Looking at the datasets, you will see 22 fastq files, representing
10 time points from Female 3 and 1 mock community. <br>You will also see
HMP_MOCK.v35.fasta which contains the sequences used in the mock community
that were sequenced in fasta format.
backdrop: true
- title: Step 1. History options
element: '#history-options-button'
content: >-
We will start the analyses by creating a new history. Click on this button
and then "Create New". Give it a name.
placement: left
backdrop: false
- title: Step 2. Import Sample Data
element: '#shared .dropdown a[href$="/library/index"]'
content: >-
The data for this course may be available from a shared library in Galaxy
(ask your instructor). If this is not the case, you can upload it
yourself.
placement: right
- title: Step 3. Load data from shared library
element: 'li a[href$="/library/list"]'
content: >-
In the dropdown menu click on Data Libraries. Navigate to the shared data
library, you should find 20 pairs of fastq files; 19 from the mice, and
one pair from the mock community.
placement: right
- title: Import Sample Data.
element: '#tool-panel-upload-button .fa.fa-upload'
content: >-
Otherwise you can upload the data directly from your computer. Obtain the
data from <a
href="https://zenodo.org/record/165147#.Wa_FXsgjHIU">zenodo</a>. Unzip it
on your computer and upload with a help of Upload manager.
placement: right
- title: Step 4. Import Sample Data
element: '#shared .dropdown a[href$="/library/index"]'
content: >-
Go back to the data library and import the following reference datasets,
or download them from Zenodo (reference_data.zip) and upload them to your
history:<ol>
<li>silva.v4.fasta</li>
<li>HMP_MOCK.v35.fasta</li>
<li>trainset9_032012.pds.fasta</li>
<li>trainset9_032012.pds.tax</li>
</ol>
placement: right
- title: Step 5. Dataset collections
content: >-
Now that’s a lot of files to manage. Luckily Galaxy can make life a bit
easier by allowing us to create dataset collections. This enables us to
easily run tools on multiple datasets at once. Let’s create a collection
now. <br><br>Since we have paired-end data, each sample consist of two
separate fastq files, one containing the forward reads, and one containing
the reverse reads. We can recognize the pairing from the file names, which
will differ only by _R1 or _R2 in the filename. We can tell Galaxy about
this paired naming convention, so that our tools will know which files
belong together.
backdrop: true
- title: Step 6. Organizing our data into a collection
element: >-
#current-history-panel .controls .actions a[href$="javascript:void(0);"]
.fa.fa-check-square-o
content: >-
Click on the <b>checkmark icon</b> at top of your history. Select all the
fastq files (40 in total), then click on for <b>all selected</b> and
select <b>Build List of Dataset Pairs</b> from the dropdown menu.
placement: left
- title: Step 7. Organizing our data into a collection
content: >-
In the next dialog window you can create the list of pairs. By default
Galaxy will look for pairs of files that differ only by a <br>_1 and
_2</b> part in their names. In our case however, these should be <b>_R1
and _R2</b>. Please change these values accordingly. You should now see a
list of pairs suggested by Galaxy. <br><br> Examine the pairings, if it
looks good, you can click on <b>auto-pair</b> to create the suggested
pairs.<br> <br><br>The middle segment is the name for each pair. You can
change these names by clicking on them. These names will be used as sample
names in the downstream analysis so always make sure they are informative.
backdrop: false
- title: Step 8. Organizing our data into a collection
content: >-
Once you are happy with your pairings, enter a name for your new
collection at the bottom right of the screen. Then click the <b>Create
List</b> button. A new dataset collection item will now appear in your
history.
backdrop: false
- title: Reducing sequencing and PCR errors
content: >-
The first thing we want to do is combine our forward and reverse reads for
each sample. This is done using the <b>make.contigs</b> command, which
requires the paired collection as input. This command will extract the
sequence and quality score data from your fastq files, create the reverse
complement of the reverse read and then join the reads into contigs. Then
we will combine all samples into a single fasta file, remembering which
reads came from which samples using a group file.
backdrop: true
- title: Reducing sequencing and PCR errors
content: >-
We have a very simple algorithm to do this. First, we align the pairs of
sequences. Next, we look across the alignment and identify any positions
where the two reads disagree. If one sequence has a base and the other has
a gap, the quality score of the base must be over 25 to be considered
real. If both sequences have a base at that position, then we require one
of the bases to have a quality score 6 or more points better than the
other. If it is less than 6 points better, then we set the consensus base
to an N. <br><br>In this experiment we used paired-end sequencing, this
means sequencing was done from from both ends of each fragment, resulting
in an overlap in the middle. We will now combine these pairs of reads into
contigs.
backdrop: true
- title: Step 9. Combine forward and reverse reads into contigs
element: '#tool-search-query'
content: Search for Make.contigs tool
placement: right
textinsert: Make.contigs
- title: Step 10. Combine forward and reverse reads into contigs
element: '#tool-search'
content: Click on the "Make.contigs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_make_contigs%2Fmothur_make_contigs%2F1.36.1.0"]
- title: Step 11. Combine forward and reverse reads into contigs
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“Way to provide files” to the "Multiple pairs - Combo mode"</li>
<li>“Fastq pairs” to the collection you just created</li>
<li>Leave all other parameters to the default settings</li>
</ul>
position: left
- title: Step 12. Combine forward and reverse reads into contigs
element: '.history-right-panel .list-items > *:first'
content: >-
Observe the output. This step merged the forward and reverse reads into
contigs for each pair, and then combines the results into a single fasta
file. To retain information about which reads originated from which
samples, it also made a group file.<br>
The first column contains the read name, and the second column contains
the sample name.
position: left
- title: Summarize data
content: >-
Before starting to work on the quality of the imported data, let's get a
feel of it.
backdrop: true
- title: Step 13. Summarize data
element: '#tool-search-query'
content: Search for Summary.seqs tool
placement: right
textinsert: Summary.seqs
- title: Step 14. Summarize data
element: '#tool-search'
content: Click on the "Summary.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_summary_seqs%2Fmothur_summary_seqs%2F1.36.1.0"]
- title: Step 15. Summarize data
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” parameter to the <b>trim.contigs.fasta</b> file created by the make.contigs tool</li>
<li>We do not need to supply a names or count file</li>
</ul>
position: left
- title: Step 16. Summarize data
element: '.history-right-panel .list-items > *:first'
content: >-
Observe the output. The summary output files give information per read.
The logfile outputs also contain some summary statistics.<br>
This tells us that we have 152,360 sequences that for the most part vary
between 248 and 253 bases. Interestingly, the longest read in the dataset
is 502 bp. Be suspicious of this. Recall that the reads are supposed to be
251 bp each. This read clearly didn’t assemble well (or at all). Also,
note that at least 2.5% of our sequences had some ambiguous base calls.
<br>We’ll take care of these issues in the next step when we run
<b>screen.seqs.</b>
position: left
- title: Step 17. Filter reads based on quality and length
element: '#tool-search-query'
content: >-
Search for Screen.seqs tool. It will remove any sequences with ambiguous
bases and anything longer than 275 bp.
placement: right
textinsert: Screen.seqs
- title: Step 18. Filter reads based on quality and length
element: '#tool-search'
content: Click on the "Screen.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_screen_seqs%2Fmothur_screen_seqs%2F1.36.1.0"]
- title: Step 19. Filter reads based on quality and length
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” to the <b>trim.contigs.fasta</b> file created by the make.contigs tool</li>
<li>“group” the group file created in the make.contigs step</li>
<li>“maxlength” parameter to <b>275</b></li>
<li>“maxambig” parameter to <b>0</b></li>
</ul>
position: left
- title: Question. Filter reads based on quality and length
content: |-
Inspect the output <ul>
<li>How many reads were removed in this screening step? (Hint: run the summary.seqs tool again)</li>
</ul>
backdrop: true
- title: Optimize files for computation
content: >-
Because we are sequencing many of the same organisms, we anticipate that
many of our sequences are duplicates of each other. Because it’s
computationally wasteful to align the same thing a bazillion times, we’ll
unique our sequences using the <b>unique.seqs</b> command.
backdrop: true
- title: Step 20. Remove duplicate sequences
element: '#tool-search-query'
content: Search for Unique.seqs tool
placement: right
textinsert: Unique.seqs
- title: Step 21. Remove duplicate sequences
element: '#tool-search'
content: Click on the "Unique.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_unique_seqs%2Fmothur_unique_seqs%2F1.36.1.0"]
- title: Step 22. Remove duplicate sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” to the <b>good.fasta</b> output from Screen.seqs</li>
</ul>
position: left
- title: Question. Remove duplicate sequences
content: |-
Inspect the output <ul>
<li>How many sequences were unique?</li>
<li>how many duplicates were removed?</li>
</ul>
backdrop: true
- title: Remove duplicate sequences
content: >-
Inspect the output. This tool outputed two files, one is a fasta file
containing only the unique sequences, and a names files. The names file
consists of two columns, the first contains the sequence names for each of
the unique sequences, and the second column contains all other sequence
names that are identical to the representative sequence in the first
column. <br>To reduce file sizes further and streamline analysis, we can
now summarize the data in a count table.
backdrop: true
- title: Step 23. Generate count table
element: '#tool-search-query'
content: Search for Count.seqs tool
placement: right
textinsert: Count.seqs
- title: Step 24. Generate count table
element: '#tool-search'
content: Click on the "Count.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_count_seqs%2Fmothur_count_seqs%2F1.36.1.0"]
- title: Step 25. Generate count table
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“name” to the <b>names</b> output from Unique.seqs</li>
<li>“Use a Group file” to <b>yes</b></li>
<li>“group” to the group file we created using the Screen.seqs tool</li>
</ul>
position: left
- title: Generate count table
content: >-
Inspect the output. The first column contains the read names of the
representative sequence, and the subsequent columns contain the number of
duplicates of this sequence observed in each sample.
backdrop: true
- title: Align sequences
content: >-
We are now ready to align our sequences to the reference. This step is an
important step to perform to improve the clustering of your OTUs
backdrop: true
- title: Step 26. Align sequences
element: '#tool-search-query'
content: Search for Align.seqs tool
placement: right
textinsert: Align.seqs
- title: Step 27. Align sequences
element: '#tool-search'
content: Click on the "Align.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_align_seqs%2Fmothur_align_seqs%2F1.36.1.0"]
- title: Step 28. Align sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” to the fasta output from Unique.seqs</li>
<li>“reference” to the <b>silva.v4.fasta</b> reference file</li>
</ul>
position: left
- title: Step 29. Align sequences
element: '#tool-search-query'
content: Search for Summary.seqs tool
placement: right
textinsert: Summary.seqs
- title: Step 30. Align sequences
element: '#tool-search'
content: Click on the "Summary.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_summary_seqs%2Fmothur_summary_seqs%2F1.36.1.0"]
- title: Step 31. Align sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” parameter to the aligned output from step 28</li>
<li>“count” parameter to <b>count_table</b> output from Count.seqs</li>
</ul>
position: left
- title: Step 32. Align sequences
element: '.history-right-panel .list-items > *:first'
content: >-
Observe the output. So what does this mean? You’ll see that the bulk of
the sequences start at position 1968 and end at position 11550. Some
sequences start at position 1250 or 1982 and end at 10693 or 13400. These
deviants from the mode positions are likely due to an insertion or
deletion at the terminal ends of the alignments. Sometimes you’ll see
sequences that start and end at the same position indicating a very poor
alignment, which is generally due to non-specific amplification.
position: left
- title: More Data Cleaning
content: >-
To make sure that everything overlaps the same region we’ll re-run
screen.seqs to get sequences that start at or before position 1968 and end
at or after position 11550. We’ll also set the maximum homopolymer length
to 8 since there’s nothing in the database with a stretch of 9 or more of
the same base in a row (this also could have been done in the first
execution of screen.seqs before).
backdrop: true
- title: Step 33. Remove poorly aligned sequences
element: '#tool-search-query'
content: Search for Screen.seqs tool
placement: right
textinsert: Screen.seqs
- title: Step 34. Remove poorly aligned sequences
element: '#tool-search'
content: Click on the "Screen.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_screen_seqs%2Fmothur_screen_seqs%2F1.36.1.0"]
- title: Step 35. Remove poorly aligned sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” to the aligned fasta file</b>
<li>“start” to <b>1968</b></li>
<li>“end” to <b>11550</b></li>
<li>“maxhomop” to <b>8</b></li>
<li>“count” to our most recent count_table</li>
</ul>
position: left
- title: Question. Remove poorly aligned sequences
content: |-
Inspect the output <ul>
<li>How many sequences were removed in this step?</li>
</ul>
backdrop: true
- title: Filter sequences
content: >-
Now we know our sequences overlap the same alignment coordinates, we want
to make sure they only overlap that region.<br>So we’ll filter the
sequences to remove the overhangs at both ends. Since we’ve done
paired-end sequencing, this shouldn’t be much of an issue. In addition,
there are many columns in the alignment that only contain gap characters
(i.e. “.”). These can be pulled out without losing any information. We’ll
do all this with <b>filter.seqs</b>.
backdrop: true
- title: Step 36. Filter sequences
element: '#tool-search-query'
content: Search for Filter.seqs tool
placement: right
textinsert: Filter.seqs
- title: Step 37. Filter sequences
element: '#tool-search'
content: Click on the "Filter.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_filter_seqs%2Fmothur_filter_seqs%2F1.36.1.0"]
- title: Step 38. Filter sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>"fasta” to good.fasta output from Sreen.seqs</li>
<li>“vertical” to <b>Yes</b></li>
<li>“trump” to <b>.</b></li>
</ul>
position: left
- title: Step 39. Filter sequences
element: '.history-right-panel .list-items > *:first'
content: >-
Observe the output. Our initial alignment was 13425 columns wide and that
we were able to remove 13049 terminal gap characters using trump=. and
vertical gap characters using vertical=yes. The final alignment length is
376 columns. Because we’ve perhaps created some redundancy across our
sequences by trimming the ends, we can re-run unique.seqs
position: left
- title: Step 40. Re-obtain unique sequences
element: '#tool-search-query'
content: Search for Unique.seqs tool
placement: right
textinsert: Unique.seqs
- title: Step 41. Re-obtain unique sequences
element: '#tool-search'
content: Click on the "Unique.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_unique_seqs%2Fmothur_unique_seqs%2F1.36.1.0"]
- title: Step 42. Re-obtain unique sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” to the <b>filtered fasta</b> output from Filter.seqs</li>
<li>“name file or count table” to the count table from the last Screen.seqs</li>
</ul>
position: left
- title: Question. Re-obtain unique sequences
content: |-
Inspect the output <ul>
<li>How many duplicate sequences did our filter step produce?</li>
</ul>
backdrop: true
- title: Pre-clustering
content: >-
The next thing we want to do to further de-noise our sequences, is to
pre-cluster the sequences using the pre.cluster command, allowing for up
to 2 differences between sequences.<br><br>This command will split the
sequences by group and then sort them by abundance and go from most
abundant to least and identify sequences that differ no more than 2
nucleotides from on another. If this is the case, then they get merged. We
generally recommend allowing 1 difference for every 100 basepairs of
sequence
backdrop: true
- title: Step 43. Perform preliminary clustering of sequences
element: '#tool-search-query'
content: Search for Pre.cluster tool
placement: right
textinsert: Pre.cluster
- title: Step 44. Perform preliminary clustering of sequences
element: '#tool-search'
content: Click on the "Pre.cluster" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_pre_cluster%2Fmothur_pre_cluster%2F1.36.1.0"]
- title: Step 45. Perform preliminary clustering of sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” to the fasta output from the last Unique.seqs run</li>
<li>“name file or count table” to the count table from the last Unique.seqs</li>
<li>“diffs” to <b>2</b></li>
</ul>
position: left
- title: Question. Perform preliminary clustering of sequences
content: |-
Inspect the output <ul>
<li>How many unique sequences are we left with after this clustering of highly similar sequences?</li>
</ul>
backdrop: true
- title: Chimera Removal
content: >-
At this point we have removed as much sequencing error as we can, and it
is time to turn our attention to removing sequencing artefacts known as
chimeras. <br>
backdrop: true
- title: What is Chimera sequence?
content: The combination of multiple sequences during PCR to create a hybrid
backdrop: true
- title: Chimera Removal
content: >-
We’ll do this chimera removal using the UCHIME algorithm that is called
within Mothur, using the <b>chimera.uchime</b> command. This command will
split the data by sample and check for chimeras. backdrop: true
<br>Our preferred way of doing this is to use the abundant sequences as
our reference. In addition, if a sequence is flagged as chimeric in one
sample, the default (dereplicate=No) is to remove it from all samples. Our
experience suggests that this is a bit aggressive since we’ve seen rare
sequences get flagged as chimeric when they’re the most abundant sequence
in another sample
backdrop: true
- title: Step 46. Remove chimeric sequences
element: '#tool-search-query'
content: Search for Chimera.uchime tool
placement: right
textinsert: Chimera.uchime
- title: Step 47. Remove chimeric sequences
element: '#tool-search'
content: Click on the "Chimera.uchime" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_chimera_uchime%2Fmothur_chimera_uchime%2F1.36.1.0"]
- title: Step 48. Remove chimeric sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” to the fasta output from Pre.cluster</li>
<li>“Select Reference Template from” to <b>Self</b></li>
<li>“count” to the count table from the last Pre.cluster</li>
<li>“dereplicate” to Yes</li>
</ul>
position: left
- title: Step 49. Remove chimeric sequences
element: '.history-right-panel .list-items > *:first'
content: >-
Running chimera.uchime with the count file will remove the chimeric
sequences from the count table, but we still need to remove those
sequences from the fasta file as well position: left
position: left
- title: Step 50. Remove chimeric sequences
element: '#tool-search-query'
content: Search for Remove.seqs tool
placement: right
textinsert: Remove.seqs
- title: Step 51. Remove chimeric sequences
element: '#tool-search'
content: Click on the "Remove.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_remove_seqs%2Fmothur_remove_seqs%2F1.36.1.0"]
- title: Step 52. Remove chimeric sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“accnos” to the uchime.accnos file from Chimera.uchime</li>
<li>“fasta” to the fasta output from Pre.cluster</li>
<li>“count” to the count table from Chimera.uchime</li>
</ul>
position: left
- title: Question. Remove chimeric sequences
content: |-
Inspect the output <ul>
<li>How many sequences were flagged as chimeric? what is the percentage? (Hint: summary.seqs)</li>
</ul>
backdrop: true
- title: Step 51. Remove chimeric sequences
element: '#tool-search'
content: Click on the "Remove.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_remove_seqs%2Fmothur_remove_seqs%2F1.36.1.0"]
- title: Step 52. Remove chimeric sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“accnos” to the uchime.accnos file from Chimera.uchime</li>
<li>“fasta” to the fasta output from Pre.cluster</li>
<li>“count” to the count table from Chimera.uchime</li>
</ul>
position: left
- title: Removal of non-bacterial sequences 1
content: >-
As a final quality control step, we need to see if there are any
“undesirables” in our dataset. Sometimes when we pick a primer set they
will amplify other stuff that survives to this point in the pipeline, such
as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and
mitochondria. There’s also just the random stuff that we want to get rid
of.<br>Now you may say, “But wait I want that stuff”. Fine. But, the
primers we use, are only supposed to amplify members of the Bacteria and
if they’re hitting Eukaryota or Archaea, then it is a mistake. Also,
mitochondria and chloroplasts have no functional role in microbial
community.
backdrop: true
- title: Step 53. Remove undesired sequences
element: '#tool-search-query'
content: Search for Classify.seqs tool
placement: right
textinsert: Classify.seqs
- title: Step 54. Remove undesired sequences
element: '#tool-search'
content: Click on the "Classify.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_classify_seqs%2Fmothur_classify_seqs%2F1.36.1.0"]
- title: Step 55. Remove undesired sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” to the fasta output from Remove.seqs</li>
<li>“reference” to trainset9032012.pds.fasta from your history</li>
<li>“taxonomy” to trainset9032012.pds.tax from your history</li>
<li>“count” to the count table file from Remove.seqs</li>
<li>“cutoff” to <b>80</b></li>
</ul>
position: left
- title: Step 56. Remove undesired sequences
element: '.history-right-panel .list-items > *:first'
content: >-
Have a look at the taxonomy output. You will see that every read now has a
classification. <br><br>Now that everything is classified we want to
remove our undesirables. We do this with the remove.lineage
position: left
- title: Step 57. Remove undesired sequences
element: '#tool-search-query'
content: Search for Remove.lineage tool
placement: right
textinsert: Remove.lineage
- title: Step 58. Remove undesired sequences
element: '#tool-search'
content: Click on the "Remove.lineage" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_remove_lineage%2Fmothur_remove_lineage%2F1.36.1.0"]
- title: Step 59. Remove undesired sequences
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>Remove.lineage with the following parameters</li>
<li>“taxonomy” to the taxonomy output from Classify.seqs</li>
<li>“taxon” to <b>Chloroplast-Mitochondria-unknown-Archaea-Eukaryota</b> in the text box under Manually select taxons for filtering</li>
<li>“fasta” to the fasta output from Remove.seqs</li>
<li>“count” to the count table from Remove.seqs</li>
</ul>
position: left
- title: Questions
content: |-
Inspect the output <ul>
<li>How many unique (representative) sequences were removed in this step?</li>
<li>How many sequences in total?</li>
</ul>
backdrop: true
- title: Assessing error rates based on our mock community
content: >-
Measuring the error rate of your sequences is something you can only do if
you have co-sequenced a mock community, that is, a sample of which you
know the exact composition. This is something we include for every 95
samples we sequence. You should too because it will help you gauge your
error rates and allow you to see how well your curation is going, and
whether something is wrong with your sequencing setup.
backdrop: true
- title: Mock community
content: >-
A defined mixture of microbial cells and/or viruses or nucleic acid
molecules created in vitro to simulate the composition of a microbiome
sample or the nucleic acid isolated therefrom.
<br>Our mock community is composed of genomic DNA from 21 bacterial
strains. So in a perfect world, this is exactly what we would expect the
analysis to produce as a result.
<br>First, let’s extract the sequences belonging to our mock samples from
our data
backdrop: true
- title: Step 59. Extract mock sample from our dataset
element: '#tool-search-query'
content: Search for Get.groups tool
placement: right
textinsert: Get.groups
- title: Step 60. Extract mock sample from our dataset
element: '#tool-search'
content: Click on the "Get.groups" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_get_groups%2Fmothur_get_groups%2F1.36.1.0"]
- title: Step 61. Extract mock sample from our dataset
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“group file or count table” to the count table from Remove.lineage</li>
<li>“groups” to <b>Mock</b></li>
<li>“fasta” to fasta output from Remove.lineage</li>
</ul>
position: left
- title: Step 62. Extract mock sample from our dataset
element: '.history-right-panel .list-items > *:first'
content: >-
Have a look at the taxonomy output. It should tell you that we had 67
unique sequences and a total of 4,060 total sequences in our Mock sample.
<br>We can now use the seq.error command to measure the error rates based
on our mock reference. Here we align the reads from our mock sample back
to their known sequences, to see how many fail to match.
position: left
- title: Step 63. Assess error rates based on a mock community
element: '#tool-search-query'
content: Search for Seq.error tool
placement: right
textinsert: Seq.error
- title: Step 64. Assess error rates based on a mock community
element: '#tool-search'
content: Click on the "Seq.error" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_seq_error%2Fmothur_seq_error%2F1.36.1.0"]
- title: Step 65. Assess error rates based on a mock community
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” to the fasta from Get.groups</li>
<li>“reference” to <b>HMP_MOCK.v35.fasta</b> file from your history</li>
<li>“count” to the count table from Get.groups</li>
</ul>
position: left
- title: Step 66. Assess error rates based on a mock community
element: '.history-right-panel .list-items > *:first'
content: Inspect the output. The error rate should be 0.0065%!
position: left
- title: Cluster mock sequences into OTUs
content: >-
In 16S metagenomics approaches, OTUs are clusters of similar sequence
variants of the 16S rDNA marker gene sequence. Each of these clusters is
intended to represent a taxonomic unit of a bacteria species or genus
depending on the sequence similarity threshold. Typically, OTU cluster are
defined by a 97% identity threshold of the 16S gene sequence variants at
genus level. 98% or 99% identity is suggested for species separation.
backdrop: true
- title: Cluster mock sequences into OTUs
content: First we calculate the pairwise distances between our sequences
backdrop: true
- title: Step 67. Cluster mock sequences into OTUs
element: '#tool-search-query'
content: Search for Dist.seqs tool
placement: right
textinsert: Dist.seqs
- title: Step 68. Cluster mock sequences into OTUs
element: '#tool-search'
content: Click on the "Dist.seqs" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_dist_seqs%2Fmothur_dist_seqs%2F1.36.1.0"]
- title: Step 69. Cluster mock sequences into OTUs
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“fasta” to the fasta from Get.groups</li>
<li>“cutoff” to <b>0.20</b></li>
</ul>
position: left
- title: Cluster mock sequences into OTUs
content: Next we group sequences into OTUs
backdrop: true
- title: Step 70. Cluster mock sequences into OTUs
element: '#tool-search-query'
content: Search for Cluster tool
placement: right
textinsert: Cluster
- title: Step 71. Cluster mock sequences into OTUs
element: '#tool-search'
content: Click on the "Cluster" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_cluster%2Fmothur_cluster%2F1.36.1.0"]
- title: Step 72. Cluster mock sequences into OTUs
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“column” to the dist output from Dist.seqs</li>
<li>“count” to the count table from Get.groups</li>
</ul>
position: left
- title: Cluster mock sequences into OTUs
content: >-
Now we make a shared file that summarizes all our data into one handy
table
backdrop: true
- title: Step 70. Cluster mock sequences into OTUs
element: '#tool-search-query'
content: Search for Make.shared tool
placement: right
textinsert: Make.shared
- title: Step 71. Cluster mock sequences into OTUs
element: '#tool-search'
content: Click on the "Make.shared" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_make_shared%2Fmothur_make_shared%2F1.36.1.0"]
- title: Step 72. Cluster mock sequences into OTUs
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“list” to the OTU list from Cluster</li>
<li>“count” to the count table from Get.groups</li>
<li>“label” to <b>0.03</b> (this indicates we are interested in the clustering at a 97% identity threshold)</li>
</ul>
position: left
- title: Cluster mock sequences into OTUs
content: And now we generate intra-sample rarefaction curves
backdrop: true
- title: Step 73. Cluster mock sequences into OTUs
element: '#tool-search-query'
content: Search for Rarefaction.single tool
placement: right
textinsert: Rarefaction.single
- title: Step 74. Cluster mock sequences into OTUs
element: '#tool-search'
content: Click on the "Rarefaction.single" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_rarefaction_shared%2Fmothur_rarefaction_shared%2F1.36.1.0"]
- title: Step 75. Cluster mock sequences into OTUs
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“shared” to the shared file from Make.shared</li>
</ul>
position: left
- title: Question
content: <ul><li>How many OTUs were identified in our mock community?</li></ul>
backdrop: true
- title: Step 76. Cluster mock sequences into OTUs
element: >-
#current-history-panel .controls .actions a[href$="javascript:void(0);"]
.fa.fa-check-square-o
content: >-
Open the rarefaction output (dataset named sobs inside the rarefaction
curves output collection). You’ll see that for 4060 sequences, we’d have
34 OTUs from the Mock community. This number of course includes some
stealthy chimeras that escaped our detection methods. If we used 3000
sequences, we would have about 31 OTUs. In a perfect world with no
chimeras and no sequencing errors, we’d have 21 OTUs. This is not a
perfect world. But this is pretty darn good!
placement: left
- title: Rarefaction
content: >-
To estimate the fraction of species sequenced, rarefaction curves are
typically used. A rarefaction curve plots the number of species as a
function of the number of individuals sampled. The curve usually begins
with a steep slope, which at some point begins to flatten as fewer species
are being discovered per sample: the gentler the slope, the less
contribution of the sampling to the total number of operational taxonomic
units or OTUs.<br>Now that we have assessed our error rates we are ready
for some real analysis.
backdrop: true
- title: Removing Mock sample
content: >-
We’re almost to the point where you can have some fun with your data (I’m
already having fun, aren’t you?). Next, we would assign sequences to OTUs,
but first, we should remove the Mock sample from our dataset, it has
served its purpose by allowing us to estimate our error rate, but in
subsequent steps we only want to use our real samples.
backdrop: true
- title: Step 77. Remove Mock community from our dataset
element: '#tool-search-query'
content: Search for Remove.groups tool
placement: right
textinsert: Remove.groups
- title: Step 78. Remove Mock community from our dataset
element: '#tool-search'
content: Click on the "Remove.groups" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_remove_groups%2Fmothur_remove_groups%2F1.36.1.0"]
- title: Step 79. Remove Mock community from our dataset
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“Select input type” to <b>fasta , name, taxonomy, or list with a group file or count table</b></li>
<li>“count table”, “fasta”, and “taxonomy” to the respective outputs from Remove.lineage</li>
<li>“groups” to <b>Mock</b></li>
</ul>
position: left
- title: Clustering sequences into OTUs
content: >-
Now, we have a couple of options for clustering sequences into OTUs. For a
small dataset like this, we could do the traditional approach using
dist.seqs and cluster as we did with the Mock sample.<br>
The alternative is to use the cluster.split command. In this approach, we
use the taxonomic information to split the sequences into bins and then
cluster within each bin. The Schloss lab have published results showing
that if you split at the level of Order or Family, and cluster to a 0.03
cutoff, you’ll get just as good of clustering as you would with the
“traditional” approach.<br>
The advantage of the cluster.split approach is that it should be faster,
use less memory, and can be run on multiple processors. In an ideal world
we would prefer the traditional route because “Trad is rad”, but we also
think that kind of humor is funny…. In this command we use taxlevel=4,
which corresponds to the level of Order. This is the approach that we
generally use in the Schloss lab.
backdrop: true
- title: Step 80. Cluster our data into OTUs
element: '#tool-search-query'
content: Search for Cluster.split tool
placement: right
textinsert: Cluster.split
- title: Step 81. Cluster our data into OTUs
element: '#tool-search'
content: Click on the "Cluster.split" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_cluster_split%2Fmothur_cluster_split%2F1.36.1.0"]
- title: Step 82. Cluster our data into OTUs
element: '#tool-search'
content: |-
Execute the tool with <ul>
<li>“Split by” to <b>Classification using fasta</b></li>
<li>“fasta” to the fasta output from Remove.groups</li>
<li>“taxonomy” to the taxonomy output from Remove.groups</li>
<li>“taxlevel” to <b>4</b></li>
<li>“count” to the count table output from Remove.groups</li>
<li>“cutoff” to <b>0.15</b></li>
</ul>
position: left
- title: Cluster our data into OTUs
content: >-
Next we want to know how many sequences are in each OTU from each group
and we can do this using the Make.shared command. Here we tell Mothur that
we’re really only interested in the 0.03 cutoff level
backdrop: true
- title: Step 83. Cluster our data into OTUs
element: '#tool-search-query'
content: Search for Make.shared tool
placement: right
textinsert: Make.shared
- title: Step 84. Cluster our data into OTUs
element: '#tool-search'
content: Click on the "Make.shared" tool to open it
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_make_shared%2Fmothur_make_shared%2F1.36.1.0"]
- title: Step 85. Cluster our data into OTUs
element: '#tool-search'
content: |-
Execute the tool with <ul>