SNP phylogeny

Overview

Teaching: 15 min
Exercises: 60 min
Questions
  • How to generate a phylogenetic tree from SNP data?

Objectives
  • Map reads against a reference genome

  • Extract single nucleotide polymorphisms

  • Establish a phylogenetic tree from the SNP data

In this episode we will try to pinpoint single nucleotide variants or single nucleotide polymorphism (SNPs) between our samples and the reference. The SNPs are determined by a process called read mapping in which they are aligned to the reference sequence.

Mapping

The identified SNPs will be used to compare the isolates to each other and to establish a phylogenetic tree.

Starting a new screen session

Again we will use a ‘screen session’ to do our calculation in the background. You can start a session and give it a descriptive name:

$ screen -S snp

This creates a session with the name ‘snp’

As you work, this session will stay active until you close this session. Even if you disconnect from your machine, the jobs you start in this session will run till completion.

SNP calling

SNIPPY is a pipeline that contains different tools to determine SNPs in sequencing reads against a reference genome. It takes forward and reverse reads of paired-end sequences and aligns them against a reference.

It is very fast but a single run of SNIPPY still takes about 10 to 12 minutes. We will therefore tell SNIPPY to run all the samples after each other. However this time we can not use a wildcard to do so. We will instead run all the samples in a loop.

$ cd ~/dc_workshop/data/trimmed_fastq/
$ for sample in ERR026473 ERR026474 ERR026478 ERR026481 ERR026482 ERR029206 ERR029207
>  do
>  snippy  --outdir ~/dc_workshop/results/snps/"${sample}" --ref ~/dc_workshop/data/GCF_000195955.2_ASM19595v2_genomic.fna --R1 "${sample}"_1.fastq.gz_trim.fastq --R2 "${sample}"_2.fastq.gz_trim.fastq
>  done

Detach session (process keeps running in background)

You can detach from a session by pressing control + a followed by d (for detach) on your keyboard.

You can now safely work on a different task or log out of your machine. The assembly will keep on running.

After the run is finished we can check the results.

$ head -n10 ~/dc_workshop/results/snps/ERR029207/snps.tab 
CHROM	POS	TYPE	REF	ALT	EVIDENCE	FTYPE	STRAND	NT_POS	AA_POS	EFFECT	LOCUS_TAG	GENE	PRODUCT
NC_000962.3	1849	snp	C	A	A:225 C:0								
NC_000962.3	1977	snp	A	G	G:130 A:0								
NC_000962.3	4013	snp	T	C	C:136 T:0								
NC_000962.3	7362	snp	G	C	C:141 G:0								
NC_000962.3	7585	snp	G	C	C:136 G:0								
NC_000962.3	9304	snp	G	A	A:128 G:0								
NC_000962.3	11820	snp	C	G	G:165 C:0								
NC_000962.3	11879	snp	A	G	G:107 A:0								
NC_000962.3	14785	snp	T	C	C:180 T:0	

This list gives us information on every SNP that was found by SNIPPY when compared to the reference genome. The first SNP is found at the position 1849 of the reference genome, and is a C in the H37Rv and an A in isolate ERR029207. There is a high confidence associated with this SNP: an A has been found 225 times in the sequencing reads and never a C at this position.

Challenge: How many SNPs were identified in each sample??

Find out how many SNPs were identified in the M. tuberculosis isolates when compared to H37Rv. Enter your solution in the table under the head ‘Number of SNPs’.

Hint: The .txt file in the SNIPPY output contains summary information

Solution

$ head .-n10 ~/dc_workshop/results/snps/ERR029207/snps.txt
 
DateTime	2018-01-09T13:34:32
ReadFiles	/home/dcuser/dc_workshop/data/ERR029207_1.fastq.gz /home/dcuser/dc_workshop/data/ERR029207_2.fastq.gz
Reference	/home/dcuser/dc_workshop/data/GCF_000195955.2_ASM19595v2_genomic.fna
ReferenceSize	4485121
Software	snippy 3.2-dev
Variant-COMPLEX	29
Variant-DEL	54
Variant-INS	37
Variant-MNP	7
Variant-SNP	1292

This *M. tuberculosis* isolate contains 1292 SNPs compared to H37Rv.

Repeat this for the other isolates.

Core SNPs

In order to compare the identified SNPs with each other we need to know if a certain position exists in all isolates. A core site can have the same nucleotide in every sample (monomorphic) or some samples can be different (polymorphic). SNIPPY will concatenate the core SNPs, i.e. ignoring sites that are monomorphic in all isolates and in the reference. Concatenation of the SNP sites reduces the size of the alignment considerably.

The ‘–noref’ argument tells SNIPPY to exclude the reference from the alignment.
The ‘–aformat’ argument determines the alignment output format. We need a phylip format as input for our next tool.

$ cd ~/dc_workshop/results/snps/
$ snippy-core --noref --aformat=phylip ERR026473 ERR026474 ERR026478 ERR026481 ERR026482 ERR029206 ERR029207

The last few lines look like this:

...
[08:20:35] Will write alignment statistics to core.txt
[08:20:35] Loading pre-masked/aligned sequences...
[08:20:35] 1/7	ERR026473 coverage 4243097/4411532 = 96.18%
[08:20:35] 2/7	ERR026474 coverage 4247216/4411532 = 96.28%
[08:20:35] 3/7	ERR026478 coverage 4262038/4411532 = 96.61%
[08:20:35] 4/7	ERR026481 coverage 4245636/4411532 = 96.24%
[08:20:35] 5/7	ERR026482 coverage 4240999/4411532 = 96.13%
[08:20:35] 6/7	ERR029206 coverage 4245210/4411532 = 96.23%
[08:20:35] 7/7	ERR029207 coverage 4256440/4411532 = 96.48%
[08:20:35] Patching variant sites into whole genome alignment...
[08:20:35] Constructing alignment object for core.full.aln
[08:20:36] Writing 'phylip' alignment to core.full.aln
[08:20:41] Writing core SNP table
[08:20:41] Found 395 core SNPs from 1733 variant sites.
[08:20:41] Saved SNP table: core.tab
[08:20:41] Constructing alignment object for core.aln
[08:20:41] Writing 'phylip' alignment to core.aln
[08:20:41] Done.

Our output in phylip format was written to ‘core.aln’. But let’s have a look at the results.

Discussion: What’s in the output of SNIPPY??

Have a look at the content of these three files with ‘cat’ or ‘head’.

‘core.aln’

‘core.full.aln’

‘core.nway.tab’

What is the difference between these files? Why is core.aln smaller? What is in core.aln?

Phylogenetic tree

Phylogenetic trees have been discussed during the lectures. We will here establish a phylogenetic tree from the file ‘core.aln’ with PhyML. PhyML is a phylogeny software based on the maximum-likelihood principle. Since the original publication, PhyML has been widely used and cited. There is a range of parameters that need to be chosen, such as nucleotide or amino-acid substitution model (see the web-version of PhyML but for simplicity we will run PhyML with standard parameters.

$ phyml -i core.aln
...
. Log likelihood of the current tree: -1807.238286.

. Compute fast branch supports on the most likely tree...

. Printing the most likely tree in file 'core.aln_phyml_tree.txt'...

. Time used 0h0m1s

oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

With this small data set, PhyML finishes very quickly. Let’s put the resulting files into a separate folder and let’s rename our resulting tree.


$ mkdir tree
$ mv core.aln_phyml_stats tree/
$ mv core.aln_phyml_tree tree/core_snps.newick

Let’s inspect our tree.

$ cd ~/dc_workshop/results/snps/
$ cd tree
$ head -n10 core_snps.newick
(ERR026481:0.00000001,ERR026482:0.00767108,((ERR026473:0.31329695,ERR026474:0.26681667)0.921000:0.15635534,(ERR026478:0.00000012,(ERR029207:0.00000001,ERR029206:0.00504197)0.000000:0.00000001)1.000000:0.58688282)1.000000:0.41758464);

This does not look much like a tree yet. The tree is written in a bracket annotation, the Newick format. In order to make sense of it we can better view this in a tree viewer. For this, we need to copy the core_snps.newick file to our own computer.

Visualization of phylogenetic trees

We will be using iTOL to view the phylogenetic tree established from the core SNPs. iTOL is a tree viewer which allows to display, annotate and management of phylogenetic and other trees, especially very large tree. First, download the core_snps.newick file to your own computer, then go to iTOL and upload your tree. Do play around with the different tree visualization options (‘MODE’) until you find a clear clustering.

Try to make three groups out of the 7 isolates.

Solution

3 groups:
..
..
..

Extra challenge: How does the clustering change when aligning the assembled contigs to the reference genome?

Align the assembled contigs to the reference genome with snippy. Produce a core genome alignment and a phylogenetic tree as done previously for the reads. How many SNPs were identified by SNIPPY when compared to the reference strain? More or less than for read alignment? Why?

$ for sample in ERR026473 ERR026474 ERR026478 ERR026481 ERR026482 ERR029206 ERR029207
do  
snippy --outdir ~/dc_workshop/results/snps/"${sample}"_contigs --ref ~/dc_workshop/data/GCF_000195955.2_ASM19595v2_genomic.fna --contigs ~/dc_workshop/results/assembly/"${sample}".fasta
done
head -n 10 ~/dc_workshop/results/snps/*contigs/snps.txt
...
snippy-core --noref --aformat=phylip ERR026473_contigs ERR026474_contigs ERR026478_contigs ERR026481_contigs ERR026482_contigs ERR029206_contigs ERR029207_contigs
phyml -i core.aln
mv core.aln_phyml_tree tree/core_snps_contigs.newick

Key Points

  • Single nucleotide polymorphisms can be identified by mapping reads to a reference genome

  • Parameters for the analysis have to be selected based on expected outcomes for this organism

  • Concatenation of SNPs helps to reduce analysis volume

  • Phylogenetic trees can be written with a bracket syntax in Newick format