Pangenome analysis
Overview
Teaching: 10 min
Exercises: 50 minQuestions
How to determine a pangenome from a collection of isolate genome sequences?
Objectives
Interpret the output of a pangenome analysis.
Pangenome analysis
The microbial pangenome is the union of genes shared by genomes of interest. The core genome is the intersection of genes of these genomes, thus core genes are genes present in all strains. The accessory genome (also: variable, flexible, dispensable genome) refers to genes not present in all strains of a species. These include genes present in two or more strains or even genes unique to a single strain. Acquired antibiotic resistance genes are typically genes of the accessory genome.
Find more information of the pangenome here
Extract genes recX, pblB and dnaA from two genomes.
For this, we need to find if the genes are present in the annotated files.
Challenge: How can I extract genes from the file with the genes?
Hints:
grep -A 10 [gene] *gbk gives you the first ten line after the gene name was found
Solution
$ cd ~/annnotation $ grep -A 10 'recX' */*gbk | grep translation
Note for each genome if the gene has been found or not. Mark them as present (1) in this file, else mark the gene that differs or is absent as 0.
Pangenome analysis
Comparing every gene of every genome to every other gene in the dataset is an enourmous task, and takes a long time even if automated. Roary is a pipeline to determine genes of the core and pangenome. It takes a few short-cuts such as clustering instead of pair-wise alignment and can perform this task in a relatively short time frame. An excellent step-by-step tutorial can be found here
First we need to find the files generated by prokka. We are using the annotations prepared beforehand for all genomes. How many are there?
$ ls ~student100/annotations/*.gff
then we can go to the orthology directory and start roary. the -p 1 option tells Roary to use two CPU cores, the -s option tells it to disregard genetic context, as we are looking at draft assemblies with different contigs, the -r option generates some interesting plots in R. The 1> and 2> redirect the standard screen output and the error output to a file for later viewing. The -f option tells it where to store the output files (the folder orthology). Roary needs quite some CPU power.
$ cd ~
$ roary ~student100/annotations/*.gff -s -p 2 -r -f orthology >roary.stdout.log 2> roary.error.log
Discussion: Open or closed pangenome?
After roary finishes, have a look at the summary file. How many core and pangenome genes are there? Visit the definition of an open and closed pangenome here. Does S. pneumoniae have a closed or an open pangenome? View the file Rplots.pdf or the file conserved_vs_total_genes.png in your browser to have a different visualization of the core and pan genome.
Visualization
Some genes are present in all genomes, some are present in some and absent in others. Data on presence and absence of genes was collected in a matrix called gene_presence_absence.csv. Clustering of this information was used to build a tree (available as accessory_binary_genes.fa.newick). As a next step we are going to visualize this clustering.
Challenge: Which isolate(s) is/are related to our reference genome OXC141 based on gene presence/absence??
Copy accessory_binary_genes.fa.newick and gene_presence_absence.csv to your own computer. Open phandango in Chrome, drop file accessory_binary_genes.fa.newick and then file gene_absence_presence.csv.
filezilla or scp or use the webbrowser
can be used to copy the files from the machine to your own computer
Solution
3 strains .. .. ..
Key Points
The microbial pangenome is the union of genes in genomes of interest.
The microbial core genome is the intersection of genes shared by genomes of interest.
Roary is a pipeline to determine genes of the pangenome.