Inspecting sequence graphs

Overview

Teaching: 10 min
Exercises: 50 min
Questions
  • Can we find out which scaffolds or contigs are connected?

Objectives
  • Find a 6 kb scaffold with high coverage

  • Figure out the copy number

  • To which scaffolds are the copies of this scaffold connected

Introducing the sequence graph

In the assembly lecture the concept of a sequence graph was introduced. It is possible to view this graph using the tool Bandage https://rrwick.github.io/Bandage/.

Examining the file containing the sequence graph

View the sequence graph using less:

$ cd ~/assemblies/ERR326690/
$ ls
$ less -S assembly_graph.fastg
>EDGE_1_length_4276_cov_92.8356:EDGE_743_length_70_cov_235.8';
AAATTAATTTGACTTTCCTGATAGAGTTGTTCACATCTTATTTCAATCTACTATATTTTA
TAGAACAGACTACTCTGAAAGTAGTTTCAGACCTCTTATGATTTCGTATCAGCCTGAATG
TCATCAAAAAAAGATAGCAGGCTTAAAAACCTGCTATCTCCTTCTATTTTTACAAAATCA

Can you understand how the sequence graph works? Look at the header:

$ cat assembly_graph.fastg |grep ">"

EDGE_1_length_4276_cov_92.8356:EDGE_743_length_70_cov_235.8';
EDGE_1_length_4276_cov_92.8356':EDGE_2_length_62_cov_265.143;
EDGE_2_length_62_cov_265.143:EDGE_567_length_57_cov_188',EDGE_820_length_34692_cov_92.9214';
EDGE_2_length_62_cov_265.143':EDGE_1_length_4276_cov_92.8356,EDGE_424_length_250_cov_165.728';
EDGE_3_length_73_cov_163.333:EDGE_572_length_1655_cov_89.6863,EDGE_573_length_10177_cov_90.9533;
EDGE_3_length_73_cov_163.333':EDGE_283_length_8100_cov_90.431',EDGE_895_length_70_cov_83.8';
EDGE_4_length_221_cov_0.801205;
EDGE_4_length_221_cov_0.801205';

Visualizing the sequence graph

Download a the sequence graph from your assembly. Open it in Bandage. Click on Draw graph. What does this remind you of? Click on a few large contigs. What is the average coverage?

Selecting a contig with high coverage

Now, we will select a contig which has a 3x higher coverage. Click on Scope, select depth range. Fill in for min coverage 200 and max coverage 1000 (or other values, around 3x the average contig depth). Press Draw Graph. What contigs are now displayed? Why do these have a higher coverage?

Investigating a contig with high coverage using Bandage and BLAST

Click on the largest contig, should be around 5-6 kb. What do you think this contig is? In the “Output” menu, at the top, the sequence of the selected contig can be copied to the clipboard. Figure out what the contig codes for using BLASTN https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome. Also record the contig number.

Where are these sequences located?

Although the sequence is a separate contig in the fasta file, we can figure out to which contigs it is connected. We could for instance investigate the assembly_graph.fastg in a text editor. Alternatively, we select scope “Around nodes” and fill in the number of our selected contig. Next, we fill in 4 for the distance and press Draw Graph. What do you think this does? Try to figure out where the contig of interest is connected to. Name another situation where an assembly graph could be useful for (Hint: Plasmids..)

Key Points

  • A genome assembly is fragmented because of repeats in the genome. The assembly graph display possible connections between contigs.