Sequence assembly

Overview

Teaching: min
Exercises: 60 min
Questions
  • How can the information in the sequencing reads be reduced?

  • What are the different methods for assembly?

Objectives
  • Understand differences between assembly methods

  • Assemble the short reads

  • What is the effect of different k-mer sizes

Sequence assembly means the alignment and merging of reads in order to reconstruct the original sequence. The assembly of a genome from short sequencing reads will take a while - from minutes up to several hours per genome.

Sequence assembly

The assembler we will run is SPAdes. SPAdes generates a final assembly from multiple kmers. A list of kmers is automatically selected by SPAdes using the maximum read length of the input data, and each individual kmer contributes to the final assembly. If desired, a list of kmers can be specified with the -k flag which will override automatic kmer selection.

Assembly

Because assembly of each genome might take a while, we will a assemble two isolates per person. Reads have already been trimmed and error corrected.

Preparation

$ cd ~
$ mkdir assembly
$ cd ~/reads

To run SPAdes we will use the spades.py command with the –only-assembler option as the reads have already been corrected, -o for the output folder, -1 for the path to the forward reads, -2 for the path to the reverse reads. We will be using the standard k-mer sizes of 21, 33 and 55 basepair. We can start the loop with the assemblies. The following is an example. Replace ERR026473 and ERR026474 with the names of your isolates

$ ls

$ for sample in ERR026473 ERR026474  ; do
    spades.py -1 "$sample"_1.fastq.gz -2 "$sample"_2.fastq.gz -o ~/assembly/$sample
  done

$ cd ~/assembly
$ ls 
$ cd ERR026473
$ ls

The assemblies for every k-mer setting are found in the folders marked K21, K33 K55 and are called final_contigs.fasta. The end result before scaffolding is called contigs.fasta and the scaffolded contigs are available in the file called scaffolds.fasta.

Challenge: How many contigs were generated by SPAdes using different kmer sizes. What about the final assembly? And what is the difference between contigs and scaffolds.

Find out how many contigs or scaffolds there are in the S. pneumoniae isolates. Enter your solution in the table

Hint:

$ grep -c

prints a count of matching lines for each input file.

Solution

$ grep -c '>'  ERR*/K21/final_contigs.fasta
$ grep -c '>'  ERR*/K33/final_contigs.fasta
$ grep -c '>'  ERR*/K55/final_contigs.fasta
$ grep -c '>'  ERR*/contigs.fasta
$ grep -c '>'  ERR*/scaffolds.fasta

At the moment, all samples are called scaffolds.fasta. This is not ideal. In the next episode we will rename the assembled scaffolds before processing them further.

Key Points

  • Assembly is a process which aligns and merges fragments from a longer DNA sequence in order to reconstruct the original sequence.

  • k-mers are short fragments of DNA of length k