Sequence assembly

Overview

Teaching: 10 min
Exercises: 30 min
Questions
  • How can the information in the sequencing reads be reduced?

  • What are the different methods for assembly?

Objectives
  • Understand differences between assembly methods

  • Assemble the short reads

Sequence assembly means the alignment and merging of reads in order to reconstruct the original sequence. We will talk about sequence assembly in a separate lecture. The assembly of a genome from short sequencing reads can take a while - up to several hours per genome. We will therefore run this process overnight in the background.

To this end we will again start a screen session. It will allow us to run our entire shell and keep that process running even if we disconnect.

Starting and attaching to screen sessions

Starting a new session

You can start a session and give it a descriptive name:

$ screen -S assembly

This creates a session with the name ‘assembly’

As you work, this session will stay active until you close it. Even if you log out or work on something else, the jobs you start in this session will run until completion.

Sequence assembly

The assembler we will run is SKESA. SKESA generates a final assembly from multiple kmers. A list of kmers is automatically selected by SKESA using the maximum read length of the input data, and each individual kmer contributes to the final assembly. If desired, the lengths of kmers can be specified with the –kmer and –steps flags which will adjust automatic kmer selection.

Assembly (over night)

Because assembly of each genome might take a couple of hours, we will run all assemblies in a loop overnight. It is important to run them within the screen session or else the process will be terminated if we disconnect from the machine.

Preparation

$ mkdir ~/dc_workshop/results/assembly
$ cd ~/dc_workshop/data/trimmed_fastq

To run SKESA we will use the skesa command with a number of option that we will explore later on. We can start the loop with the assemblies with

$ for sample in ERR026473 ERR026474 ERR026478 ERR026481 ERR026482 ERR029206 ERR029207
>  do
>  skesa \
--cores 2 \
--memory 32 \
--fastq "${sample}"_1.fastq.gz_trim.fastq "${sample}"_2.fastq.gz_trim.fastq 1> ~/dc_workshop/results/assembly/"${sample}".fasta
>  done

A non-quoted backslash, \, is used as an escape character in Bash. It preserves the literal value of the next character that follows, with the exception of newline. This means that the back slash starts a new line without starting a new command - we only add it for better readability.

Detach session (process keeps running in background)

You can detach from a session by pressing control + a followed by d (for detach) on your keyboard.

You can now safely work on something else. The assembly will keep on running.

Challenge: What do the command line parameters of SKESA mean??

Find out a) the version of SKESA you are using and b) what the used command line parameters mean.

Hint:

$ skesa -h

prints the ‘help’

Solution

$ skesa -h
..
General options:
-h [ --help ]                 Produce help message
-v [ --version ]              Print version
..
--cores arg (=0)              Number of cores to use (default all) [integer]
--memory arg (=32)            Memory available (GB, only for sorted counter)
..  
--fastq arg                   Input fastq file(s) (could be used multiple times for different runs) [string]
..
..

Exercise: How many contigs were generated by SKESA??

Find out how many contigs there are in the M. tuberculosis isolates. Enter your solution in the table

Hint:

$ infoalign

prints information about a sequencing file

Solution

infoalign ERR026473.fasta
wc -l infoalign err026743.fasta
...
ERR026473.fasta:390
ERR026474.fasta:367
ERR026478.fasta:296
ERR026481.fasta:336
ERR026482.fasta:345
ERR029206.fasta:451
ERR029207.fasta:458

Key Points

  • Assembly is a process which aligns and merges fragments from a longer DNA sequence in order to reconstruct the original sequence.

  • k-mers are short fragments of DNA of length k