Sequence assembly
Overview
Teaching: 10 min
Exercises: 30 minQuestions
How can the information in the sequencing reads be reduced?
What are the different methods for assembly?
Objectives
Understand differences between assembly methods
Assemble the short reads
Sequence assembly means the alignment and merging of reads in order to reconstruct the original sequence. We will talk about sequence assembly in a separate lecture. The assembly of a genome from short sequencing reads can take a while - up to several hours per genome. We will therefore run this process overnight in the background.
To this end we will again start a screen session. It will allow us to run our entire shell and keep that process running even if we disconnect.
Starting and attaching to screen sessions
Starting a new session
You can start a session and give it a descriptive name:
$ screen -S assembly
This creates a session with the name ‘assembly’
As you work, this session will stay active until you close it. Even if you log out or work on something else, the jobs you start in this session will run until completion.
Sequence assembly
The assembler we will run is SKESA. SKESA generates a final assembly from multiple kmers. A list of kmers is automatically selected by SKESA using the maximum read length of the input data, and each individual kmer contributes to the final assembly. If desired, the lengths of kmers can be specified with the –kmer and –steps flags which will adjust automatic kmer selection.
Assembly (over night)
Because assembly of each genome might take a couple of hours, we will run all assemblies in a loop overnight. It is important to run them within the screen session or else the process will be terminated if we disconnect from the machine.
Preparation
$ mkdir ~/dc_workshop/results/assembly
$ cd ~/dc_workshop/data/trimmed_fastq
To run SKESA we will use the skesa command with a number of option that we will explore later on. We can start the loop with the assemblies with
$ for sample in ERR026473 ERR026474 ERR026478 ERR026481 ERR026482 ERR029206 ERR029207
> do
> skesa \
--cores 2 \
--memory 32 \
--fastq "${sample}"_1.fastq.gz_trim.fastq "${sample}"_2.fastq.gz_trim.fastq 1> ~/dc_workshop/results/assembly/"${sample}".fasta
> done
A non-quoted backslash, \, is used as an escape character in Bash. It preserves the literal value of the next character that follows, with the exception of newline. This means that the back slash starts a new line without starting a new command - we only add it for better readability.
Detach session (process keeps running in background)
You can detach from a session by pressing control + a
followed by d
(for detach) on your keyboard.
You can now safely work on something else. The assembly will keep on running.
Challenge: What do the command line parameters of SKESA mean??
Find out a) the version of SKESA you are using and b) what the used command line parameters mean.
Hint:
$ skesa -h
prints the ‘help’
Solution
$ skesa -h .. General options: -h [ --help ] Produce help message -v [ --version ] Print version .. --cores arg (=0) Number of cores to use (default all) [integer] --memory arg (=32) Memory available (GB, only for sorted counter) .. --fastq arg Input fastq file(s) (could be used multiple times for different runs) [string] .. ..
Exercise: How many contigs were generated by SKESA??
Find out how many contigs there are in the M. tuberculosis isolates. Enter your solution in the table
Hint:
$ infoalign
prints information about a sequencing file
Solution
infoalign ERR026473.fasta wc -l infoalign err026743.fasta ... ERR026473.fasta:390 ERR026474.fasta:367 ERR026478.fasta:296 ERR026481.fasta:336 ERR026482.fasta:345 ERR029206.fasta:451 ERR029207.fasta:458
Key Points
Assembly is a process which aligns and merges fragments from a longer DNA sequence in order to reconstruct the original sequence.
k-mers are short fragments of DNA of length k