Sequence assembly

Overview

Teaching: min
Exercises: 60 min
Questions
  • How can the information in the sequencing reads be reduced?

  • What is the effect of k-mer size on assembly?

Objectives
  • Assemble the short reads

Sequence assembly means the alignment and merging of reads in order to reconstruct the original sequence. The assembly of a genome from short sequencing reads will take a while - from minutes up to several hours per genome.

Sequence assembly

The assembler we will run is SKESA. SKESA is a DeBruijn graph-based de-novo assembler designed for assembling reads of microbial genomes sequenced using Illumina. It uses multiple k-mer lengths during assembly. A list of kmers is automatically selected by SKESA using the maximum read length of the input data, and each individual kmer contributes to the final assembly. In the file allkmers.txt we will collect all contigs produced at different k-mer sizes.

$ cd ~/Secretome_prediction/trimmed
$ skesa --cores 2 --memory 32 --all ~/Secretome_prediction/assembly/allkmers.txt --fastq ERR022075_1.trimmed.fastq ERR022075_2.trimmed.fastq > ~/Secretome_prediction/assembly/ERR022075.fasta

This will take a while. Try not to interrupt your connection to the server while the calculation takes place.

You might notice that the sequences are no longer in fastq format but in fasta format.

The header line (identifier) line, which begins with ‘>’, gives a name and/or a unique identifier for the sequence, and may also contain additional information. Here, we get information about the contigs, their length and k-mer coverage. The remaining lines contain sequence, encoded as nucleotides (ATCG).

$ cd ~/Secretome_prediction/assembly
$ head ERR02275.fasta
>contig_1
CACGTTAAATCATATCAGGCGTAATACCACAACCCTTAAGTTAGCGCTTATGGGAATTAT
CCCCGGCTTTTTTATGTATGGTCTTACAGCACCAGTGCTGCGATTGACGCAGACAGCACA
CTCACCAGGGTAGAGCCGTAAACCAGCTTCAGACCGAAGCGAGAAACCACGTTACCTTGC
TCTTCATTCAGGCCTTTAACTGCACCTGCGATAATCCCGATTGAAGAGAAGTTAGCGAAG
GAAACCAGGAACACAGAGATGATGCCTTCAGCACGCGGAGAGAGCGTGGAAGCAATTTTC
TGCAGATCCATCATCGCAACGAACTCGTTGGAAACCAGTTTGGTCGCCATGATACTGCCC
ACTTGCAGTGCTTCACTGGAAGGAACACCCATCACCCATGCAATCGGATAGAAGATGTAG
CCCAGGATGCCCTGGAAGGAGATGCTGTAGCCAAACCAGCCAGTAACGGTGGCAAACAGT
GCGTTCAGCGCGGCGATCAGGGCGATAAAGCCAATCAGCATCGCGGCAACGATAATGGCA
ACTTTGAAACCTGCCAGAATGTATTCACCCAGCATTTCGAAGAAGCTCTGACCTTCGTGC
AGGTTGGACATCTGGATGTTTTCTTCACTGGCATCAACACGGTAAGGATTGATCAGCGAC
AGCACGATAAAGGTGCTGAACATGTTCAGTACCAGCGCAGCAACGACGTATTTCGGTTCC

Challenge: How many contigs were generated by SKESA at different k-mer size?

What is the effect of k-mer size on genome assembly? K-mers of a specific size are used for the construction of the de Bruijn graph. High k-mer sizes increase the amount of edges in a graph which in turn influences the amount of memory needed to construct the graph. Also, the chance of finding an overlap is lower. What’s an ideal k-mer size for this assembly?

Find out how many contigs have been produced at the following k-mer sizes:

  1. kmer21
  2. kmer27
  3. kmer33
  4. kmer45
  5. kmer57
  6. kmer71
  7. kmer83

Hint:

$ grep -c  "kmer21" allkmers.txt

prints a count of matching lines for the input file.

Solution

  1. 1256
  2. 602
  3. 470
  4. 287
  5. 238
  6. 216
  7. 198

Key Points

  • Assembly is a process which aligns and merges fragments from a longer DNA sequence in order to reconstruct the original sequence.

  • k-mers are short fragments of DNA of length k