Eventalign_collapse usage

In brief

This program collapses the raw file generated by nanopolish eventalign by kmers rather than by event.

Quick start

Example CLI call

nanocompore eventalign_collapse \
    -i eventalign_file.tsv
    --outpath results

Example API call

# Import package
from nanocompore.Eventalign_collapse import Eventalign_collapse

# Init the object
e = Eventalign_collapse(
    eventalign_fn="eventalign_file.tsv",
    outpath="results")

# Run the analysis
e()

Description of options

Input nanopolish eventalign file(s)/stream

Path to a nanopolish eventalign tsv output file (read access required) which can optionally be gzipped. This can also be a list of files, or a regex or a list of regex or a mix of everything. We are flexible. In command line mode it is also possible to pipe the output of the nanopolish eventalign directly into Eventalign_collapse.

Output options

outdir is a path to the directory where to write all the output files generated by Eventalign_collapse(write access required, created if it does not exist). If it already exists, an error will be raise except if overwrite is given. outprefix is the file prefix used for all files generated.

n_lines

Controls the maximum number of lines to parse before stopping to read the input file. This could be useful for testing or downsampling.

threads

Eventalign_collapse is multi threaded to speed up the data processing and keep pace with Nanopolish if using the direct piping strategy. Take advantage of many threads if you have access to a large compute cluster

Output files format

Contrary to nanopolish eventalign output text file, in Eventalign_collapse the reads are separated by a hashtag headers containing the read_id and ref_id. This reduces the redundancy and makes it easier to find the start and end of a read.

Example : #7ef1d7b9-5824-4382-b23b-78d82c07ebbd YHR055C.

The main data file contains the following fields:

ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time status median mad num_signals

  • ref_pos: Reference sequence ID (contig).
  • ref_kmer: Sequence of the reference kmers.
  • num_events: Number of events for this kmer before collapsing.
  • num_signals: Number of raw signal data points.
  • dwell_time: dwell time for this kmer in seconds
  • NNNNN_dwell_time: dwell time of events for this kmers with a model sequence "NNNNN" (events ignored by nanopolish HMM).
  • mismatch_dwell_time: dwell time of events for this kmers with a model sequence different from the reference kmer
  • status: Status of the kmer. valid if all the model and predicted kmers are the same for all events. Otherwise NNNNN or mismatching depending on the invalid state having the most signal points.
  • median: Median of the normalised intensity values provided by Nanopolish eventalign
  • mad: Median absolute deviation of the normalised intensity values provided by Nanopolish eventalign

In addition Eventalign_collapse also generates a useful index file containing reads level information. It contains the following fields:

num_events num_signals kmers missing_kmers NNNNN_kmers mismatch_kmers valid_kmers byte_offset byte_len

  • ref_id: Name of the reference sequence the read was aligned on (contig)
  • ref_start: Start coordinate of the alignment on the reference sequence
  • ref_end: End coordinate of the alignment on the reference sequence
  • read_id: Name or index of the read
  • num_events: Total number of events per read
  • num_signals: Total number of signal points per read
  • dwell_time: Cumulative dwell time in seconds for the entire resquiggled sequence
  • kmers: Overall number of kmers in the read, including the missing ones
  • NNNNN_kmers: Number of resquiggled kmers for which the majority of missing data points correspond to "NNNNN" kmers
  • mismatching_kmers:Number of resquiggled kmers for which the majority of missing data points correspond to kmers different from the model kmer (excluding NNNNN)
  • missing_kmers: Number of skipped/missing reference positions in nanopolish output
  • valid_kmers: Number of fully valid kmers
  • byte_offset: Number of characters before the start of the sequence in the main output file. This can be used in conjunction with file.seek() to directly access the start of a read. An example is provided in the Usage notebook.
  • byte_len: Length of characters after byte_offset to the end of the read, excluding the last newline. This can be used in conjunction with read() to read all the text chunk corresponding to the read.

Python API usage

Import the package

from nanocompore.Eventalign_collapse import Eventalign_collapse
from nanocompore.common import *
from pycltools.pycltools import *
import shutil

# Create output dir and define log level
outdir = "results"
shutil.rmtree (outdir)
mkdir(outdir, exist_ok=True)
set_logger("info")
Creating /home/aleg/Programming/Packages/nanocompore/docs/demo/results

Example usage

Example with minimal file

e = Eventalign_collapse(
    eventalign_fn="./nanopolish_eventalign/nanopolish_reads_2.tsv",
    outpath="results",
    outprefix="s2",
    overwrite=True,
    progress=True,
    nthreads=4)
e()

head("./results/s2_eventalign_collapse.tsv")
head("./results/s2_eventalign_collapse.tsv.idx")
2020-12-07T22:07:39.526605+0000 INFO - MainProcess | Checking and initialising Eventalign_collapse
2020-12-07T22:07:39.528104+0000 INFO - MainProcess | Starting data processing
28 reads [00:00, 94.18 reads/s]
2020-12-07T22:07:39.900955+0000 INFO - Process-4 | Output reads written:28
#c96c0f7b-9596-43c5-a545-8e15a8a7f368  control
ref_pos ref_kmer    num_events  num_signals dwell_time  NNNNN_dwell_time    mismatch_dwell_time status  median  mad
1   TACTC   1   22  0.007300000172108412    0.0 0.0 valid   96.916504   0.8642044
2   ACTCG   1   12  0.003980000037699938    0.0 0.0 valid   95.18815    0.864151
3   CTCGA   1   25  0.008299999870359898    0.0 0.0 valid   85.6821 1.5363007
4   TCGAC   2   61  0.02026000013574958 0.0 0.0 valid   96.4364 1.9204025
5   CGACA   5   77  0.025550000369548798    0.0 0.0 valid   110.84  4.2249985
6   GACAT   1   12  0.003980000037699938    0.0 0.0 valid   89.23485    0.7681503
7   ACATA   1   12  0.003980000037699938    0.0 0.0 valid   86.6423 1.3443031
8   CATAG   3   52  0.017259999876841903    0.015269999857991934    0.0 NNNNN   152.513 6.1455

ref_id  ref_start ref_end read_id                              num_events num_signals dwell_time         kmers missing_kmers NNNNN_kmers mismatch_kmers valid_kmers byte_offset byte_len 
control 1         85      c96c0f7b-9596-43c5-a545-8e15a8a7f368 414        7781        2.5830999956233427 85    34            8           0              43          0           3673     
control 1         89      fd13d01a-95ef-4073-b9cf-fb7277da2c04 453        8578        2.8478099999483675 89    40            8           0              41          3674        3546     
control 1         87      b5783126-4472-45b4-9414-322a4a1a1776 540        10909       3.6216799857793376 87    32            10          0              45          7221        4019     
control 1         90      778ba618-ced5-4e9e-81f8-16419d209940 201        4132        1.3718199981376529 90    1             3           0              86          11241       6125     
control 2         88      506ff40c-92c3-4ff5-9169-3fbb9de2fc6b 301        6408        2.127389994333498  87    45            3           0              39          17367       3014     
control 2         80      0402647b-c92d-4b5d-a820-2ce437a4ec6f 171        3288        1.0915600005537271 79    2             1           0              76          20382       5263     
control 1         88      7f5e9d3b-60a5-4a27-9d74-053cc7f4791a 540        10888       3.6146999960765243 88    32            10          0              46          25646       4058     
control 2         87      97166ef2-5d04-4544-9b6d-f3084aff6951 326        6433        2.135599997243844  86    34            5           0              47          29705       3716     
control 4         88      3127653f-163f-4fcc-bcb4-90ec8ea7e638 357        7978        2.6487000052584335 85    29            4           0              52          33422       3953     

Using the index to random access a specific entry in the file

Random access with standard library

output_fn = "./results/s2_eventalign_collapse.tsv"
index_fn = output_fn+".idx"

# Imports
import csv
from tabulate import tabulate
from random import sample
from itertools import islice

# read index file and select random lines
index_list = []
with open (index_fn) as fp:
    for l in csv.DictReader(fp, delimiter='\t'):
        index_list.append(l)
random_index = sample(index_list, k=5)
print ("RANDOM INDEX LINES")
print (tabulate(random_index, headers="keys"))

# Open the collapsed event align file
with open (output_fn) as fp:
    for index in random_index:
        # Access the header corresponding to the randomly selected index line using seek 
        fp.seek(0) # Return to file start
        fp.seek(int(index["byte_offset"])) # Move to the offset indicated in the index file
        print ("\n" + fp.readline().rstrip()) # Print read header
        # Get 5 first lines and print then
        data_list = []
        for l in islice(csv.DictReader(fp, delimiter='\t'), 5):
            data_list.append(l)
        print (tabulate(data_list, headers="keys"))
RANDOM INDEX LINES
ref_id      ref_start    ref_end  read_id                                 num_events    num_signals    dwell_time    kmers    missing_kmers    NNNNN_kmers    mismatch_kmers    valid_kmers    byte_offset    byte_len
--------  -----------  ---------  ------------------------------------  ------------  -------------  ------------  -------  ---------------  -------------  ----------------  -------------  -------------  ----------
control            10         89  6b53dc3c-4913-4970-bc92-98c8a28d510a           299           6562       2.17847       80               26              9                 0             45          81568        3895
control             2         80  0402647b-c92d-4b5d-a820-2ce437a4ec6f           171           3288       1.09156       79                2              1                 0             76          20382        5263
control             8         89  465fdd43-e5be-4c66-9165-3ba9e2817109           365           7285       2.4185        82               21              6                 0             55          67007        4332
control             9         80  0c08b5d8-6c89-4ca6-b4dc-202096535762           178           3627       1.20414       72               13              6                 0             53          85464        4180
control             1         85  c96c0f7b-9596-43c5-a545-8e15a8a7f368           414           7781       2.5831        85               34              8                 0             43              0        3673

#6b53dc3c-4913-4970-bc92-98c8a28d510a   control
  ref_pos  ref_kmer      num_events    num_signals    dwell_time    NNNNN_dwell_time    mismatch_dwell_time  status      median      mad
---------  ----------  ------------  -------------  ------------  ------------------  ---------------------  --------  --------  -------
       10  TAGAT                  1              8       0.00266             0                            0  valid     153.746   10.5975
       11  AGATA                  4             57       0.01892             0                            0  valid     138.135   10.027
       12  GATAG                  4             95       0.03155             0.00332                      0  NNNNN     107.141   15.725
       14  TAGGA                 10            359       0.11919             0                            0  valid      97.3414   6.837
       15  AGGAC                  2             40       0.01328             0                            0  valid     114.434    5.6975

#0402647b-c92d-4b5d-a820-2ce437a4ec6f   control
  ref_pos  ref_kmer      num_events    num_signals    dwell_time    NNNNN_dwell_time    mismatch_dwell_time  status      median     mad
---------  ----------  ------------  -------------  ------------  ------------------  ---------------------  --------  --------  ------
        2  ACTCG                  2             57       0.01892                   0                      0  valid      83.736   1.8462
        3  CTCGA                  1             34       0.01129                   0                      0  valid      91.6244  2.1819
        4  TCGAC                  2             48       0.01593                   0                      0  valid      93.3028  2.2658
        5  CGACA                  1             72       0.0239                    0                      0  valid     108.073   6.6295
        6  GACAT                  1             17       0.00564                   0                      0  valid      83.2325  1.6784

#465fdd43-e5be-4c66-9165-3ba9e2817109   control
  ref_pos  ref_kmer      num_events    num_signals    dwell_time    NNNNN_dwell_time    mismatch_dwell_time  status      median      mad
---------  ----------  ------------  -------------  ------------  ------------------  ---------------------  --------  --------  -------
        8  CATAG                  1              7       0.00232                   0                      0  valid      94.5734  1.5419
        9  ATAGA                  1             10       0.00332                   0                      0  valid     118.087   3.2765
       10  TAGAT                  1             33       0.01096                   0                      0  valid     138.903   2.50501
       11  AGATA                  5            149       0.04947                   0                      0  valid     136.205   3.084
       12  GATAG                  1             10       0.00332                   0                      0  valid     106.62    5.3005

#0c08b5d8-6c89-4ca6-b4dc-202096535762   control
  ref_pos  ref_kmer      num_events    num_signals    dwell_time    NNNNN_dwell_time    mismatch_dwell_time  status      median      mad
---------  ----------  ------------  -------------  ------------  ------------------  ---------------------  --------  --------  -------
        9  ATAGA                  2             17       0.00565             0                            0  valid      98.6508  9.0242
       10  TAGAT                  3             51       0.01693             0.00232                      0  NNNNN     115.811   4.14201
       14  TAGGA                 13            248       0.08232             0                            0  valid     114.183   6.287
       15  AGGAC                  5            179       0.05943             0                            0  valid     116.107   4.734
       16  GGACT                 14            358       0.11886             0                            0  valid     126.905   3.55

#c96c0f7b-9596-43c5-a545-8e15a8a7f368   control
  ref_pos  ref_kmer      num_events    num_signals    dwell_time    NNNNN_dwell_time    mismatch_dwell_time  status      median       mad
---------  ----------  ------------  -------------  ------------  ------------------  ---------------------  --------  --------  --------
        1  TACTC                  1             22       0.0073                    0                      0  valid      96.9165  0.864204
        2  ACTCG                  1             12       0.00398                   0                      0  valid      95.1881  0.864151
        3  CTCGA                  1             25       0.0083                    0                      0  valid      85.6821  1.5363
        4  TCGAC                  2             61       0.02026                   0                      0  valid      96.4364  1.9204
        5  CGACA                  5             77       0.02555                   0                      0  valid     110.84    4.225

Example of random access with pandas

output_fn = "./results/s2_eventalign_collapse.tsv"
index_fn = output_fn+".idx"

# Imports
import pandas as pd
pd.set_option('display.max_columns', 6)

# read index file and select random lines
index_df = pd.read_csv (index_fn, sep="\t")
random_lines = index_df.sample(5)
print ("Random index lines")
display (random_lines)

# Open the collapsed event align file
with open (output_fn) as fp:
    for id, read in random_lines.iterrows():

        # Access the header corresponding to the randomly selected index line using seek 
        fp.seek(0) # Return to file start
        fp.seek(read.byte_offset) # Move to the offset indicated in the index file
        print (fp.readline().rstrip()) # Print read header
        df = pd.read_csv (fp, nrows=5, sep="\t") # Read lines corresponding to the read
        display(df)
Random index lines
ref_id ref_start ref_end ... valid_kmers byte_offset byte_len
16 control 8 89 ... 55 67007 4332
9 control 4 92 ... 51 37376 4138
26 control 9 85 ... 39 103487 3341
22 control 9 86 ... 41 89645 3670
0 control 1 85 ... 43 0 3673

5 rows × 14 columns

#465fdd43-e5be-4c66-9165-3ba9e2817109  control
ref_pos ref_kmer num_events ... status median mad
0 8 CATAG 1 ... valid 94.573400 1.541901
1 9 ATAGA 1 ... valid 118.087494 3.276501
2 10 TAGAT 1 ... valid 138.903000 2.505005
3 11 AGATA 5 ... valid 136.205000 3.084000
4 12 GATAG 1 ... valid 106.619500 5.300503

5 rows × 10 columns

#73f1c3ca-2bd3-4543-9b34-693935dce530  control
ref_pos ref_kmer num_events ... status median mad
0 4 TCGAC 44 ... NNNNN 95.305800 7.780205
1 5 CGACA 15 ... valid 121.944500 11.669998
2 6 GACAT 1 ... valid 75.939900 0.930199
3 7 ACATA 4 ... valid 77.039300 2.452454
4 8 CATAG 1 ... valid 92.515045 0.845703

5 rows × 10 columns

#2cc2d044-d8d9-4e72-a616-a0d25a787b2d  control
ref_pos ref_kmer num_events ... status median mad
0 9 ATAGA 1 ... NNNNN 82.00195 2.714901
1 13 ATAGG 6 ... valid 82.24150 2.714897
2 14 TAGGA 1 ... valid 109.31050 4.791000
3 15 AGGAC 17 ... NNNNN 118.49300 5.828999
4 16 GGACT 45 ... NNNNN 126.79800 4.152000

5 rows × 10 columns

#6f044625-cf92-4819-9e1d-31757d003355  control
ref_pos ref_kmer num_events ... status median mad
0 9 ATAGA 4 ... valid 109.451000 13.992195
1 10 TAGAT 7 ... valid 118.098495 6.315506
2 11 AGATA 4 ... valid 134.714000 6.024002
3 12 GATAG 23 ... valid 98.568100 1.360298
4 13 ATAGG 11 ... valid 84.964900 3.109299

5 rows × 10 columns

#c96c0f7b-9596-43c5-a545-8e15a8a7f368  control
ref_pos ref_kmer num_events ... status median mad
0 1 TACTC 1 ... valid 96.916504 0.864204
1 2 ACTCG 1 ... valid 95.188150 0.864151
2 3 CTCGA 1 ... valid 85.682100 1.536301
3 4 TCGAC 2 ... valid 96.436400 1.920403
4 5 CGACA 5 ... valid 110.840000 4.224998

5 rows × 10 columns

Full CLI and API documentations

API documentation

from nanocompore.Eventalign_collapse import Eventalign_collapse
from nanocompore.common import jhelp
jhelp(Eventalign_collapse.__init__)

init (eventalign_fn, outpath, outprefix, overwrite, n_lines, nthreads, progress)

Collapse the nanopolish eventalign events at kmer level


  • eventalign_fn (required) [str]

Path to a nanopolish eventalign tsv output file, or a list of file, or a regex (can be gzipped)

  • outpath (default: ./) [str]

Path to the output folder (will be created if it does exist yet)

  • outprefix (default: out) [str]

text outprefix for all the files generated

  • overwrite (default: False) [bool]

If the output directory already exists, the standard behaviour is to raise an error to prevent overwriting existing data This option ignore the error and overwrite data if they have the same outpath and outprefix.

  • n_lines (default: None) [int]

Maximum number of read to parse.

  • nthreads (default: 3) [int]

Number of threads (two are used for reading and writing, all the others for parallel processing).

  • progress (default: False) [bool]

Display a progress bar during execution

CLI documentation

nanocompore eventalign_collapse --help
usage: nanocompore eventalign_collapse [-h] [--eventalign EVENTALIGN]
                                       [--n_lines N_LINES]
                                       [--nthreads NTHREADS]
                                       [--outpath OUTPATH]
                                       [--outprefix OUTPREFIX] [--overwrite]
                                       [--log_level {warning,info,debug}]
                                       [--progress]

Collapse the nanopolish eventalign output at kmers level and compute kmer level statistics
* Minimal example
    nanocompore eventalign_collapse -i nanopolish_eventalign.tsv -outprefix out

optional arguments:
  -h, --help            show this help message and exit

Input options:
  --eventalign EVENTALIGN, -i EVENTALIGN
                        Path to a nanopolish eventalign tsv output file, or a
                        list of file, or a regex (can be gzipped). It can be
                        ommited if piped to standard input (default: piped to
                        stdin)

Run parameters options:
  --n_lines N_LINES, -l N_LINES
                        Number of lines to parse.(default: no limits

Other options:
  --nthreads NTHREADS, -t NTHREADS
                        Total number of threads. 2 threads are reserved for
                        the reader and the writer (default: 3)

Output options:
  --outpath OUTPATH, -o OUTPATH
                        Path to the output folder (default: ./)
  --outprefix OUTPREFIX, -p OUTPREFIX
                        text outprefix for all the files generated (default:
                        out)
  --overwrite, -w       Use --outpath even if it exists already (default:
                        False)

Verbosity options:
  --log_level {warning,info,debug}
                        Set the log level (default: info)
  --progress            Display a progress bar during execution (default:
                        False)
list(range(0,1))
[0]