

Eventalign_collapse usage¶

In brief¶

This program collapses the raw file generated by nanopolish eventalign by kmers rather than by event.

Quick start¶

Example CLI call

nanocompore eventalign_collapse \
    -i eventalign_file.tsv
    --outpath results

Example API call

# Import package
from nanocompore.Eventalign_collapse import Eventalign_collapse

# Init the object
e = Eventalign_collapse(
    eventalign_fn="eventalign_file.tsv",
    outpath="results")

# Run the analysis
e()

Description of options¶

Input nanopolish eventalign file(s)/stream¶

Path to a nanopolish eventalign tsv output file (read access required) which can optionally be gzipped. This can also be a list of files, or a regex or a list of regex or a mix of everything. We are flexible. In command line mode it is also possible to pipe the output of the nanopolish eventalign directly into Eventalign_collapse.

Output options¶

outdir is a path to the directory where to write all the output files generated by Eventalign_collapse(write access required, created if it does not exist). If it already exists, an error will be raise except if overwrite is given. outprefix is the file prefix used for all files generated.

n_lines¶

Controls the maximum number of lines to parse before stopping to read the input file. This could be useful for testing or downsampling.

threads¶

Eventalign_collapse is multi threaded to speed up the data processing and keep pace with Nanopolish if using the direct piping strategy. Take advantage of many threads if you have access to a large compute cluster

Output files format¶

Contrary to nanopolish eventalign output text file, in Eventalign_collapse the reads are separated by a hashtag headers containing the read_id and ref_id. This reduces the redundancy and makes it easier to find the start and end of a read.

Example : #7ef1d7b9-5824-4382-b23b-78d82c07ebbd YHR055C.

The main data file contains the following fields:

ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time status median mad num_signals

ref_pos: Reference sequence ID (contig).
ref_kmer: Sequence of the reference kmers.
num_events: Number of events for this kmer before collapsing.
num_signals: Number of raw signal data points.
dwell_time: dwell time for this kmer in seconds
NNNNN_dwell_time: dwell time of events for this kmers with a model sequence "NNNNN" (events ignored by nanopolish HMM).
mismatch_dwell_time: dwell time of events for this kmers with a model sequence different from the reference kmer
status: Status of the kmer. valid if all the model and predicted kmers are the same for all events. Otherwise NNNNN or mismatching depending on the invalid state having the most signal points.
median: Median of the normalised intensity values provided by Nanopolish eventalign
mad: Median absolute deviation of the normalised intensity values provided by Nanopolish eventalign

In addition Eventalign_collapse also generates a useful index file containing reads level information. It contains the following fields:

num_events num_signals kmers missing_kmers NNNNN_kmers mismatch_kmers valid_kmers byte_offset byte_len

ref_id: Name of the reference sequence the read was aligned on (contig)
ref_start: Start coordinate of the alignment on the reference sequence
ref_end: End coordinate of the alignment on the reference sequence
read_id: Name or index of the read
num_events: Total number of events per read
num_signals: Total number of signal points per read
dwell_time: Cumulative dwell time in seconds for the entire resquiggled sequence
kmers: Overall number of kmers in the read, including the missing ones
NNNNN_kmers: Number of resquiggled kmers for which the majority of missing data points correspond to "NNNNN" kmers
mismatching_kmers:Number of resquiggled kmers for which the majority of missing data points correspond to kmers different from the model kmer (excluding NNNNN)
missing_kmers: Number of skipped/missing reference positions in nanopolish output
valid_kmers: Number of fully valid kmers
byte_offset: Number of characters before the start of the sequence in the main output file. This can be used in conjunction with file.seek() to directly access the start of a read. An example is provided in the Usage notebook.
byte_len: Length of characters after byte_offset to the end of the read, excluding the last newline. This can be used in conjunction with read() to read all the text chunk corresponding to the read.

Python API usage¶

Import the package¶

from nanocompore.Eventalign_collapse import Eventalign_collapse
from nanocompore.common import *
from pycltools.pycltools import *
import shutil

# Create output dir and define log level
outdir = "results"
shutil.rmtree (outdir)
mkdir(outdir, exist_ok=True)
set_logger("info")

Creating /home/aleg/Programming/Packages/nanocompore/docs/demo/results

Example usage¶

Example with minimal file¶

e = Eventalign_collapse(
    eventalign_fn="./nanopolish_eventalign/nanopolish_reads_2.tsv",
    outpath="results",
    outprefix="s2",
    overwrite=True,
    progress=True,
    nthreads=4)
e()

head("./results/s2_eventalign_collapse.tsv")
head("./results/s2_eventalign_collapse.tsv.idx")

2020-12-07T22:07:39.526605+0000 INFO - MainProcess | Checking and initialising Eventalign_collapse
2020-12-07T22:07:39.528104+0000 INFO - MainProcess | Starting data processing
28 reads [00:00, 94.18 reads/s]
2020-12-07T22:07:39.900955+0000 INFO - Process-4 | Output reads written:28

#c96c0f7b-9596-43c5-a545-8e15a8a7f368  control
ref_pos ref_kmer    num_events  num_signals dwell_time  NNNNN_dwell_time    mismatch_dwell_time status  median  mad
1   TACTC   1   22  0.007300000172108412    0.0 0.0 valid   96.916504   0.8642044
2   ACTCG   1   12  0.003980000037699938    0.0 0.0 valid   95.18815    0.864151
3   CTCGA   1   25  0.008299999870359898    0.0 0.0 valid   85.6821 1.5363007
4   TCGAC   2   61  0.02026000013574958 0.0 0.0 valid   96.4364 1.9204025
5   CGACA   5   77  0.025550000369548798    0.0 0.0 valid   110.84  4.2249985
6   GACAT   1   12  0.003980000037699938    0.0 0.0 valid   89.23485    0.7681503
7   ACATA   1   12  0.003980000037699938    0.0 0.0 valid   86.6423 1.3443031
8   CATAG   3   52  0.017259999876841903    0.015269999857991934    0.0 NNNNN   152.513 6.1455

ref_id  ref_start ref_end read_id                              num_events num_signals dwell_time         kmers missing_kmers NNNNN_kmers mismatch_kmers valid_kmers byte_offset byte_len 
control 1         85      c96c0f7b-9596-43c5-a545-8e15a8a7f368 414        7781        2.5830999956233427 85    34            8           0              43          0           3673     
control 1         89      fd13d01a-95ef-4073-b9cf-fb7277da2c04 453        8578        2.8478099999483675 89    40            8           0              41          3674        3546     
control 1         87      b5783126-4472-45b4-9414-322a4a1a1776 540        10909       3.6216799857793376 87    32            10          0              45          7221        4019     
control 1         90      778ba618-ced5-4e9e-81f8-16419d209940 201        4132        1.3718199981376529 90    1             3           0              86          11241       6125     
control 2         88      506ff40c-92c3-4ff5-9169-3fbb9de2fc6b 301        6408        2.127389994333498  87    45            3           0              39          17367       3014     
control 2         80      0402647b-c92d-4b5d-a820-2ce437a4ec6f 171        3288        1.0915600005537271 79    2             1           0              76          20382       5263     
control 1         88      7f5e9d3b-60a5-4a27-9d74-053cc7f4791a 540        10888       3.6146999960765243 88    32            10          0              46          25646       4058     
control 2         87      97166ef2-5d04-4544-9b6d-f3084aff6951 326        6433        2.135599997243844  86    34            5           0              47          29705       3716     
control 4         88      3127653f-163f-4fcc-bcb4-90ec8ea7e638 357        7978        2.6487000052584335 85    29            4           0              52          33422       3953

Using the index to random access a specific entry in the file¶

Random access with standard library¶

output_fn = "./results/s2_eventalign_collapse.tsv"
index_fn = output_fn+".idx"

# Imports
import csv
from tabulate import tabulate
from random import sample
from itertools import islice

# read index file and select random lines
index_list = []
with open (index_fn) as fp:
    for l in csv.DictReader(fp, delimiter='\t'):
        index_list.append(l)
random_index = sample(index_list, k=5)
print ("RANDOM INDEX LINES")
print (tabulate(random_index, headers="keys"))

# Open the collapsed event align file
with open (output_fn) as fp:
    for index in random_index:
        # Access the header corresponding to the randomly selected index line using seek 
        fp.seek(0) # Return to file start
        fp.seek(int(index["byte_offset"])) # Move to the offset indicated in the index file
        print ("\n" + fp.readline().rstrip()) # Print read header
        # Get 5 first lines and print then
        data_list = []
        for l in islice(csv.DictReader(fp, delimiter='\t'), 5):
            data_list.append(l)
        print (tabulate(data_list, headers="keys"))

RANDOM INDEX LINES
ref_id      ref_start    ref_end  read_id                                 num_events    num_signals    dwell_time    kmers    missing_kmers    NNNNN_kmers    mismatch_kmers    valid_kmers    byte_offset    byte_len
--------  -----------  ---------  ------------------------------------  ------------  -------------  ------------  -------  ---------------  -------------  ----------------  -------------  -------------  ----------
control            10         89  6b53dc3c-4913-4970-bc92-98c8a28d510a           299           6562       2.17847       80               26              9                 0             45          81568        3895
control             2         80  0402647b-c92d-4b5d-a820-2ce437a4ec6f           171           3288       1.09156       79                2              1                 0             76          20382        5263
control             8         89  465fdd43-e5be-4c66-9165-3ba9e2817109           365           7285       2.4185        82               21              6                 0             55          67007        4332
control             9         80  0c08b5d8-6c89-4ca6-b4dc-202096535762           178           3627       1.20414       72               13              6                 0             53          85464        4180
control             1         85  c96c0f7b-9596-43c5-a545-8e15a8a7f368           414           7781       2.5831        85               34              8                 0             43              0        3673

#6b53dc3c-4913-4970-bc92-98c8a28d510a   control
  ref_pos  ref_kmer      num_events    num_signals    dwell_time    NNNNN_dwell_time    mismatch_dwell_time  status      median      mad
---------  ----------  ------------  -------------  ------------  ------------------  ---------------------  --------  --------  -------
       10  TAGAT                  1              8       0.00266             0                            0  valid     153.746   10.5975
       11  AGATA                  4             57       0.01892             0                            0  valid     138.135   10.027
       12  GATAG                  4             95       0.03155             0.00332                      0  NNNNN     107.141   15.725
       14  TAGGA                 10            359       0.11919             0                            0  valid      97.3414   6.837
       15  AGGAC                  2             40       0.01328             0                            0  valid     114.434    5.6975

#0402647b-c92d-4b5d-a820-2ce437a4ec6f   control
  ref_pos  ref_kmer      num_events    num_signals    dwell_time    NNNNN_dwell_time    mismatch_dwell_time  status      median     mad
---------  ----------  ------------  -------------  ------------  ------------------  ---------------------  --------  --------  ------
        2  ACTCG                  2             57       0.01892                   0                      0  valid      83.736   1.8462
        3  CTCGA                  1             34       0.01129                   0                      0  valid      91.6244  2.1819
        4  TCGAC                  2             48       0.01593                   0                      0  valid      93.3028  2.2658
        5  CGACA                  1             72       0.0239                    0                      0  valid     108.073   6.6295
        6  GACAT                  1             17       0.00564                   0                      0  valid      83.2325  1.6784

#465fdd43-e5be-4c66-9165-3ba9e2817109   control
  ref_pos  ref_kmer      num_events    num_signals    dwell_time    NNNNN_dwell_time    mismatch_dwell_time  status      median      mad
---------  ----------  ------------  -------------  ------------  ------------------  ---------------------  --------  --------  -------
        8  CATAG                  1              7       0.00232                   0                      0  valid      94.5734  1.5419
        9  ATAGA                  1             10       0.00332                   0                      0  valid     118.087   3.2765
       10  TAGAT                  1             33       0.01096                   0                      0  valid     138.903   2.50501
       11  AGATA                  5            149       0.04947                   0                      0  valid     136.205   3.084
       12  GATAG                  1             10       0.00332                   0                      0  valid     106.62    5.3005

#0c08b5d8-6c89-4ca6-b4dc-202096535762   control
  ref_pos  ref_kmer      num_events    num_signals    dwell_time    NNNNN_dwell_time    mismatch_dwell_time  status      median      mad
---------  ----------  ------------  -------------  ------------  ------------------  ---------------------  --------  --------  -------
        9  ATAGA                  2             17       0.00565             0                            0  valid      98.6508  9.0242
       10  TAGAT                  3             51       0.01693             0.00232                      0  NNNNN     115.811   4.14201
       14  TAGGA                 13            248       0.08232             0                            0  valid     114.183   6.287
       15  AGGAC                  5            179       0.05943             0                            0  valid     116.107   4.734
       16  GGACT                 14            358       0.11886             0                            0  valid     126.905   3.55

#c96c0f7b-9596-43c5-a545-8e15a8a7f368   control
  ref_pos  ref_kmer      num_events    num_signals    dwell_time    NNNNN_dwell_time    mismatch_dwell_time  status      median       mad
---------  ----------  ------------  -------------  ------------  ------------------  ---------------------  --------  --------  --------
        1  TACTC                  1             22       0.0073                    0                      0  valid      96.9165  0.864204
        2  ACTCG                  1             12       0.00398                   0                      0  valid      95.1881  0.864151
        3  CTCGA                  1             25       0.0083                    0                      0  valid      85.6821  1.5363
        4  TCGAC                  2             61       0.02026                   0                      0  valid      96.4364  1.9204
        5  CGACA                  5             77       0.02555                   0                      0  valid     110.84    4.225

Example of random access with pandas¶

output_fn = "./results/s2_eventalign_collapse.tsv"
index_fn = output_fn+".idx"

# Imports
import pandas as pd
pd.set_option('display.max_columns', 6)

# read index file and select random lines
index_df = pd.read_csv (index_fn, sep="\t")
random_lines = index_df.sample(5)
print ("Random index lines")
display (random_lines)

# Open the collapsed event align file
with open (output_fn) as fp:
    for id, read in random_lines.iterrows():

        # Access the header corresponding to the randomly selected index line using seek 
        fp.seek(0) # Return to file start
        fp.seek(read.byte_offset) # Move to the offset indicated in the index file
        print (fp.readline().rstrip()) # Print read header
        df = pd.read_csv (fp, nrows=5, sep="\t") # Read lines corresponding to the read
        display(df)

Random index lines

	ref_id	ref_start	ref_end	...	valid_kmers	byte_offset	byte_len
16	control	8	89	...	55	67007	4332
9	control	4	92	...	51	37376	4138
26	control	9	85	...	39	103487	3341
22	control	9	86	...	41	89645	3670
0	control	1	85	...	43	0	3673

5 rows × 14 columns

#465fdd43-e5be-4c66-9165-3ba9e2817109  control

	ref_pos	ref_kmer	num_events	...	status	median	mad
0	8	CATAG	1	...	valid	94.573400	1.541901
1	9	ATAGA	1	...	valid	118.087494	3.276501
2	10	TAGAT	1	...	valid	138.903000	2.505005
3	11	AGATA	5	...	valid	136.205000	3.084000
4	12	GATAG	1	...	valid	106.619500	5.300503

5 rows × 10 columns

#73f1c3ca-2bd3-4543-9b34-693935dce530  control

	ref_pos	ref_kmer	num_events	...	status	median	mad
0	4	TCGAC	44	...	NNNNN	95.305800	7.780205
1	5	CGACA	15	...	valid	121.944500	11.669998
2	6	GACAT	1	...	valid	75.939900	0.930199
3	7	ACATA	4	...	valid	77.039300	2.452454
4	8	CATAG	1	...	valid	92.515045	0.845703

5 rows × 10 columns

#2cc2d044-d8d9-4e72-a616-a0d25a787b2d  control

	ref_pos	ref_kmer	num_events	...	status	median	mad
0	9	ATAGA	1	...	NNNNN	82.00195	2.714901
1	13	ATAGG	6	...	valid	82.24150	2.714897
2	14	TAGGA	1	...	valid	109.31050	4.791000
3	15	AGGAC	17	...	NNNNN	118.49300	5.828999
4	16	GGACT	45	...	NNNNN	126.79800	4.152000

5 rows × 10 columns

#6f044625-cf92-4819-9e1d-31757d003355  control

	ref_pos	ref_kmer	num_events	...	status	median	mad
0	9	ATAGA	4	...	valid	109.451000	13.992195
1	10	TAGAT	7	...	valid	118.098495	6.315506
2	11	AGATA	4	...	valid	134.714000	6.024002
3	12	GATAG	23	...	valid	98.568100	1.360298
4	13	ATAGG	11	...	valid	84.964900	3.109299

5 rows × 10 columns

#c96c0f7b-9596-43c5-a545-8e15a8a7f368  control

	ref_pos	ref_kmer	num_events	...	status	median	mad
0	1	TACTC	1	...	valid	96.916504	0.864204
1	2	ACTCG	1	...	valid	95.188150	0.864151
2	3	CTCGA	1	...	valid	85.682100	1.536301
3	4	TCGAC	2	...	valid	96.436400	1.920403
4	5	CGACA	5	...	valid	110.840000	4.224998

5 rows × 10 columns

Full CLI and API documentations¶

API documentation¶

from nanocompore.Eventalign_collapse import Eventalign_collapse
from nanocompore.common import jhelp
jhelp(Eventalign_collapse.__init__)

init (eventalign_fn, outpath, outprefix, overwrite, n_lines, nthreads, progress)

Collapse the nanopolish eventalign events at kmer level

eventalign_fn (required) [str]

Path to a nanopolish eventalign tsv output file, or a list of file, or a regex (can be gzipped)

outpath (default: ./) [str]

Path to the output folder (will be created if it does exist yet)

outprefix (default: out) [str]

text outprefix for all the files generated

overwrite (default: False) [bool]

If the output directory already exists, the standard behaviour is to raise an error to prevent overwriting existing data This option ignore the error and overwrite data if they have the same outpath and outprefix.

n_lines (default: None) [int]

Maximum number of read to parse.

nthreads (default: 3) [int]

Number of threads (two are used for reading and writing, all the others for parallel processing).

progress (default: False) [bool]

Display a progress bar during execution

CLI documentation¶

nanocompore eventalign_collapse --help

usage: nanocompore eventalign_collapse [-h] [--eventalign EVENTALIGN]
                                       [--n_lines N_LINES]
                                       [--nthreads NTHREADS]
                                       [--outpath OUTPATH]
                                       [--outprefix OUTPREFIX] [--overwrite]
                                       [--log_level {warning,info,debug}]
                                       [--progress]

Collapse the nanopolish eventalign output at kmers level and compute kmer level statistics
* Minimal example
    nanocompore eventalign_collapse -i nanopolish_eventalign.tsv -outprefix out

optional arguments:
  -h, --help            show this help message and exit

Input options:
  --eventalign EVENTALIGN, -i EVENTALIGN
                        Path to a nanopolish eventalign tsv output file, or a
                        list of file, or a regex (can be gzipped). It can be
                        ommited if piped to standard input (default: piped to
                        stdin)

Run parameters options:
  --n_lines N_LINES, -l N_LINES
                        Number of lines to parse.(default: no limits

Other options:
  --nthreads NTHREADS, -t NTHREADS
                        Total number of threads. 2 threads are reserved for
                        the reader and the writer (default: 3)

Output options:
  --outpath OUTPATH, -o OUTPATH
                        Path to the output folder (default: ./)
  --outprefix OUTPREFIX, -p OUTPREFIX
                        text outprefix for all the files generated (default:
                        out)
  --overwrite, -w       Use --outpath even if it exists already (default:
                        False)

Verbosity options:
  --log_level {warning,info,debug}
                        Set the log level (default: info)
  --progress            Display a progress bar during execution (default:
                        False)

list(range(0,1))

[0]