Eventalign_collapse usage¶
In brief¶
This program collapses the raw file generated by nanopolish eventalign
by kmers rather than by event.
Quick start¶
Example CLI call
nanocompore eventalign_collapse \
-i eventalign_file.tsv
--outpath results
Example API call
# Import package
from nanocompore.Eventalign_collapse import Eventalign_collapse
# Init the object
e = Eventalign_collapse(
eventalign_fn="eventalign_file.tsv",
outpath="results")
# Run the analysis
e()
Description of options¶
Input nanopolish eventalign file(s)/stream¶
Path to a nanopolish eventalign
tsv output file (read access required) which can optionally be gzipped.
This can also be a list of files, or a regex or a list of regex or a mix of everything. We are flexible.
In command line mode it is also possible to pipe the output of the nanopolish eventalign
directly into Eventalign_collapse
.
Output options¶
outdir
is a path to the directory where to write all the output files generated by Eventalign_collapse
(write access required, created if it does not exist).
If it already exists, an error will be raise except if overwrite
is given. outprefix
is the file prefix used for all files generated.
n_lines¶
Controls the maximum number of lines to parse before stopping to read the input file. This could be useful for testing or downsampling.
threads¶
Eventalign_collapse
is multi threaded to speed up the data processing and keep pace with Nanopolish if using the direct piping strategy. Take advantage of many threads if you have access to a large compute cluster
Output files format¶
Contrary to nanopolish eventalign
output text file, in Eventalign_collapse
the reads are separated by a hashtag headers containing the read_id and ref_id. This reduces the redundancy and makes it easier to find the start and end of a read.
Example : #7ef1d7b9-5824-4382-b23b-78d82c07ebbd YHR055C.
The main data file contains the following fields:
ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time status median mad num_signals
- ref_pos: Reference sequence ID (contig).
- ref_kmer: Sequence of the reference kmers.
- num_events: Number of events for this kmer before collapsing.
- num_signals: Number of raw signal data points.
- dwell_time: dwell time for this kmer in seconds
- NNNNN_dwell_time: dwell time of events for this kmers with a model sequence "NNNNN" (events ignored by nanopolish HMM).
- mismatch_dwell_time: dwell time of events for this kmers with a model sequence different from the reference kmer
- status: Status of the kmer.
valid
if all the model and predicted kmers are the same for all events. OtherwiseNNNNN
ormismatching
depending on the invalid state having the most signal points. - median: Median of the normalised intensity values provided by Nanopolish eventalign
- mad: Median absolute deviation of the normalised intensity values provided by Nanopolish eventalign
In addition Eventalign_collapse
also generates a useful index file containing reads level information. It contains the following fields:
num_events num_signals kmers missing_kmers NNNNN_kmers mismatch_kmers valid_kmers byte_offset byte_len
- ref_id: Name of the reference sequence the read was aligned on (contig)
- ref_start: Start coordinate of the alignment on the reference sequence
- ref_end: End coordinate of the alignment on the reference sequence
- read_id: Name or index of the read
- num_events: Total number of events per read
- num_signals: Total number of signal points per read
- dwell_time: Cumulative dwell time in seconds for the entire resquiggled sequence
- kmers: Overall number of kmers in the read, including the missing ones
- NNNNN_kmers: Number of resquiggled kmers for which the majority of missing data points correspond to "NNNNN" kmers
- mismatching_kmers:Number of resquiggled kmers for which the majority of missing data points correspond to kmers different from the model kmer (excluding NNNNN)
- missing_kmers: Number of skipped/missing reference positions in nanopolish output
- valid_kmers: Number of fully valid kmers
- byte_offset: Number of characters before the start of the sequence in the main output file. This can be used in conjunction with file.seek() to directly access the start of a read. An example is provided in the Usage notebook.
- byte_len: Length of characters after byte_offset to the end of the read, excluding the last newline. This can be used in conjunction with read() to read all the text chunk corresponding to the read.
Python API usage¶
Import the package¶
from nanocompore.Eventalign_collapse import Eventalign_collapse from nanocompore.common import * from pycltools.pycltools import * import shutil # Create output dir and define log level outdir = "results" shutil.rmtree (outdir) mkdir(outdir, exist_ok=True) set_logger("info")
Example usage¶
Example with minimal file¶
e = Eventalign_collapse( eventalign_fn="./nanopolish_eventalign/nanopolish_reads_2.tsv", outpath="results", outprefix="s2", overwrite=True, progress=True, nthreads=4) e() head("./results/s2_eventalign_collapse.tsv") head("./results/s2_eventalign_collapse.tsv.idx")
Using the index to random access a specific entry in the file¶
Random access with standard library¶
output_fn = "./results/s2_eventalign_collapse.tsv" index_fn = output_fn+".idx" # Imports import csv from tabulate import tabulate from random import sample from itertools import islice # read index file and select random lines index_list = [] with open (index_fn) as fp: for l in csv.DictReader(fp, delimiter='\t'): index_list.append(l) random_index = sample(index_list, k=5) print ("RANDOM INDEX LINES") print (tabulate(random_index, headers="keys")) # Open the collapsed event align file with open (output_fn) as fp: for index in random_index: # Access the header corresponding to the randomly selected index line using seek fp.seek(0) # Return to file start fp.seek(int(index["byte_offset"])) # Move to the offset indicated in the index file print ("\n" + fp.readline().rstrip()) # Print read header # Get 5 first lines and print then data_list = [] for l in islice(csv.DictReader(fp, delimiter='\t'), 5): data_list.append(l) print (tabulate(data_list, headers="keys"))
Example of random access with pandas¶
output_fn = "./results/s2_eventalign_collapse.tsv" index_fn = output_fn+".idx" # Imports import pandas as pd pd.set_option('display.max_columns', 6) # read index file and select random lines index_df = pd.read_csv (index_fn, sep="\t") random_lines = index_df.sample(5) print ("Random index lines") display (random_lines) # Open the collapsed event align file with open (output_fn) as fp: for id, read in random_lines.iterrows(): # Access the header corresponding to the randomly selected index line using seek fp.seek(0) # Return to file start fp.seek(read.byte_offset) # Move to the offset indicated in the index file print (fp.readline().rstrip()) # Print read header df = pd.read_csv (fp, nrows=5, sep="\t") # Read lines corresponding to the read display(df)
Full CLI and API documentations¶
API documentation¶
from nanocompore.Eventalign_collapse import Eventalign_collapse from nanocompore.common import jhelp jhelp(Eventalign_collapse.__init__)
CLI documentation¶
nanocompore eventalign_collapse --help
usage: nanocompore eventalign_collapse [-h] [--eventalign EVENTALIGN]
[--n_lines N_LINES]
[--nthreads NTHREADS]
[--outpath OUTPATH]
[--outprefix OUTPREFIX] [--overwrite]
[--log_level {warning,info,debug}]
[--progress]
Collapse the nanopolish eventalign output at kmers level and compute kmer level statistics
* Minimal example
nanocompore eventalign_collapse -i nanopolish_eventalign.tsv -outprefix out
optional arguments:
-h, --help show this help message and exit
Input options:
--eventalign EVENTALIGN, -i EVENTALIGN
Path to a nanopolish eventalign tsv output file, or a
list of file, or a regex (can be gzipped). It can be
ommited if piped to standard input (default: piped to
stdin)
Run parameters options:
--n_lines N_LINES, -l N_LINES
Number of lines to parse.(default: no limits
Other options:
--nthreads NTHREADS, -t NTHREADS
Total number of threads. 2 threads are reserved for
the reader and the writer (default: 3)
Output options:
--outpath OUTPATH, -o OUTPATH
Path to the output folder (default: ./)
--outprefix OUTPREFIX, -p OUTPREFIX
text outprefix for all the files generated (default:
out)
--overwrite, -w Use --outpath even if it exists already (default:
False)
Verbosity options:
--log_level {warning,info,debug}
Set the log level (default: info)
--progress Display a progress bar during execution (default:
False)
list(range(0,1))