Compare samples with SampComp

In brief

First, SampComp parses the sample eventalign collapse files and then the observed results are piled-up per reference at position level. In a second time, positions are compared using various statistical methods and the statistics are stored in a shelve DBM database containing the results for all positions with sufficient coverage. The API returns a SampCompDB database wrapper object that can be subsequently interrogated to extract data and plots.

Quick start

Example CLI call

nanocompore sampcomp \
    --file_list1 ./data/S1_R1.tsv,./data/S1_R2.tsv \
    --file_list2 ./data/S2_R1.tsv,./data/S2_R2.tsv \
    --label1 S1 \
    --label2 S2 \
    --fasta ./reference/ref.fa \
    --outpath ./results

Example API call

# Import package
from nanocompore.SampComp import SampComp

# Init the object
s = SampComp(
    eventalign_fn_dict = {
        'S1':{'rep1':'./data/S1_R1.tsv', 'rep2':'./data/S1_R2.tsv'},
        'S2':{'rep1':'./data/S2_R1.tsv', 'rep2':'./data/S2_R2.tsv'}},
    outpath = "./results/",
    fasta_fn = "./reference/ref.fa")

# Run the analysis
s()

Description of main options

SampComp provides a very flexible analysis framework with a few mandatory options and many optional parameters. The full CLI and API documentations are provided at the bottom of this page.

Sample files

SampComp requires sample files obtained with NanopolishComp EventalignCollapse as explained before (see data preparation) for both the control and the experimental conditions. 2 conditions are expected and at least 2 replicates per conditions are highly recommended. If SampComp is called through the CLI the files can be provided using either relevant command options or a YAML file. With the Python API you can pass either a python dictionary or a YAML file.

!!! info "YAML file option (CLI or API)" This option allows to pass a YAML formatted file indicating the sample condition labels and paths to data files with the option --sample_yaml for the CLI or directly with eventalign_fn_dict for the API. The file should be formatted as follow:

```yaml
WT:
    rep1:   ./data/S1_R1.tsv
    rep2:   ./data/S1_R2.tsv

KO:
    rep1:   ./data/S2_R1.tsv
    rep2:   ./data/S1_R2.tsv
</code></pre>
<p>!!! info "Command line option (CLI only)"
    This option requires to provide a comma separated list of files per condition using <code>--file_list1</code> and <code>--file_list2</code> arguments as well as the labels for condition each conditions using <code>--label1</code> and <code>--label2</code>.</p>
<p>!!! info "Python dictionary (API only)"
    This option allows to pass a multi-level python dictionary containing the sample condition labels and paths to data files. The dictionary should be formatted as follow:</p>

<pre><code>```python
eventalign_fn_dict = {
    "WT":  {"rep1":"./data/S1_R1.tsv", "rep2":"./data/S1_R2.tsv"},
    "KO": {"rep1":"./data/S2_R1.tsv", "rep2":"./data/S1_R2.tsv"}
    }
```</code></pre>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Transcriptome-reference-FASTA-file">Transcriptome reference FASTA file<a class="anchor-link" href="#Transcriptome-reference-FASTA-file">&#182;</a></h4><p>A transcriptome FASTA reference file is required to extract kmer sequences during the analyses. The reference has to be the same as the one used at the mapping step. (CLI: <code>--fasta</code>, API: <code>fasta_fn</code>)</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Output-folder">Output folder<a class="anchor-link" href="#Output-folder">&#182;</a></h4><p>Although it is not mandatory, it is recommended to provide a path to a directory where the program will output the result files  (CLI: <code>--outpath</code>, API: <code>outpath</code>). In addition, users can also specify a prefix for the files to be generated (CLI: <code>--outprefix</code>, API: <code>outprefix</code>). Finally, if the outpath directory already exists, the program will raise an error to avoid erasing result files. To ignore the error you can to specify to overwrite previous results (CLI: <code>--overwrite</code>, API: <code>overwrite</code>).</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Genome-annotation-BED-file">Genome annotation BED file<a class="anchor-link" href="#Genome-annotation-BED-file">&#182;</a></h4><p>Optionally, a BED file containing the genome annotations corresponding to the transcriptome fasta file can be provided. In that case Nanocompore will also convert the transcript coordinates into the genome space (CLI: <code>--bed</code>, API: <code>bed_fn</code>)</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Statistical-options">Statistical options<a class="anchor-link" href="#Statistical-options">&#182;</a></h4><p><code>SampComp</code> implements several statistical methods to evaluate the difference between the 2 conditions (<code>comparison_method</code>).</p>
<ul>
<li>Gaussian Mixture Model = GMM (default)</li>
<li>Kolmogorov–Smirnov test = KS (default)</li>
<li>Mann–Whitney U test = MW</li>
<li>T-test = TT</li>
</ul>
<p>In addition, it is also possible to specify the number of adjacent positions to take into account for the pvalue calculation (<code>sequence_context</code>) as well as the weights to give to adjacent position, using either an "uniform" or a "harmonic" distribution (<code>sequence_context_weights</code>).</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Coverage-options">Coverage options<a class="anchor-link" href="#Coverage-options">&#182;</a></h4><p>The default coverage threshold for <code>SampComp</code> to perform a statistical test is 30 reads in each replicates. This is quite conservative and can be modified if needed (<code>min_coverage</code>). In addition, to reduce the computational burden it is possible to randomly down-sample the number of reads for high coverage references (<code>downsample_high_coverage</code>).</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Manually-exclude-or-include-references-(API-only)">Manually exclude or include references (API only)<a class="anchor-link" href="#Manually-exclude-or-include-references-(API-only)">&#182;</a></h4><p>The API allows to specify references to be included or excluded from the analysis (<code>select_ref_id</code> and
<code>exclude_ref_id</code>). This can be useful to analyse a specific set of transcripts or to run a small test before analysing the whole dataset.</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Advanced-API-Usage">Advanced API Usage<a class="anchor-link" href="#Advanced-API-Usage">&#182;</a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Import-the-package">Import the package<a class="anchor-link" href="#Import-the-package">&#182;</a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">

```python
from nanocompore.SampComp import SampComp
from nanocompore.common import *
import shutil

# Create output dir and define log level
outdir = "results"
shutil.rmtree (outdir)
mkdir(outdir, exist_ok=True)
set_logger("info")

Using a Python dictionary to specify the locations of the eventalign files

Default option using a multilevel dictionary.

# Init the object
s = SampComp ( 
    eventalign_fn_dict = {
        "KO":{
            "rep1":'./eventalign_files/yeast/KO_1_eventalign_collapsed.tsv', 
            "rep2":"./eventalign_files/yeast/KO_2_eventalign_collapsed.tsv"},
        "WT":{
            "rep1":'./eventalign_files/yeast/WT_1_eventalign_collapsed.tsv', 
            "rep2":"./eventalign_files/yeast/WT_2_eventalign_collapsed.tsv"}},
    outpath = "./results",
    outprefix= "yeast_",
    nthreads=3,
    max_invalid_kmers_freq = 0.2,
    min_coverage=30,
    fasta_fn = "./references/yeast/Yeast_transcriptome.fa",
    overwrite = True)

# Run the analysis
db = s()
2020-12-07T18:30:16.356391+0000 INFO - MainProcess | Checking and initialising SampComp
2020-12-07T18:30:16.419459+0000 INFO - MainProcess | Reading eventalign index files
2020-12-07T18:30:16.431004+0000 ERROR - MainProcess | High fraction of invalid kmers (20.74%) for read 107
2020-12-07T18:30:16.432924+0000 INFO - MainProcess |    References found in index: 146
2020-12-07T18:30:16.433256+0000 INFO - MainProcess | Filtering out references with low coverage
2020-12-07T18:30:16.613162+0000 INFO - MainProcess |    References remaining after reference coverage filtering: 1
2020-12-07T18:30:16.613684+0000 INFO - MainProcess | Starting data processing
2020-12-07T18:31:08.593363+0000 INFO - Process-17 | All Done. Transcripts processed: 1
2020-12-07T18:31:08.636924+0000 INFO - MainProcess | Loading SampCompDB
2020-12-07T18:31:08.654674+0000 INFO - MainProcess | Calculate results

Using a YAML file instead to specify the files location

To use a YAML file, just pass a path to the file instead of a dictionary. In this example we also specify to use 10 threads to speed-up the data processing (nthreads = 10).

# Read YAML file to display structure
print(open("./samples.yaml").read())
Modified:
    rep1:   ./eventalign_files/simulated/modified_rep_1.tsv
    rep2:   ./eventalign_files/simulated/modified_rep_2.tsv
Unmodified:
    rep1:   ./eventalign_files/simulated/unmodified_rep_1.tsv
    rep2:   ./eventalign_files/simulated/unmodified_rep_2.tsv

# Init the object
s = SampComp (
    eventalign_fn_dict = "./samples.yaml",
    outpath= "./results",
    outprefix= "simulated_",
    fasta_fn = "./references/simulated/ref.fa",
    overwrite = True,
    progress = True,
    nthreads = 4)

# Run the analysis
db = s ()
2020-12-07T18:31:19.526366+0000 INFO - MainProcess | Checking and initialising SampComp
2020-12-07T18:31:19.536530+0000 INFO - MainProcess | Reading eventalign index files
2020-12-07T18:31:19.559451+0000 INFO - MainProcess |    References found in index: 5
2020-12-07T18:31:19.559797+0000 INFO - MainProcess | Filtering out references with low coverage
2020-12-07T18:31:19.563239+0000 INFO - MainProcess |    References remaining after reference coverage filtering: 5
2020-12-07T18:31:19.566024+0000 INFO - MainProcess | Starting data processing
100%|██████████| 5/5 [00:13<00:00,  2.63s/ Processed References]
2020-12-07T18:31:32.819370+0000 INFO - Process-21 | All Done. Transcripts processed: 5
2020-12-07T18:31:32.828972+0000 INFO - MainProcess | Loading SampCompDB
2020-12-07T18:31:32.841026+0000 INFO - MainProcess | Calculate results

Tweaking the statistical tests

In this example, we require all 4 possible statistical tests to be performed (comparison_methods=["GMM", "MW", "KS", "TT"]), and we want a pvalue that takes into account 2 adjacent positions(sequence_context=2). In addition we also specify that the weights for adjacent position should follow a harmonic distribution (sequence_context_weights="harmonic"). Finally, we also want the Gaussian mixture model to use the logistic regression test on top of the ANOVA (logit = True).

# Init the object
s = SampComp (
    eventalign_fn_dict = "./samples.yaml",
    outpath = "./results/",
    outprefix= "simulated_stats_",
    fasta_fn = "./references/simulated/ref.fa",
    overwrite=True,
    comparison_methods=["GMM", "MW", "KS", "TT"],
    sequence_context=2,
    sequence_context_weights="harmonic",
    logit = True,
    nthreads = 6)

# Run the analysis
db = s ()
2020-12-07T18:31:48.171748+0000 INFO - MainProcess | Checking and initialising SampComp
2020-12-07T18:31:48.176382+0000 INFO - MainProcess | Reading eventalign index files
2020-12-07T18:31:48.189340+0000 INFO - MainProcess |    References found in index: 5
2020-12-07T18:31:48.189937+0000 INFO - MainProcess | Filtering out references with low coverage
2020-12-07T18:31:48.193885+0000 INFO - MainProcess |    References remaining after reference coverage filtering: 5
2020-12-07T18:31:48.195954+0000 INFO - MainProcess | Starting data processing
2020-12-07T18:32:08.969484+0000 INFO - Process-27 | All Done. Transcripts processed: 5
2020-12-07T18:32:08.983224+0000 INFO - MainProcess | Loading SampCompDB
2020-12-07T18:32:09.000121+0000 INFO - MainProcess | Calculate results

Explore the database object returned by Sampcomp

The database object returned by Sampcomp is a Python GDBM object database indexed by reference id. Although we highly recommend to used SampCompDB to extract information, it is also possible to open the database directly using the python standard library module shelve.

import shelve

with shelve.open ("./results/simulated_SampComp.db") as db:
    # Read metadata stored in metadata slot
    print(db["__metadata"])
    # Read list of references ids stored in ref_id_list slot
    print(db["__ref_id_list"])
    # Access stats for reference 'ref_0000' in position 1
    print(db["ref_0000"][1]["txComp"])
{'package_name': '1.0.1.dev0', 'package_version': 'nanocompore', 'timestamp': '2020-12-07 18:31:32.813597', 'comparison_methods': ['GMM', 'KS'], 'pvalue_tests': ['GMM_anova_pvalue', 'KS_dwell_pvalue', 'KS_intensity_pvalue'], 'sequence_context': 0, 'min_coverage': 30, 'n_samples': 4}
['ref_0001', 'ref_0000', 'ref_0002', 'ref_0003', 'ref_0004']
{'GMM_model': {'model': GaussianMixture(n_components=2,
                random_state=RandomState(MT19937) at 0x7FE80C820270), 'cluster_counts': 'Modified_rep1:43/12__Modified_rep2:49/11__Unmodified_rep1:4/51__Unmodified_rep2:3/57'}, 'GMM_anova_pvalue': 0.0026070675453166692, 'GMM_anova_model': {'pvalue': 0.0026070675453166692, 'delta_logit': 3.8311555435, 'table': F_onewayResult(statistic=382.0733766141018, pvalue=0.0026070675453166692), 'log_ratios': array([ 1.21924028,  1.42711636, -2.34180581, -2.67414865])}, 'KS_intensity_pvalue': 2.0829525794874904e-18, 'KS_dwell_pvalue': 4.9251621679434625e-36, 'shift_stats': OrderedDict([('c1_mean_intensity', 99.54709799468591), ('c2_mean_intensity', 91.84415815799566), ('c1_median_intensity', 100.61305720731731), ('c2_median_intensity', 92.05605629218894), ('c1_sd_intensity', 6.85551651760893), ('c2_sd_intensity', 5.143626863495356), ('c1_mean_dwell', 0.02567361308475615), ('c2_mean_dwell', 0.009745078440682191), ('c1_median_dwell', 0.023047132241771167), ('c2_median_dwell', 0.007591605413309661), ('c1_sd_dwell', 0.010681688969392145), ('c2_sd_dwell', 0.007203231747770627)])}

Full CLI and API documentations

API documentation

API help can be obtained with conventional python methods (help or ?) or rendered nicely in Jupyter with the jhelp function from nanocompore

from nanocompore.SampComp import SampComp
from nanocompore.common import jhelp
jhelp(SampComp)

SampComp (eventalign_fn_dict, fasta_fn, bed_fn, outpath, outprefix, overwrite, whitelist, comparison_methods, logit, allow_warnings, sequence_context, sequence_context_weights, min_coverage, min_ref_length, downsample_high_coverage, max_invalid_kmers_freq, select_ref_id, exclude_ref_id, nthreads, progress)

Init analysis and check args


  • eventalign_fn_dict (required) [dict]

Multilevel dictionnary indicating the condition_label, sample_label and file name of the eventalign_collapse output. 2 conditions are expected and at least 2 sample replicates per condition are highly recommended. One can also pass YAML file describing the samples instead. Example d = {"S1": {"R1":"path1.tsv", "R2":"path2.tsv"}, "S2": {"R1":"path3.tsv", "R2":"path4.tsv"}}

  • fasta_fn (required) [str]

Path to a fasta file corresponding to the reference used for read alignment.

  • bed_fn (default: None) [str]

Path to a BED file containing the annotation of the transcriptome used as reference when mapping.

  • outpath (default: results) [str]

Path to the output folder.

  • outprefix (default: out_) [str]

text outprefix for all the files generated by the function.

  • overwrite (default: False) [bool]

If the output directory already exists, the standard behaviour is to raise an error to prevent overwriting existing data This option ignore the error and overwrite data if they have the same outpath and outprefix.

  • whitelist (default: None) [Whitelist]

Whitelist object previously generated with nanocompore Whitelist. If not given, will be automatically generated.

  • comparison_methods (default: ['GMM', 'KS']) [list]

Statistical method to compare the 2 samples (mann_whitney or MW, kolmogorov_smirnov or KS, t_test or TT, gaussian_mixture_model or GMM). This can be a list or a comma separated string. {MW,KS,TT,GMM}

  • logit (default: False) [bool]

Force logistic regression even if we have less than 2 replicates in any condition.

  • allow_warnings (default: False) [bool]

If True runtime warnings during the ANOVA tests don't raise an error.

  • sequence_context (default: 0) [int]

Extend statistical analysis to contigous adjacent base if available.

  • sequence_context_weights (default: uniform) [str]

type of weights to used for combining p-values. {uniform,harmonic}

  • min_coverage (default: 30) [int]

minimal read coverage required in all sample.

  • min_ref_length (default: 100) [int]

minimal length of a reference transcript to be considered in the analysis

  • downsample_high_coverage (default: 1000) [int]

For reference with higher coverage, downsample by randomly selecting reads.

  • max_invalid_kmers_freq (default: 0.1) [float]

maximum frequency of NNNNN, mismatching and missing kmers in reads.

  • select_ref_id (default: []) [list]

if given, only reference ids in the list will be selected for the analysis.

  • exclude_ref_id (default: []) [list]

if given, refid in the list will be excluded from the analysis.

  • nthreads (default: 3) [int]

Number of threads (two are used for reading and writing, all the others for parallel processing).

  • progress (default: False) [bool]

Display a progress bar during execution

CLI documentation

nanocompore sampcomp --help
usage: nanocompore sampcomp [-h] [--sample_yaml sample_yaml]
                            [--file_list1 /path/to/Condition1_rep1,/path/to/Condition1_rep2]
                            [--file_list2 /path/to/Condition2_rep1,/path/to/Condition2_rep2]
                            [--label1 Condition1] [--label2 Condition2]
                            --fasta FASTA [--bed BED]
                            [--max_invalid_kmers_freq MAX_INVALID_KMERS_FREQ]
                            [--min_coverage MIN_COVERAGE]
                            [--downsample_high_coverage DOWNSAMPLE_HIGH_COVERAGE]
                            [--min_ref_length MIN_REF_LENGTH]
                            [--comparison_methods COMPARISON_METHODS]
                            [--sequence_context {0,1,2,3,4}]
                            [--sequence_context_weights {uniform,harmonic}]
                            [--pvalue_thr PVALUE_THR] [--logit]
                            [--allow_warnings] [--nthreads NTHREADS]
                            [--outpath OUTPATH] [--outprefix OUTPREFIX]
                            [--overwrite] [--log_level {warning,info,debug}]
                            [--progress]

Compare 2 samples and find significant signal differences

* Minimal example with file_list arguments

    nanocompore sampcomp -1 f1.tsv,f2.tsv -2 f3.tsv,f4.tsv -f ref.fa -o results
* Minimal example with sample YAML file

    nanocompore sampcomp -y samples.yaml -f ref -o results

optional arguments:
  -h, --help            show this help message and exit

YAML sample files:
  Option allowing to describe sample files in a YAML file

  --sample_yaml sample_yaml, -y sample_yaml
                        YAML file containing the sample file labels. See
                        formatting in documentation. (required if --file_list1
                        and --file_list2 not given)

Arguments sample files:
  Option allowing to describe sample files directly as command line arguments

  --file_list1 /path/to/Condition1_rep1,/path/to/Condition1_rep2, -1 /path/to/Condition1_rep1,/path/to/Condition1_rep2
                        Comma separated list of NanopolishComp files for label
                        1. (required if --sample_yaml not given)
  --file_list2 /path/to/Condition2_rep1,/path/to/Condition2_rep2, -2 /path/to/Condition2_rep1,/path/to/Condition2_rep2
                        Comma separated list of NanopolishComp files for label
                        2. (required if --sample_yaml not given)
  --label1 Condition1   Label for files in --file_list1 (default: Condition1)
  --label2 Condition2   Label for files in --file_list2 (default: Condition2)

Input options:
  --fasta FASTA, -f FASTA
                        Fasta file used for mapping (required)
  --bed BED             BED file with annotation of transcriptome used for
                        mapping (optional)

Transcript filtering options:
  --max_invalid_kmers_freq MAX_INVALID_KMERS_FREQ
                        Max fequency of invalid kmers (default: 0.1)
  --min_coverage MIN_COVERAGE
                        Minimum coverage required in each condition to do the
                        comparison (default: 30)
  --downsample_high_coverage DOWNSAMPLE_HIGH_COVERAGE
                        Transcripts with high coverage will be downsampled
                        (default: 5000)
  --min_ref_length MIN_REF_LENGTH
                        Minimum length of a reference transcript to include it
                        in the analysis (default: 100)

Statistical testing options:
  --comparison_methods COMPARISON_METHODS
                        Comma separated list of comparison methods. Valid
                        methods are: GMM,KS,TT,MW. (default: GMM,KS)
  --sequence_context {0,1,2,3,4}
                        Sequence context for combining p-values (default: 0)
  --sequence_context_weights {uniform,harmonic}
                        Type of weights to use for combining p-values
  --pvalue_thr PVALUE_THR
                        Adjusted p-value threshold for reporting significant
                        sites (default: 0.05)
  --logit               Use logistic regression testing also when all
                        conditions have replicates (default: False)
  --allow_warnings      If True runtime warnings during the ANOVA tests don't
                        raise an error (default: False)

Other options:
  --nthreads NTHREADS, -t NTHREADS
                        Number of threads (default: 3)

Output options:
  --outpath OUTPATH, -o OUTPATH
                        Path to the output folder (default: ./)
  --outprefix OUTPREFIX, -p OUTPREFIX
                        text outprefix for all the files generated (default:
                        out)
  --overwrite, -w       Use --outpath even if it exists already (default:
                        False)

Verbosity options:
  --log_level {warning,info,debug}
                        Set the log level (default: info)
  --progress            Display a progress bar during execution (default:
                        False)