Skip to content

Compare samples with SampComp

In brief

First, SampComp parses the sample eventalign collapse files and then the observed results are piled-up per reference at position level. In a second time, positions are compared using various statistical methods and the statistics are stored in a shelve DBM database containing the results for all positions with sufficient coverage. The API returns a SampCompDB database wrapper object that can be subsequently interrogated to extract data and plots.

Quick start

Example CLI call

nanocompore sampcomp \
    --file_list1 ./data/S1_R1.tsv,./data/S1_R2.tsv \
    --file_list2 ./data/S2_R1.tsv,./data/S2_R2.tsv \
    --label1 S1 \
    --label2 S2 \
    --fasta ./reference/ref.fa \
    --outpath ./results/ \

Example API call

# Import package
from nanocompore.SampComp import SampComp

# Init the object
s = SampComp(
    eventalign_fn_dict = {
        'S1':{'rep1':'./data/S1_R1.tsv', 'rep2':'./data/S1_R2.tsv'},
        'S2':{'rep1':'./data/S2_R1.tsv', 'rep2':'./data/S2_R2.tsv'}},
    outpath = "./results/",
    fasta_fn = "./reference/ref.fa")

# Run the analysis
s()

Description of main options

SampComp provides a very flexible analysis framework with a few mandatory options and many optional parameters. The full CLI and API documentations are provided at the bottom of this page.

Sample files

SampComp requires sample files obtained with NanopolishComp EventalignCollapse as explained before (see data preparation) for both the control and the experimental conditions. 2 conditions are expected and at least 2 replicates per conditions are highly recommended. If SampComp is called through the CLI the files can be provided using either relevant command options or a YAML file. With the Python API you can pass either a python dictionary or a YAML file.

YAML file option (CLI or API)

This option allows to pass a YAML formatted file indicating the sample condition labels and paths to data files with the option --sample_yaml for the CLI or directly with eventalign_fn_dict for the API. The file should be formatted as follow:

WT:
    rep1:   ./data/S1_R1.tsv
    rep2:   ./data/S1_R2.tsv

KO:
    rep1:   ./data/S2_R1.tsv
    rep2:   ./data/S1_R2.tsv

Command line option (CLI only)

This option requires to provide a comma separated list of files per condition using --file_list1 and --file_list2 arguments as well as the labels for condition each conditions using --label1 and --label2.

Python dictionary (API only)

This option allows to pass a multi-level python dictionary containing the sample condition labels and paths to data files. The dictionary should be formatted as follow:

eventalign_fn_dict = {
    "WT":  {"rep1":"./data/S1_R1.tsv", "rep2":"./data/S1_R2.tsv"},
    "KO": {"rep1":"./data/S2_R1.tsv", "rep2":"./data/S1_R2.tsv"}
    }

Transcriptome reference FASTA file

A transcriptome FASTA reference file is required to extract kmer sequences during the analyses. The reference has to be the same as the one used at the mapping step. (CLI: --fasta, API: fasta_fn)

Output folder

Although it is not mandatory, it is recommended to provide a path to a directory where the program will output the result files (CLI: --outpath, API: outpath). In addition, users can also specify a prefix for the files to be generated (CLI: --outprefix, API: outprefix). Finally, if the outpath directory already exists, the program will raise an error to avoid erasing result files. To ignore the error you can to specify to overwrite previous results (CLI: --overwrite, API: overwrite).

Genome annotation BED file

Optionally, a BED file containing the genome annotations corresponding to the transcriptome fasta file can be provided. In that case Nanocompore will also convert the transcript coordinates into the genome space (CLI: --bed, API: bed_fn)

Statistical options

SampComp implements several statistical methods to evaluate the difference between the 2 conditions (comparison_method).

  • Gaussian Mixture Model = GMM (default)
  • Kolmogorov–Smirnov test = KS (default)
  • Mann–Whitney U test = MW
  • T-test = TT

In addition, it is also possible to specify the number of adjacent positions to take into account for the pvalue calculation (sequence_context) as well as the weights to give to adjacent position, using either an "uniform" or a "harmonic" distribution (sequence_context_weights).

Coverage options

The default coverage threshold for SampComp to perform a statistical test is 30 reads in each replicates. This is quite conservative and can be modified if needed (min_coverage). In addition, to reduce the computational burden it is possible to randomly down-sample the number of reads for high coverage references (downsample_high_coverage).

Manually exclude or include references (API only)

The API allows to specify references to be included or excluded from the analysis (select_ref_id and exclude_ref_id). This can be useful to analyse a specific set of transcripts or to run a small test before analysing the whole dataset.

Advanced API Usage

Import the package

from nanocompore.SampComp import SampComp

Using a Python dictionary to specify the locations of the eventalign files

Default option using a multilevel dictionary.

# Init the object
s = SampComp ( 
    eventalign_fn_dict = {
        "KO":{
            "rep1":'./eventalign_files/yeast/KO_1_eventalign_collapsed.tsv', 
            "rep2":"./eventalign_files/yeast/KO_2_eventalign_collapsed.tsv"},
        "WT":{
            "rep1":'./eventalign_files/yeast/WT_1_eventalign_collapsed.tsv', 
            "rep2":"./eventalign_files/yeast/WT_2_eventalign_collapsed.tsv"}},
    outpath = "./results",
    outprefix= "yeast_",
    fasta_fn = "./references/yeast/Yeast_transcriptome.fa",
    overwrite = True)

# Run the analysis
db = s()
Initialising SampComp and checking options
Initialising Whitelist and checking options
Reading eventalign index files
    References found in index: 125
Filtering out references with low coverage
    References remaining after reference coverage filtering: 1
Starting data processing
100%|██████████| 1/1 [00:25<00:00, 25.07s/ Processed References]
Loading SampCompDB
Calculate results

Using a YAML file instead to specify the files location

To use a YAML file, just pass a path to the file instead of a dictionary. In this example we also specify to use 10 threads to speed-up the data processing (nthreads = 10).

# Read YAML file to display structure
print(open("./samples.yaml").read())
Modified:
    rep1:   ./eventalign_files/simulated/modified_rep_1.tsv
    rep2:   ./eventalign_files/simulated/modified_rep_2.tsv
Unmodified:
    rep1:   ./eventalign_files/simulated/unmodified_rep_1.tsv
    rep2:   ./eventalign_files/simulated/unmodified_rep_2.tsv
# Init the object
s = SampComp (
    eventalign_fn_dict = "./samples.yaml",
    outpath= "./results",
    outprefix= "simulated_",
    fasta_fn = "./references/simulated/ref.fa",
    overwrite = True,
    nthreads = 10)

# Run the analysis
db = s ()
Initialising SampComp and checking options
Initialising Whitelist and checking options
Reading eventalign index files
    References found in index: 5
Filtering out references with low coverage
    References remaining after reference coverage filtering: 5
Starting data processing
100%|██████████| 5/5 [00:05<00:00,  2.41s/ Processed References]
Loading SampCompDB
Calculate results

Tweaking the statistical tests

In this example, we require all 4 possible statistical tests to be performed (comparison_methods=["GMM", "MW", "KS", "TT"]), and we want a pvalue that takes into account 2 adjacent positions(sequence_context=2). In addition we also specify that the weights for adjacent position should follow a harmonic distribution (sequence_context_weights="harmonic"). Finally, we also want the Gaussian mixture model to use the logistic regression test on top of the ANOVA (logit = True).

# Init the object
s = SampComp (
    eventalign_fn_dict = "./samples.yaml",
    outpath = "./results/",
    outprefix= "simulated_stats_",
    fasta_fn = "./references/simulated/ref.fa",
    overwrite=True,
    comparison_methods=["GMM", "MW", "KS", "TT"],
    sequence_context=2,
    sequence_context_weights="harmonic",
    logit = True)

# Run the analysis
db = s ()
Initialising SampComp and checking options
Initialising Whitelist and checking options
Reading eventalign index files
    References found in index: 5
Filtering out references with low coverage
    References remaining after reference coverage filtering: 5
Starting data processing
100%|██████████| 5/5 [00:17<00:00,  3.45s/ Processed References]
Loading SampCompDB
Calculate results

Explore the database object returned by Sampcomp

The database object returned by Sampcomp is a Python GDBM object database indexed by reference id. Although we highly recommend to used SampCompDB to extract information, it is also possible to open the database directly using the python standard library module shelve.

import shelve

with shelve.open ("./results/simulated_SampComp.db") as db:
    # Read metadata stored in metadata slot
    print(db["__metadata"])
    # Read list of references ids stored in ref_id_list slot
    print(db["__ref_id_list"])
    # Access stats for reference 'ref_0000' in position 1
    print(db["ref_0000"][1]["txComp"])
{'package_name': 'nanocompore', 'package_version': '1.0.0rc3-dev', 'timestamp': '2019-06-19 14:11:42.582069', 'comparison_methods': ['GMM', 'KS'], 'pvalue_tests': ['GMM_anova_pvalue', 'KS_dwell_pvalue', 'KS_intensity_pvalue'], 'sequence_context': 0, 'min_coverage': 30, 'n_samples': 4}
['ref_0002', 'ref_0000', 'ref_0003', 'ref_0001', 'ref_0004']
{'GMM_model': {'model': GaussianMixture(covariance_type='full', init_params='kmeans', max_iter=100,
                means_init=None, n_components=2, n_init=1, precisions_init=None,
                random_state=<mtrand.RandomState object at 0x7ff59c481ee8>,
                reg_covar=1e-06, tol=0.001, verbose=0, verbose_interval=10,
                warm_start=False, weights_init=None), 'cluster_counts': 'Modified_rep1:6/49__Modified_rep2:6/54__Unmodified_rep1:50/5__Unmodified_rep2:52/8'}, 'GMM_anova_pvalue': 0.0022723661815992765, 'GMM_anova_model': {'pvalue': 0.0022723661815992765, 'delta_logit': -3.9703346955, 'table': F_onewayResult(statistic=438.5704870887501, pvalue=0.0022723661815992765), 'log_ratios': array([-1.96611286, -2.06142304,  2.14006616,  1.77306734])}, 'KS_intensity_pvalue': 2.2306904419226962e-17, 'KS_dwell_pvalue': 1.0667209507829918e-31, 'shift_stats': OrderedDict([('c1_mean_intensity', 99.54709799468591), ('c2_mean_intensity', 91.84415815799566), ('c1_median_intensity', 100.61305720731731), ('c2_median_intensity', 92.05605629218894), ('c1_sd_intensity', 6.85551651760893), ('c2_sd_intensity', 5.143626863495356), ('c1_mean_dwell', 0.02567361308475615), ('c2_mean_dwell', 0.009745078440682191), ('c1_median_dwell', 0.023047132241771167), ('c2_median_dwell', 0.007591605413309661), ('c1_sd_dwell', 0.010681688969392145), ('c2_sd_dwell', 0.007203231747770627)])}

Full CLI and API documentations

API documentation

API help can be obtained with conventional python methods (help or ?) or rendered nicely in Jupyter with the jhelp function from nanocompore

from nanocompore.SampComp import SampComp
from nanocompore.common import jhelp
jhelp(SampComp)

SampComp (eventalign_fn_dict, fasta_fn, bed_fn, outpath, outprefix, overwrite, whitelist, comparison_methods, logit, allow_warnings, sequence_context, sequence_context_weights, min_coverage, min_ref_length, downsample_high_coverage, max_invalid_kmers_freq, select_ref_id, exclude_ref_id, nthreads, log_level)

Init analysis and check args


  • eventalign_fn_dict (required) [dict]

Multilevel dictionnary indicating the condition_label, sample_label and file name of the eventalign_collapse output. 2 conditions are expected and at least 2 sample replicates per condition are highly recommended. One can also pass YAML file describing the samples instead. Example d = {"S1": {"R1":"path1.tsv", "R2":"path2.tsv"}, "S2": {"R1":"path3.tsv", "R2":"path4.tsv"}}

  • fasta_fn (required) [str]

Path to a fasta file corresponding to the reference used for read alignment.

  • bed_fn (default: None) [str]

Path to a BED file containing the annotation of the transcriptome used as reference when mapping.

  • outpath (default: results) [str]

Path to the output folder.

  • outprefix (default: out_) [str]

text outprefix for all the files generated by the function.

  • overwrite (default: False) [bool]

If the output directory already exists, the standard behaviour is to raise an error to prevent overwriting existing data This option ignore the error and overwrite data if they have the same outpath and outprefix.

  • whitelist (default: None) [Whitelist]

Whitelist object previously generated with nanocompore Whitelist. If not given, will be automatically generated.

  • comparison_methods (default: ['GMM', 'KS']) [list]

Statistical method to compare the 2 samples (mann_whitney or MW, kolmogorov_smirnov or KS, t_test or TT, gaussian_mixture_model or GMM). This can be a list or a comma separated string. {MW,KS,TT,GMM}

  • logit (default: False) [bool]

Force logistic regression even if we have less than 2 replicates in any condition.

  • allow_warnings (default: False) [bool]

If True runtime warnings during the ANOVA tests don't raise an error.

  • sequence_context (default: 0) [int]

Extend statistical analysis to contigous adjacent base if available.

  • sequence_context_weights (default: uniform) [str]

type of weights to used for combining p-values. {uniform,harmonic}

  • min_coverage (default: 30) [int]

minimal read coverage required in all sample.

  • min_ref_length (default: 100) [int]

minimal length of a reference transcript to be considered in the analysis

  • downsample_high_coverage (default: 0) [int]

For reference with higher coverage, downsample by randomly selecting reads.

  • max_invalid_kmers_freq (default: 0.1) [float]

maximum frequency of NNNNN, mismatching and missing kmers in reads.

  • select_ref_id (default: []) [list]

if given, only reference ids in the list will be selected for the analysis.

  • exclude_ref_id (default: []) [list]

if given, refid in the list will be excluded from the analysis.

  • nthreads (default: 3) [int]

Number of threads (two are used for reading and writing, all the others for parallel processing).

  • log_level (default: info) [str]

Set the log level. {warning,info,debug}

CLI documentation

nanocompore sampcomp --help
usage: nanocompore sampcomp [-h] [--sample_yaml sample_yaml]
                            [--file_list1 /path/to/Condition1_rep1,/path/to/Condition1_rep2]
                            [--file_list2 /path/to/Condition2_rep1,/path/to/Condition2_rep2]
                            [--label1 Condition1] [--label2 Condition2]
                            --fasta FASTA [--bed BED] [--outpath OUTPATH]
                            [--outprefix OUTPREFIX] [--overwrite]
                            [--max_invalid_kmers_freq MAX_INVALID_KMERS_FREQ]
                            [--min_coverage MIN_COVERAGE]
                            [--downsample_high_coverage DOWNSAMPLE_HIGH_COVERAGE]
                            [--min_ref_length MIN_REF_LENGTH]
                            [--comparison_methods COMPARISON_METHODS]
                            [--sequence_context {0,1,2,3,4}]
                            [--sequence_context_weights {uniform,harmonic}]
                            [--pvalue_thr PVALUE_THR] [--logit]
                            [--allow_warnings] [--nthreads NTHREADS]
                            [--log_level {warning,info,debug}]

Compare 2 samples and find significant signal

* Minimal example with file_list arguments

    nanocompore sampcomp -1 f1.tsv,f2.tsv -2 f3.tsv,f4.tsv -f ref.fa -o results
* Minimal example with sample YAML file

    nanocompore sampcomp -y samples.yaml -f ref -o results

optional arguments:
  -h, --help            show this help message and exit

YAML sample files:
  Option allowing to describe sample files in a YAML file

  --sample_yaml sample_yaml, -y sample_yaml
                        YAML file containing the sample file labels. See
                        formatting in documentation. (required if --file_list1
                        and --file_list2 not given)

Arguments sample files:
  Option allowing to describe sample files directly as command line arguments

  --file_list1 /path/to/Condition1_rep1,/path/to/Condition1_rep2, -1 /path/to/Condition1_rep1,/path/to/Condition1_rep2
                        Comma separated list of NanopolishComp files for label
                        1. (required if --sample_yaml not given)
  --file_list2 /path/to/Condition2_rep1,/path/to/Condition2_rep2, -2 /path/to/Condition2_rep1,/path/to/Condition2_rep2
                        Comma separated list of NanopolishComp files for label
                        2. (required if --sample_yaml not given)
  --label1 Condition1   Label for files in --file_list1 (default: Condition1)
  --label2 Condition2   Label for files in --file_list2 (default: Condition2)

Input/Output options:
  --fasta FASTA, -f FASTA
                        Fasta file used for mapping (required)
  --bed BED             BED file with annotation of transcriptome used for
                        mapping (optional)
  --outpath OUTPATH, -o OUTPATH
                        Path to the output folder (default: results)
  --outprefix OUTPREFIX, -p OUTPREFIX
                        text outprefix for all the files generated by the
                        function (default: out_)
  --overwrite           Use --outpath even if it exists already (default:
                        False)

Transcript filtering options:
  --max_invalid_kmers_freq MAX_INVALID_KMERS_FREQ
                        Max fequency of invalid kmers (default: 0.1)
  --min_coverage MIN_COVERAGE
                        Minimum coverage required in each condition to do the
                        comparison (default: 30)
  --downsample_high_coverage DOWNSAMPLE_HIGH_COVERAGE
                        Used for debug: transcripts with high covergage will
                        be downsampled (default: 0)
  --min_ref_length MIN_REF_LENGTH
                        Minimum length of a reference transcript to include it
                        in the analysis (default: 100)

Statistical testing options:
  --comparison_methods COMPARISON_METHODS
                        Comma separated list of comparison methods. Valid
                        methods are: GMM,KS,TT,MW. (default: GMM,KS)
  --sequence_context {0,1,2,3,4}
                        Sequence context for combining p-values (default: 0)
  --sequence_context_weights {uniform,harmonic}
                        Type of weights to use for combining p-values
  --pvalue_thr PVALUE_THR
                        Adjusted p-value threshold for reporting significant
                        sites (default: 0.05)
  --logit               Use logistic regression testing also when all
                        conditions have replicates (default: False)
  --allow_warnings      If True runtime warnings during the ANOVA tests don't
                        raise an error (default: False)

Other options:
  --nthreads NTHREADS, -t NTHREADS
                        Number of threads (default: 3)
  --log_level {warning,info,debug}
                        log level (default: info)