Compare samples with SampComp¶
In brief¶
First, SampComp
parses the sample eventalign collapse files and then the observed results are piled-up per reference at position level. In a second time, positions are compared using various statistical methods and the statistics are stored in a shelve DBM database containing the results for all positions with sufficient coverage. The API returns a SampCompDB
database wrapper object that can be subsequently interrogated to extract data and plots.
Quick start¶
Example CLI call
nanocompore sampcomp \
--file_list1 ./data/S1_R1.tsv,./data/S1_R2.tsv \
--file_list2 ./data/S2_R1.tsv,./data/S2_R2.tsv \
--label1 S1 \
--label2 S2 \
--fasta ./reference/ref.fa \
--outpath ./results
Example API call
# Import package
from nanocompore.SampComp import SampComp
# Init the object
s = SampComp(
eventalign_fn_dict = {
'S1':{'rep1':'./data/S1_R1.tsv', 'rep2':'./data/S1_R2.tsv'},
'S2':{'rep1':'./data/S2_R1.tsv', 'rep2':'./data/S2_R2.tsv'}},
outpath = "./results/",
fasta_fn = "./reference/ref.fa")
# Run the analysis
s()
Description of main options¶
SampComp
provides a very flexible analysis framework with a few mandatory options and many optional parameters. The full CLI and API documentations are provided at the bottom of this page.
Sample files¶
SampComp
requires sample files obtained with NanopolishComp EventalignCollapse
as explained before (see data preparation) for both the control and the experimental conditions. 2 conditions are expected and at least 2 replicates per conditions are highly recommended. If SampComp
is called through the CLI the files can be provided using either relevant command options or a YAML file. With the Python API you can pass either a python dictionary or a YAML file.
!!! info "YAML file option (CLI or API)"
This option allows to pass a YAML formatted file indicating the sample condition labels and paths to data files with the option --sample_yaml
for the CLI or directly with eventalign_fn_dict
for the API. The file should be formatted as follow:
```yaml
WT:
rep1: ./data/S1_R1.tsv
rep2: ./data/S1_R2.tsv
KO:
rep1: ./data/S2_R1.tsv
rep2: ./data/S1_R2.tsv
</code></pre>
<p>!!! info "Command line option (CLI only)"
This option requires to provide a comma separated list of files per condition using <code>--file_list1</code> and <code>--file_list2</code> arguments as well as the labels for condition each conditions using <code>--label1</code> and <code>--label2</code>.</p>
<p>!!! info "Python dictionary (API only)"
This option allows to pass a multi-level python dictionary containing the sample condition labels and paths to data files. The dictionary should be formatted as follow:</p>
<pre><code>```python
eventalign_fn_dict = {
"WT": {"rep1":"./data/S1_R1.tsv", "rep2":"./data/S1_R2.tsv"},
"KO": {"rep1":"./data/S2_R1.tsv", "rep2":"./data/S1_R2.tsv"}
}
```</code></pre>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Transcriptome-reference-FASTA-file">Transcriptome reference FASTA file<a class="anchor-link" href="#Transcriptome-reference-FASTA-file">¶</a></h4><p>A transcriptome FASTA reference file is required to extract kmer sequences during the analyses. The reference has to be the same as the one used at the mapping step. (CLI: <code>--fasta</code>, API: <code>fasta_fn</code>)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Output-folder">Output folder<a class="anchor-link" href="#Output-folder">¶</a></h4><p>Although it is not mandatory, it is recommended to provide a path to a directory where the program will output the result files (CLI: <code>--outpath</code>, API: <code>outpath</code>). In addition, users can also specify a prefix for the files to be generated (CLI: <code>--outprefix</code>, API: <code>outprefix</code>). Finally, if the outpath directory already exists, the program will raise an error to avoid erasing result files. To ignore the error you can to specify to overwrite previous results (CLI: <code>--overwrite</code>, API: <code>overwrite</code>).</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Genome-annotation-BED-file">Genome annotation BED file<a class="anchor-link" href="#Genome-annotation-BED-file">¶</a></h4><p>Optionally, a BED file containing the genome annotations corresponding to the transcriptome fasta file can be provided. In that case Nanocompore will also convert the transcript coordinates into the genome space (CLI: <code>--bed</code>, API: <code>bed_fn</code>)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Statistical-options">Statistical options<a class="anchor-link" href="#Statistical-options">¶</a></h4><p><code>SampComp</code> implements several statistical methods to evaluate the difference between the 2 conditions (<code>comparison_method</code>).</p>
<ul>
<li>Gaussian Mixture Model = GMM (default)</li>
<li>Kolmogorov–Smirnov test = KS (default)</li>
<li>Mann–Whitney U test = MW</li>
<li>T-test = TT</li>
</ul>
<p>In addition, it is also possible to specify the number of adjacent positions to take into account for the pvalue calculation (<code>sequence_context</code>) as well as the weights to give to adjacent position, using either an "uniform" or a "harmonic" distribution (<code>sequence_context_weights</code>).</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Coverage-options">Coverage options<a class="anchor-link" href="#Coverage-options">¶</a></h4><p>The default coverage threshold for <code>SampComp</code> to perform a statistical test is 30 reads in each replicates. This is quite conservative and can be modified if needed (<code>min_coverage</code>). In addition, to reduce the computational burden it is possible to randomly down-sample the number of reads for high coverage references (<code>downsample_high_coverage</code>).</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Manually-exclude-or-include-references-(API-only)">Manually exclude or include references (API only)<a class="anchor-link" href="#Manually-exclude-or-include-references-(API-only)">¶</a></h4><p>The API allows to specify references to be included or excluded from the analysis (<code>select_ref_id</code> and
<code>exclude_ref_id</code>). This can be useful to analyse a specific set of transcripts or to run a small test before analysing the whole dataset.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Advanced-API-Usage">Advanced API Usage<a class="anchor-link" href="#Advanced-API-Usage">¶</a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Import-the-package">Import the package<a class="anchor-link" href="#Import-the-package">¶</a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
```python
from nanocompore.SampComp import SampComp
from nanocompore.common import *
import shutil
# Create output dir and define log level
outdir = "results"
shutil.rmtree (outdir)
mkdir(outdir, exist_ok=True)
set_logger("info")
Using a Python dictionary to specify the locations of the eventalign files¶
Default option using a multilevel dictionary.
# Init the object s = SampComp ( eventalign_fn_dict = { "KO":{ "rep1":'./eventalign_files/yeast/KO_1_eventalign_collapsed.tsv', "rep2":"./eventalign_files/yeast/KO_2_eventalign_collapsed.tsv"}, "WT":{ "rep1":'./eventalign_files/yeast/WT_1_eventalign_collapsed.tsv', "rep2":"./eventalign_files/yeast/WT_2_eventalign_collapsed.tsv"}}, outpath = "./results", outprefix= "yeast_", nthreads=3, max_invalid_kmers_freq = 0.2, min_coverage=30, fasta_fn = "./references/yeast/Yeast_transcriptome.fa", overwrite = True) # Run the analysis db = s()
Using a YAML file instead to specify the files location¶
To use a YAML file, just pass a path to the file instead of a dictionary. In this example we also specify to use 10 threads to speed-up the data processing (nthreads = 10
).
# Read YAML file to display structure print(open("./samples.yaml").read())
# Init the object s = SampComp ( eventalign_fn_dict = "./samples.yaml", outpath= "./results", outprefix= "simulated_", fasta_fn = "./references/simulated/ref.fa", overwrite = True, progress = True, nthreads = 4) # Run the analysis db = s ()
Tweaking the statistical tests¶
In this example, we require all 4 possible statistical tests to be performed (comparison_methods=["GMM", "MW", "KS", "TT"]
), and we want a pvalue that takes into account 2 adjacent positions(sequence_context=2
). In addition we also specify that the weights for adjacent position should follow a harmonic distribution (sequence_context_weights="harmonic"
). Finally, we also want the Gaussian mixture model to use the logistic regression test on top of the ANOVA (logit = True
).
# Init the object s = SampComp ( eventalign_fn_dict = "./samples.yaml", outpath = "./results/", outprefix= "simulated_stats_", fasta_fn = "./references/simulated/ref.fa", overwrite=True, comparison_methods=["GMM", "MW", "KS", "TT"], sequence_context=2, sequence_context_weights="harmonic", logit = True, nthreads = 6) # Run the analysis db = s ()
Explore the database object returned by Sampcomp¶
The database object returned by Sampcomp
is a Python GDBM object database indexed by reference id. Although we highly recommend to used SampCompDB
to extract information, it is also possible to open the database directly using the python standard library module shelve.
import shelve with shelve.open ("./results/simulated_SampComp.db") as db: # Read metadata stored in metadata slot print(db["__metadata"]) # Read list of references ids stored in ref_id_list slot print(db["__ref_id_list"]) # Access stats for reference 'ref_0000' in position 1 print(db["ref_0000"][1]["txComp"])
Full CLI and API documentations¶
API documentation¶
API help can be obtained with conventional python methods (help or ?) or rendered nicely in Jupyter with the jhelp function from nanocompore
from nanocompore.SampComp import SampComp from nanocompore.common import jhelp jhelp(SampComp)
CLI documentation¶
nanocompore sampcomp --help
usage: nanocompore sampcomp [-h] [--sample_yaml sample_yaml]
[--file_list1 /path/to/Condition1_rep1,/path/to/Condition1_rep2]
[--file_list2 /path/to/Condition2_rep1,/path/to/Condition2_rep2]
[--label1 Condition1] [--label2 Condition2]
--fasta FASTA [--bed BED]
[--max_invalid_kmers_freq MAX_INVALID_KMERS_FREQ]
[--min_coverage MIN_COVERAGE]
[--downsample_high_coverage DOWNSAMPLE_HIGH_COVERAGE]
[--min_ref_length MIN_REF_LENGTH]
[--comparison_methods COMPARISON_METHODS]
[--sequence_context {0,1,2,3,4}]
[--sequence_context_weights {uniform,harmonic}]
[--pvalue_thr PVALUE_THR] [--logit]
[--allow_warnings] [--nthreads NTHREADS]
[--outpath OUTPATH] [--outprefix OUTPREFIX]
[--overwrite] [--log_level {warning,info,debug}]
[--progress]
Compare 2 samples and find significant signal differences
* Minimal example with file_list arguments
nanocompore sampcomp -1 f1.tsv,f2.tsv -2 f3.tsv,f4.tsv -f ref.fa -o results
* Minimal example with sample YAML file
nanocompore sampcomp -y samples.yaml -f ref -o results
optional arguments:
-h, --help show this help message and exit
YAML sample files:
Option allowing to describe sample files in a YAML file
--sample_yaml sample_yaml, -y sample_yaml
YAML file containing the sample file labels. See
formatting in documentation. (required if --file_list1
and --file_list2 not given)
Arguments sample files:
Option allowing to describe sample files directly as command line arguments
--file_list1 /path/to/Condition1_rep1,/path/to/Condition1_rep2, -1 /path/to/Condition1_rep1,/path/to/Condition1_rep2
Comma separated list of NanopolishComp files for label
1. (required if --sample_yaml not given)
--file_list2 /path/to/Condition2_rep1,/path/to/Condition2_rep2, -2 /path/to/Condition2_rep1,/path/to/Condition2_rep2
Comma separated list of NanopolishComp files for label
2. (required if --sample_yaml not given)
--label1 Condition1 Label for files in --file_list1 (default: Condition1)
--label2 Condition2 Label for files in --file_list2 (default: Condition2)
Input options:
--fasta FASTA, -f FASTA
Fasta file used for mapping (required)
--bed BED BED file with annotation of transcriptome used for
mapping (optional)
Transcript filtering options:
--max_invalid_kmers_freq MAX_INVALID_KMERS_FREQ
Max fequency of invalid kmers (default: 0.1)
--min_coverage MIN_COVERAGE
Minimum coverage required in each condition to do the
comparison (default: 30)
--downsample_high_coverage DOWNSAMPLE_HIGH_COVERAGE
Transcripts with high coverage will be downsampled
(default: 5000)
--min_ref_length MIN_REF_LENGTH
Minimum length of a reference transcript to include it
in the analysis (default: 100)
Statistical testing options:
--comparison_methods COMPARISON_METHODS
Comma separated list of comparison methods. Valid
methods are: GMM,KS,TT,MW. (default: GMM,KS)
--sequence_context {0,1,2,3,4}
Sequence context for combining p-values (default: 0)
--sequence_context_weights {uniform,harmonic}
Type of weights to use for combining p-values
--pvalue_thr PVALUE_THR
Adjusted p-value threshold for reporting significant
sites (default: 0.05)
--logit Use logistic regression testing also when all
conditions have replicates (default: False)
--allow_warnings If True runtime warnings during the ANOVA tests don't
raise an error (default: False)
Other options:
--nthreads NTHREADS, -t NTHREADS
Number of threads (default: 3)
Output options:
--outpath OUTPATH, -o OUTPATH
Path to the output folder (default: ./)
--outprefix OUTPREFIX, -p OUTPREFIX
text outprefix for all the files generated (default:
out)
--overwrite, -w Use --outpath even if it exists already (default:
False)
Verbosity options:
--log_level {warning,info,debug}
Set the log level (default: info)
--progress Display a progress bar during execution (default:
False)