Outputs
While running, Nanocompore stores the results in an SQLite database. After all transcripts are analyzed, it would perform a postprocessing step and output an easy to read TSV (tab-separated values) file.
Results TSV
The results TSV is found in the output directory, typically named as "out_nanocompore_results.tsv". There's a set of columns that always will be present in the results and others that may appear or not, depending on the input configuration. For example, changing the statistical tests that are performed by Nanocompore would change the set of output columns.
Mandatory output columns
Column | Description | Example |
---|---|---|
pos | Starting position on the transcript's reference sequence of the k-mer for which the row reports results. Indexing is 0-based, meaning that the first nucleotide on the transcript is position 0. | 42 |
chr | Chromosome id. This will be set only if a GTF file is provided using the gtf parameter in the configuration. |
chr1 |
genomicPos | The position on the chromosome corresponding to the position on the transcript. This is only set when gtf file is provided. |
1324576 |
ref_id | Reference name of the transcript. | ENST00000676788.1 |
strand | Genomic strand. This is set only when gtf file is provided. |
+ |
ref_kmer | The k-mer sequence as found in the reference. | GCTACGT |
Gaussian-Mixture Models (GMM) output columns
Column | Description | Example |
---|---|---|
GMM_chi2_pvalue | p-value obtained from performing an association testing with a Chi-squared test between the reads' condition labels and the cluster assignments obtained from the GMM. | 0.00123 |
GMM_chi2_qvalue | The p-value corrected for multiple testing using the Benjamini-Hochberg procedure. | 1.0 |
GMM_LOR | The log odds ratio from the GMM. Suppose that we have a wildtype (WT) condition and a mod-writer knockdown (KD) condition and that we have clusters C1 and C2 detected by the GMM, then GMM_LOR=ln((WT_C1/WT_C2)/(KD_C1/KD_C2)) , where ln denotes the natural logarithm. |
1.3 |
<SAMPLE>_mod | The number of reads for the given k-mer assigned to the GMM cluster considered to represent the modification state. | 47 |
<SAMPLE>_unmod | The number of reads for the given k-mer assigned to the GMM cluster considered to represent the non-modification state. | 72 |
The last two columns are repeated for each of the samples provided in the input configuration YAML file. <SAMPLE>
will be substituted with the sample label used in the configuration.
IMPORTANT:
It's recommended that both the q-value and the GMM_LOR
values are used when filtering the results. The q-value provides a measurement on the probability that the separation between the two conditions is due to chance, while the LOR measures the amount of separation. As a rule of thumb, we suggest considering as modified sites with q-value <= 0.01
and |GMM_LOR| >= 0.5
(i.e. absolute value of the LOR is larger than 0.5).
Shift statistics TSV
The shift statistics TSV gives summary statistics (mean, median, standard deviation) at the position level for the signal measurements (current intensity and dwell time) for the two conditions. The data will always be gathered during the analysis and saved to the database, but it will be exported to a TSV file only when export_shift_stats: True
is added to the configuration. The TSV will includes the following columns:
Column | Description | Example |
---|---|---|
ref_id | Reference name of the transcript. | ENST00000676788.1 |
pos | Starting position on the transcript's reference sequence of the k-mer for which the row reports results. Indexing is 0-based, meaning that the first nucleotide on the transcript is position 0. | 42 |
c1_mean_intensity | Mean value for the current intensity at the position for condition 1. | 78.91 |
c2_mean_intensity | Mean value for the current intensity at the position for condition 2. | 81.91 |
c1_median_intensity | Median value for the current intensity at the position for condition 1. | 78.21 |
c2_median_intensity | Median value for the current intensity at the position for condition 2. | 80.21 |
c1_sd_intensity | Standard deviation for the current intensity at the position for condition 1. | 1.21 |
c2_sd_intensity | Standard deviation for the current intensity at the position for condition 2. | 2.21 |
c1_mean_dwell | Mean value for the dwell time at the position for condition 1. | 0.31 |
c2_mean_dwell | Mean value for the dwell time at the position for condition 2. | 0.23 |
c1_median_dwell | Median value for the dwell time at the position for condition 1. | 0.29 |
c2_median_dwell | Median value for the dwell time at the position for condition 2. | 0.33 |
c1_sd_dwell | Standard deviation for the dwell time at the position for condition 1. | 0.19 |
c2_sd_dwell | Standard deviation for the dwell time at the position for condition 2. | 0.23 |
Result database
The TSV files described above are created for the user's convenience at the end of Nanocompore's run. All data for them are sourced from the SQLite database that Nanocompore uses throughout the run to store all results. The database would be found in the output directory under the filename "out_sampComp_sql.db".
The database schema is as follows:
CREATE TABLE transcripts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name VARCHAR NOT NULL UNIQUE
);
CREATE INDEX transcripts_name_index
ON transcripts(name);
CREATE TABLE kmer_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
transcript_id INTEGER NOT NULL,
pos INTEGER NOT NULL,
kmer INTEGER NOT NULL,
c1_mean_intensity FLOAT,
c1_mean_dwell FLOAT,
c1_median_intensity FLOAT,
c1_median_dwell FLOAT,
c1_std_intensity FLOAT,
c1_std_dwell FLOAT,
c2_mean_intensity FLOAT,
c2_mean_dwell FLOAT,
c2_median_intensity FLOAT,
c2_median_dwell FLOAT,
c2_std_intensity FLOAT,
c2_std_dwell FLOAT,
UNIQUE (transcript_id, pos),
FOREIGN KEY (transcript_id) REFERENCES transcripts(id)
);
CREATE INDEX kmer_results_transcript_id_index
ON kmer_results(transcript_id);
However, depending on the choice of statistical tests, the samples used and other parameters, additional column may be added. For example, suppose we're using the GMM and KS tests for comparing 3 knock-down and 3 wilde type samples. We'd get the following additional columns in the kmer_results
table.
CREATE TABLE kmer_results (
...
-- GMM columns
GMM_chi2_pvalue FLOAT,
GMM_LOR VARCHAR,
KD_1_mod FLOAT,
KD_1_unmod FLOAT,
KD_2_mod FLOAT,
KD_2_unmod FLOAT,
KD_3_mod FLOAT,
KD_3_unmod FLOAT,
WT_1_mod FLOAT,
WT_1_unmod FLOAT,
WT_2_mod FLOAT,
WT_2_unmod FLOAT,
WT_3_mod FLOAT,
WT_3_unmod FLOAT,
-- KS columns
KS_intensity_pvalue FLOAT,
KS_dwell_pvalue FLOAT,
...
)