Skip to content

Generate simulated reads

In brief

...

Quick start

...

Description of main options

...

Advanced API Usage

...

Import the package and plotting tools

from nanocompore.SimReads import SimReads

# Ploting lib imports
import matplotlib.pyplot as pl
%matplotlib inline

Generate reads without modifications

SimReads (
    fasta_fn="./references/simulated/ref.fa",
    ref_list=["ref_0000"],
    outpath="./results/",
    overwrite="True",
    plot=True,
    nreads_per_ref=100)
Initialising SimReads and checking options
Importing RNA model file
Reading Fasta file and simulate corresponding data
100%|██████████| 1/1 [00:00<00:00,  1.59 References/s]

png

SimReads (
    fasta_fn="./references/simulated/ref.fa",
    ref_list=["ref_0000"],
    outpath="./results/",
    overwrite="True",
    plot=True,
    mod_extend_context=3,
    nreads_per_ref=100,
    intensity_mod=5,
    dwell_mod=5,
    mod_reads_freq=0.5)
Initialising SimReads and checking options
Importing RNA model file
Reading Fasta file and simulate corresponding data
100%|██████████| 1/1 [00:00<00:00,  1.25 References/s]

png

Generate a small dataset with both modified and unmodified conditions

# Options
fasta = "./references/simulated/ref.fa"
data_dir = "./eventalign_files/simulated/"

for replicate, nreads in [(1, 55), (2, 60)]:
    # Generate non modified control
    SimReads (
        fasta_fn=fasta,
        outpath=data_dir,
        outprefix=f"unmodified_rep_{replicate}",
        overwrite=True,
        nreads_per_ref= nreads)

    # Generate modified control
    SimReads (
        fasta_fn=fasta,
        outpath=data_dir,
        outprefix=f"modified_rep_{replicate}",
        overwrite=True,
        nreads_per_ref= nreads,
        intensity_mod=3,
        dwell_mod=3,
        mod_reads_freq=0.9,
        mod_bases_freq = 0.25,
        pos_rand_seed=2)
Initialising SimReads and checking options
Importing RNA model file
Reading Fasta file and simulate corresponding data
100%|██████████| 5/5 [00:00<00:00,  6.43 References/s]
Initialising SimReads and checking options
Importing RNA model file
Reading Fasta file and simulate corresponding data
100%|██████████| 5/5 [00:01<00:00,  4.36 References/s]
Initialising SimReads and checking options
Importing RNA model file
Reading Fasta file and simulate corresponding data
100%|██████████| 5/5 [00:00<00:00,  5.63 References/s]
Initialising SimReads and checking options
Importing RNA model file
Reading Fasta file and simulate corresponding data
100%|██████████| 5/5 [00:01<00:00,  4.71 References/s]

Full CLI and API documentations

API documentation

API help can be obtained with conventional python methods (help or ?) or rendered nicely in Jupyter with the jhelp function from nanocompore

from nanocompore.SimReads import SimReads
from nanocompore.common import jhelp
jhelp(SimReads)

SimReads (fasta_fn, outpath, outprefix, overwrite, run_type, ref_list, nreads_per_ref, plot, intensity_mod, dwell_mod, mod_reads_freq, mod_bases_freq, mod_bases_type, mod_extend_context, min_mod_dist, pos_rand_seed, data_rand_seed, not_bound, log_level)

Simulate reads in a NanopolishComp like file from a fasta file and an inbuild model. The simulated reads correspond to the sequences provided in the fasta file and follow the intensity and dwell time from the corresponding model (RNA or DNA).


  • fasta_fn (required) [str]

Fasta file containing references to use to generate artificial reads.

  • outpath (default: ./) [str]

Path to the output folder.

  • outprefix (default: out) [str]

text outprefix for all the files generated by the function.

  • overwrite (default: False) [bool]

If the output directory already exists, the standard behaviour is to raise an error to prevent overwriting existing data This option ignore the error and overwrite data if they have the same outpath and outprefix.

  • run_type (default: RNA) [str]

Define the run type model to import {RNA,DNA}

  • ref_list (default: []) [list]

Restrict the references to the listed IDs.

  • nreads_per_ref (default: 100) [int]

Number of reads to generate per references.

  • plot (default: False) [bool]

If true, generate an interactive plot of the trace generated.

  • intensity_mod (default: 0) [float]

Fraction of intensity distribution SD by which to modify the intensity distribution loc value.

  • dwell_mod (default: 0) [float]

Fraction of dwell time distribution SD by which to modify the intensity distribution loc value.

  • mod_reads_freq (default: 0) [float]

Frequency of reads to modify.

  • mod_bases_freq (default: 0.25) [float]

Frequency of bases to modify in each read (if possible).

  • mod_bases_type (default: A) [str]

Base for which to modify the signal. {A,T,C,G}

  • mod_extend_context (default: 2) [int]

number of adjacent base affected by the signal modification following an harmonic series.

  • min_mod_dist (default: 6) [int]

Minimal distance between 2 bases to modify.

  • pos_rand_seed (default: 42) [int]

Define a seed for randon position picking to get a deterministic behaviour.

  • data_rand_seed (default: None) [int]

Define a seed for generating the data. If None (default) the seed is drawn from /dev/urandom.

  • not_bound (default: False) [bool]

Do not bind the values generated by the distributions to the observed min and max observed values from the model file.

  • log_level (default: info) [str]

Set the log level {warning, info, debug}

CLI documentation

nanocompore simreads --help
usage: nanocompore simreads [-h] --fasta FASTA [--run_type {RNA,DNA}]
                            [--outpath OUTPATH] [--outprefix OUTPREFIX]
                            [--overwrite] [--nreads_per_ref NREADS_PER_REF]
                            [--intensity_mod INTENSITY_MOD]
                            [--dwell_mod DWELL_MOD]
                            [--mod_reads_freq MOD_READS_FREQ]
                            [--mod_bases_freq MOD_BASES_FREQ]
                            [--mod_bases_type {A,T,C,G}]
                            [--mod_extend_context MOD_EXTEND_CONTEXT]
                            [--min_mod_dist MIN_MOD_DIST]
                            [--pos_rand_seed POS_RAND_SEED] [--not_bound]
                            [--log_level {warning,info,debug}]

Simulate reads in a NanopolishComp like file from a fasta file and an inbuild model

* Minimal example without model alteration
    nanocompore simreads -f ref.fa -o results -n 50

* Minimal example with alteration of model intensity loc parameter for 50% of the reads
    nanocompore simreads -f ref.fa -o results -n 50 --intensity_mod 2 --mod_reads_freq 0.5 --mod_bases_freq 0.2

optional arguments:
  -h, --help            show this help message and exit

Input/Output options:
  --fasta FASTA, -f FASTA
                        Fasta file containing references to use to generate
                        artificial reads
  --run_type {RNA,DNA}  Define the run type model to import (default: RNA)
  --outpath OUTPATH, -o OUTPATH
                        Path to the output folder (default: ./)
  --outprefix OUTPREFIX, -p OUTPREFIX
                        text outprefix for all the files generated by the
                        function (default: out)
  --overwrite           Use --outpath even if it exists already (default:
                        False)
  --nreads_per_ref NREADS_PER_REF, -n NREADS_PER_REF
                        Number of reads to generate per references (default:
                        100)

Signal modification options:
  --intensity_mod INTENSITY_MOD
                        Fraction of intensity distribution SD by which to
                        modify the intensity distribution loc value (default:
                        0)
  --dwell_mod DWELL_MOD
                        Fraction of dwell time distribution SD by which to
                        modify the intensity distribution loc value (default:
                        0)
  --mod_reads_freq MOD_READS_FREQ
                        Frequency of reads to modify (default: 0)
  --mod_bases_freq MOD_BASES_FREQ
                        Frequency of bases to modify in each read (if
                        possible) (default: 0.25)
  --mod_bases_type {A,T,C,G}
                        Base for which to modify the signal (default: A)
  --mod_extend_context MOD_EXTEND_CONTEXT
                        number of adjacent base affected by the signal
                        modification following an harmonic series (default: 2)
  --min_mod_dist MIN_MOD_DIST
                        Minimal distance between 2 bases to modify (default:
                        6)

Other options:
  --pos_rand_seed POS_RAND_SEED
                        Define a seed for randon position picking to get a
                        deterministic behaviour (default: 42)
  --not_bound           Do not bind the values generated by the distributions
                        to the observed min and max observed values from the
                        model file (default: False)
  --log_level {warning,info,debug}
                        Set the log level (default: info)