Skip to content

Data API

The Data API provides functionality to easily read the preprocessed signal data that Nanocompore uses for the analysis. This can be used for custom plotting or other purposes. In general, you just need to load the configuration file used for the analysis via the load_config function and then you can query the data with get_references, get_reads, and get_pos.

For example:

>>> from nanocompore.api import load_config, get_pos

# Load the YAML configuration file to a Config object.
>>> config = load_config('analysis.yaml')

# Get the signal data for a given position:
>>> ref_id = 'ENST00000464651.1|ENSG00000166136.16|OTTHUMG00000019346.4|OTTHUMT00000051221.1|NDUFB8-204|NDUFB8|390|retained_intron|'
>>> get_pos(config, ref_id, 243)
    condition sample                                  read  intensity  dwell
0          WT   WT_2  a6f3e188-6288-4215-acdc-fe28beba411f    -1624.0   27.0
1          WT   WT_2  09923db6-eccc-497f-8621-8adeea9b1bfb     4072.0   20.0
2          WT   WT_2  f65926cc-bf13-4396-ba92-7f2f690b71d9    -2571.0    5.0
3          WT   WT_2  aebabd0a-5260-41c4-b38b-1ebb117dc0fb      586.0   16.0
4          WT   WT_2  994256e9-afab-4b54-94ff-cc37ae4cbe08     5229.0   16.0
..        ...    ...                                   ...        ...    ...
383        WT   WT_1  79df3c74-a4c6-4335-93c5-a0ca7e3aec78    -1067.0   25.0
384        WT   WT_1  f7dad9c6-d3d9-4501-85fb-6c6246a03719    -2225.0   56.0
385        WT   WT_1  8653efdc-943f-48f8-b6f1-174cc4bb1ad5     2837.0   12.0
386        WT   WT_1  fdf524f0-5bb5-45fc-a783-7e3a592eb149      462.0   30.0
387        WT   WT_1  b05c004e-5f58-4bfa-896b-ce28b4225ab2     -469.0   27.0

[388 rows x 5 columns]

Reference

get_metadata(db)

Returns the metadata from the given SQLite database.

The metadata contains information such as input files, resquiggler used, and data types for the binary encoded fields.

Parameters:

Name Type Description Default
db str

Path to the SQLite database produced by the preprocessing command of Nanocompore.

required

Returns:

Type Description
dict

Dictionary containing the metadata

Source code in nanocompore/api.py
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
def get_metadata(db: str) -> dict[str, str]:
    """
    Returns the metadata from the given SQLite database.

    The metadata contains information such as input files,
    resquiggler used, and data types for the binary encoded fields.

    Parameters
    ----------
    db : str
        Path to the SQLite database produced by the preprocessing
        command of Nanocompore.

    Returns
    -------
    dict
        Dictionary containing the metadata
    """
    with closing(sqlite3.connect(db)) as conn,\
         closing(conn.cursor()) as cursor:
        query = "SELECT key, value FROM metadata"
        return {k: v for k, v in cursor.execute(query).fetchall()}

get_pos(config, reference_id, pos)

Get the data for a given position for all samples. Note that position is a 0-based index of the first nucleotide of a k-mer.

Returns the signal data for a specific position of the given reference transcript from all reads.

Parameters:

Name Type Description Default
config Config

Path to a Nanocompore configuration file.

required
reference_id str

ID for a reference sequence (transcript).

required
pos int

Position on the transcript for which to get data. A 0-based index is assumed.

required

Returns:

Type Description
DataFrame

Where the DataFrame contains the following columns:

  • condition condition label (as defined in the configuration)
  • sample sample label (as defined in the configuration)
  • read: id of the read (qname)
  • intensity: current intensity
  • dwell: dwell time for the kmer

Examples:

>>> from nanocompore.api import load_config, get_pos
>>> config = load_config('analysis.yaml')
>>> get_pos(config, 'ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|', 532)
   condition sample                                  read  intensity  dwell
0         WT    WT1  a4395b0d-dd3b-48e3-8afb-4085374b1147     3800.0    7.0
1         WT    WT1  f9733448-6e6b-47ba-9501-01eda2f5ea26     4865.0  126.0
2         WT    WT1  6f5e3b2e-f27b-47ef-b3c6-2ab4fdefd20a     3272.0   42.0
3         WT    WT2  2da07406-70c2-40a1-835a-6a7a2c914d49     6241.0   44.0
4         WT    WT2  54fc1d38-5e3d-4d77-a717-2d41b4785af6     4047.0    9.0
5         WT    WT2  3cfa90d1-7dfb-4398-a224-c75a3ab99873     3709.0   70.0
6         KD    KD1  3f46f499-8ce4-4817-8177-8ad61b784f27     4807.0   57.0
7         KD    KD1  73d62df4-f04a-4207-a4bc-7b9739b3c3b2     4336.0  132.0
8         KD    KD1  b7bc9a36-318e-4be2-a90f-74a5aa6439bf     -861.0    7.0
9         KD    KD2  ac486e16-15be-47a8-902c-2cfa2887c534     2706.0   45.0
10        KD    KD2  797fd991-570e-42d4-8292-0a7557b192d7     5450.0   24.0
11        KD    KD2  4e1ad358-ec2b-40b4-8e9a-54db28a40551      206.0   47.0
Source code in nanocompore/api.py
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
def get_pos(config: Config, reference_id: str, pos: int) -> pd.DataFrame:
    """
    Get the data for a given position for all samples.
    Note that position is a 0-based index of the first
    nucleotide of a k-mer.

    Returns the signal data for a specific position
    of the given reference transcript from all reads.

    Parameters
    ----------
    config : Config
        Path to a Nanocompore configuration file.
    reference_id : str
        ID for a reference sequence (transcript).
    pos : int
        Position on the transcript for which to get data. A 0-based
        index is assumed.

    Returns
    -------
    pandas.DataFrame
        Where the DataFrame contains the following columns:

        - condition  condition label (as defined in the configuration)
        - sample     sample label (as defined in the configuration)
        - read:      id of the read (qname)
        - intensity: current intensity
        - dwell:     dwell time for the kmer

    Examples
    --------
    >>> from nanocompore.api import load_config, get_pos
    >>> config = load_config('analysis.yaml')
    >>> get_pos(config, 'ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|', 532)
       condition sample                                  read  intensity  dwell
    0         WT    WT1  a4395b0d-dd3b-48e3-8afb-4085374b1147     3800.0    7.0
    1         WT    WT1  f9733448-6e6b-47ba-9501-01eda2f5ea26     4865.0  126.0
    2         WT    WT1  6f5e3b2e-f27b-47ef-b3c6-2ab4fdefd20a     3272.0   42.0
    3         WT    WT2  2da07406-70c2-40a1-835a-6a7a2c914d49     6241.0   44.0
    4         WT    WT2  54fc1d38-5e3d-4d77-a717-2d41b4785af6     4047.0    9.0
    5         WT    WT2  3cfa90d1-7dfb-4398-a224-c75a3ab99873     3709.0   70.0
    6         KD    KD1  3f46f499-8ce4-4817-8177-8ad61b784f27     4807.0   57.0
    7         KD    KD1  73d62df4-f04a-4207-a4bc-7b9739b3c3b2     4336.0  132.0
    8         KD    KD1  b7bc9a36-318e-4be2-a90f-74a5aa6439bf     -861.0    7.0
    9         KD    KD2  ac486e16-15be-47a8-902c-2cfa2887c534     2706.0   45.0
    10        KD    KD2  797fd991-570e-42d4-8292-0a7557b192d7     5450.0   24.0
    11        KD    KD2  4e1ad358-ec2b-40b4-8e9a-54db28a40551      206.0   47.0
    """
    data_files = _get_data_files(config)
    sample_mapper = np.vectorize(dict(enumerate(data_files)).get)
    if config.get_resquiggler() == UNCALLED4:
        kit = config.get_kit()
        df = _get_bam_pos(data_files.values(), reference_id, pos, kit)
    else:
        df = _get_db_pos(data_files.values(), reference_id, pos)
    df['sample'] = sample_mapper(df['sample'])
    condition_mapper = np.vectorize(config.sample_to_condition().get)
    df['condition'] = condition_mapper(df['sample'])
    return df.loc[:, ['condition', 'sample', 'read', 'intensity', 'dwell']]

get_reads(config, reference_id, selected_reads=None)

Get the data for all reads mapping to the given reference.

Parameters:

Name Type Description Default
config Config

Path to a Nanocompore configuration file.

required
reference_id str

ID for a reference sequence (transcript).

required
selected_reads Optional[list[str]]

Optional list of UUIDs of the reads for which to get data. By default it's set to None and returns all reads.

None

Returns:

Type Description
tuple[Float[np.ndarray, ["reads positions variables"]],

list[str], list[str], list[str]]

A tuple with (signal_data, reads, samples, conditions)

  • signal_data is a 3D array with shape (reads, positions, variables). In the variables dimension 0=intensity, 1=dwell time.
  • reads is a list of read ids (qname).
  • samples is a list of the sample labels (as defined in the config).
  • conditions is a list of the condition labels (as defined in the config).

Raises:

Type Description
KeyError

If the reference_id is not found in the data sources.

Examples:

>>> from nanocompore.api import load_config, get_references
>>> config = load_config('analysis.yaml')
>>> get_reads(config, 'ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|')
Source code in nanocompore/api.py
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
def get_reads(
        config: Config,
        reference_id: str,
        selected_reads: Optional[list[str]]=None
    ) -> tuple[Float[np.ndarray, "reads positions variables"],
               list[str],
               list[str],
               list[str]]:
    """
    Get the data for all reads mapping to the given reference.

    Parameters
    ----------
    config : Config
        Path to a Nanocompore configuration file.
    reference_id : str
        ID for a reference sequence (transcript).
    selected_reads : Optional[list[str]]
        Optional list of UUIDs of the reads for which to get data.
        By default it's set to None and returns all reads.

    Returns
    -------
    tuple[Float[np.ndarray, ["reads positions variables"]],
          list[str],
          list[str],
          list[str]]

        A tuple with (signal_data, reads, samples, conditions)

        - signal_data is a 3D array with shape (reads, positions, variables).
          In the variables dimension 0=intensity, 1=dwell time.
        - reads is a list of read ids (qname).
        - samples is a list of the sample labels (as defined in the config).
        - conditions is a list of the condition labels (as defined in the config).

    Raises
    ------
    KeyError
        If the reference_id is not found in the data sources.

    Examples
    --------
    >>> from nanocompore.api import load_config, get_references
    >>> config = load_config('analysis.yaml')
    >>> get_reads(config, 'ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|')
    """
    data_files = _get_data_files(config)
    if config.get_resquiggler() == UNCALLED4:
        kit = config.get_kit()
        data, reads, samples = _get_bam_reads(data_files.values(),
                                              reference_id,
                                              kit,
                                              selected_reads)
    else:
        data, reads, samples = _get_db_reads(data_files.values(),
                                             reference_id,
                                             selected_reads)
    sample_mapper = np.vectorize(dict(enumerate(data_files)).get)
    samples = sample_mapper(samples)
    condition_mapper = np.vectorize(config.sample_to_condition().get)
    conditions = condition_mapper(samples)
    return data, reads, samples.tolist(), conditions.tolist()

get_references(config, has_data=True)

Returns a list of all references found in the list of samples defined in the configuration.

Parameters:

Name Type Description Default
config Config

Path to a Nanocompore configuration file.

required
has_data bool

If True (default) will return only references for which there are mapped reads.

True

Returns:

Type Description
list

List of transcript reference id strings.

Examples:

>>> from nanocompore.api import load_config, get_references
>>> config = load_config('analysis.yaml')
>>> get_references(config)
['ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|', 'ENST00000642480.2|ENSG00000075624.17|OTTHUMG00000023268|OTTHUMT00000495153.1|ACTB-213|ACTB|2021|protein_coding|']
Source code in nanocompore/api.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def get_references(config: Config, has_data=True) -> list[str]:
    """
    Returns a list of all references found in the
    list of samples defined in the configuration.

    Parameters
    ----------
    config : Config
        Path to a Nanocompore configuration file.
    has_data : bool, default=True
        If True (default) will return only references
        for which there are mapped reads.

    Returns
    -------
    list
        List of transcript reference id strings.

    Examples
    --------
    >>> from nanocompore.api import load_config, get_references
    >>> config = load_config('analysis.yaml')
    >>> get_references(config)
    ['ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|', 'ENST00000642480.2|ENSG00000075624.17|OTTHUMG00000023268|OTTHUMT00000495153.1|ACTB-213|ACTB|2021|protein_coding|']
    """
    data_files = list(_get_data_files(config).values())
    if config.get_resquiggler() == UNCALLED4:
        return _get_bam_references(data_files, has_data)
    else:
        return _get_db_references(data_files, has_data)

load_config(config_path)

Load a configuration file.

Parameters:

Name Type Description Default
config_path str

Path to the Nanocompore configuration file.

required

Returns:

Type Description
Config

A configuration object.

Source code in nanocompore/api.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def load_config(config_path: str) -> Config:
    """
    Load a configuration file.

    Parameters
    ----------
    config_path : str
        Path to the Nanocompore configuration file.

    Returns
    -------
    Config
        A configuration object.
    """
    with open(config_path, 'rb') as f:
        return Config(yaml.safe_load(f))