Data API

The Data API provides functionality to easily read the preprocessed signal data that Nanocompore uses for the analysis. This can be used for custom plotting or other purposes. In general, you just need to load the configuration file used for the analysis via the load_config function and then you can query the data with get_references, get_reads, and get_pos.

For example:

>>> from nanocompore.api import load_config, get_pos

# Load the YAML configuration file to a Config object.
>>> config = load_config('analysis.yaml')

# Get the signal data for a given position:
>>> ref_id = 'ENST00000464651.1|ENSG00000166136.16|OTTHUMG00000019346.4|OTTHUMT00000051221.1|NDUFB8-204|NDUFB8|390|retained_intron|'
>>> get_pos(config, ref_id, 243)
    condition sample                                  read  intensity  dwell
0          WT   WT_2  a6f3e188-6288-4215-acdc-fe28beba411f    -1624.0   27.0
1          WT   WT_2  09923db6-eccc-497f-8621-8adeea9b1bfb     4072.0   20.0
2          WT   WT_2  f65926cc-bf13-4396-ba92-7f2f690b71d9    -2571.0    5.0
3          WT   WT_2  aebabd0a-5260-41c4-b38b-1ebb117dc0fb      586.0   16.0
4          WT   WT_2  994256e9-afab-4b54-94ff-cc37ae4cbe08     5229.0   16.0
..        ...    ...                                   ...        ...    ...
383        WT   WT_1  79df3c74-a4c6-4335-93c5-a0ca7e3aec78    -1067.0   25.0
384        WT   WT_1  f7dad9c6-d3d9-4501-85fb-6c6246a03719    -2225.0   56.0
385        WT   WT_1  8653efdc-943f-48f8-b6f1-174cc4bb1ad5     2837.0   12.0
386        WT   WT_1  fdf524f0-5bb5-45fc-a783-7e3a592eb149      462.0   30.0
387        WT   WT_1  b05c004e-5f58-4bfa-896b-ce28b4225ab2     -469.0   27.0

[388 rows x 5 columns]

Reference

`get_metadata(db)`

Returns the metadata from the given SQLite database.

The metadata contains information such as input files, resquiggler used, and data types for the binary encoded fields.

Parameters:

Name	Type	Description	Default
`db`	`str`	Path to the SQLite database produced by the preprocessing command of Nanocompore.	required

Returns:

Type	Description
`dict`	Dictionary containing the metadata

Source code in nanocompore/api.py

def get_metadata(db: str) -> dict[str, str]:
    """
    Returns the metadata from the given SQLite database.

    The metadata contains information such as input files,
    resquiggler used, and data types for the binary encoded fields.

    Parameters
    ----------
    db : str
        Path to the SQLite database produced by the preprocessing
        command of Nanocompore.

    Returns
    -------
    dict
        Dictionary containing the metadata
    """
    with closing(sqlite3.connect(db)) as conn,\
         closing(conn.cursor()) as cursor:
        query = "SELECT key, value FROM metadata"
        return {k: v for k, v in cursor.execute(query).fetchall()}

`get_pos(config, reference_id, pos)`

Get the data for a given position for all samples. Note that position is a 0-based index of the first nucleotide of a k-mer.

Returns the signal data for a specific position of the given reference transcript from all reads.

Parameters:

Name	Type	Description	Default
`config`	`Config`	Path to a Nanocompore configuration file.	required
`reference_id`	`str`	ID for a reference sequence (transcript).	required
`pos`	`int`	Position on the transcript for which to get data. A 0-based index is assumed.	required

Returns:

Type	Description
`DataFrame`	Where the DataFrame contains the following columns: condition condition label (as defined in the configuration) sample sample label (as defined in the configuration) read: id of the read (qname) intensity: current intensity dwell: dwell time for the kmer

Examples:

>>> from nanocompore.api import load_config, get_pos
>>> config = load_config('analysis.yaml')
>>> get_pos(config, 'ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|', 532)
   condition sample                                  read  intensity  dwell
0         WT    WT1  a4395b0d-dd3b-48e3-8afb-4085374b1147     3800.0    7.0
1         WT    WT1  f9733448-6e6b-47ba-9501-01eda2f5ea26     4865.0  126.0
2         WT    WT1  6f5e3b2e-f27b-47ef-b3c6-2ab4fdefd20a     3272.0   42.0
3         WT    WT2  2da07406-70c2-40a1-835a-6a7a2c914d49     6241.0   44.0
4         WT    WT2  54fc1d38-5e3d-4d77-a717-2d41b4785af6     4047.0    9.0
5         WT    WT2  3cfa90d1-7dfb-4398-a224-c75a3ab99873     3709.0   70.0
6         KD    KD1  3f46f499-8ce4-4817-8177-8ad61b784f27     4807.0   57.0
7         KD    KD1  73d62df4-f04a-4207-a4bc-7b9739b3c3b2     4336.0  132.0
8         KD    KD1  b7bc9a36-318e-4be2-a90f-74a5aa6439bf     -861.0    7.0
9         KD    KD2  ac486e16-15be-47a8-902c-2cfa2887c534     2706.0   45.0
10        KD    KD2  797fd991-570e-42d4-8292-0a7557b192d7     5450.0   24.0
11        KD    KD2  4e1ad358-ec2b-40b4-8e9a-54db28a40551      206.0   47.0

Source code in nanocompore/api.py

def get_pos(config: Config, reference_id: str, pos: int) -> pd.DataFrame:
    """
    Get the data for a given position for all samples.
    Note that position is a 0-based index of the first
    nucleotide of a k-mer.

    Returns the signal data for a specific position
    of the given reference transcript from all reads.

    Parameters
    ----------
    config : Config
        Path to a Nanocompore configuration file.
    reference_id : str
        ID for a reference sequence (transcript).
    pos : int
        Position on the transcript for which to get data. A 0-based
        index is assumed.

    Returns
    -------
    pandas.DataFrame
        Where the DataFrame contains the following columns:

        - condition  condition label (as defined in the configuration)
        - sample     sample label (as defined in the configuration)
        - read:      id of the read (qname)
        - intensity: current intensity
        - dwell:     dwell time for the kmer

    Examples
    --------
    >>> from nanocompore.api import load_config, get_pos
    >>> config = load_config('analysis.yaml')
    >>> get_pos(config, 'ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|', 532)
       condition sample                                  read  intensity  dwell
    0         WT    WT1  a4395b0d-dd3b-48e3-8afb-4085374b1147     3800.0    7.0
    1         WT    WT1  f9733448-6e6b-47ba-9501-01eda2f5ea26     4865.0  126.0
    2         WT    WT1  6f5e3b2e-f27b-47ef-b3c6-2ab4fdefd20a     3272.0   42.0
    3         WT    WT2  2da07406-70c2-40a1-835a-6a7a2c914d49     6241.0   44.0
    4         WT    WT2  54fc1d38-5e3d-4d77-a717-2d41b4785af6     4047.0    9.0
    5         WT    WT2  3cfa90d1-7dfb-4398-a224-c75a3ab99873     3709.0   70.0
    6         KD    KD1  3f46f499-8ce4-4817-8177-8ad61b784f27     4807.0   57.0
    7         KD    KD1  73d62df4-f04a-4207-a4bc-7b9739b3c3b2     4336.0  132.0
    8         KD    KD1  b7bc9a36-318e-4be2-a90f-74a5aa6439bf     -861.0    7.0
    9         KD    KD2  ac486e16-15be-47a8-902c-2cfa2887c534     2706.0   45.0
    10        KD    KD2  797fd991-570e-42d4-8292-0a7557b192d7     5450.0   24.0
    11        KD    KD2  4e1ad358-ec2b-40b4-8e9a-54db28a40551      206.0   47.0
    """
    data_files = _get_data_files(config)
    sample_mapper = np.vectorize(dict(enumerate(data_files)).get)
    if config.get_resquiggler() == UNCALLED4:
        kit = config.get_kit()
        df = _get_bam_pos(data_files.values(), reference_id, pos, kit)
    else:
        df = _get_db_pos(data_files.values(), reference_id, pos)
    df['sample'] = sample_mapper(df['sample'])
    condition_mapper = np.vectorize(config.sample_to_condition().get)
    df['condition'] = condition_mapper(df['sample'])
    return df.loc[:, ['condition', 'sample', 'read', 'intensity', 'dwell']]

`get_reads(config, reference_id, selected_reads=None)`

Get the data for all reads mapping to the given reference.

Parameters:

Name	Type	Description	Default
`config`	`Config`	Path to a Nanocompore configuration file.	required
`reference_id`	`str`	ID for a reference sequence (transcript).	required
`selected_reads`	`Optional[list[str]]`	Optional list of UUIDs of the reads for which to get data. By default it's set to None and returns all reads.	`None`

Returns:

Type	Description
`tuple[Float[np.ndarray, ["reads positions variables"]],`	list[str], list[str], list[str]] A tuple with (signal_data, reads, samples, conditions) signal_data is a 3D array with shape (reads, positions, variables). In the variables dimension 0=intensity, 1=dwell time. reads is a list of read ids (qname). samples is a list of the sample labels (as defined in the config). conditions is a list of the condition labels (as defined in the config).

Raises:

Type	Description
`KeyError`	If the reference_id is not found in the data sources.

Examples:

>>> from nanocompore.api import load_config, get_references
>>> config = load_config('analysis.yaml')
>>> get_reads(config, 'ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|')

Source code in nanocompore/api.py

def get_reads(
        config: Config,
        reference_id: str,
        selected_reads: Optional[list[str]]=None
    ) -> tuple[Float[np.ndarray, "reads positions variables"],
               list[str],
               list[str],
               list[str]]:
    """
    Get the data for all reads mapping to the given reference.

    Parameters
    ----------
    config : Config
        Path to a Nanocompore configuration file.
    reference_id : str
        ID for a reference sequence (transcript).
    selected_reads : Optional[list[str]]
        Optional list of UUIDs of the reads for which to get data.
        By default it's set to None and returns all reads.

    Returns
    -------
    tuple[Float[np.ndarray, ["reads positions variables"]],
          list[str],
          list[str],
          list[str]]

        A tuple with (signal_data, reads, samples, conditions)

        - signal_data is a 3D array with shape (reads, positions, variables).
          In the variables dimension 0=intensity, 1=dwell time.
        - reads is a list of read ids (qname).
        - samples is a list of the sample labels (as defined in the config).
        - conditions is a list of the condition labels (as defined in the config).

    Raises
    ------
    KeyError
        If the reference_id is not found in the data sources.

    Examples
    --------
    >>> from nanocompore.api import load_config, get_references
    >>> config = load_config('analysis.yaml')
    >>> get_reads(config, 'ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|')
    """
    data_files = _get_data_files(config)
    if config.get_resquiggler() == UNCALLED4:
        kit = config.get_kit()
        data, reads, samples = _get_bam_reads(data_files.values(),
                                              reference_id,
                                              kit,
                                              selected_reads)
    else:
        data, reads, samples = _get_db_reads(data_files.values(),
                                             reference_id,
                                             selected_reads)
    sample_mapper = np.vectorize(dict(enumerate(data_files)).get)
    samples = sample_mapper(samples)
    condition_mapper = np.vectorize(config.sample_to_condition().get)
    conditions = condition_mapper(samples)
    return data, reads, samples.tolist(), conditions.tolist()

`get_references(config, has_data=True)`

Returns a list of all references found in the list of samples defined in the configuration.

Parameters:

Name	Type	Description	Default
`config`	`Config`	Path to a Nanocompore configuration file.	required
`has_data`	`bool`	If True (default) will return only references for which there are mapped reads.	`True`

Returns:

Type	Description
`list`	List of transcript reference id strings.

Examples:

>>> from nanocompore.api import load_config, get_references
>>> config = load_config('analysis.yaml')
>>> get_references(config)
['ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|', 'ENST00000642480.2|ENSG00000075624.17|OTTHUMG00000023268|OTTHUMT00000495153.1|ACTB-213|ACTB|2021|protein_coding|']

Source code in nanocompore/api.py

def get_references(config: Config, has_data=True) -> list[str]:
    """
    Returns a list of all references found in the
    list of samples defined in the configuration.

    Parameters
    ----------
    config : Config
        Path to a Nanocompore configuration file.
    has_data : bool, default=True
        If True (default) will return only references
        for which there are mapped reads.

    Returns
    -------
    list
        List of transcript reference id strings.

    Examples
    --------
    >>> from nanocompore.api import load_config, get_references
    >>> config = load_config('analysis.yaml')
    >>> get_references(config)
    ['ENST00000674681.1|ENSG00000075624.17|OTTHUMG00000023268|-|ACTB-219|ACTB|2554|protein_coding|', 'ENST00000642480.2|ENSG00000075624.17|OTTHUMG00000023268|OTTHUMT00000495153.1|ACTB-213|ACTB|2021|protein_coding|']
    """
    data_files = list(_get_data_files(config).values())
    if config.get_resquiggler() == UNCALLED4:
        return _get_bam_references(data_files, has_data)
    else:
        return _get_db_references(data_files, has_data)

`load_config(config_path)`

Load a configuration file.

Parameters:

Name	Type	Description	Default
`config_path`	`str`	Path to the Nanocompore configuration file.	required

Returns:

Type	Description
`Config`	A configuration object.

Source code in nanocompore/api.py

def load_config(config_path: str) -> Config:
    """
    Load a configuration file.

    Parameters
    ----------
    config_path : str
        Path to the Nanocompore configuration file.

    Returns
    -------
    Config
        A configuration object.
    """
    with open(config_path, 'rb') as f:
        return Config(yaml.safe_load(f))