SEGUID v2: Checksums for Linear, Circular, Single- and Double-Stranded Biological Sequences

The seguid Python package implements functions for calculating biological sequence checksums for linear, circular, single- and double-stranded sequences based on either the SEGUID v2 algorithm (Pereira et al., 2024) or the original SEGUID v1 (Babnigg & Giometti, 2006).

_images/seguid-figure.png

Single-stranded DNA examples

>>> from seguid import *

## Linear single-stranded DNA

>>> lsseguid("TATGCCAA")
'lsseguid=EevrucUNYjqlsxrTEK8JJxPYllk'

## Linear single-stranded DNA

>>> lsseguid("AATATGCC")
'lsseguid=XsJzXMxgv7sbpqIzFH9dgrHUpWw'

## Circular single-stranded DNA

>>> csseguid("TATGCCAA")
'csseguid=XsJzXMxgv7sbpqIzFH9dgrHUpWw'

## Same rotating two basepairs

>>> csseguid("GCCAATAT")
'csseguid=XsJzXMxgv7sbpqIzFH9dgrHUpWw'

Double-stranded DNA examples

>>> from seguid import *

## Linear double-stranded DNA

## AATATGCC
## ||||||||
## TTATACGG

>>> ldseguid("AATATGCC", "GGCATATT")
'ldseguid=dUxN7YQyVInv3oDcvz8ByupL44A'

## Same swapping Watson and Crick

>>> ldseguid("GGCATATT", "AATATGCC")
'ldseguid=dUxN7YQyVInv3oDcvz8ByupL44A'

## Circular double-stranded DNA

>>> cdseguid("TATGCCAA", "TTGGCATA")
'cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A'

## Same swapping Watson and Crick

>>> cdseguid("TTGGCATA", "TATGCCAA")
'cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A'

## Same rotating two basepairs (= minimal rotation by Watson)

>>> cdseguid("AATATGCC", "GGCATATT")
'cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A'

Installation

The seguid package is available on PyPI and can be installed as:

python -m pip install --user seguid

Module contents

SEGUID checksums for linear, circular, single- and double-stranded sequences

This package provides four functions, lsseguid(), csseguid(), ldseguid(), and cdseguid() (Table 1), for calculating SEGUID v2 checksums, and one function, seguid(), for calculating SEGUID v1 checksums. SEGUID v2 is described in Pereira et al. (2024), and SEGUID v1 in Babnigg & Giometti (2006).

Topology

Strandedness

Function

linear

single

lsseguid()

circular

single

csseguid()

linear

double

ldseguid()

circular

double

cdseguid()

Table 1: The four functions that calculate SEGUID v2 checksums for biological sequences with linear or circular topology and a single or two complementary strands.

This package works without external dependencies, but csseguid() and cdseguid() can be made faster by installing pydivsufsort, provides a faster implementation of the algorithm for finding the smallest string rotation.

Usage

lsseguid(seq, alphabet="{DNA}", form="long")

csseguid(seq, alphabet="{DNA}", form="long")

ldseguid(watson, crick, alphabet="{DNA}", form="long")

cdseguid(watson, crick, alphabet="{DNA}", form="long")

seguid(seq, alphabet="{DNA}", form="long")

Function arguments

seq (string) The sequence for which the checksum should be calculated. The sequence may only comprise of symbols in the alphabet specified by the alphabet argument.

watson, crick (strings) Two reverse-complementary DNA sequences. Both sequences should be specified in the 5’-to-3’ direction.

alphabet (string) The type of sequence used. If "{DNA}" (default), then the input is a DNA sequence. If "{RNA}", then the input is an RNA sequence. If "{protein}", then the input is an amino-acid sequence. If "{DNA-extended}" or "{RNA-extended}", then the input is a DNA or RNA sequence specified an extended set of symbols, including IUPAC symbols (4). If "{protein-extended}", then the input is an amino-acid sequence with an extended set of symbols, including IUPAC symbols (5). A custom alphabet may also be used. A non-complementary alphabet is specified as a comma-separated set of single symbols, e.g. "X,Y,Z". A complementary alphabet is specified as a comma-separated set of paired symbols, e.g. "AT,CG". It is also possible to extend a pre-defined alphabet, e.g. "{DNA},XY".

form (string) How the checksum is presented. If "long" (default), the full-length checksum is returned. If "short", the short, six-digit checksum is returned. If "both", both the short and the long checksums are returned.

Value

The SEGUID functions return a single string, if form is either "long" or "short". If form is "both", then a tuple of two strings is returned, where the first component holds the "short" checksum and the second the "long" checksum. The long checksum, without the prefix, is a string with 27 characters. The short checksum, without the prefix, is the first six characters of the long checksum. All long checksums are prefixed with a label indicating which SEGUID method was used.

All functions produce checksums using the Base64url encoding (“Base 64 Encoding with URL and Filename Safe Alphabet”), with the exception for seguid(), which uses Base64 encoding. The “long” checksums returned are always 27-character long. This is because the SHA-1 hash (6) is 160-bit long (20 bytes), which result in the encoded representation always end with a padding character (=) so that the length is a multiple of four character. We relax this requirement, by dropping the padding character.

Base64 and Base64url encodings

The Base64url encoding is the Base64 encoding with non-URL-safe characters substituted with URL-safe ones (Josefsson, 2006). Specifically, the plus symbol (+) is replaced by the minus symbol (-), and the forward slash (/) is replaced by the underscore symbol (_).

The Base64 checksum used for the original SEGUID checksum is not guaranteed to contain symbols that can safely be used as-is in a Uniform Resource Locator (URL). Specifically, it may consist of forward slashes (/) and plus symbols (+), which are characters that carry special meaning in a URL. For the same reason, a Base64 checksum cannot safely be used as a file or directory name, because it may have a forward slash.

References

  1. G Babnigg & CS Giometti, A database of unique protein sequence identifiers for proteome studies. Proteomics. 2006 Aug;6(16):4514-22, doi:10.1002/pmic.200600032.

  2. H Pereira, PC Silva, WM Davis, L Abraham, G Babnigg, H Bengtsson & B Johansson, SEGUID v2: Extending SEGUID Checksums for Circular, Linear, Single- and Double-Stranded Biological Sequences, bioRxiv, doi:10.1101/2024.02.28.582384.

  3. S Josefsson, The Base16, Base32, and Base64 Data Encodings, RFC 4648, October 2006, doi:10.17487/RFC4648.

  4. Wikipedia article ‘Nucleic acid notation’, February 2024, https://en.wikipedia.org/wiki/Nucleic_acid_notation.

  5. Wikipedia article ‘Amino acids’, February 2024, https://en.wikipedia.org/wiki/Amino_acid.

  6. Wikipedia article ‘SHA-1’ (Secure Hash Algorithm 1), December 2023, https://en.wikipedia.org/wiki/SHA-1.

seguid.seguid(seq: str, alphabet: str = '{DNA}', form: str = 'long') str[source]

SEGUID v1 checksum for linear protein or single-stranded DNA.

Warning

seguid() (obsolete) is superseded by lsseguid() (recommended).

Given a nucleotide or amino-acid sequence seq in uppercase, the function returns a string containing the SEquence Globally Unique IDentifier (SEGUID). The SEGUID is defined as the Base64 encoded SHA1 checksum calculated for the sequence in uppercase with the trailing padding symbol (=) removed.

The original definition of the SEGUID v1 checksum algorithm (Babnigg & Giometti, 2006) included transformation to uppercase before calculating the checksum. Here, seguid() does not coerce the input sequence to upper case. If your input sequence has lower-case symbols, you can use str.upper() to achieve what the original method does. seguid() only accepts symbols as specified by the alphabet argument. Thus, our implementation is more conservative, which has the benefit of lowering the risk of passing the incorrect sequence by mistake.

The resulting checksum string may contain forward slash (/) and plus-sign (+) symbols. These characters cannot be a part of a Uniform Resource Locator (URL) or a filename on some operating systems. The SEGUID v2 checksum produced by lsseguid() is similar to the SEGUID v1 checksum by seguid(), but uses the Base64url encoding that do not produce these characters.

The checksum is prefixed with seguid=.

Examples

>>> seguid("AT")
'seguid=Ax/RG6hzSrMEEWoCO1IWMGska+4'
seguid.lsseguid(seq: str, alphabet: str = '{DNA}', form: str = 'long') str[source]

SEGUID checksum for linear single-stranded DNA.

The same as the seguid() function except that forward slashes (/) and plus signs (+) in the resulting checksum are replaced by underscores (_) and minus signs (-), respectively following the Base64url standard in RFC 4648.

This checksum is applicable to linear single-stranded DNA sequences and protein sequences, among other sequences. If protein sequences are analyzed, the alphabet argument should be "{protein}" or "{protein-extended}".

The checksum is prefixed with lsseguid=.

Examples

>>> lsseguid("AT")
'lsseguid=Ax_RG6hzSrMEEWoCO1IWMGska-4'
seguid.csseguid(seq: str, alphabet: str = '{DNA}', form: str = 'long') str[source]

SEGUID checksum for circular single-stranded DNA.

The csseguid() is the lsseguid() checksum calculated for the lexicographically smallest string rotation of seq. This checksum is Only defined for circular single-stranded sequences.

The checksum is prefixed with csseguid=.

Examples

>>> csseguid("ATTT")
'csseguid=ot6JPLeAeMmfztW1736Kc6DAqlo'
>>> lsseguid("ATTT")
'lsseguid=ot6JPLeAeMmfztW1736Kc6DAqlo'
>>> csseguid("TTTA")
'csseguid=ot6JPLeAeMmfztW1736Kc6DAqlo'
>>> lsseguid("TTTA")
'lsseguid=8zCvKwyQAEsbPtC4yTV-pY0H93Q'
seguid.ldseguid(watson: str, crick: str, alphabet: str = '{DNA}', form: str = 'long') str[source]

SEGUID checksum for linear double-stranded DNA.

This function calculates the SEGUID checksum for a double-stranded DNA (dsDNA) sequence defined by two strings representing the upper (Watson) and the complementary (Crick) DNA strands. Watson and Crick strands should be of equal length. Optional single-stranded DNA regions in the ends are indicated by a dash (-) in either strand.

The algorithm first selects the lexicographically smallest of the Watson and Crick strands. The two string are then joined 5’-3’, separated by a semicolon (;), and the lsseguid() function is used on the resulting string.

For example, consider the linear dsDNA sequence defined by watson="-TATGCC" and crick="-GCATAC" as in:

dsDNA    SEGUID checksum

-TATGCC  ldseguid=rr65d6AYuP-CdMaVmdw3L9FPt6I
 |||||
CATACG-

-GCATAC  ldseguid=rr65d6AYuP-CdMaVmdw3L9FPt6I
 |||||
CCGTAT-

The SEGUID algorithm identifies the "-GCATAC" strand as the lexicographic smallest of the two. Then it concattenates the two as:

"-GCATAC" + ";" + "-TATGCC"

and calculates the final checksum based on that sequence.

The checksum is prefixed with ldseguid=.

Examples

>>> ldseguid("-TATGCC", "-GCATAC")
'ldseguid=rr65d6AYuP-CdMaVmdw3L9FPt6I'
>>> ldseguid("-GCATAC", "-TATGCC")
'ldseguid=rr65d6AYuP-CdMaVmdw3L9FPt6I'
seguid.cdseguid(watson: str, crick: str, alphabet: str = '{DNA}', form: str = 'long') str[source]

SEGUID checksum for circular double-stranded DNA.

The cdseguid() is the lsseguid() checksum calculated for the lexicographically smallest string rotation of a double-stranded DNA sequence. Only defined for circular sequences.

The checksum is prefixed with cdseguid=.