SEGUID v2: Checksums for Linear, Circular, Single- and Double-Stranded Biological Sequences¶
The seguid Python package implements functions for calculating biological sequence checksums for linear, circular, single- and double-stranded sequences based on either the SEGUID v2 algorithm (Pereira et al., 2024) or the original SEGUID v1 (Babnigg & Giometti, 2006).
Single-stranded DNA examples¶
>>> from seguid import *
## Linear single-stranded DNA
>>> lsseguid("TATGCCAA")
'lsseguid=EevrucUNYjqlsxrTEK8JJxPYllk'
## Linear single-stranded DNA
>>> lsseguid("AATATGCC")
'lsseguid=XsJzXMxgv7sbpqIzFH9dgrHUpWw'
## Circular single-stranded DNA
>>> csseguid("TATGCCAA")
'csseguid=XsJzXMxgv7sbpqIzFH9dgrHUpWw'
## Same rotating two basepairs
>>> csseguid("GCCAATAT")
'csseguid=XsJzXMxgv7sbpqIzFH9dgrHUpWw'
Double-stranded DNA examples¶
>>> from seguid import *
## Linear double-stranded DNA
## AATATGCC
## ||||||||
## TTATACGG
>>> ldseguid("AATATGCC", "GGCATATT")
'ldseguid=dUxN7YQyVInv3oDcvz8ByupL44A'
## Same swapping Watson and Crick
>>> ldseguid("GGCATATT", "AATATGCC")
'ldseguid=dUxN7YQyVInv3oDcvz8ByupL44A'
## Circular double-stranded DNA
>>> cdseguid("TATGCCAA", "TTGGCATA")
'cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A'
## Same swapping Watson and Crick
>>> cdseguid("TTGGCATA", "TATGCCAA")
'cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A'
## Same rotating two basepairs (= minimal rotation by Watson)
>>> cdseguid("AATATGCC", "GGCATATT")
'cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A'
Installation¶
The seguid package is available on PyPI and can be installed as:
python -m pip install --user seguid
Module contents¶
SEGUID checksums for linear, circular, single- and double-stranded sequences¶
This package provides four functions, lsseguid()
, csseguid()
, ldseguid()
, and cdseguid()
(Table 1), for calculating SEGUID v2 checksums, and one function, seguid()
, for calculating SEGUID v1 checksums. SEGUID v2 is described in Pereira et al. (2024), and SEGUID v1 in Babnigg & Giometti (2006).
Topology |
Strandedness |
Function |
---|---|---|
linear |
single |
|
circular |
single |
|
linear |
double |
|
circular |
double |
|
Table 1: The four functions that calculate SEGUID v2 checksums for biological sequences with linear or circular topology and a single or two complementary strands.
This package works without external dependencies, but csseguid()
and cdseguid()
can be made faster by installing pydivsufsort, provides a faster implementation of the algorithm for finding the
smallest string rotation.
Usage¶
lsseguid(seq, alphabet="{DNA}", form="long")
csseguid(seq, alphabet="{DNA}", form="long")
ldseguid(watson, crick, alphabet="{DNA}", form="long")
cdseguid(watson, crick, alphabet="{DNA}", form="long")
seguid(seq, alphabet="{DNA}", form="long")
Function arguments¶
seq (string) The sequence for which the checksum should be calculated. The sequence may only comprise of symbols in the alphabet specified by the alphabet argument.
watson, crick (strings) Two reverse-complementary DNA sequences. Both sequences should be specified in the 5’-to-3’ direction.
alphabet (string)
The type of sequence used. If "{DNA}"
(default), then the input is a DNA sequence. If "{RNA}"
, then the
input is an RNA sequence. If "{protein}"
, then the input is an amino-acid sequence. If "{DNA-extended}"
or "{RNA-extended}"
, then the input is a DNA or RNA sequence specified an extended set of symbols, including
IUPAC symbols (4). If "{protein-extended}"
, then the input is an amino-acid sequence with an extended set
of symbols, including IUPAC symbols (5). A custom alphabet may also be used. A non-complementary alphabet
is specified as a comma-separated set of single symbols, e.g. "X,Y,Z"
. A complementary alphabet is specified
as a comma-separated set of paired symbols, e.g. "AT,CG"
. It is also possible to extend a pre-defined
alphabet, e.g. "{DNA},XY"
.
form (string)
How the checksum is presented. If "long"
(default), the full-length checksum is returned. If "short"
,
the short, six-digit checksum is returned. If "both"
, both the short and the long checksums are returned.
Value¶
The SEGUID functions return a single string, if form is either "long"
or "short"
. If form
is "both"
, then a tuple of two strings is returned, where the first component holds the "short"
checksum and the second the "long"
checksum. The long checksum, without the prefix, is a string with 27
characters. The short checksum, without the prefix, is the first six characters of the long checksum.
All long checksums are prefixed with a label indicating which SEGUID method was used.
All functions produce checksums using the Base64url encoding (“Base 64 Encoding with URL and Filename Safe Alphabet”), with the exception for seguid()
,
which uses Base64 encoding. The “long” checksums returned are always 27-character long. This is because the SHA-1 hash (6) is 160-bit long
(20 bytes), which result in the encoded representation always end with a padding character (=
) so that the
length is a multiple of four character. We relax this requirement, by dropping the padding character.
Base64 and Base64url encodings¶
The Base64url encoding is the Base64 encoding with non-URL-safe characters substituted with URL-safe ones (Josefsson, 2006).
Specifically, the plus symbol (+
) is replaced by the minus symbol (-
), and the forward slash (/
) is replaced
by the underscore symbol (_
).
The Base64 checksum used for the original SEGUID checksum is not guaranteed to contain symbols
that can safely be used as-is in a Uniform Resource Locator (URL). Specifically, it may consist of forward
slashes (/
) and plus symbols (+
), which are characters that carry special meaning in a URL. For the same
reason, a Base64 checksum cannot safely be used as a file or directory name, because it may have a forward slash.
References
G Babnigg & CS Giometti, A database of unique protein sequence identifiers for proteome studies. Proteomics. 2006 Aug;6(16):4514-22, doi:10.1002/pmic.200600032.
H Pereira, PC Silva, WM Davis, L Abraham, G Babnigg, H Bengtsson & B Johansson, SEGUID v2: Extending SEGUID Checksums for Circular, Linear, Single- and Double-Stranded Biological Sequences, bioRxiv, doi:10.1101/2024.02.28.582384.
S Josefsson, The Base16, Base32, and Base64 Data Encodings, RFC 4648, October 2006, doi:10.17487/RFC4648.
Wikipedia article ‘Nucleic acid notation’, February 2024, https://en.wikipedia.org/wiki/Nucleic_acid_notation.
Wikipedia article ‘Amino acids’, February 2024, https://en.wikipedia.org/wiki/Amino_acid.
Wikipedia article ‘SHA-1’ (Secure Hash Algorithm 1), December 2023, https://en.wikipedia.org/wiki/SHA-1.
- seguid.seguid(seq: str, alphabet: str = '{DNA}', form: str = 'long') str [source]¶
SEGUID v1 checksum for linear protein or single-stranded DNA.
Warning
seguid()
(obsolete) is superseded bylsseguid()
(recommended).Given a nucleotide or amino-acid sequence
seq
in uppercase, the function returns a string containing the SEquence Globally Unique IDentifier (SEGUID). The SEGUID is defined as the Base64 encoded SHA1 checksum calculated for the sequence in uppercase with the trailing padding symbol (=
) removed.The original definition of the SEGUID v1 checksum algorithm (Babnigg & Giometti, 2006) included transformation to uppercase before calculating the checksum. Here,
seguid()
does not coerce the input sequence to upper case. If your input sequence has lower-case symbols, you can usestr.upper()
to achieve what the original method does.seguid()
only accepts symbols as specified by the alphabet argument. Thus, our implementation is more conservative, which has the benefit of lowering the risk of passing the incorrect sequence by mistake.The resulting checksum string may contain forward slash (
/
) and plus-sign (+
) symbols. These characters cannot be a part of a Uniform Resource Locator (URL) or a filename on some operating systems. The SEGUID v2 checksum produced bylsseguid()
is similar to the SEGUID v1 checksum byseguid()
, but uses the Base64url encoding that do not produce these characters.The checksum is prefixed with
seguid=
.Examples
>>> seguid("AT") 'seguid=Ax/RG6hzSrMEEWoCO1IWMGska+4'
- seguid.lsseguid(seq: str, alphabet: str = '{DNA}', form: str = 'long') str [source]¶
SEGUID checksum for linear single-stranded DNA.
The same as the
seguid()
function except that forward slashes (/
) and plus signs (+
) in the resulting checksum are replaced by underscores (_
) and minus signs (-
), respectively following the Base64url standard in RFC 4648.This checksum is applicable to linear single-stranded DNA sequences and protein sequences, among other sequences. If protein sequences are analyzed, the alphabet argument should be
"{protein}"
or"{protein-extended}"
.The checksum is prefixed with
lsseguid=
.Examples
>>> lsseguid("AT") 'lsseguid=Ax_RG6hzSrMEEWoCO1IWMGska-4'
- seguid.csseguid(seq: str, alphabet: str = '{DNA}', form: str = 'long') str [source]¶
SEGUID checksum for circular single-stranded DNA.
The
csseguid()
is thelsseguid()
checksum calculated for the lexicographically smallest string rotation ofseq
. This checksum is Only defined for circular single-stranded sequences.The checksum is prefixed with
csseguid=
.Examples
>>> csseguid("ATTT") 'csseguid=ot6JPLeAeMmfztW1736Kc6DAqlo' >>> lsseguid("ATTT") 'lsseguid=ot6JPLeAeMmfztW1736Kc6DAqlo' >>> csseguid("TTTA") 'csseguid=ot6JPLeAeMmfztW1736Kc6DAqlo' >>> lsseguid("TTTA") 'lsseguid=8zCvKwyQAEsbPtC4yTV-pY0H93Q'
- seguid.ldseguid(watson: str, crick: str, alphabet: str = '{DNA}', form: str = 'long') str [source]¶
SEGUID checksum for linear double-stranded DNA.
This function calculates the SEGUID checksum for a double-stranded DNA (dsDNA) sequence defined by two strings representing the upper (Watson) and the complementary (Crick) DNA strands. Watson and Crick strands should be of equal length. Optional single-stranded DNA regions in the ends are indicated by a dash (
-
) in either strand.The algorithm first selects the lexicographically smallest of the Watson and Crick strands. The two string are then joined 5’-3’, separated by a semicolon (
;
), and thelsseguid()
function is used on the resulting string.For example, consider the linear dsDNA sequence defined by
watson="-TATGCC"
andcrick="-GCATAC"
as in:dsDNA SEGUID checksum -TATGCC ldseguid=rr65d6AYuP-CdMaVmdw3L9FPt6I ||||| CATACG- -GCATAC ldseguid=rr65d6AYuP-CdMaVmdw3L9FPt6I ||||| CCGTAT-
The SEGUID algorithm identifies the
"-GCATAC"
strand as the lexicographic smallest of the two. Then it concattenates the two as:"-GCATAC" + ";" + "-TATGCC"
and calculates the final checksum based on that sequence.
The checksum is prefixed with
ldseguid=
.Examples
>>> ldseguid("-TATGCC", "-GCATAC") 'ldseguid=rr65d6AYuP-CdMaVmdw3L9FPt6I' >>> ldseguid("-GCATAC", "-TATGCC") 'ldseguid=rr65d6AYuP-CdMaVmdw3L9FPt6I'
- seguid.cdseguid(watson: str, crick: str, alphabet: str = '{DNA}', form: str = 'long') str [source]¶
SEGUID checksum for circular double-stranded DNA.
The
cdseguid()
is thelsseguid()
checksum calculated for the lexicographically smallest string rotation of a double-stranded DNA sequence. Only defined for circular sequences.The checksum is prefixed with
cdseguid=
.