crops.elements.sequences module

Sequence and multi-sequence objects are defined here.

guess_type(inseq)[source]

Return the biological type of the sequence as guessed from residue types.

Parameters:inseq (str) – Sequence to be evaluated.
Returns:Sequence type (‘Protein’ or ‘DNA’ or ‘RNA’ or ‘Unknown’).
Return type:str
class oligoseq(oligomer_id=None, imer=None)[source]

Bases: object

An object grouping several crops.elements.sequences.sequence objects pertaining to a common oligomer.

Parameters:
Variables:
  • id (str) – Oligomer sequence identifier (e.g. PDB id).
  • imer (dict [str, crops.elements.sequence.monomer_sequence]) – Container of several crops.elements.sequence.monomer_sequence making up the oligomer.
Raises:

TypeError – If the input formats are wrong.

Example:
>>> from crops.elements import sequences as ces
>>> my_oligoseq = ces.oligoseq(oligomer_id='exampleID')
>>> my_oligoseq.add_monomer
>>> my_sequence.add_monomer('header_example','GATTACA',nid='mychain')
>>> my_sequence.add_monomer('another_header','TACATACA')
>>> my_sequence.nchains()
2
>>> my_sequence.length('mychain')
7
>>> my_sequence.write('/path/to/output/dir/')
>>> print(my_sequence)
docs Protein/polynucleotide sequence object: (id='example_id', # chains = 2)
>>> my_sequence.purge()
>>> my_sequence.nchains()
0
add_sequence(newseq)[source]

Add a new crops.elements.sequences.sequence to the object.

Parameters:

newseq (crops.elements.sequences.sequence) – Sequence object.

Raises:
chainlist()[source]

Return a set with all the chain names in the object.

Returns:Chain names in crops.elements.sequences.oligoseq.
Return type:set [str]
copy()[source]
deepcopy()[source]
del_sequence(seqid)[source]

Remove the selected crops.elements.sequences.sequence from the object.

Parameters:seqid (str) – Doomed sequence’s identifier.
Raises:TypeError – If seqid is not a string.
id
imer
length(seqid)[source]

Return the length of a certain sequence.

Parameters:

seqid (str) – ID of crops.elements.sequences.sequence.

Raises:
Returns:

Length of crops.elements.sequences.sequence.

Return type:

int

nchains()[source]

Return number of chains in object, counting all sequence objects contained.

Returns:Number of chains in object, counting al crops.elements.sequences.sequence contained.
Return type:int
nseqs()[source]

Return number of sequence objects in object.

Returns:Number of crops.elements.sequences.sequence objects in object.
Return type:int
purge()[source]

Clear the object’s content without deleting the object itself.

set_cropmaps(mapdict, cropmain=False)[source]

Sets the parsed cropmaps from crops.iomod.parsers.parsemapfile.

Parameters:
  • mapdict (dict [str, dict [str, dict [int, int]]]) – Parsed maps for this specific object.
  • cropmain (bool, optional) – If True, it will crop ‘mainseq’ and generate ‘fullseq’ and ‘cropseq’. If ‘mainseq’ has been edited before this operation will yield wrong results, defaults to False.
Raises:

TypeError – When mapdict has not the appropriate format.

whatseq(chain)[source]

Return the sequence number corresponding to a given chain.

Parameters:chain (str) – The chain ID.
Returns:The crops.elements.sequences.sequence of that chain.
Return type:str
write(outdir, infix='', split=False, oneline=False)[source]

Write all crops.elements.sequences.sequence objects to .fasta file or string.

Parameters:
  • outdir (str) – Output directory or ‘string’.
  • infix (str, optional) – Filename tag to distinguish from original input file, defaults to “”.
  • split (bool, optional) – If True, identical sequences are dumped for each chain, defaults to False.
  • oneline (bool, optional) – If True, sequences are not split in 80 residue-lines, defaults to False.
Raises:

FileNotFoundError – Output directory not found.

class sequence(seqid=None, oligomer=None, seq=None, chains=None, source=None, header=None, biotype=None, extrainfo=None)[source]

Bases: object

A crops.elements.sequences.sequence object representing a single chain sequence.

The crops.elements.sequences.sequence class represents a data structure to hold all sequence versions and other useful information characterising it. It contains functions to store, manipulate and organise sequence versions.

Parameters:
  • seqid (str) – Sequence identifier. Can be used alone or together with oligomer ID, defaults to None.
  • oligomer (str, optional) – Oligomer identifier. Sometimes as important as seqid, defaults to None.
  • seq (str, optional) – Sequence string, defaults to None.
  • chains (set [str], optional) – The names of chains having this sequence, defaults to None.
  • source (str, optional) – Source of the sequence, defaults to None
  • header (str, optional) – Standard .fasta header, starting with “>”, defaults to None.
  • biotype (str, optional) – Type of molecule (‘Protein’, ‘DNA’, ‘RNA’…), defaults to None.
  • extrainfo (str, optional) – Other useful information about the sequence, defaults to None.
Variables:
  • name (str) – Sequence identifier.
  • oligomer_id (str) – Oligomer identifier.
  • chains (set [str]) – The names of chains having this sequence.
  • seqs (dict [str, str]) – The set of sequences, including default “mainseq”.
  • source (str) – Source of the sequence.
  • source_headers (list [str]) – A list of headers from input files.
  • crops_header (str) – A new header containing the information from the object that will be used when printing sequence and cropmap.
  • biotype (str) – Type of molecule (‘Protein’, ‘DNA’, ‘RNA’…).
  • infostring (str) – Other useful information about the sequence.
  • cropmap (dict [int, int]) – A dictionary mapping residue numbers from original sequence to cropped sequence.
  • cropbackmap (dict [int, int]) – A dictionary mapping residue numbers from cropped sequence to original sequence.
  • msa (Any) – A free variable not used by CROPS itself.
  • cropmsa (Any) – A free variable not used by CROPS itself.
  • intervals (crops.elements.intervals.intinterval) – The integer interval object containing the cropping information.
Raises:

TypeError – For wrong input formats.

Example:
>>> from crops.elements import sequences as ces
>>> myseq = ces.sequence(seqid='1', oligomer = 'exampleID')
>>> myseq.mainseq('GATTACA')
>>> myseq.mainseq()
'GATTACA'
>>> myseq.chains = {'A', 'B'}
>>> myseq.addseq('gapseq','GAT--C-')
>>> myseq.addseq('cobra','TACATACA')
>>> myseq.length()
7
>>> myseq.ngaps('gapseq')
3
>>> myseq.guess_biotype()
'DNA'
>>> print(myseq)
Sequence object >EXAMPLEID_1|Chains A,B (seq=GATTACA, type=DNA, length=7)
>>> myseq.source = 'Example'
>>> myseq.addseq('cropseq', '+A+T++')
>>> myseq.addseq('cropgapseq', '+A+-++')
>>> myseq.full_length()
7
>>> myseq.mainseq('AT')
'AT'
>>> myseq.ncrops()
4
>>> myseq.update_cropsheader()
>>> myseq.cropinfo()
'#Residues cropped: 4 (1 not from terminals) ; % cropped: 66.67 (16.67 not from terminal segments)'
>>> myseq.dump(out='string')
'>crops|exampleID_1|Chains A,B|Source: Example|#Residues cropped: 4 (1 not from terminal segments) ; % cropped: 66.67 (16.67 not from terminal segments)\nAT\n'
Example:
>>> from crops.elements import sequences as ces
>>> from crops.iomod import parsers as cip
>>> myseq = cip.parseseqfile('7M6C.fasta')
>>> myseq
Sequence object: (>7M6C_1|Chain A, seq=MRTLWIMAVL[...]KPLCKKADPC, type=Undefined, length=138)
>>> myseq.guess_biotype()
'Protein'
>>> myseq
Sequence object: (>7M6C_1|Chain A, seq=MRTLWIMAVL[...]KPLCKKADPC, type=Protein, length=138)
addseq(newid, newseq)[source]

Add sequence to seqs dictionary.

Parameters:
  • newid (str) – New sequence’s identifier.
  • newseq (str) – New sequence.
Raises:
biotype
chains
copy()[source]
cropbackmap
cropinfo()[source]

Return a string containing statistics about the cropped residues.

Returns:Statistics on number of crops.
Return type:str
cropmap
cropmsa
crops_header
deepcopy()[source]
delseq(delid=None, wipeall=False)[source]

Delete sequence(s) from the seqs dictionary.

Parameters:
  • delid (str, optional) – ID of sequence to be deleted, defaults to None.
  • wipeall (bool, optional) – If True, all the sequences are deleted, defaults to False.
Raises:

TypeError – If delid is not a string or wipeall is not a boolean.

dump(out, split=False, oneline=False)[source]

Write header and main sequence to a file. If the file exists, output is appended.

Parameters:
  • out (str, file) – An output filepath (str), ‘string’, or an open file.
  • split (bool, optional) – If True, identical sequences are dumped for every chain, defaults to False.
  • oneline (bool, optional) – If True, sequences are not split in 80 residue-lines, defaults to False.
Raises:
  • TypeError – If out is neither a string nor an open file.
  • KeyError – If object contains no chains.
Returns:

A string containing the output if and only if out==’string’.

Return type:

str

dumpmap(out, split=False)[source]

Write header and cropmap to a file. If file exists, output is appended.

Parameters:
  • out (str, file) – An output filepath (str) or an open file.
  • backmap (bool, optional) – If True, the output will be self.cropbackmap, defaults to False.
  • split (bool, optional) – If True, identical maps are dumped for every chain, defaults to False.
Raises:
  • TypeError – If out is neither a string nor an open file.
  • ValueError – If one or both of cropmap and cropbackmap are empty.
  • KeyError – If object contains no chains.
full_length()[source]

Return the length of the full sequence. If not found, the main sequence will be considered the full sequence, and will be saved as so.

Returns:Length of the full sequence.
Return type:int
guess_biotype()[source]

Save the guessed biotype and return it.

Returns:Guessed biotype.
Return type:str
infostring
intervals
length()[source]

Return the length of the main sequence.

Returns:Length of the main sequence.
Return type:int
mainseq(add=None)[source]

Return or modifies the main sequence.

Parameters:add (str, optional) – If given, the main sequence is replaced by ‘add’, defaults to None.
Raises:TypeError – If ‘add’ is given and is not a string.
Returns:The (new) main sequence.
Return type:str
msa
name
ncrops(seqid='cropseq', offterminals=False, offmidseq=False)[source]

Return the number of cropped elements (‘+’,’*’) in a sequence.

Parameters:
  • seqid (str, optional) – The ID of the sequence containing the cropped elements, defaults to ‘cropseq’.
  • offterminals (bool, optional) – Count those removed from terminal segments only, defaults to False.
  • offmidseq (bool, optional) – Count those removed NOT from terminal segments only, defaults to False.
Raises:

TypeError – If seqid is not a string, or offterminals, offmidseq are not boolean.

Returns:

Number of cropped elements in seqid according to interval chosen. If seqid not found, 0 is returned.

Return type:

int

ngaps(seqid='gapseq')[source]

Return the number of gaps (‘-’) in a sequence.

Parameters:seqid (str, optional) – The ID of the sequence containing the gaps, defaults to ‘gapseq’.
Raises:TypeError – If seqid is not a string.
Returns:Number of gaps in seqid. If ‘gapseq’ is a list of several models, a list is returned. If seqid not found, 0 is returned.
Return type:int or list [int]
oligomer_id
seqs
source
source_headers
update_cropsheader()[source]

Update cropsheader. Useful after updating any information from the sequence.