Genetic sequences, XML and the Big Bang

What do genetic sequences, XML, and the Big Bang have in common?

WIPO (the World Intellectual Property Office) published Standard ST.26 in Dec. 2021 which describes how to specify a nucleotide or amino acid sequence, and they refer to the date for the switchover (July 1, 2022) as the 'big bang implementation date'.

The standard allows for exact descriptions of amino acids (including D-amino acids and amino acids containing modified or synthetic side chains), DNA sequences and RNA sequences, using a particular format called 'XML' of extensible markup language, which itself defines a set of rules for encoding documents in a format that is both human-readable and machine-readable, good for storing, transmitting, and reconstructing arbitrary data.

Lest the wary reader fear this article constitutes a rabbit-hole of incomprehensible technobabble, we jump straight to an example : D-arginine, a venerable amino acid and an interesting example since it is a handed molecule, D being the dextro or right-handed version while L (levo) is what's left. Our two-dimensional file format (or even 1-dimensional, since the text file is 'serializable' into a string of characters) will have to somehow describe the three-dimensional situation.

In this case the molecule is actually referred to by name, making things relatively simple - the XML for such a molecule looks like this:

<INSDFeature>
    <INSDFeature_key>SITE</INSDFeature_key>
    <INSDFeature_location>9</INSDFeature_location>
    <INSDFeature_quals>
        <INSDQualifier>
            <INSDQualifier_name>note</INSDQualifier_name>
            <INSDQualifier_value>D-Arginine</INSDQualifier_value>
        </INSDQualifier>
    </INSDFeature_quals>
</INSDFeature>

The full XML file has some extra info about the patent application - inventors, applicants, software that produced the XML, etc. A complete example is shown below, where the 'meat' is just the line near the bottom with the sequence "atagatagatagatgwrtkhg"

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ST26SequenceListing PUBLIC "-//WIPO//DTD Sequence Listing 1.3//EN" "ST26SequenceListing_V1_3.dtd">
<ST26SequenceListing originalFreeTextLanguageCode="en" nonEnglishFreeTextLanguageCode="ru" dtdVersion="V1_3" fileName="/Users/jr/Downloads/test2.xml" softwareName="WIPO Sequence" softwareVersion="2.1.0" productionDate="2022-06-29">
	<ApplicationIdentification>
		<IPOfficeCode>IL</IPOfficeCode>
		<ApplicationNumberText></ApplicationNumberText>
		<FilingDate>2022-06-29</FilingDate>
	</ApplicationIdentification>
	<ApplicantFileReference>rutman ip</ApplicantFileReference>
	<ApplicantName languageCode="en">Jeremy Rutman</ApplicantName>
	<ApplicantNameLatin>Jeremy Rutman</ApplicantNameLatin>
	<InventionTitle languageCode="en">Test sequence</InventionTitle>
	<SequenceTotalQuantity>1</SequenceTotalQuantity>
	<SequenceData sequenceIDNumber="1">
		<INSDSeq>
			<INSDSeq_length>21</INSDSeq_length>
			<INSDSeq_moltype>DNA</INSDSeq_moltype>
			<INSDSeq_division>PAT</INSDSeq_division>
			<INSDSeq_feature-table>
				<INSDFeature>
					<INSDFeature_key>source</INSDFeature_key>
					<INSDFeature_location>1..21</INSDFeature_location>
					<INSDFeature_quals>
						<INSDQualifier>
							<INSDQualifier_name>mol_type</INSDQualifier_name>
							<INSDQualifier_value>other DNA</INSDQualifier_value>
						</INSDQualifier>
						<INSDQualifier id="q2">
							<INSDQualifier_name>note</INSDQualifier_name>
							<INSDQualifier_value>Free Text</INSDQualifier_value>
							<NonEnglishQualifier_value>test</NonEnglishQualifier_value>
						</INSDQualifier>
						<INSDQualifier id="q3">
							<INSDQualifier_name>organism</INSDQualifier_name>
							<INSDQualifier_value>synthetic construct</INSDQualifier_value>
							<NonEnglishQualifier_value>ok</NonEnglishQualifier_value>
						</INSDQualifier>
					</INSDFeature_quals>
				</INSDFeature>
			</INSDSeq_feature-table>
			<INSDSeq_sequence>atagatagatagatgwrtkhg</INSDSeq_sequence>
		</INSDSeq>
	</SequenceData>
</ST26SequenceListing>

So as you can see the 'action' is in the <INSDSeq_sequence> tag. In case you were wondering, the w,r,k, and h symbols are variables or options standing for one of two or three possibilities, as per the following table.

The amino acids are enumerated similarly, again with some variable or optional values at the end:

and lists of modified nucleotides and modified amino acids are also defined.

The full standard is defined here and a software package for generating and verifying these files is available on the WIPO sequence homepage, here. The standards document also lists some tips for converting from the current standard ST.25 to ST.26.

The software has a very useful 'validate' feature (see pic) which allows you to check that your file will come out kosher, and also has some import tools to allow you to import sequences in several formats, including the previous format ST.25

Genetic sequences, XML and the Big Bang

Recent Posts

Comments