Abstract
Tardigrades, also known as water bears, are a phylum of microscopic metazoans with the extraordinary ability to endure environmental extremes. When threatened by suboptimal habitat conditions, these creatures enter a suspended animation-like state called cryptobiosis, in which metabolism is diminished, similar to hibernation. In this state, tardigrades benefit from enhanced extremotolerance, withstanding dehydration efficiently for years at a time in a type of cryptobiosis called anhydrobiosis. Recent studies have demonstrated that the tardigrade proteome is at the heart of cryptobiosis. Principally, intrinsically disordered proteins (IDPs) and tardigrade-specific intrinsically disordered proteins (TDPs) are known to help protect cell function in the absence of water. Importantly, TDPs have been successfully expressed in cells of other species experimentally, even protecting human tissue against stress in vitro. However, previous work has failed to address how to strategically identify TDPs in the tardigrade proteome holistically. The overarching purpose of this current study, consequently, was to generate a list of IDPs/TDPs associated with tardigrade cryptobiosis that are high-priority for further investigation. Firstly, a novel database containing 44,836 tardigrade proteins from 338 different species was constructed to consolidate and standardize publicly available data. Secondly, a support vector machine (SVM) was created to sort the newly constructed database entries on the binary basis of disorder (i.e., IDP versus non-IDP). Features of this model draw from disorder metrics and literature curation, correctly classifying 160 of the 171 training set proteins (~93.6%). Of the 5,415 putative IDPs/TDPs our SVM identified, we present 82 (30 having confident subclass prediction and 52 having experimental detection in previous studies). Subsequently, the role each protein might play in tardigrade resilience is discussed. By and large, this supervised machine learning classifier represents a promising new approach for identifying IDPs/TDPs, opening doors to harness the tardigrade’s remarkable faculties for biomaterial preservation, genetic engineering, astrobiological research, and ultimately, the benefit of humankind.
Introduction
Subsects of life that can withstand environmental extremes have been the subject of longstanding scientific fascination. Understanding the proteomics of these organisms may aid us in harnessing their unique extremotolerance to develop biotechnologies conferring this resilience to other creatures. One key extremotolerant organism is the tardigrade, commonly known as the water bear (Figure 1).
With their ability to survive environmental stresses sufficient to kill many other animals, the tardigrade has piqued the interest of researchers for 248 years (Bonnet & Goeze, 1773). These creatures are ubiquitous. There are over 1,300 species, all of which are meiofauna, minute animals inhabiting watery films and gaps between grains of sediment (Degma, Bertolani, & Guidetti, 2021). Tellingly, tardigrades have survived the five major mass extinctions; and Sloan, Alves Batista and Loeb (2017) even suggest it would require boiling Earth’s oceans to eliminate tardigrades definitively.
Understanding tardigrade extremotolerance is the central focus of this research. Currently, the literature includes robust documentation of the tardigrade’s capabilities in surviving environmental stresses, such as extreme temperatures (Doyère, 1842; Rahm, 1921), radiation (Beltrán-Pardo, Jönsson, Harms-Ringdahl, Haghdoost, & Wojcik, 2015; Hashimoto & Kunieda, 2017; Horikawa et al., 2006; Horikawa et al., 2013; Jönsson, Harms-Ringdahl, & Torudd, 2005), and vacuums/intense pressures (Jönsson, Rabbow, Schill, Harms-Ringdahl, & Rettberg, 2008; Seki & Toyoshima, 1998). In particular, tardigrades can withstand desiccation/dehydration adeptly; In 1948, zoologist Tina Franceschi described witnessing tardigrades from a 120-year-old dried moss sample being revived (Franceschi, 1948). Though this claim is heavily disputed (Jönsson & Bertolani, 2001), there is considerable contemporary evidence that tardigrades can indeed survive in a dehydrated state, also called anhydrobiosis, for nine to thirty years (Tsujimoto, Imura, & Kanda, 2016). This rare ability to cope with desiccation drives this computational investigation of the tardigrade proteome. Here, by pinpointing proteins involved in tardigrade desiccation tolerance, we provide insight for developing tardigrade protein-based technologies that could abate deleterious cell processes. This work could open doors for a range of applications, such as engineering drought and radiation-resistant plants, extending viability time for transfusion of blood products or transplant of organs, and establishing stable vaccine stockpiles.
Hence, in this study, we created and deployed a support vector machine (SVM) to generate a concise list of proteins of interest in tardigrade extremotolerance. Annotations of tardigrade-specific intrinsically disordered proteins (TDPs), a protein family recently implicated in tardigrade resilience (Hesgrove & Boothby, 2020; Yamaguchi et al., 2012), served as positive training data. We created and report here a novel, comprehensive, nonredundant pan-proteome database (PPD), from which we derived our training and testing sets. This database is composed of the phylum’s publicly available protein sequences, with flagged proteins involved in anhydrobiosis and other forms of extremotolerance. The feature set selected includes DISOPRED3 disorder metrics (Jones & Cozzetto, 2015), such as fractions of disordered residues and concentrations of certain amino acids, as well as phenotypic properties derived from literature curation. In short, this study consisted of creating an all-inclusive database and narrowing it down to locate proteins of interest. Altogether, understanding TDPs could help elucidate how to repurpose the tardigrade’s aptitude for survival for human benefit. This dataset, to the best of our knowledge, is the most comprehensive Tardigrada PPD that has been deduplicated to eliminate redundancy. An efficient regular expression-based algorithm permitted the assignment of a unifying identifier to most PPD proteins. Whereas previous proteomic studies have struggled with inconsistent nomenclature obfuscating protein identity, our database overcomes this obstacle. Overall, this study represents the first attempt of its kind to strategically mine proteomes for disordered proteins involved specifically in cryptobiosis. Ultimately, an enhanced understanding of such proteins could enable humans to imitate tardigrade resilience through development of novel TDP technology with wide-ranging applications in translational medicine and genetic engineering.
Review of Literature
Mechanisms of Anhydrobiosis
The tardigrade’s unique ability to undergo anhydrobiosis has been long documented but only recently understood. Anhydrobiosis is a type of cryptobiosis that can occur in any developmental stage (Schill & Fritz, 2008) whereby the tardigrade retracts its limbs and curls into a spherical formation called a tun (Figure 2, overleaf; Baumann, 1922). In a 1997 landmark study, Ricci and Pagani proposed a “Sleeping Beauty” hypothesis of aging, postulating that the tardigrade’s biological clock pauses during the tun state. Hengherr, Brümmer, and Schill (2008a) corroborated this idea by showing how lifespans of periodically dried Milnesium tardigradum were similar to those of their control counterparts, excluding time spent in the tun state. In this state, a type of biostasis, tardigrades rely on a host of molecular mechanisms to survive prolonged periods of desiccation, such as the disaccharide trehalose.
Implication of Trehalose
Trehalose is a sugar reported to be involved in anhydrobiosis in some tardigrade species (Crowe, 2002; Kinchin, 2008; Webb, 1964). However, a preponderance of inconsistencies, both between and within individual studies, reveal there is considerable inter-/intraspecies variation in concentrations of this substance. To illustrate, Jönsson and Persson (2010) observed increased levels of trehalose in Macrobiotus islandicus (accounting for 2.9% of anhydrobiote dry weight) compared to lower amounts in other species such as M. tardigradum (0.077% of dry weight). By contrast, in a different study, the latter species was previously shown to lack trehalose altogether (Hengherr, Heyer, Köhler, & Schill, 2008b). In tandem, these conflicting findings indicate trehalose many not be solely responsible for the phenomenon of anhydrobiosis in tardigrades.
While historically the field has spent a great deal of time focused on trehalose, scientists have recently redirected their attention to the tardigrade proteome. Promising findings have implicated proteins such as late embryogenesis abundant proteins (LEAs; Förster et al., 2009; Förster et al., 2012; Schokraie et al., 2010; Tanaka et al., 2015), heat shock proteins (Hsps; Alterio, Guidetti, Boschini, & Rebecchi, 2012; Förster et al., 2009; Förster et al., 2012; Jönsson & Schill, 2007; Reuner et al., 2010; Rizzo et al., 2010; Schill, Steinbrück, & Köhler, 2004; Schokraie et al., 2010; Schokraie et al., 2011; Wang, Grohme, Mali, Schill, & Frohme, 2014; Yoshida et al., 2017), and DNA damage suppressor proteins (Dsup; Hashimoto et al., 2016; Hashimoto & Kunieda, 2017; Yoshida et al., 2017) in augmenting tardigrade extremotolerance. Consequently, the tardigrade proteome has taken center stage as a cache for previously unexplored factors with ties to desiccation tolerance. In particular, proteins lacking order, meaning they do not have fixed tertiary structures, are of heightened interest.
Tardigrade-Specific Intrinsically Disordered Proteins
A family of proteins called tardigrade-specific intrinsically disordered proteins (TDPs) is heavily involved in tardigrade cryptobiosis. Some TDPs are heat soluble, including cytosolic abundant heat soluble (CAHS), secretory abundant heat soluble (SAHS), and mitochondrial abundant heat soluble (MAHS; Yamaguchi et al., 2012). In addition, Dsup is type of nucleosome-binding and DNA-protecting protein (Chavez, Cruz-Becerra, Fei, Kassavetis, & Kadonaga, 2019) that is also considered a TDP. The term “tardigrade-specific” indicates that these proteins, to date, have not been identified outside of the phylum, and “intrinsically disordered” signifies proteins (IDPs) or protein regions (IDRs) lacking a consistent tertiary structure under certain cellular conditions (Jirgensons, 1996). In essence, unique properties of the TDP family can be attributed to this disorder, in that proteins within this classification adopt a material state upon desiccation, forming non-crystalline, amorphous solids (Boothby et al., 2017). Considered a type of biological glass, these solids have been shown to protect cells during dehydration in a multitude of ways (Crowe, Carpenter, & Crowe, 1998; Sun & Leopold, 1997) by coming together to form specific intra- and extracellular structures supporting cell structure when water is scarce (Richaud et al., 2020). Further, radical ions and reactive oxygen species can be sequestered by some IDPs, thereby mitigating oxidative stress. TDPs were initially characterized by their location of expression in the cell. Yamaguchi et al. (2012) used green fluorescent protein analysis to confirm the cytosolic abundant heat soluble (CAHS) TDP tended to gravitate toward the matrix of the cytoplasm. On the other hand, the secretory abundant heat soluble (SAHS) TDP was detected in the culture medium, indicating the protein had crossed the plasma membrane. As implied by the practice of naming TDPs simply for where they localize (Yamaguchi et al., 2012), there is, as yet, a paucity of research exploring the intricate mechanisms by which these proteins operate. This study, therefore, focuses on intrinsically disordered proteins, with an emphasis on tardigrade-specific cases.
Study Overview
This study uses machine learning to uncover previously unidentified TDPs—as well as IDPs not exclusive to the phylum—to generate a novel list of high-priority proteins in need of further investigation. This research takes into consideration disorder metrics and phenotypic properties derived from literature curation and is novel in (1) comprehensively mining tardigrade proteomes for IDPs involved in cryptobiosis, (2) reporting, for the first time, a deduplicated and highly organized tardigrade pan-proteome database, and (3) crafting a concise list of proteins of interest in the anhydrobiotic phenotype. We anticipate that this nonredundant and centralized database, coupled with our novel machine learning pipeline, will serve as an asset for tardigrade researchers, and that the targeted list of proteins generated will grant future studies more direction.
Project Goal
The overarching goal of this study was to strategically mine the tardigrade proteome, by way of machine learning, for intrinsically disordered proteins involved in the cryptobiotic phenotype.
Objectives
To construct and deduplicate a comprehensive tardigrade pan-proteome database (PPD)
To use an intrinsic disorder prediction server (i.e., DISOPRED3) for analyzing location and intensity of disorder in proteins from the PPD
To employ a combination of machine learning and literature curation to filter disorder prediction results in order to yield a list of proteins in need of further exploration as prospective cryptobiosis-associated proteins
To utilize InterPro, a bioinformatic tool for functional analysis, to annotate the protein list generated
Methodology
Pan-Proteome Database Construction
Data Sourcing
In order to maximize the likelihood of detecting TDPs through disorder prediction and machine learning, tardigrade proteomic data were consolidated. First and foremost, a table was created encompassing all publicly available tardigrade protein sequences from UniProt (Universal Protein Resource (UniProt), 2021) and NCBI (National Center for Biotechnology Information (NCBI) Resource Coordinators, 2016). To eliminate redundancy, global pairwise sequence alignments were performed between NCBI and UniProt sequences using Harvard’s supercomputer cluster, O2. Only 104 NCBI sequences were able to be mapped (overlapped) as fragments of UniProt sequences, and 6,000+ unique NCBI protein sequences were added to the UniProt table. These merged datasets produced a table just under 45,000 proteins long. Experimental detection of proteins was noted in this supplementary table. Specifically, relevant peptide sequences were searched for in articles published between 2010 and 2017. Data from six articles were selected: Kamilari, Jørgensen, Schiøtt, and Møbjerg (2019), Schokraie et al. (2010), Schokraie et al. (2011), Schokraie et al. (2012), Yamaguchi et al. (2012), and Yoshida et al. (2017).
Redundancy Elimination (Deduplication)
Because these datasets were so vast (44,836 sequences), Excel was used to remove redundant sequence entries.
Sequence Mapping
To avoid introducing redundancy, to map between identical sequences cataloged under different names, and to maximize data inclusivity, literature-derived sequences were compared to sequences from UniProt and NCBI. A regular expression-based algorithm was devised in Perl: a brute force strategy that enumerates all possibilities for an edit distance (ED) between two sequences being aligned. For small ED, this algorithm runs faster than global sequence alignment and is therefore well-suited for mapping between nearly identical sequences for large datasets. Edit number dictated how sequences were dealt with (Figure 3).
Intrinsic Disorder Prediction Utilizing DISOPRED3
DISOPRED3 is an intrinsic disorder prediction server that makes a binary call between ordered and disordered residues. DISOPRED2, its predecessor, was trained on sequences associated with missing residues in X-ray crystallography structures, a telltale sign of an intrinsically disordered region (IDR; Jones & Cozzetto, 2015). DISOPRED3 was selected because it was ranked highly in the 2014 Critical Assessments of Techniques for Protein Structure Prediction evaluation (Monastyrskyy, Kryshtafovych, Moult, Tramontano, & Fidelis, 2014). All PPD proteins were subjected to DISOPRED3 assessment.
Machine Learning
Currently in the field of tardigrade research, there is no preemptive full-scale TDP identification system in place. Given the expensive and time-consuming nature of wet lab procedures for measuring protein expression and characterizing protein function, there is a chance TDPs that could be used to transfer tardigrade resilience to other species have yet to be identified, which compelled us to create the machine learning pipeline presented here as a means of automating the TDP identification process.
Feature Construction and IDP Metric Compilation
There is no single accepted set of guidelines for distinguishing IDPs from ordered proteins, ordered regions, or IDRs, so an additional literature search was conducted to locate common IDP characteristics. From this search, certain guiding principles were gleaned, namely that IDPs tend to (1) contain fewer hydrophobic amino acids (Oldfield & Dunker, 2014), which impedes hydrophobic collapse; (2) contain higher concentrations of disorder promoting amino acids such as proline and serine (Atkins, Boateng, Sorensen, & McGuffin, 2015); (3) contain higher concentrations of aromatic residues such as phenylalanine, tryptophan, and tyrosine (Oldfield & Dunker, 2014); and (4) that IDRs of more than 30 residues are considered long (Mohan, Uversky, & Radivojac, 2009). Based upon this literature search, the following dimensions were established for the disorder classifier: protein length, number of hydrophilic/hydrophobic residues, frequency of proline, glutamic acid, serine, glutamine, lysine, phenylalanine, tryptophan, tyrosine, and continuously disordered region length. Hydropathy determination was in accordance with Kyte-Doolittle (Kyte & Doolittle, 1982) and Hopp-Woods (Hopp & Woods, 1981) scales. Python was used to parse the long DISOPRED3 file and produce a summary output file recording the above-mentioned disorder statistics.
Training Sets
Negatives (ordered proteins) were flagged in the PPD based on if they had (1) a crystal structure in the Protein Data Bank (PDB; though this only accounted for three in the PDB, as accessed in October of 2020) and (2) enzyme commission (EC) numbers, as most enzymes have stable tertiary structures. Others were manually annotated, relying on confirmation that there were ordered PDB homologs. Positives (ordered proteins) were flagged in the PPD by searching the accompanying NCBI or UniProt annotation methodically for TDP keywords (i.e., CAHS, SAHS, MAHS, LEA, Dsup, and associated abbreviation expansions).
Support Vector Machine
Once training sets were established, it was necessary to select a machine learning model. When graphing a decision boundary, as the number of features, or dimensions, describing each data point increases (here, there were 11), the difficulty of separating data (44,836 points in the PPD) cleanly into distinct classes also rises. This “Curse of Dimensionality” (Bellman, 1966) is why we turn to artificial intelligence. This study utilized supervised machine learning in the form of a support vector machine (SVM). SVMs are highly effective classification tools with a computational edge over alternative separation techniques (Cristianini & Shawe-Taylor, 2000). The present SVM was created by modifying a preexisting code from Chang et al. (2020).
One challenge encountered while training the model was a severe class imbalance (55 entries in the positive training set versus 2,500+ entries in the negative training set). To account for this issue, 116 ordered examples were randomly selected to soften the imbalance while continuing to reflect size differences between the two sets. The balance issue was also dealt with using weighted scoring in the code itself. The scalar magnitude of the hyperplane equation’s coefficients for the resulting classifier (Table 1, overleaf) served as a measure of relative influence for each SVM feature. A linear kernel was utilized so coefficient values would be more readily interpretable. Also, feature values ranged drastically in magnitude. Therefore, standardization was performed through Z-scoring, mean centering the value distribution at zero and causing standard deviation of the transformed distribution to equal one. This aided hyperplane parameterization by preventing bias toward a certain scale. Gradient descent was employed to optimize hyperplane parameterization, running for between 100 and 1,000 iterations with the squared hinge loss function. This methodology combatted overfitting, given the traditional bias-variance tradeoff that plagues machine learning (Luxburg & Schölkopf, 2008).
Because the positive training set was derived from evolutionarily distinct groups (CAHS, SAHS, MAHS, LEA, and Dsup), they were not sufficiently homologous to each other to warrant using alignment scores as SVM features (Hesgrove & Boothby, 2020). To clarify, homology exists within each TDP class (e.g., CAHS versus CAHS), but less so across TDP classes (e.g., CAHS versus SAHS). Using alignment data in a classifier could have diluted its predictive power. Instead, the model was trained to generically predict TDPs. Proteins were subclassified afterward by performing ~300,000 local sequence alignments between positive predictions and positive training set proteins to classify predictions into their respective TDP subgroups. Bit scores for alignments between predicted positives and known positives were averaged with respect to each TDP subgroup, with the predicted subgroup being declared according to the highest average. Specifically, the BLOSUM62 standard substitution scoring matrix (Henikoff & Henikoff, 1992) was used, but with a more severe gap penalty, tailoring the code to locate loose evolutionary relationships (the penalty for a gap of any length was −5, and the added penalty for each residue greater than one for a gap was −1).
Statistical Analysis
To assess how well the model separated the training set and generalized to unseen data, leave-one-out cross-validation was implemented. Afterward, to ensure cross-validation results weren’t rendered by chance, additional leave-one-out cross-validation rounds were conducted (cross-validation confirmation), first with a shuffled feature space, and then again with shuffled training set class labels (a randomized control). Subclass prediction confidence was also evaluated. 55 known TDPs from the positive training set were aligned against each other. Then, bit scores from within each class (e.g., CAHS versus CAHS) and across classes (e.g., CAHS versus SAHS) were recorded in Excel to locate a class-specific threshold separating bimodal distributions (Table 2, overleaf). To be deemed a confident prediction, the predicted positive bit score had to meet the threshold for its respective predicted class. Though this procedure does not produce an e-value or p-value, it does indicate the confidence of each predicted positive, erring on the more conservative side, as it corresponds to a false positive rate of 0%, with respect to TDP subtypes.
Results
SVM Performance
The classifier designed in this study was proficient at separating known TDPs from known non-TDPs in cross-validated training set performance, correctly classifying 160/171 training set proteins, yielding raw accuracy of ~93.6% (Figure 4). This points to the suitability of disorder and sequence features selected for SVM training. For training set performance, the area under the curve (AUC) score for the receiver operating characteristic (ROC) curve was ~0.98. To confirm the hyperplane was not overfitted to training set data, leave-one-out cross-validation was performed. The leave-one-out cross-validated AUC of the ROC was ~0.95. To ensure the promising cross-validation results were not generated by chance, secondary cross-validation (confirmation) was conducted, which involved rerunning cross-validation, but with shuffled training data. For the first round of cross-validation confirmation, with training on shuffled labels, the AUC score was ~0.56. For the second round, with training on shuffled feature space, the AUC score was ~0.52 (Figure 5). Shuffled training data diminished the predictive capacity of the SVM, as expected. Qualitatively speaking, the closeness of model performance during training and leave-one-out cross-validation alludes to negligible overfitting and the strong, generalizable predictive power of the overall pipeline.
Finalized, High-Priority List
At the core of this study was the creation of a concise list of TDPs intended to guide researchers in the budding field of tardigrade-based technology development. The SVM predicted 5,415 previously unknown IDPs/TDPs out of 44,836 total PPD proteins. Of these 5,415 proteins, 52 were also detected experimentally in previous studies, meaning they map to sequences from the six publications listed under “Data Sourcing.” Also of these 5,415 proteins, 30 passed the bit score threshold for homology to TDP subclasses. To gain insight into how these proteins might play a role in cryptobiosis, InterPro, a protein annotation tool (Apweiler et al., 2001; Blum et al., 2021), was used to analyze the 82 proteins (Appendix). Many entries were “hypothetical proteins,” which we were able to characterize with InterPro, shedding light on their function. This narrowed down list will empower researchers to sidestep tedious experimental sifting for disordered proteins.
InterPro Annotations
Proteins in the finalized list were manually annotated with InterPro to provide a foothold for scientists exploring TDPs, as well as to expedite the genesis of new TDP technology. InterPro annotations showed that the 82 proteins of interest are involved in diverse biological processes, including motor activity, ATP binding, DNA binding, and ion sequestration. Curiously, 15/82 (~18.3%) were annotated as heat shock proteins (Hsp20, Hsp40, Hsp70, and Hsp90 were all detected). In this group, prevalence of heat shock proteins, a type of molecular chaperone that assists in biomolecule assembly/disassembly, reinforces their role in the tardigrade cryptobiotic molecular landscape (Schokraie et al., 2010; Schokraie et al., 2011). Characterizing these putative TDPs is a pivotal step forward in this field.
Subclassification Trends
For TDP subclassification, 297,825 local sequence alignments were performed (55 training set proteins x 5,415 positive predictions). The 52 predictions with experimental detection in previous literature were not deemed confident by way of threshold comparison, though it should be noted that the threshold confidence metric errs on the conservative side (corresponding to a false positive rate of 0%). Of the 5,415 disordered predictions, 30 subclass assignment bit score averages met the threshold. This confidence measurement does not apply to the disordered prediction of the protein, but exclusively to subclass assignment (e.g., CAHS versus SAHS). Most positives were subclassified as damage suppressing proteins (Figure 6). The bit score distributions utilized to determine the class-specific thresholds are displayed in Figure 7 (overleaf).
Discussion
A Tool for More Targeted Research
One immediate application of this classifier is compensating for both the deficit in and drawbacks of tardigrade experiments. The handful of existing publications we pulled data from, though carefully selected, presumably reflect a fraction of tardigrade protein expression under various stress conditions. Moreover, IDPs are difficult to measure and identify because many can form higher-order structures that can may make them inaccessible to standard platforms such as nuclear magnetic resonance (Radivojac et al., 2004) or x-ray crystallography. Now that we have narrowed down proteins of interest, these can be targeted for specific measurements experimentally, such as with mass spectrometry, or genetic experiments. Since it would be a costly and lengthy process to investigate all PPD proteins experimentally, identifying proteins of interest beforehand ensures resources are spent prudently. While predictive pipelines are typically utilized to weed out IDPs and focus on ordered proteins (Atkins et al., 2015), the same principle can be readily reversed, as was done here, for the purpose of studying IDPs.
Making Sense of Disorder
IDPs are vital to extremotolerance and cell biology in general (Dunker, Bondos, Huang, & Oldfield, 2015; Wright & Dyson, 2015), playing crucial roles in protein-protein interactions due to their flexibility, but they are severely understudied (Necci, Piovesan, Critical Assessment of Protein Intrinsic Disorder (CAID) Predictors, Database of Protein Disorder (DisProt) Curators, & Tosatto, 2021).
Identifying illustrative examples of IDPs here paves the way for more deeply understanding them. This study expands the number of known examples and thereby potentiates further analysis into what constitutes an IDP. The 5,415 SVM positive predictions, if they are indeed disordered, dramatically enlarge the register of known tardigrade-specific IDPs (TDPs), which is a notable step forward for creating TDP technologies with impacts in fields ranging from astrobiology to genetic engineering.
Next Steps
An area for additional investigation would be constructing a separate SVM classifier for each TDP subclass. Future studies could also increase the number of shuffled cross-validation confirmation trials and develop a system for strategically, instead of randomly, selecting the negative training set. To fully reap the benefits of TDP-based technology, researchers need to assess proteins we identified in vitro, such as through mass spectrometry and western blotting (as in Schokraie et al., 2010), as determining which of these potential IDPs are upregulated during desiccation, and to what extent, will be critical for identifying with confidence which are ideal for incorporation into biotechnology. Other avenues of investigation would include designing genetic knockout/knockdown experiments in tardigrades for targeting proteins identified here, as well as testing for transfer of stress tolerance to heterologous systems and performing stress tolerance assays.
Tardigrade-Specific Intrinsically Disordered Proteins and Biotechnology
Recent studies have observed how TDPs can be utilized as a means for transferring oxidative stress tolerance (Chavez et al., 2019), radiation tolerance (Hashimoto et al., 2016; Kirke, Jin, & Zhang, 2020; Westover et al., 2020), and osmotic stress tolerance (Tanaka et al., 2015) between organisms. Notably, in 2017, transfer of CAHS proteins to yeast increased their desiccation tolerance approximately 100-fold (Boothby et al., 2017). The same study found Escherichia coli exhibited a similar response, pointing to the promise of using TDPs to confer tardigrade extremotolerance to other life forms.
Therefore, while this study fortifies knowledge of basic tardigrade biology and stress response, it has ramifications for higher life forms as well. Serendipitously, certain proteins pinpointed were likened to those found in humans. Common examples from the Appendix include fatty acid-binding proteins (Smathers & Petersen, 2011) and DnaJ/DnaK proteins. Table 3 presents select proteins from the Appendix and their human homologs, logical candidates for technological applications.
Insofar as TDPs take on a material state upon desiccation (Boothby et al., 2017), form supportive cell structures (Richaud et al., 2020), and sequester ions (Bray, 1993), they hold great promise for many advantageous applications. To clarify, TDPs could be used to design more effective, affordable biomaterial preservation methods. The concept of mimicking anhydrobiotic creatures to preserve organs, tissues, and blood has been pondered for half a century (Keilin, 1959). Today, cytosolic abundant heat-soluble TDPs (CAHS) are already being evaluated for manufacturing pharmaceutical excipients, such as lactate dehydrogenase and lipoprotein lipase, and the results are encouraging (Piszkiewicz et al., 2019). Such a preservative could also aid storage of medical supplies containing biological components, such as vaccines. To that end, because CAHS proteins protect enzymes from desiccation and lyophilization, they have been tested as a vaccine stabilizer. When a vaccine with this preservative was tested in mice, the stabilizer was deemed as not toxic, and treatment elicited antibody production as intended (Esterly et al., 2020). Cold storage is sometimes a limiting factor when it comes to distributing vaccines (Kaufmann, Miller, & Cheyne, 2011; Zaffran et al., 2013)—such as with some COVID-19, chicken pox, and Ebola vaccines—but with TDP stabilizers, certain vaccine stockpiles could be dispersed and stored safely and affordably, circumventing the cumbersome cold chain and even ultra-cold chain vaccine preservation methods, thus increasing global medical equity.
Outside of the realm of biopreservation, TDPs could be a valuable addition to therapeutics. For instance, they could be incorporated into treatments for delaying deleterious cell processes when an injury occurs in an isolated location. Delaying necrosis and apoptosis would be especially germane to improving patient health outcomes when treatment is time-sensitive. This advancement could be crucial for the military in decreasing battlefield mortality.
TDPs could also have applications in medical conditions involving oxidative stress. Because oxidative stress can contribute to cancer (Liguori et al., 2018), antioxidant properties of these proteins (Hashimoto et al., 2016; Rizzo et al., 2010) could prove useful as part of new treatments. Specifically, certain tardigrade proteins form structures that can capture reactive oxygen species and cut short their damaging ripple effect within the cell.
Aside from medical implications, mimicking tardigrade resilience has more broad technological applications, namely in coping with Earth’s changing climate. For instance, a damage suppressor protein (Dsup) gene was expressed in plants and decreased radiation-induced DNA damage in Kirke et al. (2020). Theoretically, extremotolerant plants could be engineered to maintain a stable food supply despite escalating environmental challenges. In the far future, perhaps such recombinant plants could play a role in sustaining life on other planets.
On the note of space travel, tardigrades have survived the brutal conditions of space on multiple missions since the early 2000s (Jönsson et al., 2008; Persson et al., 2011; Rebecchi et al., 2009, 2010, 2011; Vukich et al., 2012) and are considered model organisms for space research (Bertolani et al., 2001; Guidetti, Rizzo, Altiero, & Rebecchi, 2012; Jönsson, 2007; May, Maria, & Guimard, 1964). Seeing as certain TDPs are involved in radiotolerance (Hashimoto et al., 2016), they could be used to engineer radiation-resistant crops or even to protect astronauts against cosmic radiation, helping them explore new frontiers.
Conclusion
Researchers have long marveled at tardigrade resilience, instigating today’s endeavors in tardigrade biomimicry (the imitation of biological phenomena in man-made technology). Given mounting evidence that intrinsically disordered proteins (IDPs) play a pivotal role in cryptobiotic phenomena, the dearth of literature on IDPs is peculiar. In this study, a computational method for identifying tardigrade-specific IDPs (TDPs) of interest was devised. This paves the way for potentially revolutionary TDP/IDP-based technology. The novel and thorough pan-proteome database (PPD) presented here yielded a prioritized shortlist of hypothetically cryptobiosis-associated IDPs, which were identified by a novel support vector machine classifier trained on IDP properties. Our classifier performed promisingly well, accurately classifying ~93.6% of the training set. During testing, 5,415 of the 44,836 PPD proteins were categorized as disordered. InterPro annotations of 82 positives, some heretofore unreported in the literature, suggest a plausible facilitative role for many of them in tardigrade extremotolerance. By methodically funneling the expansive tardigrade proteome into a more succinct proteins of interest list, this study charts a course for more directed research into tardigrade cryptobiosis, potentially permitting humans to one day exploit and cross-apply the tardigrade’s natural strengths for the benefit of such fields as medicine, biotechnology, and even space travel. Further, considering how water shortages have wreaked havoc on crops and livestock in the wake of climate change, the prospect of engineering TDP-expressing crops that can withstand drought is of special import. The tardigrade’s perplexing proteins could be key to conferring the same desiccation tolerance that has been observed in the enigmatic tardigrade for hundreds of years. Ironically, unraveling the underpinnings of protein disorder could be paramount for creating a more ordered world.
Acknowledgements
High-throughput computation was enabled by the Harvard Medical School O2 high performance compute cluster.
Appendix
Catalog of 82 Putative Intrinsically Disordered Proteins
Greater distance to hyperplane indicates more confident classification. Predicted Type column is based on bit score averages from local alignments between predicted/known positives. Amino acid is abbreviated as AA throughout. 52 red entries: experimentally detected in previous studies, with predicted types not satisfying bit score significance threshold. 30 blue entries: confident TDP type predictions without experimental detection.