Five Challenges for the Marriage of AI and Synthetic Biology: Technology, Data, Algorithms, Evaluation and Sociology


Source丨ACM Newsletter

Compilation | Wang Yue

Editor | Chen Caixian

In the past two decades, biology has changed dramatically, and engineering based on biological systems has become possible. The genomic revolution that gave our cells the ability to sequence the genetic code (DNA) is a major driver of this dramatic change. And one of the latest discoveries brought about by the genomic revolution is the ability to precisely edit DNA in the body using CRISPR.

Advanced manifestations of the genetic code, such as protein synthesis, are called phenotypes. High-throughput phenotype data combined with precise editing of DNA link changes in the underlying code to external phenotypes.


Legend: Wacomka


Legend: This figure represents a high-level representation of the cell’s genetic code (DNA).


Legend: datasets/data types frequently used in biology (this list is incomplete)

The potential of synthetic biology

Synthetic biology will have a transformative impact on food, energy, climate, medicine and materials…every field in the world.


Legend: Synthetic biology may affect every field in the world

Synthetic biology has brought insulin from pigs without sacrificing pigs (which was possible in previous stages of genetic engineering), synthetic leather, coats made of silk that are not spiders at all, anti-malaria and anti-cancer Medicines, meatless burgers that taste like meat, renewable biofuels, hop-free beer with no hops, the scent of extinct flowers, artificial collagen for cosmetics, and the elimination of genes from dengue-carrying mosquitoes. Many believe this is just the tip of the iceberg, as the ability to engineer living things opens up endless possibilities to transform the world, and there is a growing level of public and private investment in this area.


Legend: Significant growth in academic (a) and commercial (b) fields has provided a wealth of information, data, and environmental resources for the application of AI in synthetic biology.

In addition, entering the third wave of AI, with its focus on incorporating the environment into models, its potential to impact synthetic biology has greatly increased.

It is well known that the genotype of an organism is not so much a blueprint for its phenotype, but rather the initial conditions of a complex, interconnected, dynamic system. Biologists have spent decades building and managing a large set of properties, including regulation, association, rate of change, and function, to describe this complex, dynamic system. Other resources such as gene networks, known functional associations, protein-protein interactions, protein-metabolite interactions, and knowledge-driven dynamic models of transcription, translation, and interactions provide rich resources for AI models.

The interpretability of the model is also crucial to reveal new design principles. These models give biologists the ability to solve more complex problems about biological systems and build comprehensive, interpretable models to accelerate discovery and research. The growth of knowledge and resources in the field can be clearly seen in the number of synthetic biology publications and the commercial opportunities in synthetic biology.

AI and its impact on synthetic biology

Compared with its potential in the field of synthetic biology, AI has a limited impact.

We have seen successful applications of AI, but are still limited to specific datasets and research questions. The current challenge for AI in the field remains how generic it is to a wider range of applications and other datasets.

Data mining, statistics, and mechanistic modeling are currently the main drivers of computational biology and bioinformatics in the field, but the lines between these techniques and AI/ML are often blurred. For example, clustering is a data mining technique that identifies patterns and structures in gene expression data that can indicate whether engineered modifications lead to toxic outcomes in cells. These clustering techniques can also act as unsupervised learning models to find structure in unlabeled datasets. These developing classic techniques and new AI/ML (machine learning) methods will play a greater role and impact in the future field of synthetic biology as people become accustomed to larger datasets. Transcriptome data volumes are doubling every 7 months, and high-throughput workflows for proteomics and metabolomics are becoming more available.

Furthermore, the progressive automation and miniaturization of microfluidic chips for laboratory work heralds a future where data processing and analysis will enable multiplication of the productivity of synthetic biology. DARPA’s Collaborative Discovery and Design (SD2, 2018–2021) program focuses on building artificial intelligence models that aim to bridge the gap between AI and synthetic biology needs. This is also evident in some companies using SoTA technology in this area (eg Amyris, Zymergen or Ginkgo Bioworks).

AI and synthetic biology overlap in areas such as applying existing AI/ML to existing datasets; generating new datasets (e.g., the upcoming NIH Bridge2AI); and creating new AI/ML techniques to apply new or existing data. Although SD2 contributes in the last item, it still has some potential and has a long way to go in the future.

Artificial intelligence can help synthetic biology overcome a grand challenge of predicting the impact of bioengineering methods on living organisms and the environment. Due to the inability to predict the outcome of bioengineering, the cellular engineering goal of synthetic biology (i.e., inverse design) can only be achieved through a lot of trial and error. Artificial intelligence offers an opportunity to use publicly available and experimental data to predict impacts on biological agents and the environment.

Designing genetic structures for cellular programming. Much research in the field of synthetic biology has focused on the engineering of gene structures/gene circuits, which is very different from the challenges faced in designing electronic circuits.

Combining known biophysical, machine learning, and reinforcement learning models, AI techniques can effectively predict the impact of structures on a subject and vice versa, and while already quite powerful, there is still room for improvement. In the field of machine-assisted gene circuit design, various artificial intelligence technologies have been put into application, including expert systems, multi-agent systems, constrained reasoning, heuristic search, optimization and machine learning.

Sequence-based models and graph convolutional networks have also received attention in the field of engineered biological systems. Factor-graph neural networks have been used to incorporate biological knowledge into deep learning models. Graph convolutional networks have been used to predict protein functions from protein-protein interaction networks. Sequence-based convolutional and recurrent neural network models have been used to identify potential protein binding sites, gene expression, and design of new biological structures. Where AI is most useful is in developing comprehensive models, which will reduce the number of experiments or designs that need to be performed.

Metabolic Engineering. In metabolic engineering, artificial intelligence has been applied to almost all stages of the bioengineering process, for example, artificial neural networks have been used to predict translation initiation sites, annotate protein functions, predict synthetic pathways, optimize the expression levels of multiple exogenous genes , predicting the strength of regulatory elements, predicting plasmid expression, optimizing nutrient concentration and fermentation conditions, predicting enzyme kinetic parameters, understanding the association between genotype and phenotype, and predicting the guiding effect of CRISPR. Clustering has been used to discover secondary metabolite biosynthesis gene clusters and to identify enzymes that catalyze specific reactions. Ensemble approaches have been used to predict pathway dynamics, optimal growth temperatures, and in directed evolution approaches to find proteins that confer higher fitness. Support vector machines have been used to optimize ribosome binding site sequences and predict the behavior of CRISPR guide RNAs. Among the stages of metabolic engineering, AI is most promisingly applied to process scale-up, a major bottleneck in the field, and downstream processing (such as the systematic extraction of the resulting molecules from fermentation broths).

Experiment automation. The impact of AI in helping automate lab work and recommending experimental designs extends well beyond the “learning” phase of the DBTL cycle. Automation is becoming increasingly important in practice, as it is the most reliable way to obtain the high-quality, high-volume, low-bias data needed to train AI algorithms, and it also enables predictable bioengineering. Automation offers the opportunity to rapidly transfer and expand complex protocols to other laboratories. For example, the liquid-handling robotic station forms the backbone of the biofoundry and cloud lab. These foundries have been able to see themselves being disrupted by robots and planning algorithms in the future, gaining the ability to rapidly iterate through DBTL cycles. Semantic networks, ontologies, and schemas have revolutionized the representation, communication, and exchange of designs and protocols. These tools enable rapid experimentation and generate more data in a structured, queryable format. In a field where most content is either lost or manually recorded in lab notes, the promise of artificial intelligence is driving major changes in the field, reducing barriers to generating data.

Microfluidics is an alternative to macroscopic liquid handling with higher throughput, less reagent consumption and less expensive scaling. In fact, microfluidics may be the key technology for realizing autonomous driving labs, which promises to greatly speed up the R&D process by augmenting automated experimental platforms with artificial intelligence. Self-driving labs involve fully automated DBTL cycles in which AI algorithms make assumptions based on previous experimental results, actively looking for promising experimental procedures. So this may be the biggest opportunity for AI researchers in the field of synthetic biology. While automated DBTL loops have been demonstrated in liquid-handling robotic workstations, the scalability, high-throughput capabilities, and manufacturing flexibility offered by microfluidic chips may offer the ultimate technological leap to make artificial intelligence a reality.

Challenges of studying synthetic biology with AI

Artificial intelligence has begun to find its way into various synthetic biology applications, but remaining technical and societal issues stand as barriers between the two fields.

technical challenges. The technical challenges of applying AI to synthetic biology are: data is scattered in different models, difficult to combine, unstructured, and often lacks the context in which the data was collected; models require far more data than is typically collected in a single experiment and lack of interpretability and uncertainty quantification; and in larger design tasks, there are no metrics or criteria to effectively evaluate model performance. Furthermore, experiments are often designed to explore only positive outcomes, which complicates or biases the evaluation of models.


Legend: The challenges of applying artificial intelligence technology to the field of synthetic biology.

Data challenge. The lack of suitable datasets remains the primary hurdle for combining artificial intelligence with synthetic biology. Applying artificial intelligence to synthetic biology requires large amounts of labeled, curated, high-quality, context-rich data from individual experiments. Although the community has made progress in building databases containing various biological sequences (even whole genomes) and phenotypes, labelling data is still scarce. As used herein, “marker data” refers to phenotypic data mapped to measurements that capture their biological function or cellular response. It is the presence of this measurement and labeling that has allowed AI/ML and synthetic biology solutions to mature, as in other fields, pitting AI against human capabilities.

Lack of investment in data engineering is part of the reason for the lack of applicable datasets. Beneath the brilliance of technological advancements in artificial intelligence, people often fail to see the computing infrastructure needed to support and ensure their success. The AI ​​community calls this the pyramid of needs, and data engineering is an important part of it. Data engineering includes steps for experiment planning, data collection, structuring, access, and exploration. Successful AI application stories contain standardized, consistent, and reproducible data engineering steps. While we can now collect biological data at unprecedented scale and detail, this data is often not immediately applicable to machine learning. There are still many barriers to adopting community-wide standards for storing and sharing measurement data, experimental conditions, and other metadata that make data more amenable to AI technologies. Rigorous work and a high level of consensus are required to enable the rapid adoption of these standards, while promoting common standards for data quality assessment. In short, AI models require consistent and comparable measurements across all experiments, which extends the experimental timeline. This requirement adds another huge burden to those who already follow complex protocols for scientific experiments. Consequently, the long-term need to collect data is often sacrificed in order to meet looming project deadlines.


Legend: A canonical AI/ML infrastructure can support synthetic biology research. While mid-stage research is often the focus of attention, it is the foundation that is critical and requires a significant investment of resources.

This situation often results in sparse data sets that represent only a small fraction of the multiple layers that make up the omics data stack. In this case, data representation has a significant impact on the ability to integrate these isolated datasets for comprehensive modeling. The industry is currently investing a lot of work in various verticals, performing data cleaning, schema alignment, and extract, transform, and load (ETL) operations to collect difficult-to-manage digital data and prepare it into a form suitable for analysis . These tasks take up nearly 50% to 80% of a data scientist’s time, limiting their ability to explore in depth. Handling a large number of data types (data multimodality) is a challenge for synthetic biology researchers, and the complexity of preprocessing activities increases dramatically with data diversity compared to data volume.

Modeling/algorithmic challenges. Many of the popular algorithms driving current AI advances, such as those in computer vision and NLP, are not robust when analyzing omics data. Traditional applications of these models often suffer from the “curse of dimensionality” when applied to data collected in specific experiments. Under certain conditions, a single experimenter can generate genomic, transcriptomic, and proteomic data of over 12,000 measurements (dimensions) on an organism. For such an experiment, the number of labeled instances (e.g., success or failure) is usually only tens to hundreds at most. For these high-dimensional data types, the dynamics (time resolution) of the system are rarely captured. These measurement errors make inferences about complex dynamic systems a significant challenge.


Legend: The curse of dimensionality

Omics data has both similarities and differences with other data modalities such as sequential data, textual data, and web-based data, and classical approaches are not always applicable. Common features of these data include positional encoding and dependencies, as well as complex interaction patterns. However, there are also some fundamental differences between these data, such as: their underlying characterization, the context required for meaningful analysis, and the relative normalization across modalities for biologically meaningful comparisons. Therefore, it is difficult to find robust generative models (similar to Gaussian models or stochastic block models) that can accurately describe omics data.

Furthermore, biological sequences and systems represent complex biological functional codes, but there are few systematic approaches to interpret these codes in a manner similar to interpreting semantics or from the context of written texts. These diverse characteristics make extracting insights, generating and validating hypotheses through data exploration challenging. Engineering biology involves the challenge of learning black-box systems where we can observe inputs and outputs, but we have limited knowledge of the inner workings of the system. Considering that these biological systems operate in a combined large parameter space, artificial intelligence solutions use strategies to efficiently design experiments to explore biological systems, thereby generating and verifying various hypotheses, tantamount to presenting a huge demand in this space and Chance.

Finally, many popular AI algorithmic solutions do not explicitly account for uncertainty nor show robust mechanisms to control errors under input perturbations. This fundamental gap is especially important in the synthetic biological space, given the randomness and noise inherent in the biological systems we are trying to engineer.

Metrics/evaluation challenges . Standard AI evaluation metrics based on prediction and accuracy are insufficient for use in synthetic biology. Regression models like ℝ or metrics based on the accuracy of classification models do not account for the complexity of the underlying biological systems we are trying to model. In this field, it is equally important to quantify other metrics by which a model can elucidate the inner workings of biological systems and capture existing domain knowledge. To this end, AI solutions incorporating the principles of explainability and transparency are key to supporting iterative and interdisciplinary research. Furthermore, for the ability to properly quantify uncertainty, we need to creatively develop new metrics to measure the effectiveness of these methods.

We also need appropriate experimental design metrics. Evaluating and validating models in synthetic biology sometimes requires additional experiments and additional resources. A small misclassification or small error can have a significant impact on research goals. These costs should be integrated into the objective function or evaluation of the AI ​​model to reflect the real-world impact of misclassification.

Sociological challenges. Sociological problems may be more challenging than technical barriers (and vice versa) in harnessing artificial intelligence combined with synthetic biology. Our impression is that the lack of coordination and understanding between the disparate cultures involved in the research results in some sociological barriers. While there are already some solutions to this obstacle, it is interesting to note that there are still some protracted sociological problems in academia and industry.

The social problem arises because two very different groups of experts: computational scientists and laboratory scientists collide and disagree in their work.

Computational scientists and laboratory scientists are trained differently. Trained computational scientists tend to focus on abstraction, passionate about automation, computational efficiency, and disruptive approaches. They naturally tend to specialize in tasks and find ways to offload repetitive tasks to automated computer systems. Laboratory scientists, on the other hand, are practical, trained to make specific observations, and prefer interpretable analyses to accurately describe the specific results of an experiment.


Legend: Computational scientists and laboratory scientists come from different research cultures and must learn to work together to fully benefit from the combination of artificial intelligence and synthetic biology.

The two worlds have different cultures, and this is reflected not only in how the two groups solve problems, but also in what problems they think are worth solving.

For example, there has been tension between efforts to build infrastructure to support general research and efforts to study specific research questions. Computational scientists tend to provide reliable infrastructure that can be used for a variety of projects, while experimental scientists tend to focus on the end goal. Computational scientists like to develop mathematical models to explain and predict the behavior of biological systems, whereas laboratory scientists like to generate qualitative hypotheses and test those hypotheses with experiments as quickly as possible (at least when studying microbes, as these experiments can almost done).

In addition, computer scientists are often only excited about some inflated goals, such as bioengineering organisms to Mars, life writing compilers able to create DNA to meet the required specifications, reconstructing trees to take the desired shape, bioengineering dragons in reality life, or replace scientists with artificial intelligence. Scientists in the lab see this goal as pure “hype,” because in previous cases the type of computing promised a lot but didn’t deliver, preferring to only consider what could be achieved using the current state of the technology.

Solve societal challenges . The solution to these sociological problems is to encourage interdisciplinary teams and needs. While we can’t deny that achieving this kind of inclusive environment in a company (where the team thrives and loses) may be easier than in an academic setting where a graduate student or postdoc often publishes several first author papers Just declare success without needing to integrate with other disciplines.

One possible way to achieve this integration is to run cross-training courses where laboratory scientists are trained in programming and machine learning, and computational scientists are trained in experiments. This will bring some valuable, unique and necessary cultural exchange to both communities. The sooner we find this out, the faster synthetic biology can develop.

In the long term, we need university courses that combine the teaching of biology and bioengineering with automation and mathematics. Although some schools are currently offering such courses, it is only a drop in the bucket for now.

Perspectives and Opportunities

Artificial intelligence can fundamentally enhance synthetic biology and make it fully impactful by adding a third axis to the engineering stage space, such as physics, chemistry, or biology. Most obviously, AI can produce accurate predictions in bioengineering outcomes, enabling efficient reverse engineering.

In addition, AI could support scientists in designing experiments and choosing when and where to sample, a problem that currently requires trained experts. AI can also support automated searches, high-throughput analysis, and hypothesis generation based on big data sources, including historical experimental data, online databases, ontologies, and other technical materials.

Artificial intelligence could allow synthetic biology experts to explore large design spaces more quickly and come up with some interesting “outside the box” hypotheses, increasing the experts’ knowledge. Synthetic biology presents some unique challenges to current AI solutions that, if addressed, will lead to fundamental advances in the fields of synthetic biology and AI. Designing biological systems essentially relies on the ability to control the system, which is the ultimate test of understanding the fundamental laws of the system. Therefore, artificial intelligence solutions that enable synthetic biology research must be able to describe the mechanisms that make the best predictions possible.

Although recent AI techniques based on deep learning architectures have changed the way we think about feature engineering and pattern discovery, they are still in their infancy in terms of their ability to reason about and explain their learning mechanisms.

Therefore, AI solutions that combine causal inference, interpretability, robustness, and uncertainty estimation requirements have enormous potential impact in this interdisciplinary field. The complexity of biological systems makes AI solutions based purely on brute force association discovery unable to effectively describe the intrinsic characteristics of the system. A new class of algorithms that smoothly integrate physical and mechanical models with data-driven models is an exciting new research direction. We are currently seeing some initial positive results in climate science and computational chemistry, and hope to see similar progress in biological systems research.

As AI provides the tools to modify biological systems, synthetic biology can in turn inspire new AI approaches. Biology inspired fundamental elements of artificial intelligence such as neural networks, genetic algorithms, reinforcement learning, computer vision, and swarm robotics. In fact, there are many biological phenomena that can and are worth simulating digitally. For example, gene regulation involves an elaborate network of interactions that not only allow cells to sense and respond to their environment, but also keep cells alive and stable. Maintaining homeostasis (the state of stable internal, physical and chemical conditions maintained by living systems) involves producing the appropriate cellular components at the right time and in the right quantities, sensing internal gradients, and carefully regulating cellular exchanges with the environment. Can we understand and harness this ability to produce truly self-regulating artificial intelligence or robots?

Another example involves urgent attributes (that is, attributes that are displayed by the system but not by its components). For example, the behavior and response of an ant colony is that of a single organism, not just the sum of the individual ants. Similarly, consciousness (that is, the perception or awareness of internal or external existence) is a qualitative characteristic derived from a physical basis such as neurons. Swarm robots that self-organize and collectively build structures already exist. Can we use emerging general theories to create hybrids of robotic and biological systems? Can we create consciousness from a completely different physical substrate, such as a transistor? A final possible example involves self-healing and replication: even the simplest examples of life show the ability to self-repair and replicate. Can we understand the dilemma of this phenomenon producing a self-healing and replicating AI?

While this kind of biomimicry has been considered before, the beauty of “synthetic biology” is that it gives us the ability to “tinker” with biological systems to test models and rationale for biomimicry. For example, we can now tinker with cellular gene regulation at the genome scale, modify it, and test what exactly is responsible for its extraordinary resilience and adaptability. Or we could bioengineer the ants to test what colony behavior ensues and how that behavior affects ant survival. Or we could alter a cell’s self-repair and self-replication mechanisms and test the effects of long-term evolution on its ability to compete.

Furthermore, in cellular modeling, we have a good understanding of the biological mechanisms involved. Even knowing how a neural network detects the shape of an eye is unlikely to understand how the brain does the same thing, but synthetic biology research is different. The predictions of the mechanistic model were not perfect, but produced qualitatively acceptable results. Combining these mechanistic models with the predictive power of ML can help bridge the gap between the two and provide biological insights into why some ML models are more effective than others at predicting biological behavior. This insight can lead us to investigate new ML architectures and methods.

Artificial intelligence can help synthetic biology, and synthetic biology can in turn help artificial intelligence, and the interaction of these two disciplines in a continuous feedback loop will create a future we cannot imagine now, just as Benjamin Franklin could not imagine his vision of electricity. The discovery that will one day make the Internet possible in the future.

Original link:


Leifeng Network

This article is reprinted from:
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment

Your email address will not be published.