Model

Introduction

    The Nano Luciferase (NanoLuc), introduced in 2013 by Promega, is a new member of the luciferase reporter gene/protein family. It can transfer the Furimazine (FMZ) and oxygen into Furimamide and carbon dioxide, generating a glow-type signal as output (Fig. 1). NanoLuc has a smaller molecular weight (19.1 kDa), brighter signal output (150-fold than conventional luciferases), longer signal half-life (about 120 min), and higher signal-to-noise ratio (1). So many advantages make the NanoLuc excellent application prospects in bioluminescence imaging, exploring gene regulation and cell signaling, and investigating protein-protein interaction. Same for our project, the NanoLuc is chosen as one of the output signals of the Ribozyme-Enabled Detection of RNA (RENDR) system in the detection part.
Fig. 1 Reaction mechanism of Nano Luciferase.
    According to the standard protocol provided by Promega, the mixed reagent, including Nano-Glo® Luciferase Assay Buffer and Nano-Glo® Luciferase Assay Substrate, is added to the test system. After measuring the light intensity of the sample, the concentration of NanoLuc in the test system can be calculated by the calibration curve. The standard protocol guarantees the strong linearity of luminescence in the NanoLuc concentration between 10-2 and 104 pM, which makes dilution operation required during experiments (Fig. 2). Although some researchers show concern about the non-linearity that may occur for NanoLuc at higher concentrations, such a situation may not happen due to the saturation limit of common laboratory luminometers. However, the light detectors used in the laboratory luminometers are mostly single photon detectors such as photomultiplier (PMT) or avalanche photodiodes (APD), which require a configuration of high voltage that is unrealistic for the low-cost devices designed for the unique environment. Indeed, the camera and photoresistors are the most used light detectors in low-cost devices. A higher concentration of NanoLuc is required because both have a higher limit of detection (LOD) than PMT and APD. Another shortcoming of the low-cost device is that the operation of adding reagents to the test sample is easily affected. The reaction between NanoLuc and its substrate in our system is swift, and the light intensity decays rapidly, thus requiring luminometers equipped with injectors to measure the transient peak luminescence. Since our device users are most likely not trained in the laboratory, the situation will be more serious. All in all, the standard protocol's calibration curve method is unsuitable for our project's hardware design, and the new one for NanoLuc concentrations estimation should be built.
Fig. 2 Linear dynamic range of NanoLuc.
    The original intention of this model is to mathematically describe the relationship between luminescence decay kinetics recorded by hardware and NanoLuc concentration, and thus can be used for estimating the possibility of AHPND for shrimps, which is the pursuit of the RENDR system. Such a modeling process is significant for implementing the OMEGA project and measurement.

Model Derivation

    Before starting modeling, we first learn the new protocol for measuring the NanoLuc concentrations and how the hardware collects the luminescence decay kinetics for the test sample (Fig. 3). Assume there is a sample (such as reaction-finished cell-free system or purified NanoLuc) waiting for the measurement. Mixing the sample with the Nano-Glo® Luciferase Assay Substrate (FMZ) in the paper chip placed in the monitoring devices, the camera in the upper cover will record the whole luminescence decay kinetics after the device is under the dark environment inside (Fig. 3a). The recorded video is then processed in the software part using the back-end algorithm for paper-chip positioning (Fig. 3b) and data filtering (Fig. 3c). After the above process, the output data is ready for modeling and analysis (Fig. 3d). Notice that our new protocol intentionally removes the Nano-Glo® Luciferase Assay Buffer to ensure the possibility of recording using a low-cost camera, which also shorters the progress of luminescence decay for fast detection.
Fig. 3 The process of data collection. a The monitoring device capture the luminescence decay kinetics using camera. b The paper-chip positioning algrithm determinds the postion of samples. c The filter decrease the noise in the collected data. d Data modeling and analysis to mining the information hidden in the data.
    To derive a model for explaining the relationship between the luminescence decay kinetics and NanoLuc concentration, Video A is created as the learning data using purified NanoLuc (147.12 μg/mL, 7.7 μM for NanoLuc), which has three NanoLuc concentrations (100%, 50%, and 10%). Each sample in video A contains 0.2 μL FMZ and 10μL NanoLuc solution of defined concentration. Video A is attached as follows.
The video is a gif, you may need refresh page for watching.
    The data extracted using the back-end algorithm in the software is shown as follows (Fig. 4). The data in Video A marks as ${D_A} = \{ {t_i},{x_i},{u_i}\} ,i = 1,2, \cdots n$ , where ${t_i}$ represents the time, $x$ denotes the NanoLuc concentration, and ${u_i}$ is the light intensity at ${t_i}$.
Fig. 4 The data extracted from Video A.
    Here we pay attention to two critical features in the luminescence decay kinetics data:
    i. The data collected by hardware is scarce and noisy. Gaining the governing equations is not easy, not to mention for a system that is not studied upon imperfect data. Given that, obtaining physical laws from the rigorous first principles seems impossible.
    ii. The change of NanoLuc concentration causes the different initial light intensity, showcasing a strong positive correlation. While the luminescence decay time (in fact, the luminescence in the last point is just lower than the camera's LOD) is similar in different groups, which indicates the decay speed is related to the NanoLuc concentration.
    Considering the scarce and noisy data, the data-driven equation discovery was proposed to analyze the data in Video A. Our data-driven strategy is based on the work published in Nature Communications, which develops a novel PINN-SR method (i.e., physics-informed neural networks with sparse regression) possessing salient features of interpretability and generalizability to discover governing PDEs of nonlinear spatiotemporal systems from scarce and noisy data. However, this work requires pre-guessing the PDE terms library for future deletion, which restricts its application in the free-form governing equation discovery (2). The new strategy (PINN-GP-SR) has almost the exact implementation as the PINN-SR but introduces genetic programming (GP) to generate a possible library in the first turns (Fig. 5). GP is also a method for mining the physical laws from data (3). It transforms the mathematical formula into a tree structure so that different trees can evolve by exchanging the sub-tree or mutating themselves to generate various trees for evaluation. GP is easy to overfit in the dataset, so producing a large library for PINN-SR is convenient. In comparison, Occam's Razor of SR can later shape the large library to the correct size, and PINN can embed the physical law that can be described by governing equations in the data learning process.
Fig. 5 The new strategy (PINN-GP-SR) discovery the free-form governing equation.
    Let us go back to using the PINN-GP-SR to derive the governing equation of luminescence decay. .
    (1) A neural network is trained as a universal function approximator for the luminescence decay process. Given that we have found that the decay speed is related to the NanoLuc concentration, a neural network is employed to predict the light intensity according to the time and NanoLuc concentration, marked as $u = NN1\left( {x,t;\theta } \right)$, $\theta $ is the parameters learned from data. Video A is recorded with a 30fps camera frame rate, so three traces has 5484 points each. All the data points are normalized and split into two parts for training and evaluation, including a training set (70% of data) and a test set (30% of data). The NN1 has eight hidden layers with 100 nodes, and the Relu activation function in each layer (Fig. 6a). $\theta $ is learned from the training set with the Adam algorithm, while the mean squared error (MSE) is employed as the loss function: M S E u = 1 n i = 1 n [ u i N N ( x i , t i ; θ ) ] 2 The best network is evaluated and selected using the same metric in the test set. Finally, the NN1 is converged with a good performance in both the training set (MSE: 2.59e-05, R2: 0.99905) and the test set (MSE: 2.70e-05, R2: 0.99904) after 22503 steps of training (Fig. 6b, c). The convergent NN1 will serve as GP input to construct the governing equation in the next step.
Fig. 6 The process of data collection. a The architecture of the NN1. b The loss changes with the training process. c The R2 changes with the training process.
    (2) A modified genetic programming algorithm is constructed to get the baseline PDE. Different essential mathematical operations are designed as the tree structure so that different formulas can be represented by stacking the base tree (Fig. 7).
Fig. 7 The basic GP tree can be used to represent essential mathematical operations.
    In the tree evolution progress, each tree can be mutated in its sub-tree or crossover with other trees. After mutation and crossover, the tree will be transformed into the mathematical formula and expanded as a polynomial with weights. The weights will then be recalculated using a sequential threshold ridge regression (STRidge) algorithm, which guarantees the sparsity and is called pruning of the GP tree. After pruning, each GP tree will be evaluated using the following MSE loss: 1 n i=1 n {NN( x i , t i ;θ)N[NN( x i , t i ;θ)]} 2 Where $N[u]$ is the nonlinear differential operator represented by the GP tree. The evaluated result will also be used in the crossover operation to imitate the natural biological process of evolution. Thanks to the auto differentiation implemented by the chain ruler, the $N[u]$ can be easily computed without any bias from the neural network. In this part, we receive the ODE library with only one term: α u 1+β Where $\alpha $ is the learnable parameters and $\beta $ is the normalized NanoLuc concentration.
    (3) Post-learning using PINN to gain a more precise PDE. Since in the last section, we get the ODE library with only one term, the sparse regression is no longer required to shape the ODE library's size. PINN is used to embed the governing equation into the neural network by redesigning the loss function. In our example, the loss function will be updated by adding the MSE loss of the data and the best GP tree together: MS E u = 1 n i=1 n [ u i NN( x i , t i ;θ)] 2 MS E f = 1 n i=1 n {NN ( x i , t i ;w,b) t N[NN ( x i , t i ;w,b) t ]} 2 loss=MS E u +MS E f The convergent NN1 then be fine-tuned with the new loss function. After another 17724 learning steps, the convergent NN1 gains higher performance in the training set (MSE: 2.06e-05, R2: 0.99924) and the test set (MSE: 2.30e-05, R2: 0.99922), which indicate that PINN successfully embeds the physical law in the network. And the final luminescence decay kinetics of Video A will be represented as follow: du dt =α u 1+β α=0.0016 Fig. 8 shows the simulation using the ODE above and its gap compared to experimental data. Considering the data collected with hardware is full of noise, the performance is quite well.
Fig. 8 Luminescence decay kinetics. a ODE simulation result. b The difference between experimantal data and simulation data.

Model Validation

    To test the accuracy of the equation, two new videos were prepared simultaneously with Video A. Video B has four NanoLuc concentrations (100%, 50%, 25%, and 10%), while Video C has two NanoLuc concentrations (100% and 30%). Each sample in the video contains 0.2μL FMZ and 10μL NanoLuc solution of defined concentration, which keeps the same conditions as Video A. After processing in the software, the dataset ${D_B}$, ${D_C}$ was created (Fig. 9).
Fig. 9 Data collected using hardware. a Data of Video B. b Data of Video C.
    The data with 100% NanoLuc concentration was used for data filter hyperparameter calibration in the software, while the other sample was used for the test. By solving the following optimization problem, we can have the unknown concentration of NanoLuc: min β i=1 n [ u i DataSim( du dt =0.0016 u 1+β )] 2 Where data simulation with ODE is implemented using the Euler method, the optimization problem is solved by genetic algorithm (GA). The inference of NanoLuc concentration is shown in the following table. All the test samples has the error within 5%, which means that the ODE equation is reliable.
        Video Index         Groundtruth         Prediction
B 50% 54%
B 25% 23%
B 10% 10%
C 30% 28%
    To understand the learnable parameters in the equation, three new videos were prepared asynchronously with Video A. Video D, Video E, and Video F were recorded in 14 days, 24 days, and 32 days after the recording of Video A, so as the Dataset ${D_D}$ , ${D_E}$ , ${D_F}$ is generated (Fig. 10).
Fig. 10 Data collected in Video D, Video E, Video F.
    From the form of the ODE equation, it is easy to infer that the learnable parameters maybe relate to the enzyme activity. So three new datasets were created with the same enzyme used in the old datasets but not recorded simultaneously to generate the gradients of enzyme activity. All sample is generated with 100% NanoLuc concentration, so the learnable parameters can be calculated by solving the following optimization problem: min α i=1 n [ u i DataSim( du dt =α u 2 )] 2 The learnable parameter, $\alpha $ , decreases when the enzyme activity decreases (Fig. 11). The luminescence decay kinetics is controlled by two factors: the $\alpha $ and the light intensity. Higher NanoLuc concentration, which equals to high enzyme activity, results in higher initial light intensity. To maintain the same luminescence decay time, the $\alpha $ changes dynamicaly with the enzyme activity.
Fig. 11 The dynimical of $\alpha $ with the change of enzyme activity.

Model Extrapolation

    Before this part, we all test the ODE equation with the NanoLuc concentration in the μM level (100% NanoLuc = 7.7 μM for NanoLuc, biggest concentration difference in 10-fold) without changing FMZ's concentration. Next, we want to test the model's generalizability with a concentration change large than 10-fold for both NanoLuc and FMZ.
    The purified NanoLuc was diluted from 103 times to 108 times to gain different NanoLuc concentrations from nM to pM. The same protocols without buffer were used, while the monitor device was changed to the microplate reader since the NanoLuc concentration is too small to be detected by the camera. All the data collected is shown in Fig. 12.
Fig. 12 The data collected in the the microplate reader with different NanoLuc concentrations from nM to pM.
    The data in Fig. 12 shows that the luminescence decay kinetics exists in the different NanoLuc concentrations from 7.7 nM to 0.77 pM. To check whether the luminescence decay kinetics obeys the ODE equation discovered by PINN-GP-SR or not, the linearity between the numerical differentiation calculated from light intensity (du/dt) and the squared light intensity data (u^2) is checked since each trace can be identified as 100% NanoLuc (also called standard sample in future text). The checking result is shown in Fig. 13. When the concentration of NanoLuc decreases, the linearity is not guaranteed, the reason may be as follows: (1) when the light intensity is close to the LOD of the microplate reader, the data collected is oscillating, which can be seen in Fig. 13d, e. (2) The numerical differentiation is not so reliable, especially for the oscillating data. A data filter will be reconmend in future data processing to deal with oscillation.
Fig. 13 the linearity between the numerical differentiation calculated from light intensity (du/dt) and the squared of light intensity data (u^2) in different NanoLuc concentrations. a 7.7 nM. b 0.77 nM. c 77 pM. d 7.7 pM. e 0.77 pM. f R2 table.
    The same exploration is also implemented for the FMZ. As for a detection protocol, the substrate in excess is an important condition. The limitation of FMZ concentration of our new protocol is also one of the key points. The FMZ is diluted using PBS in different ratios, including 1:1, 1:3, 1:7, 1:15, and 1:31. Using the same protocol and test in the microplate reader, the data is plotted in Fig. 14. With the decreasing of FMZ, the reaction rate is also decreasing, and the peak is becoming observable.
Fig. 14 The data collected in the the microplate reader with different FMZ concentrations.
    The linearity between the numerical differentiation calculated from light intensity (du/dt) and the squared light intensity data (u^2) is checked and shown in Fig. 15. The linearity is not maintained in the high dilution rate due to the lack of FMZ.
Fig. 15 the linearity between the numerical differentiation calculated from light intensity (du/dt) and the squared of light intensity data (u^2) in different dilution rate of FMT . a 1:1. b 1:3. c 1:7. d 1:15. e 1:31. f R2 table.

Discussion

    NanoLuc, a new member of the luciferase reporter gene/protein family, has great potential in imaging and labeling research. However, it only can be tested on standard devices like microplate readers according to the standard protocol provided by Promega, which limits its use in rapid detection and special environments.
    In our projects, low-cost hardware with a camera is created to detect the AHPND in situ. Considering the skill level of potential users and the actual test situation, the new protocol is designed, and a new luminescence decay kinetics is built to estimate NanoLuc concentration. The new kinetics are not only suitable for a low-cost device like the camera but also the microplate readers proposed by Promega's protocol, which shows the high generalizability of our model (Fig. 16).
Fig. 16 The new luminescence decay kinetics has high generalizability.
    In our model process, we successfully distill the kinetics from sparse and noisy data with the help of physics-informed neural networks and genetic programming. Then lots of experimental work is used to explain the parameters meaning in the new ODE equation and its generalizability, which breaks the black box of neural networks. Our work is the first AI for Science in the iGEM community that connects the data-driven modeling method and mechanism-based modeling method, and this framework will be more and more popular in governing equation mining.
    In conclusion, the whole process of our modeling sets a good example for model development. Systematic implementation of the NanoLuc luminescence decay kinetics and proper modeling in our project will also contribute to the iGEM community and inspire other teams who also choose the NanoLuc as an imaging and labeling system in their projects.

Reference

      1. M. P. Hall et al., Engineered luciferase reporter from a deep sea shrimp utilizing a novel imidazopyrazinone substrate. ACS Chem. Biol. 7, 1848-1857 (2012).
      2. Z. Chen, Y. Liu, H. Sun, Physics-informed learning of governing equations from scarce data. Nat. Commun. 12, 6136 (2021).
      3. M. Schmidt, H. Lipson, Distilling Free-Form Natural Laws from Experimental Data. Science 324, 81-85 (2009).