Statistical Design of Experiments

  • What are we trying to achieve?
    Multiple factors govern the yield of a protein produced in bacteria. We are trying to find out the most optimal values of the factors that govern scFv expression in SHuffle E. coli by carrying out experiments at different values of these factors and measuring the yield of the protein in each case. These experiments are statistically determined.
  • Why is this the best method?
    Statistical design of experiments samples the multifactorial landscape of response variable (in our case, yield) much more efficiently than traditional one-factor-at-a-time approaches, where the influence of only one factor is investigated at one particular value of other factors.
  • What does this tell us?
    • Through this, we can determine, statistically, the best values of factors to produce our protein in with SHuffle E. coli. Identifying the values of factors like time of incubation or temperature of the culture can be valuable for industrial scale production.
    • We can also identify if there are any interactions between factors, i.e, does the value of one particular factor affect the relationship of the other factors with the response variable.

Optimising protein expression

While the production and characterization of the properties of a drug is necessary, we also felt that to maximize the impact of our therapy, it needs to reach as many people as possible, i.e., it needs to be mass produced. One of the steps in industrial scale production of biologics in bioreactors is process optimization: maximizing the output of the system by narrowing down to an ideal set values of governing factors.

Moreover, disulfide bonded protein production in SHuffle E. coli works a little differently, and it is important to identify what factors influence the protein production in this strain. We hypothesized, after brainstorming and literature review that, for scFv production in SHuffle E. coli cells[1][2][3][4] , the following factors govern the yield of the protein:

Continuously varying factors:

  • % dissolved oxygen
    Dissolved oxygen in a culture can govern the growth rate of bacteria and hence the protein produced.
  • Concentration of inducer (IPTG)
    Concentration of inducer has been found out to be an important factor to optimize for protein production in general, and the ideal value of inducer concentration is protein specific.
  • Period of incubation
    Period of incubation is yet another factor, and has been found out to be strongly interacting with dissolved oxygen and temperature post-induction
  • OD600 at induction
    OD at induction captures the growth rate at which the bacterial culture is captured and instructed to produce protein
  • Nutrient concentration
    Naturally, a higher concentration of nutrients can mean higher growth rate of bacteria and more protein produced, but this factor can also interact with % dissolved oxygenn.

Categorical factors:

  • Strain (SHuffle B or SHuffle K12)
    The SHuffle strain of E. coli has been engineered on two different existing E. coli strains, and have different properties in protein production
  • Plasmid backbone
    We used a pET21b backbone with a T7/lac promoter. Other plasmid backbones may influence transcription and hence protein expression.
  • Molecular Chaperones to Enhance Folding
    For producing antibodies in SHuffle, we had access to multiple molecular folding chaperones. We wanted to experiment with PDI-Gpx7 fusion chaperone which shuttles the oxidative power of H2O2 to produce disulfide bonds[5].
  • Media
    Growth medium for bacteria can govern the growth rate and interact with a lot of other factors. We primarily used LB media.

Design of Experiments (DoE) is an empirical statistical methodology we employed to create models for protein yield governed by various factors.

Introduction to DoE

Often in biological research, while trying to optimize the factors governing an output of a particular system, there is a tendency to optimize one factor at a time (OFAT), while keeping other factors constant. This approach is intuitive, but is flawed because it does not sample the entire factor space correctly and is unable to find out interactions between factors.

A typical DoE workflow consists of specifically designed experiments called treatments and the output or response is measured. The values of factors in this case are varied together, i.e. this is a multifactorial approach.

Figure reproduced from Gilman et al 2021[6] . a. shows the response landscape of a process governed by two factors. b, c show the OFAT optimisation of the process. One can see that this approach is resource intensive in the number of experiments performed and misses the optimal mark. d, e show an iterative DoE approach reaching the peak of the landscape much more efficiently.

A textbook DoE approach has three steps: scoping, screening and optimisation. Since in our case we had the budget of experimenting on a limited number of factors, we designed only screening and optimization experiments.

Model and Designs

We used the JMP software by SAS to create all our experimental designs and run analysis on them. Out of the multiple factors we hypothesized, we narrowed down to six and planned a Definitive Screening Design to screen for the top two or three factors, on which we planned to characterize a response surface by a Response Surface Methodology model.

We eventually realized that it would be difficult to execute such a design at our scale and resources. Thankfully, a significant amount of work was done in literature on antibody fragments in SHuffle E. coli, such that we could narrow down to three important factors to investigate and characterize an ideal response surface on them for producing our protein.

We finally decided to run a response surface design in the lab on three factors:

  1. Temperature post-induction
    We varied this between 16C and 37C. From our literature review, the ideal temperature and its interactions was bound to lie between these values.
  2. IPTG concentration
    Varying from 0.1 to 1mM
  3. Optical Density of the culture at induction
    Varying from 0.5 to 1

The final design that we executed, and its output is given below. Please note that the output is SDS PAGE band intensity measurements, which are proportional to the yield.

Measurement and Analysis

To measure our response variable yield, we used the darkness of the SDS PAGE band as a proxy for the amount of protein. To convert the darkness of the band to mg/ml units, we plotted a standard curve for known concentrations of Bovine Serum Albumin. BSA also served as a calibration mechanism between different gel images, which were auto-imaged at different illumination settings. Details of this can be found on our measurements page.
We used ImageJ software by NIH for analysis of the SDS PAGE gels. The intensity units were normalized across gels via known concentrations of BSA.
The outputs were inputted into JMP, which performed regression analysis on empirical multivariable equations to estimate the parameters.
The JMP report is presented below.

We found out that in our experiments, there was no statistically significant effect (i.e., all p values > 0.05) of any factor influencing the yield of our antibody fragments.
The most significant factor: the quadratic effect of temperature on yield, with a p-value of 0.21, was in agreement with some literature on SHuffle protein expression that temperature is the most important factor influencing protein yield.
The entire JMP report is available here. PDF We also plotted response surfaces for the coupled interactions of Temperature-IPTG, Temperature-OD, and IPTG-OD.

We found the most optimal condition in our sampling to be at 0.1mM IPTG, 26.5°C incubation temperature and induction at OD600 of 1.
From our BSA standardization curve on SDS-PAGE gels, we could estimate the yield of our scFv at its most optimal production condition in our sampling, which turned out to be 379 μg/ml.
Yields could not be estimated for lower values of band intensity, since they fall below the intercept in the BSA standard curve.

Collaboration with Team Virginia

While we had to narrow down on the number of factors we could investigate to three, and ignored time of incubation as a factor, we found out that understanding the relationship between time of incubation and temperature of incubation is important, especially for the SHuffle strain.
Thankfully, we were partnering with Team Virginia, who were using SHuffle E. coli to produce scFvs and antibodies for atherosclerosis diagnostics.
Since production optimisation was an important aspect of each of our projects, we decided to investigate the relationship between time and temperature. We designed and analyzed the experiments, while Team Virginia executed them.

For this model, we had a resource constraint of carrying out only 9 experiments. The JMP software has the ability to accommodate experimental constraints in the experimental design. We set up a Custom Design in JMP with the following values of the factors:

  • Temperature of incubation post-induction
    Between 16 and 37°C
  • Period of incubation
    Between 6 and 24 hours

For this model as well, we used SDS-PAGE band intensity as a proxy for yield. The experiments table with the yield outputs, and the response surface plot for the two are presented below:

For this particular model as well, the effects of either variable were not found to be significant.

The most optimal condition was found after incubation for 24 hours at 26.5°C.

More importantly, there seemed to be an ideal range of temperature between our extreme values chosen, where yield was the maximum for a given value of incubation period (time). This understanding can be inferred from the response surface plot below:


  • Statistically significant factor interactions could not be observed for these experiments.
  • Ideal production conditions were found to be 0.1mM IPTG, 26.5°C temperature and OD600 of 1
  • Yield calculated at optimal condition is 379 μg/ml.

Future Directions

  • To observe statistically significant interactions, the amount of culture could be increased, since the size of cell pellets would significantly differ for different conditions.
  • Small scale fermenters could be used to probe for factors like %DO, which is a better unit of measurement than RPM for a shake-flask experiment.


  1. Ren G, Ke N, Berkmen M. Use of the SHuffle Strains in Production of Proteins. Curr Protoc Protein Sci. 2016 Aug 1;85:5.26.1-5.26.21. DOI
    PMID: 27479507.
  2. Ahmadzadeh M, Farshdari F, Nematollahi L, Behdani M, Mohit E. Anti-HER2 scFv Expression in Escherichia coli SHuffle®T7 Express Cells: Effects on Solubility and Biological Activity. Mol Biotechnol. 2020 Jan;62(1):18-30. DOI
    PMID: 31691197.
  3. Behravan A, Hashemi A. RSM-based Model to Predict Optimum Fermentation Conditions for Soluble Expression of the Antibody Fragment Derived from 4D5MOC-B Humanized Mab in SHuffle™ T7 E. coli. Iran J Pharm Res. 2021;20(1):e127052. DOI
  4. Hashemi, A., Basafa, M. & Behravan, A. Machine learning modeling for solubility prediction of recombinant antibody fragment in four different E. coli strains. Sci Rep 12, 5463 (2022). DOI
  5. Lénon M, Ke N, Szady C, Sakhtah H, Ren G, Manta B, Causey B, Berkmen M. Improved production of Humira antibody in the genetically engineered Escherichia coli SHuffle, by co-expression of human PDI-GPx7 fusions. Appl Microbiol Biotechnol. 2020 Nov;104(22):9693-9706. DOI
    Epub 2020 Sep 30. PMID: 32997203; PMCID: PMC7595990.
  6. Statistical Design of Experiments for Synthetic Biology James Gilman, Laura Walls, Lucia Bandiera, and Filippo Menolascina ACS Synthetic Biology 2021 10 (1), 1-18 DOI