Barcelona_UB - iGEM 2022

Introduction

Simulations and models have been being very important since some years ago. They allow to have a first approach of the project without using any biological material, which reduces costs and highlights the correct way to follow. Models and simulation are not only used in the firsts steps of the project, they are strong tools to support the data, the hypothesis and even to have a visual representation of things that cannot be seen with the naked eye, for example: protein structure

The main objective of our project is to get the docking of two proteins: CD19 and CD19-LIGAND (from now on: CD19L). The difficulty of this step is that CD19 (B-linfocites marker), a real protein encoded in humans genome, does not have so many ligands, that is why we had troubles looking for a ligand. After searching in bibliography we found an artificial ligand used in the development of a drug which finally could not be released. Another problem was that we did not have real evidence of the interaction between CD19 and CD19L, moreover, we could not find the structure of CD19L. Nevertheless, only with the sequence of CD19L we tried to use structural biology tools to study if this ligand was optimal for our aim.

1. Modeling of CD19-L

First of all, we needed to have the structure of CD19L because in RCSB Protein Data Bank (RCSB PDB) we only found the structure of CD19 (6AL5). To get the structure we tried different approaches. First using AlphaFold2, an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence and has shown incredible results in the CASP competition in 2020. However, the result was dissapointing, as more than the 50% of the structure was disordered.

Our second approach was using homology modeling. Homology modeling is a easy way to find proteins with similar structure in order to use them as templates for model our protein, using reastraints and energy minimization. It needs to be said that the modelling was done only for the recognition domain in order to reduce the time of the modelling and because it is the only important domain in the docking protein-protein interaction. And schema of the method is:

1. Fold assignment: There are many ways to do this. We decide to search on phi-blast, which allows to run the process iteratively and save the PSSM matrix. We run phi-blast with 4 iterations on the uniprot database (here we have every protein with no bias) and then used the PSSM matrix to search one time on the PDB database, which only has proteins with known structures.
2. Template selection: This step is crucial to obtain a good model, so sometimes in this scheme you might need to come back to a previous step to reajust your choices. From all the possible templates we finally select 6ANI_L (chain L), 6ANI_H (chain H), and 1C5B_H (chain H).
3. Model building: Here we used the program Modeller. This program works in 3 steps: Rigid Body Assembly, only the backbone of the residues is used; Spatial restraints, where probability of density functions are used taking into account the templates to do the restraints; and finally the side-chain modeling, using backbone dependent rotamer libraries and energetic and packing criteria.
4. Evaluation: There are some strategies to evaulate the model. We used the program PROSA to select the closest model to the native conformation. The selection of the closest model to the native conformation has become crucial for structure prediction. Several methods have been developed to score protein models by energies, knowledge-based potentials and combination of both. The analysis of a database of protein structures shows that certain residues end to be in closer proximity than others. This frequency can be interpreted as a probability and by the inverse of the Boltzmann law we can calculate energies. This potential force field is named statistic potential or knowledge based potential.
5. Improvement: As we said, sometimes you need to come back to some steps in order to have the best model.

**Figure 1.** Python script to run the Modeller where it is shown the templates used for the model, and a function to sort the models taking into account the total energy.

**Figure 2.** Representation of the structure of CD19L (recognition domain only) and colored by secondary structure.

**Figure 3. Graphic of the Z-score (vertical axis) and the residue position (horizontal axis) using PROSA.** For a good structure (native conformation) we have to check that the Z-score is under 0 values. For the representation of the Z-score we have used a winsize of 30 residues.

2. Modeling of docking between CD19 and CD19L

One time we have the receptor (CD19) and the ligand (CD19L) structures it is time to see if they interact. There are different approaches to study the interaction between two proteins. Here, we have chosen an exhaustive search (pyDock = Zdock + energy scoring) instead of a stochastic sampling. These methods use a geometry-based docking with a Fast Fourier Transformation (FFT) based grid search where the proteins are discretized into small grids and then using a correlation function and the FFT to search the more stable interaction. This algorithm has a N³lnN³ computional cost, so that is why we are only using the reconigtion domain. We generated 2000 models of interaction and then scored by energy using electrostatics, van der waals, surface and hidrogen bonds. From the 10 best solutions we created the structures in pdb format and then chose the best one.