Proof of Concept | Tianjin

Proof of Concept for "Dancing Venus" Data Storage

Statistics of sequencing results

After receiving the sequencing results, we decoded the sequencing results and generated a GIF image. We first made statistics on the editing of all sites counted from all colonies sent for testing.

Figure 1: The number of colonies with editing events at each site.

Figure 2: All sites in yNuwa002.

The yGIF024 strain had the best editing effect, so we tried to decode the yGIF024 strain first.

The original decoding program

We write decoding programs in Python. The program first recognizes and reads the information sequence from the sequencing results, then translates the information sequence into a "0" and "1" array,and then converts the "0" and "1" arrays into 7 pixel maps, finally concatenates the 7 pixel maps into 1 GIF map.

Figure 3: Flowchart of the original decoding program

With the first generation of decoding programs, we get the decoding results.

Figure 4: First-generation decoding results

This is the result of our first generation of decoding, which is still relatively far from the “dancing venus” gif which we originally envisioned. After analysis, the main reasons for the differences are:

1) many loci have not been successfully edited, because in addition to the first figure, each graph is generated on the basis of the previous graph, so the resulting errors will be cumulatively magnified.

2) Due to the existence of bystander editing, changing the positioning sequence or the base sequence of the coding row and column will cause two kinds of errors:

a. Unrecognized: The inability to recognize the positioning sequence leads to the inability to read the information sequence;

b. Error reading: Reading out the wrong row and column values produces incorrect results.

To bring the decoding results closer to the expected gif plot, we decided to reduce the error by using two methods, one using the original sequence correction and one for synthesizing the sequencing results of multiple colonies.

Decoding using the original sequence correction

In a program that uses the original sequence correction, we imported the original plasmid sequence before the edit occurred, and found and sliced the original sequence. Five parts were isolated in each information sequence, oritar2=[], ori1g2=[], orib=[], orir=[], oril=[], corresponding to the editing target, gRNA comparison recognition region, number dimension, row dimension, and column dimension, respectively. When decoding the information sequence of each sequencing result, we used the three dimensions of the original plasmid sequence corresponding to the sequence instead of reading out the three dimensions from the sequencing results that may be edited by bystander editing, so that the "Unrecognized" and "Error reading" results were greatly reduced.

The gif graph decoded using the correction program has only 1 "Unrecognized" result and no "Error read" result. However, since there are many sites in a single colony that have not been successfully edited, we hope to synthesize the sequencing results of multiple colonies to achieve complementary and better decoding results.

Figure 5:Second-generation decoding results

Decoding of multicolony sequencing results using multicolony

In the multicolony decoding program, we define the "imagechange1" and "imagechange2" functions. They perform functions similarly to the first generation of decoding programs, decoding a sequencing result and returning a "0" and "1" list for image generation. Using the program, we can decode 5 sequencing results to obtain a "0" and "1" list of 7 pixel pictures. After that, the program will integrate the "0" and "1" lists corresponding to the pixel images, that is, for a certain site, as long as there is a result of successful editing, the site is considered to be successfully edited, and the result of the site editing by the bystander will be discarded and made up by the result that no bystander editing occurs.

We decoded yGIF024, yGIF032, yGIF034, yGIF36, and yGIF107. Except for 3 sites with no editing results at all, there are no errors of "Unrecognized" and "False reading".

Figure 6: Third-generation decoding results

The three generations of decoding programs we have developed have all been submitted to Gitlab, and you can download the program from Gitlab to verify our decoding results.

Proof of Concept for "Ode To Joy" Data Storage

Decoding

Figure 1: Ode to Joy with error. The wrong note is in the red box.

The decoding part of this program is mainly completed by aligning the sequences before and after editing, and decoding is completed by the software compiled. The decoding method shows as the followings:

We defined an identification sequence for each edited unit. There are three types of identification sequences: "CCTAGA", "CCATGG" and "CCAAGA". Here, the first three bits are a PAM sequence, the third and fourth bits are our x1 values, and the sixth bit is our octave values. This is also a part of our identification units.

Due to the problem of the construction time of gRNA arrays and the possible mutations of DNA itself, it was difficult for us to restore DNA perfectly according to the coding in the dry experiments, so we made a compromise, it means that we considered that x1, y1, z1 are fixed values, and omitting the tone that does not edit to reduce the length of the segment, thus simplifying our wet experiments and shortening the experiment cycle.

In order to write “Joy” project, we evaluated Ode to Joy and chose an average value of E3 as x1. Other notes were changed based on it, which can low the difficulty of our wet experiments and speed up our iteration.

We wrote a file through C++ to read DNA fragments and find our anchor tags and editing sites.

First generation decoding

Figure 2: The first decoding of Ode to Joy, which the note edited correctly are in the green box.

The first generation of coding supports that as long as the base in the editing window is inconsistent with the original sequence, it will be changed.

We have established a series of vectors to form a matrix to describe the editing results. Through the matrix, we can intuitively understand which site is edited and which site is not edited.

For this segment obtained from the wet experiments, we decoded it for 24 times in total, and 23 times except for the 11th data loss. According to the results, we can get the highest number of edits for the 20th to 23rd times, so we mainly decoded according to the data of these times. The following matrix was obtained.

[1,0,1,0,0,0,0,0] //Expected value: 5 actual value: 6
[0,1,1,0,0,0,0,0] //Expected value: 2 actual value: 4
[0,1,1,0,0,0,0,0] //Expected value: 2 actual value: 4
[0,1,0,0,0,0,0,0] //Expected value: 2 actual value: 2
[1,1,0,0,0,0,0,0] //Expected value: 4 actual value: error
[0,0,1,0,0,0,0,0] //Expected value: 5 actual value: 5
[0,0,0,0,0,0,0,0] //Expected value: 5 actual value: 3
[0,1,0,0,0,0,0,0] //Expected value: 2 actual value: 2
[0,0,1,1,0,0,0,0] //Expected value: 5 actual value: error

Among them, 0 represents “NOT edited”, and 1 represents “edited”. Through decoding, we got unexpected data. The difference between the obtained value and the expected value was large. Which is far from our ideal editing effect, and we think it is mainly caused by wrong editing.

Second generation decoding

Figure 3: The second deconding of Ode to Joy. All notes are edited correctly.

Therefore, we developed the second generation of verification. We obtained the matrix by comparing the experimental sequence with the original sequence, and we only recorded the specified changes to reduce the impact of error editing on the decoding:

[0,0,1,0,0,0,0,0] //Expected value: 5 actual value: 5
[0,1,0,0,0,0,0,0] //Expected value: 2 actual value: 2
[0,1,0,0,0,0,0,0] //Expected value: 2 actual value: 2
[0,1,0,0,0,0,0,0] //Expected value: 2 actual value: 2
[1,0,0,0,0,0,0,0] //Expected value: 4 actual value: 4
[0,0,1,0,0,0,0,0] //Expected value: 5 actual value: 5
[0,0,0,0,0,0,0,0] //Expected value: 5 actual value: 3
[0,1,0,0,0,0,0,0] //Expected value: 2 actual value: 2
[0,0,0,1,0,0,0,0] //Expected value: 5 actual value: 1

Since then, our results achieved our expectations nearly, and we analyzed the reason is the efficiency of editing. Both 2 error areas are caused by unsuccessful editing, but we could fix these problems by aligning other sequences to get the expected results.