Document Document

Model

Model Overview



On this page, we will explore what kind of encoding within nucleotides is most suitable for micro Nuwa. Most DNA storage methods store information in the form of nucleotide sequences. The theoretical Shannon information capacity [1] of DNA is 2 bits/nt, although this limit is difficult to reach due to inherent errors in the DNA storage channel in the form of sequencing and synthesis error rates and biochemical constraints in sequence space [2,3]. However, for micro Nuwa, we have more aspects to consider, as follows:
1 Requirements for DNA data sequence
  1.1 PAM site
  1.2 The editing window contains cytosine
  1.3 No duplication between each gRNA
2 Requirements for CBE
  2.1 Minimize mismatch and homology
  2.2 Edit site spacing affected by steric hindrance effect of Cas9
3 Limitations of DNA synthesis and sequencing technology

Besides, in order to evaluate different codes, the following parameter are considered:
1  Information density (bits/mm3) [4]
2  Information entropy production of an editing event (bits)
3  Editing potential, refers to the maximum editing limit of DNA data. For instance, one sequence can be changed from 1 to 100, and the other can only be changed from 1 to 10. Thus, the editing potential of the former is 10 times that of the latter.

With the coding requirements and parameters in mind, let's start our modeling.

Basic Principles



Note: This section only discusses the case that there is only one cytosine in the editing window.


For nCDA1Δ198-BE3, two strands of DNA can be edited. We assume that AGTAGTACG is an data editing unit (Figure 1a). The sequence state is represented by x. There are four values of x (Figure 1).rated our project.


Figure 1: Four states of data editing unit.

When we assign meanings to fixed sequences, we can assign values to such variable data editing units, define different operations, and such fixed modified sequence can represent any meaning. For example, fixed sequence C is assigned a value of 0, and the value F(x) of the overall composite sequence is defined as the sum of fixed sequence C and variable data editing unit f (x1, x2, x3..., xn). The data structure in Figure 2b can be edited from 0 to a positive integer of 1~11.


Figure 2: The meaning of each part of the DNA sequence. (a)Using invariant sequences as gRNA binding sites. (b) Three edit units decorate one fixed sequence.

First Generation Music Coding





Now we have a fixed sequence-editing unit composite sequence and an operation method f(x) that defines the sequence. How to use these to achieve data editing? Considering the limited number of editing units for each fixed sequence, the values that can be represented by such a composite sequence must be limited. In such a limited number of values changes is our goal. Considering the group in the mathematical definition, we can think that the finite number constitutes a group, and the process of editing corresponds to the closed binary operation of the group. Arbitrary editing is similar to error correction. The wrong value of the data is the element of the group, and it is an error correction process to modify the element into the correct one. Since the error value may be arbitrary, error correction editing is the editing algorithm we need. Therefore, we associate the principle of Reed Solomon coding. For Figure 2b, group G can be built:

  • C : The value of fixed sequence;
  • f(xn) : The value of edit unit Xn. Each unit has four values;
  • n : edit units in total;
  • F : Final value of composite sequence

For a composite sequence with 8 states, G is a Galois field GF[23]. At present, C can be assigned any value artificially. If C ∈ {0,1,2,3,4,5,6,7}, n=2, f(x1,x2) ∈ {0,1,2,3,4,5,6,7}, Defined under binary F = G⊕ f(x1,x2) It can be proved that F ∈ {0,1,2,3,4,5,6,7}, table 1 lists all cases.


For a C major within an octave, it contains eight values (C, D, E, F, G, A, B). Assigning each note of the C major scale a positive integer from 0 to 7 (the value of C) in turn, we can modify the pitch arbitrarily through a composite sequence of n=2. When it comes to Sharp,Flat, Staccato,Pizz and so on, we just need to increase the value of n and give meaning to the resulting F. For instance, n=3, f(x3) =0 means ground state; f(x3) =1 means sharp; f(x3) =2 means ornaments; n=4, f(x4)=0 means ground state; f(x4)=1 means up an octive; f(x4)=2 means down an octive.


This coding method has two sides.

    Advantages:
  • 1.This kind of coding has infinite potential. As long as there are enough editing units, the original information can be modified into any information.
  • 2.The idea of Galois field is used in coding, which can ensure sufficient redundancy and self correction capability
    Disadvantages:
  • 1.The potential of coding is acquired at the expense of information density. In the above example, supposing that each unit needs 20 bp and the fixed sequence needs 10 bp, then a note needs to occupy at least 50 bp for storage. Such information density is unacceptable.
  • 2.If the editing information is very complex, a lot of editing cells need to be reserved. This will lead to a substantial increase in the cost of DNA synthesis and a decrease in the efficiency of DNA utilization.


Based on the above problems, we try to maintain a balance between coding potential and information density. The second generation refers to Midi file format for music coding. This is an important engineering iteration of our project, which will be described in detail below.

Second Generation Music Coding



In order to solve the problem of encoding density, we consider using the heap in the data structure to build a new structure. The largest structure of MIDI files is a data structure called Chunk, which is divided into HeaderChunk (hereinafter referred to as HC) and TrackChunk (hereinafter referred to as TC). HC mainly stores the format type, TC quantity and time format of MIDI files; TC stores a series of MIDI events and data.

Data Structure


We have defined a structure, as shown in the figure below. We will introduce each part in turn. <Edit Window> <chunk name> <PAM> <Rhythm type> <event> <parameter> <Musical Symbol Information>


Figure 3 The data structure of HeaderChunk designed for music.

In order to facilitate the unification of base information and numbers, We assign values to each base: A=0, T=1, G=2, C=3. Therefore, we can digitize all bases in the structure, and the data value of each base is ai. The whole data unit can be regarded as a value of a1 as the element.

<Edit Window>

The editing window is 5 bp. According to the above the value of each base, we think that each base in the editing window is a numerical value, which is respectively. According to its internal value, we can establish a sequence that can be modified, and this region corresponds to a one-dimensional vector B:

Def the value of edit window is F(a):
If a1=3.
All elements in B with a value of 3 are assigned a value of 1, and each element of D is 0 or 2. Def D':
Only 0 and 1 of the elements in D' can be determined by which element has changed. Then the value in F'(a) will be change. According to the F'(a), corresponding modification rules will be designed and applied in reading.


<PAM>

The pam site, NGG, is a fragment recognized by dCas9.


<Chunk Name>

Chunk name is a structural identification unit, which is composed of 20 bases and serves as a segment for modifying identification and recognition.


<Rhythm type>

Rhythm typeis composed of two bases. The encoding value of this segment is a6,a7. Def G(a6,a7),the value of G represents different rhythm patterns.

Table 3: The values of G and DNA Sequences Given by Different Rhythm


<event><parameter>

Like Rhythm type, we use this method to create more music events.
Def m=G(a7,a8), represents the functional relationship between the actual frequency and the data signal. Def vector M, which the number of element is m+1. Each element represents the continuity between notes. Both <event> and <parameter> determine the value of

Table 4 : Assignment to M with different values of E(a6,a7,a8)


<Musical Symbol Information>

Generally, the pitch in audio files is described by different wave functions. Digitizing the pitch greatly simplifies the sound storage. Although many details are lost, it shows great advantages in reducing the file size. As mentioned earlier, the value of m represents the number of notes contained in the unit. The length of this part is a function of m. Considering factors such as the length of the base, the universality of the range, etc., we will code all pitches of the commonly used five octaves. If each fragment is 3bp, Def vector: The value of Musical Symbol Information :

Table 5 : Codes corresponding to all pitches

All the above vectors form a matrix, which is the characteristic matrix B of all the parameters of a note.

How to edit


Recalling the previous article, a large number of bases and their corresponding numbers are generated in multiple data units, while the actual information part is mainly the region after the <Rhythm type>. We can think of the actual information as stored values. The data in the edit window of matrix B is changed, which is similar to the pointer type in C language.


Figure 4 Linear table formed by vector B

Let's suppose that a pointer was originally created to point to the original vector B. However, when data unit B is modified by foreign vector D, B changes and a new B ` is generated. So the original B data is abandoned, but our pointer does not point to the new data B `, but to the space left by the original B. At this time, in order to prevent errors, there are two methods: one is to clean up the pointer directly, so that the data unit will disappear forever, and the other is to create a new data unit to fill the original vacancy. You only need to create a new data unit at the position of the original vector B, and it can be considered that the unit has been completely changed. The original data is' invisible ', and we can no longer identify it. Although it remains in DNA, it can no longer be observed because there is no pointer to it.

When establishing a data structure, it is necessary to record the value D of each vector B. When the data structure needs to be changed, modify the position to be modified, as shown in the following figure:

Figure 5 Pointer movement and matrix B update in case of editing.

Finally, when we need to read, we just need to read the value F (a) of B in order from small to large to get the complete data.
This has three main advantages:
1 The density of data has been greatly improved, and more modifications can be made in a single edit.
2 A chunk like data structure is adopted, which greatly ensures the independence of each module.
3 It has its own identification sequence, and its resolution has no great influence on the position of the DNA chain.

Disadvantages:
1 CRISPR editing technology itself may also edit bases outside the editing window, which may lead to data structure damage.
2 During modification, new DNA fragments need to be synthesized and linked to the original fragments. If a small amount of modification is made, the cost will be too high. Therefore, it is more suitable for use after a large number of modifications.

GIF Coding





We first needed to transform the image Microvenus[5] into DNA sequence, in a way that can be edited by base editor. We used one 35bp base sequence to encode one pixel. In the 35bp sequence, “TAG” located on 1-3, “TGA” located on 10-12 and “NGG” located on 33-35 were the positioning sequences of information. When they were recognized by our decoding program from the sequencing results, the program regarded this DNA fragment to contain image information, then it proceeded to decode. The bases located on 4-5 represented the default color for this pixel, considering that our base editor might convert C to T or G to A on non-modifying sections, we set the rule when these two bases are both pyrimidines or purines, the pixel showed up as white; when one of them is pyrimidine and the other purine, the pixel showed up as black. Follow-up base sequences on 6-7 and 8-9 represented the row and column values of the pixel, this was designed to assure our encoding system can be used to store information on the genome scale: We were able to correctly locate each pixel encoded by each base sequence, even under the circumstance where a large number of gene rearrangements happen. Sequence on 13-32 was where gRNAs targeted on and our modifications took place, the modifying site consisted of 5bp and at least one C, we coded that for each one of the C edited, the color changed once. Last three bases on 33-35 functioned as the positioning sequence and the PAM locus, necessity for base editor to take effect.


Figure 6 The structure of the information sequence.

After all the editing sites were successfully edited, we got the first image - "microvenus"



Figure 7 The generation of the first pixelmap.

Next step, we wanted to modify Microvenus to generate a series of images, and finally concatenate them to form a GIF. The original encoding system was slightly adjusted, bases on 4-5 no longer represents color of the pixel, but the information of which modification led to which image. For example, bases on 4-5 are AT meant original pixel changed color once and generated the second image; if they were AC, then it meant that the exact pixel on the second image was changed once again and generated the third image, and so on. What last image's modification resulted in becomes the next image to be further modified, therefore we acquired seven different images in the end. These seven images were concatenated to be a GIF — Dancing Venus. (Encoding-decoding program download available on our Gitlab)


Figure 8 Generation of the last six pixel maps."AT" in blue is for generating the first pixel map. “AC” in orange is for generating the second pixel map. “AG” in pink is for generating the third map. “TC” in purple is for generating the fourth map. “TG” in cyan is for generating the fifth map. “CG” in brown is for generating the sixth map.

References


[1]C.E. Shannon. A mathematical theory of communication Bell Syst. Tech. J.,1948, 27, 379-423
[2]Heckel, R.; Mikutis, G.; Grass, R. N. A Characterization of the DNA Data Storage Channel. Sci Rep 2019, 9 (1), 9663. DOI: 10.1038/s41598-019-45832-6
[3]Organick, L.; Chen, Y. J.; Dumas Ang, S.; Lopez, R.; Liu, X.; Strauss, K.; Ceze, L. Probing the physical limits of reliable DNA data retrieval. Nat Commun 2020, 11 (1), 616. DOI: 10.1038/s41467-020-14319-8
[4]Church, G. M.; Gao, Y.; Kosuri, S. Next-generation digital information storage in DNA. Science 2012, 337 (6102), 1628. DOI: 10.1126/science.1226355
[5]Davis, Joe. Contemporary Art and the Genetic Code. Art Journal, vol. 55, no. 1,1996, pp. 70-74.