On this page, we will explore what kind of encoding within nucleotides is most suitable for
micro Nuwa. Most DNA storage methods store information in the form of nucleotide sequences.
The theoretical Shannon information capacity [1]
of DNA is 2 bits/nt, although this limit
is
difficult to reach due to inherent errors in the DNA storage channel in the form of
sequencing and synthesis error rates and biochemical constraints in sequence space [2,3].
However, for micro Nuwa, we have more aspects to consider, as follows:
1 Requirements for DNA data sequence
1.1 PAM site
1.2 The editing window contains cytosine
1.3 No duplication between each gRNA
2 Requirements for CBE
2.1 Minimize mismatch and homology
2.2 Edit site spacing affected by steric hindrance effect of Cas9
3 Limitations of DNA synthesis and sequencing technology
Besides, in order to evaluate different codes, the following parameter are considered:
1 Information density (bits/mm3) [4]
2 Information entropy production of an editing event (bits)
3 Editing potential, refers to the maximum editing limit of DNA data.
For instance, one
sequence can be changed from 1 to 100, and the other can only be changed from 1 to
10.
Thus,
the editing potential of the former is 10 times that of the latter.
With the coding requirements and parameters in mind, let's start our modeling.
For nCDA1Δ198-BE3, two strands of DNA can be edited. We assume that AGTAGTACG is an data editing unit (Figure 1a). The sequence state is represented by x. There are four values of x (Figure 1).rated our project.
Now we have a fixed sequence-editing unit composite sequence and an operation method f(x) that
defines the sequence. How to use these to achieve data editing? Considering the limited number of
editing units for each fixed sequence, the values that can be represented by such a composite
sequence must be limited. In such a limited number of values changes is our goal. Considering the
group in the mathematical definition, we can think that the finite number constitutes a group, and
the process of editing corresponds to the closed binary operation of the group. Arbitrary editing is
similar to error correction. The wrong value of the data is the element of the group, and it is an
error correction process to modify the element into the correct one. Since the error value may be
arbitrary, error correction editing is the editing algorithm we need. Therefore, we associate the
principle of Reed Solomon coding.
For Figure 2b, group G can be built:
For a composite sequence with 8 states, G is a Galois field GF[23]. At present, C can be assigned any value artificially. If C ∈ {0,1,2,3,4,5,6,7}, n=2, f(x1,x2) ∈ {0,1,2,3,4,5,6,7}, Defined under binary F = G⊕ f(x1,x2) It can be proved that F ∈ {0,1,2,3,4,5,6,7}, table 1 lists all cases.
For a C major within an octave, it contains eight values (C, D, E, F, G, A, B). Assigning each note of the C major scale a positive integer from 0 to 7 (the value of C) in turn, we can modify the pitch arbitrarily through a composite sequence of n=2. When it comes to Sharp,Flat, Staccato,Pizz and so on, we just need to increase the value of n and give meaning to the resulting F. For instance, n=3, f(x3) =0 means ground state; f(x3) =1 means sharp; f(x3) =2 means ornaments; n=4, f(x4)=0 means ground state; f(x4)=1 means up an octive; f(x4)=2 means down an octive.
This coding method has two sides.
Based on the above problems, we try to maintain a balance between coding potential and information density. The second generation refers to Midi file format for music coding. This is an important engineering iteration of our project, which will be described in detail below.
In order to solve the problem of encoding density, we consider using the heap in the data structure to build a new structure. The largest structure of MIDI files is a data structure called Chunk, which is divided into HeaderChunk (hereinafter referred to as HC) and TrackChunk (hereinafter referred to as TC). HC mainly stores the format type, TC quantity and time format of MIDI files; TC stores a series of MIDI events and data.
We have defined a structure, as shown in the figure below. We will introduce each part in turn. <Edit Window> <chunk name> <PAM> <Rhythm type> <event> <parameter> <Musical Symbol Information>
In order to facilitate the unification of base information and numbers, We assign values to each base: A=0, T=1, G=2, C=3. Therefore, we can digitize all bases in the structure, and the data value of each base is ai. The whole data unit can be regarded as a value of a1 as the element.
The editing window is 5 bp. According to the above the value of each base, we think that each base in the editing window is a numerical value, which is respectively. According to its internal value, we can establish a sequence that can be modified, and this region corresponds to a one-dimensional vector B:
The pam site, NGG, is a fragment recognized by dCas9.
Chunk name is a structural identification unit, which is composed of 20 bases and serves as a segment for modifying identification and recognition.
Rhythm typeis composed of two bases. The encoding value of this segment is a6,a7. Def G(a6,a7),the value of G represents different rhythm patterns.
Like Rhythm type, we use this method to create more music events.
Def m=G(a7,a8), represents the functional relationship between the actual
frequency and the data signal. Def
vector M, which the number of element is m+1. Each element represents the continuity between
notes.
Both <event> and <parameter> determine the value of
Generally, the pitch in audio files is described by different wave functions. Digitizing the pitch greatly simplifies the sound storage. Although many details are lost, it shows great advantages in reducing the file size. As mentioned earlier, the value of m represents the number of notes contained in the unit. The length of this part is a function of m. Considering factors such as the length of the base, the universality of the range, etc., we will code all pitches of the commonly used five octaves. If each fragment is 3bp, Def vector: The value of Musical Symbol Information :
Recalling the previous article, a large number of bases and their corresponding numbers are generated in multiple data units, while the actual information part is mainly the region after the <Rhythm type>. We can think of the actual information as stored values. The data in the edit window of matrix B is changed, which is similar to the pointer type in C language.
Let's suppose that a pointer was originally created to point to the original vector B. However,
when
data unit B is modified by foreign vector D, B changes and a new B ` is generated. So the
original B
data is abandoned, but our pointer does not point to the new data B `, but to the space left by
the
original B. At this time, in order to prevent errors, there are two methods: one is to clean up
the
pointer directly, so that the data unit will disappear forever, and the other is to create a new
data
unit to fill the original vacancy. You only need to create a new data unit at the position of
the
original vector B, and it can be considered that the unit has been completely changed. The
original
data
is' invisible ', and we can no longer identify it. Although it remains in DNA, it can no longer
be
observed because there is no pointer to it.
When establishing a data structure, it is necessary to record the value D of each vector B. When
the
data structure needs to be changed, modify the position to be modified, as shown in the
following
figure:
Finally, when we need to read, we just need to read the value F (a) of B in order from small to
large
to
get the complete data.
This has three main advantages:
1 The density of data has been greatly improved, and more modifications can be made in a
single
edit.
2 A chunk like data structure is adopted, which greatly ensures the independence of each
module.
3 It has its own identification sequence, and its resolution has no great influence on the
position of
the
DNA chain.
Disadvantages:
1 CRISPR editing technology itself may also edit bases outside the editing window, which
may
lead to
data structure damage.
2 During modification, new DNA fragments need to be synthesized and linked to the original
fragments. If
a small amount of modification is made, the cost will be too high. Therefore, it is more
suitable
for
use after a large number of modifications.
We first needed to transform the image Microvenus[5] into DNA sequence, in a way that can be edited by
base editor. We used one 35bp base sequence to encode one pixel. In the 35bp sequence, “TAG” located
on 1-3, “TGA” located on 10-12 and “NGG” located on 33-35 were the positioning sequences of
information. When they were recognized by our decoding program from the sequencing results, the
program regarded this DNA fragment to contain image information, then it proceeded to decode. The
bases located on 4-5 represented the default color for this pixel, considering that our base editor
might convert C to T or G to A on non-modifying sections, we set the rule when these two bases are
both pyrimidines or purines, the pixel showed up as white; when one of them is pyrimidine and the
other purine, the pixel showed up as black. Follow-up base sequences on 6-7 and 8-9 represented the
row and column values of the pixel, this was designed to assure our encoding system can be used to
store information on the genome scale: We were able to correctly locate each pixel encoded by each
base sequence, even under the circumstance where a large number of gene rearrangements happen.
Sequence on 13-32 was where gRNAs targeted on and our modifications took place, the modifying site
consisted of 5bp and at least one C, we coded that for each one of the C edited, the color changed
once. Last three bases on 33-35 functioned as the positioning sequence and the PAM locus, necessity
for base editor to take effect.
After all the editing sites were successfully edited, we got the first image - "microvenus"
[1]C.E. Shannon. A mathematical theory of communication Bell Syst.
Tech.
J.,1948, 27, 379-423
[2]Heckel, R.; Mikutis, G.; Grass, R. N. A Characterization of the DNA Data Storage Channel.
Sci
Rep 2019, 9 (1), 9663. DOI: 10.1038/s41598-019-45832-6
[3]Organick, L.; Chen, Y. J.; Dumas Ang, S.; Lopez, R.; Liu, X.; Strauss, K.; Ceze, L. Probing
the
physical limits of reliable DNA data retrieval. Nat Commun 2020, 11 (1), 616. DOI:
10.1038/s41467-020-14319-8
[4]Church, G. M.; Gao, Y.; Kosuri, S. Next-generation digital information storage in DNA.
Science
2012, 337 (6102), 1628. DOI: 10.1126/science.1226355
[5]Davis, Joe. Contemporary Art and the Genetic Code. Art Journal, vol. 55, no. 1,1996, pp.
70-74.