back-ground
Modeling

Modeling

Index

Abstract

Using prevalence and climate data from Singapore, we developed and evaluated models to predict the prevalence of four dengue serotypes. In creating this model, we created multiple machine learning and regression analysis methods and compared them. In addition, using the results of multiple models, we considered the effects of climate-related features on each serotype.

Introduction

As the number of people infected with dengue fever increases and the endemic area expands, the importance of a dengue infection number prediction system is increasing[1]. A dengue infection number prediction system is important, but considering the characteristics of dengue fever, a system that can predict epidemics for each serotype is more effective for epidemic prevention. The ability to predict epidemics by serotype is also useful for effective vaccine distribution. However, few systems predict epidemics by serotype[2]. Therefore, we created a system to predict the proportion of infected people by serotype using data from Singapore, aiming to solve these problems.

The relationship between Dry's prediction system and Wet lab/Hardware lab is shown in Figure 1.

Data and Methods

About Data

Although there are many data that simply aggregate the number of infected people, there are very few data that investigate serotypes. In addition, most of the data on serotypes, of which there are only a few, are often old or annual data at the research scale[3][4][5][6]. However, Singapore conducts serotyping on a national scale. Prevalence data from Singapore [7] were used to develop a model to predict the prevalence for each of the four serotypes. This data is referred to as data 1 below. The reasons for choosing Singapore are as follows.

  • The data must have been investigated up to the serotype.
  • The scale of the data should be large enough.
  • Data must be published monthly.
  • Since the country is small, even if the dynamics of epidemics are homogenized, there is little impact to predictio[8].

The period and contents of the data used are shown in Table 1 and Figure 2.

Although the climate data contains multiple features, we used four features: monthly average rainfall, average temperature, maximum temperature, and minimum temperature[10]. Let this data be data 2. In addition to the fact that many of these features are used in other prediction systems, this project requires continuous operation, so we decided to use only these data, which are easy to collect[2][11]. For the climate data, we used data from the Ang Mo Kio district, where many clusters of dengue fever have occurred and the population density is high[12][13]. Table 2 and Figure 3 show the period and contents of the data used. Figure 4 shows the climate data after Min-Max normalization. Missing values in climate data were imputed according to Table 3.

Figure 5 shows the correlation coefficient between the number of infected people and the climate data.

About forecasting methods

Lasso and RF were implemented to examine the impact of the four features, a simple NN was implemented to examine the complexity of the model, and LSTM was implemented to examine the duration over which climate data affect the proportion of infected individuals. These four methods were then compared. The prediction method used is shown in Table 4.

In addition, each prediction method created training data according to Table 5.

Table 6 summarizes the parameters related to learning for each method.

Result

Lasso

The data were trained in four ways: whether to Min-Max normalization process data 1 and 2 or not. The model names and min-max normalization are in Table 7.

GridSearch was conducted for alpha, and since alpha ignoring all four features was optimal in the106alpha10110^{-6} \leq alpha \leq 10^{-1} range, GridSearch was conducted in the106alpha10110^{-6} \leq alpha \leq 10^{-1} range with the condition that the coefficients of the four features are non-zero. The loss function for each of the four serotypes, using the parameters determined by GridSearch, is shown in Table 8. Figure 6 is a graphical representation of Table 8.

Of the four Lassos, Lasso-3 was adopted as the one with the smallest sum of loss functions. The prediction results using Lasso-3 are shown in Figure 7. The coefficients and alpha values for each feature are shown in Table 9. Figure 8 is a graphical representation of Table 9.

RandomForest

As with Lasso, we trained the system in four ways: with or without Min-Max normalization processing data 1 and data 2. The model names and min-max normalization are in Table 10.

GridSearch was performed for each parameter in the range shown in Table 11.

The loss function for each of the four RF serotypes, using the parameters determined by GridSearch, is shown in Table 12. Figure 9 is a graphical representation of Table 13.

Of the four RF, the sum of the loss functions was the smallest for RF-4. The prediction results using RF-4 are shown in Figure 10. The parameters used in training were shown in Table 14.

Also, the coefficients of the RF-4 features are shown in Table 14 and Figure 11.

Simple-NN

For training data, data 1 was divided by 100, and data 2 was Min-Max normalized. The number of nodes in the hidden layer and the corresponding NN are in Table 15.

When these models were trained with batchsize=3batchsize = 3, the MSE were shown in Figure 12, Figure 13, and Figure 14, respectively.

The val-loss of the epoch with the smallest sum of val-loss for each serotype is shown in Table 15.

The loss function on the test data was shown in Figure 16.

Among the three NN, NN-1, which had the smallest sum of loss functions, was adopted. The prediction results using NN-1 are shown in Figure 16.

LSTM

For training data, data 1 was divided by 100, and data 2 was Min-Max normalized. The LSTM that correspond to the bind_size are shown in Table 17.

When these models were trained with batchsize=2batchsize = 2, the MSE were Figure 17, Figure 18, Figure 19, and Figure 20.

The respective val-losses of the epochs with the smallest sum of val-losses for each serotype are shown in Table 18.

The loss function on the test data was shown in Figure 21.

Among the four LSTMs, LSTM-4 was adopted because it had the smallest sum of loss functions. The prediction results using LSTM-4 are shown in Figure 22.

Discussion

None of the models were able to learn the 2021 upward trend of DENV-3. This may be because this upward trend is not due to climatic influences, but rather an epidemiological reason that few people in Singapore have antibodies to DENV-3 because there have been few past outbreaks of DENV-3 in the country[19].

The Lasso and RF results reveal features that strongly influence serotype. Comparing the results in Figures 8 and 11, we can see that the characteristics affected by different serotypes are different among dengue fever. This is consistent with studies showing that temperature and rainfall (humidity) have different effects on the range of mosquito behavior and the incubation period of the virus for each serotype[20][21][22][23].

For the four LSTM, the larger the duration of the given weather feature, the smaller the loss function. Although some studies suggest that prevalent serotypes begin to appear 2-6 months in advance, it is not possible to determine from the LSTM results whether a larger period would improve prediction accuracy, and further research is needed[24].

Synergies with Wet lab/Hardware lab for this project

Once the test kits developed by Wet lab and Hardware lab become widely enough used, we can expect to be able to collect serotype-specific data on a more regular basis. When that happens, the forecasting systems developed by our Dry will be able to make more accurate forecasts at a higher frequency. In addition, it will be possible to analyze infection trends over a more detailed period of time, which may lead to epidemiological discoveries. For example, how long does a change in genotype affect the trend of infection, how does the trend of infection move from one city to another, and how do the four serotypes interact with each other. In fact, it has been shown that genotype variation had a strong influence on past pandemics and that subtypes spread throughout the world[25][26][27].

If our test kits allow for the permanent collection of data, they may not only improve our understanding of the aforementioned epidemiological models of transmission, but may also more directly reduce the risk of dengue fever. For example, the risk of severe disease differs with each serotype. The risk of severe disease also changes with genotype variation[25][28][29]. We believe that by constantly collecting data from test kits and monitoring trends in serotypes and genotypes, we will be able to warn and alert the public in advance. Ultimately, this is expected to have a positive impact on vaccine uptake for dengue fever.

List of Abbreviations

  • Lasso : least absolute shrinkage and selection operator
  • RF : Random Forest
  • NN : Neural Network
  • LSTM : Long Short Term Memory
  • MSE : Mean Squared Error

Reference

[1] WHO. Dengue guidelines for diagnosis, treatment, prevention and control : new edition
[2] Baharom M , Ahmad N, Hod R, Abdul Manaf MR. Dengue Early Warning System as Outbreak Prediction Tool: A Systematic Review
[3] Steven T. Stoddard ,Helen J. Wearing,Robert C. Reiner Jr,Amy C. Morrison,Helvio Astete,Stalin Vilcarromero,Carlos Alvarez,Cesar Ramal-Asayag,Moises Sihuincha,Claudio Rocha,Eric S. Halsey,Thomas W. Scott,Tadeusz J. Kochel,Brett M. Forshey. Long-Term and Seasonal Dynamics of Dengue in Iquitos, Peru
[4] Eric S. Halsey ,Morgan A. Marks,Eduardo Gotuzzo,Victor Fiestas,Luis Suarez,Jorge Vargas,Nicolas Aguayo,Cesar Madrid,Carlos Vimos,Tadeusz J. Kochel,V. Alberto Laguna-Torres. Correlation of Serotype-Specific Dengue Virus Infection with Clinical Manifestations
[5] Brett M. Forshey,Carolina Guevara,V. Alberto Laguna-Torres,Manuel Cespedes,Jorge Vargas,Alberto Gianella,Efrain Vallejo,César Madrid,Nicolas Aguayo,Eduardo Gotuzzo,Victor Suarez,Ana Maria Morales,Luis Beingolea,Nora Reyes,Juan Perez,Monica Negrete,Claudio Rocha,Amy C. Morrison,Kevin L. Russell,Patrick J. Blair,James G. Olson,Tadeusz J. Kochel ,for the NMRCD Febrile Surveillance Working Group. Arboviral Etiologies of Acute Febrile Illnesses in Western South America, 2000–2007
[6] MINISTRY OF HEALTH. Communicable Diseases Surveillance SINGAPORE 2018
[7] National Environment Agency. Quarterly Dengue Surveillance Data
[8] Tibutius T. P. Jayadas, Thirunavukarasu Kumanan, Laksiri Gomes, Chandima Jeewandara, Gathsaurie N. Malavige, Diyanath Ranasinghe, Ramesh S. Jadi, Ranjan Ramasamy, Sinnathamby N. Surendran, Regional Variation in Dengue Virus Serotypes in Sri Lanka and Its Clinical and Epidemiological Relevance
[9] National Environment Agency. Surveillance and Epidemiology Programme
[10] Meteorological Service Singapore. Historical Daily Records
[11] Michael J, Karyn Apfeldorf. An open challenge to advance probabilistic forecasting for dengue epidemics
[12] National Environment Agency. Dengue Clusters
[13] Singapore Department of Statistics. [14] Meteorological Service Singapore. Climate of Singapore
[15] Robert Tibshirani. Regression Shrinkage and Selection via the Lasso
[16] S Hochreiter, J Schmidhuber. short-term memory
[17] scikit learn org. sklearn.linear_model.Lasso
[18] scikit learn org. sklearn.ensemble.RandomForestRegressor
[19] Jacqueline Teoh. The National Centre for Infectious Diseases. Epidemic Dengue in Singapore During COVID-19 Pandemic
[20] Wayne A.RowleyCharles L.Graham. The effect of temperature and relative humidity on the flight performance of female Aedes aegypti
[21] A Rohani, Y C Wong, I Zamre, H L Lee, M N Zurainee. The effect of extrinsic incubation temperature on development of dengue serotype 2 and 4 viruses in Aedes aegypti (L.)
[22] Rebecca C. Christofferson, Christopher N. Mores. Potential for Extrinsic Incubation Temperature to Alter Interplay between Transmission Potential and Mortality of Dengue-Infected Aedes aegypti
[23] Fang-Zhen Xiao, Yi Zhang, Yan-Qin Deng, Si He, Han-Guo Xie, Xiao-Nong Zhou, Yan-Sheng Yan. The effect of temperature on the extrinsic incubation period and infection rate of dengue virus serotype 2 infection in Aedes albopictus
[24] Jayanthi Rajarethinam, Li-Wei Ang, Janet Ong, Joyce Ycasas, Hapuarachchige Chanditha Hapuarachchi, Grace Yap, Chee-Seng Chong, Yee-Ling Lai, Jeffery Cutter, Derek Ho, Vernon Lee, Lee-Ching Ng. Dengue in Singapore from 2004 to 2016: Cyclical Epidemic Patterns Dominated by Serotypes 1 and 2
[25] Hasitha A. Tissera, Eng Eong Ooi, Duane J. Gubler, Ying Tan, Barathy Logendra, Wahala M.P.B. Wahala, Aravinda M. de Silva, M.R. Nihal Abeysinghe, Paba Palihawadana, Sunethra Gunasena, Clarence C. Tam, Ananda Amarasinghe, G. William Letson, Harold S. Margolis, Aruna Dharshan De Silva. New Dengue Virus Type 1 Genotype in Colombo, Sri Lanka
[26] Mya Myat Ngwe Tun, Rohitha Muthugala, Lakmali Rajamanthri, Takeshi Nabeshima, Corazon C. Buerano, Kouichi Morita. Emergence of Genotype I of Dengue Virus Serotype 3 during a Severe Dengue Epidemic in Sri Lanka in 2017
[27] William B. Messer, Duane J. Gubler, Eva Harris, Kamalanayani Sivananthan, Aravinda M. de Silva. Emergence and Global Spread of a Dengue Serotype 3, Subtype III Virus
[28] M Gittens-St Hilairecorresponding, Nicole Clarke-GreenidgeAn analysis of the subtypes of dengue fever infections in Barbados 2003–2007 by reverse transcriptase polymerase chain reaction
[29] Ananda Nisalak, Timothy P Endy, Suchitra Nimmannitya, Siripen Kalayanarooj, Usa Thisayakorn, Robert M Scott, Donald S Burke, Charles H Hoke, Bruce L Innis, David W Vaughn. Serotype-specific dengue virus circulation and dengue disease in Bangkok, Thailand from 1973 to 1999