Dietary preferences and diabetic risk in China: A large-scale nationwide Internet data-based study
中国的饮食偏好和糖尿病风险研究:一项基于大规模互联网数据的全国性研究
Funding information: Ministry of Science and Technology of the People's Republic of China, Grant/Award Numbers: 2016YFC0901200, 2016YFC1304904, 2016YFC1305600, 2017YFC1310700, 2018YFC1311800; National Natural Science Foundation of China, Grant/Award Numbers: 81500610, 81500660, 81621061, 81622011; Shanghai Municipal Commission of Health and Family Planning, Grant/Award Number: 20174Y0014; the Program for Professor of Special Appointment (Younger Eastern Scholar) at Shanghai Institutions of Higher Learning, Grant/Award Number: QD2016007
Abstract
enBackground
Unhealthy diet is one of the important risk factors of diabetes, which is one of the major public health problems in China. The Internet tools provide large-scale passively collected data that show people's dietary preferences and their relationship with diabetes risk.
Methods
212 341 708 individuals' dietary preference labels were created based on Internet data from online search and shopping software. Metabolic data obtained from the 2010 China Noncommunicable Disease Surveillance, which had 98 658 participants, was used to estimate the relation between dietary preferences geographical distribution and diabetes risk.
Results
Chinese dietary preferences had different geographical distribution, which is related to the local climate and consumption level. Fried food preference proportion distribution was significantly positively correlated with diabetes prevalence, hypertension prevalence and body mass index (BMI). Similarly, grilled food preference proportion distribution had significantly positive correlation with the prevalence of diabetes and hypertension. In contrast, spicy food preference proportion distribution was negatively correlated with diabetes prevalence. Sweet food preference proportion distribution was positively related to diabetes prevalence. Using dietary preferences data to predict regional prevalence of diabetes, hypertension and BMI, the average values of error (95% CI) between the three paired predicted and observed values were 9.8% (6.9%-12.7%), 7.5% (5.0%-10.0%) and 1.6% (1.2%-2.0%), respectively.
Conclusions
Fried food, grilled food, and sweet food preferences were positively related to diabetes risk whereas spicy food preference was negatively correlated with diabetes risk. Dietary preferences based on passively collected Internet data could be used to predict regional prevalence of diabetes, hypertension, and BMI and showed good value for public health monitoring.
摘要
zh背景
糖尿病是中国面临的主要
公共卫生问题之一, 而不健康饮食是糖尿病的一个重要的危险因素。互联网工具提供了大规模的被动采集的数据, 这些数据显示了人们的饮食偏好及其与糖尿病风险的关系。
方法
根据来自搜索引擎和外卖购物软件的互联网数据, 创建了212, 341,708人的饮食偏好标签。代谢状态数据来自于2010年中国慢病监测项目(共有98,658名参与者), 并用于估计饮食偏好的地理分布与糖尿病风险之间的关系。
结果
中国人的饮食偏好具有不同的地理分布, 其分布与当地的气候和消费水平有关。油炸食品偏好的比例分布与糖尿病患病率, 高血压患病率和体重指数呈显着正相关。同样, 烧烤食物偏好的比例分布与糖尿病、高血压的患病率也呈显着正相关。相反, 辛辣食物偏好的比例分布与糖尿病患病率呈负相关。甜食偏爱的比例分布与糖尿病患病率呈正相关。使用饮食偏好数据来预测区域的糖尿病、高血压的患病率及体重指数, 三组预测值和观察值之间的平均误差(95%可信区间)分别为9.8%(6.9%-12.7%), 7.5%(5.0%- 10.0%)和1.6%(1.2%-2.0%)。
结论
油炸食物, 烧烤食物和甜食的偏好与糖尿病风险呈正相关, 而辛辣食物的偏好与糖尿病风险呈负相关。基于被动采集的互联网数据的饮食偏好可用于预测糖尿病, 高血压和体重指数的区域患病率, 并在公共卫生监测中显示出良好的价值。
1 INTRODUCTION
Reducing the prevalence of diabetes is one of the major public health challenges in China. According to two recent nationwide surveys, the prevalence of diabetes in Chinese adults has reached 10%, indicating that more than 114 million adults suffered from diabetes.1, 2 Unhealthy diet is one of the important risk factors not only for diabetes but also for diabetic risk factors such as hypertension and obesity. Dietary types and preferences could affect energy intake and metabolic mechanism. Previous studies had shown that fried food consumption had been positively associated with several metabolic disorders including hypertension,3 type 2 diabetes,4 and obesity5-7 in Europe and the United States. Meanwhile persons who prefer to consume fresh chili pepper, which has a spicy taste, could reduce the mortality caused by diabetes.8 Because of the large population, the complex climate, and the diversity of culture in China, Chinese people have a variety of dietary preferences and many are different from the western countries. However, the dietary preference distribution and its relationship with diabetes risk in Chinese people are still uncertain.
The Internet has developed rapidly in recent years and has gradually linked all aspects of people's life. The Internet has spawned several sources of big data, such as Google, Facebook, Twitter, Instagram, Tumblr, and Amazon. These online practical tools and their full digital form provide a wealth of passively collected data that could show individual lifestyle behaviors and may be mined for purposes of public health such as disease surveillance and risk factors assessment.9 These online tools could collect timelier, more widely available, and more cost-effective data that yield important insights into current disease trends.10 Taking advantage of these public data, recent studies had developed models for early warning of influenza activity11 and respiratory illnesses12 and prediction of obesity prevalence.13 As more and more Chinese are accustomed to using the Internet, online meal searching and ordering are becoming very common, especially among young and middle-aged people, so it is possible to understand and analyze Chinese dietary behavior through Internet data. The data collected by search engine and online meal ordering software allowed us to describe the distribution of Chinese dietary preferences from a big-data perspective. It has been documented that such analyses could help provide a whole picture of nationwide features and their correlations with disease risk.9, 13
In this study, we used the large-scale nationwide Internet data to analyze the relation between Chinese dietary preferences and diabetes risk.
2 METHODS
2.1 Dietary preference labels from Internet data
Dietary preference labels were created using description of user behaviors on the basis of Baidu Inc.'s Internet data. This company has the biggest search engine14 and one of the top three online meal ordering apps in China. In the study, Internet data of user behaviors had two major sources. The first one is user query logs and click logs of food ordering/shopping apps, which recorded the users' choices when they searched, bought, or ordered food on those apps. The other is user query logs and click logs of search engine, which recorded user behaviors when they searched for cooking recipes or bought food through the search engine and its links. Data from the two sources come from both computers and mobile phones. We integrated the data from both devices as data for a single user when under the same internal ID. We integrated all those sources of user data and used a combination of rule-based method and machine learning method to remove spam data, such as advertisements, commercial promotional information, and political statements. Given the quality and comprehensiveness of the data, we selected the data during the whole year of 2016 for analysis.
In this study, we used the labels on dietary preferences including fried food, grilled food, instant-boiled food, spicy food, sweet food, and tongue-numbing food. Frying, grilling, and instant boiling are three of the most popular cooking methods, and spicy, sweet, and tongue numbing are the three dietary preferences with apparent geographic variations in China and can be easily categorized by food name. In order to categorize the behaviors in user query and click logs into six dietary preferences groups, we built a model ensemble of dietary preference classification models. (The methods of building classification models and definition of dietary preference are in the Supporting Information). Then we calculated, for each user, the aggregate number of clicks and queries that fall within each of the six dietary preference groups. If the proportion of a user's clicks/queries that fall in one particular dietary preference group to all of his/her clicks/queries exceeded the standard (25% in this study), we labeled that as a dietary preference of that user.
Data statistics and analysis conform to a privacy protection declaration signed with the user of the software (Supporting Information), using only the count data of dietary preference labels in each region, and do not involve personal information. All data exchange does not involve personal privacy information and ID number was encrypted.
2.2 Metabolic data source and definitions
Data on diabetes prevalence, hypertension prevalence, body mass index (BMI), and glucose levels were collected on the basis of the 2010 China Noncommunicable Disease Surveillance,2 which was designed to select a nationally representative sample of the general population, covering major geographic areas of all 31 provinces in mainland China. The total number of study participants was 98 658. Diabetes was defined according to the American Diabetes Association 2010 criteria or self-reported previous diagnosis. Hypertension was defined as systolic blood pressure greater than 139 mm Hg or diastolic blood presssure greater than 89 mm Hg. The BMI was defined as the body weight (in kilograms) divided by the square of the body height (in meters).
2.3 The geographical distribution analysis
In this study, all proportion distributions of dietary preferences were weighted to represent overall Chinese adult population aged 18 years or older. The geographical distribution data obtained from the Internet were calculated based on 334 local prefecture level administrative areas nationwide, then pooled into provincial distribution. The data were divided into 372 subgroups according to 31 provinces, gender, and age groups to calculate the proportion of each subgroup's dietary preferences respectively. Every subgroup's proportion was weighted according to the age and sex composition ratio of 2010 China population census data and then the regional distribution of dietary preferences was obtained by summing up related subgroup data. Metabolic status data, including the prevalence of diabetes and hypertension, BMI, fasting plasma glucose (FPG), and postprandial plasma glucose (PPG), were also divided into subgroups by the method mentioned previously in order to be weighted based on the age and sex composition ratio and then analyzed with proportion distributions of dietary preferences.
2.4 Economy and climate classification
We used the provincial consumption level data, which were obtained from China Statistical Yearbook of National Bureau of Statistics, to represent the economic level. The climate analysis used average temperatures in January and July, which were usually used to represent winter and summer temperature status in the region. The monthly temperature data were obtained from the monthly data set of the standard value of the ground climate of China International Exchange Station (1971-2000 years) on the website of the China Meteorological Administration Data Center. The temperature data used the central city with the majority of population in each province to represent the provincial data. The average values of latitude and altitude of the provinces were used for geographic analysis. We used tertiles to define high, medium, and low level of economic status, climate status, latitude, and altitude.
2.5 Statistical analysis
Pearson's correlation was used to analyze the correlations between the proportions of dietary preferences and the value of metabolic state on provincial level. ANOVA was used to examine the effect of geographical, climatic, and economic factors on dietary preferences. Generalized linear model was used to fit the observed values of diabetes prevalence, hypertension prevalence and BMI on the basis of age, gender, and dietary preferences. Paired t test and intraclass correlation coefficients were used to evaluate the different between predicted and observed values. All P values were 2-tailed and <.05 was considered statistically significant. All statistical analyses were conducted using the SAS system, version 9.3.
3 RESULTS
3.1 The geographical distribution of Chinese dietary preferences
In this study, Internet users who had dietary preferences labels and aged 18 years or older were selected as study participants. The total number of the study participants was 212 341 708, which accounted for more than a sixth of the adult population in China. The majority of participants were younger than age 45. The proportion of men was 54.9% (Table 1).
Age subgroup | Total | Men | Women |
---|---|---|---|
18-24 | 54 900 778 (25.9%) | 29 823 449 | 25 077 329 |
25-34 | 99 078 036 (46.7%) | 53 835 994 | 45 242 042 |
35-44 | 47 988 376 (22.6%) | 25 412 642 | 22 575 734 |
45-54 | 8 069 157 (3.8%) | 4 094 010 | 3 975 147 |
55-64 | 1 874 889 (0.9%) | 1 022 002 | 852 887 |
> = 65 | 430 472 (0.2%) | 240 165 | 190 307 |
- a Data are numbers and percentages (%).
The data showed that fried food preference was mainly distributed in and around the capital Beijing and northeast of China (Figure 1A). The overall distribution was high in the east and low in the west. The region with the highest proportions of grilled food preference was Beijing and the northeast areas; Hainan province in the south also had a high proportion (Figure 1B). Spicy food preference was mainly located in the region centered on Sichuan province in West China. In North and Central China, there were also some regions with high proportion (Figure 1C). The highest proportion region of tongue-numbing food preference was also centered on Sichuan province and this food preference was widely distributed in the northwest and northeast China (Figure 1D). Most people who preferred instant-boiled food were distributed in the hot-pot culture areas such as Beijing and Sichuan (Figure 1E). Sweet food preference was mainly distributed in the southeast coastal areas and large modern cities in the north such as Beijing and Tianjin (Figure 1F). The provincial dietary preferences proportions were shown in Supporting Information Table S1. There is a significant correlation between the dietary preference data from Internet dietary data and epidemiological survey data (Supporting Information Figure S1).

3.2 Stratified analysis of different dietary preferences
The participants of different age groups had different dietary preferences. In general, these dietary preferences were more popular among young people, and as the age increased, the appetite for food was gradually weakened (Figure 2A). Instant-boiled food and grilled food preference showed more decrease in older people than other preferences. Different sexes also showed different preferences. Fried food and sweet food were more popular in women, and their proportion in women were respectively 33% and 35% higher than in men (Figure 2B). Spicy food preference was a little more in women whereas other preferences did not have significant differences between men and women.

Combined with Chinese dietary preference distribution map, we showed the effects of geographical, climatic, and economic factors on food preferences. Residents in higher latitude areas preferred fried food (Figure 2C, P = .003). The proportions of grilled food and sweet food preference decreased in higher altitude areas whereas the proportions of spicy food and tongue-numbing food preference increased (Figure 2C, P = .042, P < .001, P = .041 and P = .002, respectively). We analyzed the effect of temperature in the coldest and the warmest month on dietary preferences. There was a higher proportion of residents in the lower average temperature area in January preferred fried food (Figure 2C, P = .002). Meanwhile in the area with the higher average temperature in July, fewer people preferred spicy food and tongue-numbing food (Figure 2C, P = .043 and P = .001) but more people preferred sweet food (Figure 2C, P = .004). In addition, with the increase of economic consumption level, grilled food and sweet food preference proportions increased (Figure 2C, P = .033 and P = .009).
Correlation analysis with geographical distribution of dietary preferences and diabetes risk.
We studied correlation between the proportions of dietary preferences and the value of metabolic state on provincial level (Table 2). The proportion of fried food preference was significantly positively correlated with diabetes prevalence, hypertension prevalence, and BMI (r = 0.581, r = 0.715, r = 0.667, P < .001, respectively). Further, it had a positive correlation with FPG and PPG (r = 0.355, r = 0.360, P < .05, respectively). Similarly, the proportion of grilled food preference had also significant positive correlation with diabetes prevalence, hypertension prevalence, BMI, FPG, and PPG (r = 0.467, r = 0.435, r = 0.391, r = 0.422, r = 0.372, all P < .05). In contrast, the proportion of spicy food preference was negatively correlated with diabetes prevalence (r = −0.393, P < .05) and was also negatively correlated with FPG and PPG (r = −0.653, r = −0.425, P < .001 and P < .05, respectively). The proportion of sweet food preference was positively related to diabetes prevalence and FPG (r = 0.381, r = 0.398, P < .05, respectively). There was a negative correlation between tongue-numbing food preference and FPG (r = −0.614, P < .001). The same result was found in instant-boiled food preference (r = −0.395, P < .05). Most of the significant correlations found also exist with the prevalence of different abnormal blood glucose status as defined by prediabetes criteria (Supporting Information Table S2).
Correlation coefficients | |||||
---|---|---|---|---|---|
Food preferences | Diabetes prevalence | Hypertension prevalence | BMI | FPG | PPG |
Fried food | .581*** | .715*** | .667*** | .355* | .360* |
Grilled food | .467** | .435* | .391* | .422* | .372* |
Spicy food | −.393* | .039 | −.036 | −.653*** | −.425* |
Tongue-numbing food | −.295 | .132 | .100 | −.614*** | −.348 |
Instant-boiled food | −.288 | −.276 | −.135 | −.395* | −.190 |
Sweet food | .381* | −.289 | −.156 | .398* | .259 |
- Note. Data are coefficients after age and gender standardization.
- Abbreviations: BMI, body mass index; FPG, fasting plasma glucose; PPG, Postprandial plasma glucose. N = 31.
- * P < .05.
- ** P < .01.
- *** P < .001.
This study used six kinds of regional dietary preferences proportion data plus the average age and sex ratio to predict the prevalence of diabetes, hypertension, and the mean value of BMI in this area by linear models (Figure 3). The coefficients of the linear models for predicting the aforementioned three values were .923, .908. and .863, respectively. There was no significant difference between the predicted values and the actual observed values (P values of paired t test were .939, .995, and .992, respectively) and intraclass correlation coefficients of the three paired values were .92, .91, and .86, respectively. The average values of error between the three paired predicted and observed values were 9.8% (95% CI 6.9% to 12.7%), 7.5% (5.0% to 10.0%), and 1.6% (1.2% to 2.0%), respectively. After the addition of dietary preferences data into predictive models, both the average errors between predicted and observed values and residuals of models significantly decreased respectively (Supporting Information Figure S2). The results showed that dietary preferences could fit the diabetes-related metabolic state values very well and suggested that there were strong correlations.

4 DISCUSSION
In this study, we used large-scale nationwide Internet data to show the geographical distribution of Chinese dietary preferences. Higher latitude residents preferred fried food. Higher altitude residents preferred spicy food and tongue-numbing food but disliked grilled food and sweet food. Grilled food and sweet food were more popular in high consumption level areas. People who live in cold winter areas preferred fried food whereas those who live in hot summer areas disliked spicy food but preferred sweet food. Further we found that fried food and grilled food preferences were positively related to diabetes prevalence, hypertension prevalence, and BMI and spicy food preference was negatively correlated with diabetes prevalence. Sweet food preference was also positively related to diabetes prevalence. Dietary preferences data could be used to predict regional values of diabetes prevalence, hypertension prevalence, and BMI accurately. This work suggests the potential of the online data to be used in diabetes risk factor surveillance and prediction of disease prevalence in populations across regions.
Integrating big-data sources into the practice of public health surveillance is vital for this century's epidemiology.15, 16 Big data from Internet can be used to predict lifestyle behavior and health-related data for epidemiologic needs. To our knowledge, our study is the first to use large-scale nationwide passively collected online data to draw the geographical distribution map of Chinese dietary preferences and link these online data results to real-world health outcomes. The results showed a variety of dietary preferences are associated with diabetes-related metabolic states. This method may avoid a series of statistical biases such as recall bias and reporting bias. Further we used regional dietary preferences proportion data, plus age and sex data, to predict three diabetes-related metabolic states in areas: diabetes prevalence, hypertension prevalence, and BMI. The average errors of the three predictive models were all less than 10%, of which the BMI predictive model was less than 2%, suggesting that the three predictive models had good applicative value. These new online data analysis and predictive models have the potential to be tools for the monitoring of public health and also for the early warning and intervention of chronic diseases.
On the other hand, the results confirmed the relationship between dietary preferences and diabetes. The frying process modifies both the foods and the frying oil. It may reduce water content, increase energy density, change fatty acid composition, and deteriorate frying oil, especially when the frying oil is reused, through the processes of oxidation and hydrogenation.5 Frying also makes food crunchy and aromatic and improves food palatability, which may lead to excess energy intake.6 So greater frequency of fried foods consumption has been linked to a higher risk of being overweight and obese in cohort studies.6, 7 Another study had shown that frequently fried-food consumption was significantly associated with risk of higher incidence of type 2 diabetes in two large prospective cohorts in the United States.4 This association was mediated in part by BMI and hypertension.4 Our study found that the fried food preference was positively related to prevalence of diabetes and hypertension and BMI values, which were consistent with the previous studies. The same correlation was also found in grilled food, probably because both fried food and grilled food were treated with high temperature. High temperature treated food contained several harmful substances including trans-fatty acids (TFAs) and advanced glycation end-products (AGEs), which had been documented to promote metabolic diseases.17, 18 In epidemiological studies, it had been reported that intake of sucrose and fructose may be one of the underlying etiologies of metabolic diseases.19 This study showed a similar result in that sweet food preference was positively correlated with diabetes prevalence and FPG. Capsaicin levels had been shown to be associated with decreased FPG levels and the maintenance of insulin levels.20 The same association also existed in PPG.21 A large prospective cohort study showed that increasing the fresh pepper in a diet could reduce diabetes-caused mortality.8 Our study found that the diabetes prevalence was negatively related to the preference of spicy food that contains a lot of capsaicin. This result supported the hypothesis that spicy food could improve glucose metabolism. Tongue-numbing food and instant-boiled food are two kinds of distinctive Chinese diets. Tongue-numbing food preference is similar to spicy food preference in dietary taste and it also showed a strong negative correlation with FPG in our study. Instant-boiled food preference was related to hot-pot culture and negatively related to FPG, which may be related to its relatively low cooking temperature, which products fewer harmful substances.17, 18 Most of the previous dietary studies had focused on western diet and various categories such as Mediterranean diet and fast food. Our study provided new evidence from the big-data perspective and confirmed the correlation between dietary preferences and diabetes-related metabolic states. This study showed that in Chinese adults, even though Chinese diet and western diet have a lot of differences, similar cooking styles might have the same effects on diabetes-related metabolic states.
Our study showed that dietary preferences had different distributions in terms of geographical, climatic, and economic factors. These results suggested that we should have precise healthy diet strategy for people in different geographical, climatic, and economic conditions. High latitude and cold winter regions need to limit the frying habit. Attention should be paid to the increase of grilled food and sweet food preference in areas with low altitude or rapid economic development. These results provided a certain reference value for other regions with similar conditions, such as the developing countries with rapid economic development and a lack of healthy diet education.22 For China, with the development of economic level, grilled food and sweet food preferences that were positively related to diabetes or hypertension prevalence may increase further, which indicates a need for government provision of targeted dietary guidance to public. Dietary preferences were more obvious among young and middle-aged people, suggesting that dietary health education may have more benefits for younger people.
A main strength of the present study is that we built a new link between the online dietary data and real-world health outcomes. In addition, the study had an extremely large sample size that was many times larger than participants in traditional cohorts or studies and covered all provinces nationwide. This large sample size could greatly reduce the sampling bias and could be considered as a real-world study, which can truly reflect the actual dietary distribution and the relationship with metabolic status. However, this study has some limitations. First, as an ethical and privacy protective study, the relation analysis was not made at the individual level. Instead, we analyzed the proportion data of dietary preferences on the regional level. These relations had been adjusted by stratification for age and sex but we were unable to adjust some related confounding factors such as ethnicity. However, it is still of practical value to predict the prevalence of chronic diseases using incompletely adjusted dietary preferences data. Second, the objects of this study were Internet users. Due to the limited development time of the Internet, middle-aged and young people were the dominant users and the older people were relatively fewer. But there were still more than 400 000 participants who were older than 65 in this study. For this, the final data were weighted on the basis of the age and sex composition ratio of national census data when calculating. Third, because the database that had enough data details for our analysis came from two different years (the Internet data came from 2016 and the metabolic disease survey was in 2010), there was a time interval between the two databases. However, compared with the two national chronic disease surveys,1, 2 there was no significant change in the average value of metabolic parameters of Chinese from 2010 to 2013 (FPG: 100.5 mg/dL vs 100.5 mg/dL, PPG: 112.3 mg/dL vs 114.2 mg/dL, BMI: 23.7 kg/m2 vs 24.0 kg/m2), which indicated the influence of the time interval was limited. Finally, this is an ecological study that does not explain causality.
Our study drew the geographical distribution map of Chinese dietary preferences on the basis of large-scale online data and showed the effects of geographical, climatic, and economic factors on dietary preferences. We found that fried food, grilled food, and sweet food preferences were positively related to diabetes risk and spicy food preference was negatively correlated with diabetes risk. The fact that the increase in preference of sweet food and grilled food coincided with higher consumption level signals that the Chinese government should pay attention to health diet education when the economy develops rapidly. Using passively collected Internet data to analyze and predict diabetes risk could be timely and accurate. It showed good application potential and could be a positive attempt to use Internet data.
ACKNOWLEDGMENTS
This work is funded by National Natural Science Foundation of China (No.81500660, 81500610, 81622011, 81621061). It is also supported by National Key R&D Program of Ministry of Science and Technology of the People's Republic of China (No.2016YFC1305600, 2017YFC1310700, 2016YFC0901200, 2016YFC1304904, 2018YFC1311800), the Program for Professor of Special Appointment (Younger Eastern Scholar) at Shanghai Institutions of Higher Learning (No.QD2016007) and Shanghai Municipal Commission of Health and Family Planning (No.20174Y0014).
CONFLICT OF INTEREST
No potential conflicts of interest relevant to this article are reported.
ETHICAL APPROVAL
This study was approved by the Ruijin Hospital Ethics Committee, Shanghai JiaoTong University School of Medicine.