Substantially underestimated global health risks of current ozone pollution

0
Substantially underestimated global health risks of current ozone pollution

In this study, all variates are first imputed for missing values and re-sampled before being considered as inputs. Next, the processed data and ground truths (output) should be spatiotemporally collocated and fed into the GL-CEF model for training. A total of two validation schemes are then exploited to verify the performance of modelled results. Eventually, the global exposure levels and all-cause mortality burden due to ambient O₃ are carefully assessed and discussed.

Datasets

Previous works widely considered O₃ precursors as key inputs of the model8,19,20,21,22, including NO₂ and HCHO. In our study, the high-resolution (~ 5 km) NO₂ and HCHO tropospheric vertical column density (TroVCD) from TROPOMI50,51 are adopted as the primary variates. The O₃ profile from CAMS52 is also introduced as a primary variate to provide the apriori information of ambient O₃. Meanwhile, multiple frequently used factors are selected as the auxiliary variates to improve the performance of the model, which consist of solar radiation intensity (necessary conditions for photochemical reactions53,54,55), meteorological fields, and geographic elements. The in-situ measurements of MDA8 O₃ from the Open Air Quality (OpenAQ), China National Environmental Monitoring Center, US Environmental Protection Agency, and European Environment Agency are used as the ground truths (output). At last, the latest replay ambient O₃ product from GEOS-CF56 is applied for comparison to the modelled results. More specific details are given as follows.

OpenAQ can provide globally distributed ambient concentrations of major air pollutants, which came from various sources over more than 100 countries. The air quality records from OpenAQ have been broadly utilized in worldwide studies of recent years24,56,57,58,59. In the present study, the global in-situ O₃ measurements during 2019–2021 are collected from OpenAQ and calculated to MDA8 ambient O₃ concentrations (regarded as the ground truths). It’s worth noting that the duration of in-situ measurements per day should exceed 20 h (from the first to the last). Furthermore, only the source of “governments” is employed to guarantee the data quality. The units of all values are transformed to ppb based on corresponding references. Supplementary Fig. 9a illustrates the spatial locations of in situ stations in the globe, using the symbols of red circles. A total of > 7,000 in situ stations (by 2021) are considered in this study, which densely cover China, Europe, the US, etc. Since the data from OpenAQ could be irregularly missing, the in-situ O₃ measurements from the China National Environmental Monitoring Center, US Environmental Protection Agency, and European Environment Agency are selected for supplement.

TROPOMI devised a self-appropriate atmospheric NO₂ retrieval algorithm based on that from the Ozone Monitoring Instrument, which can generate NO₂ TroVCD worldwide51. The differential optical absorption spectroscopy method, a chemical transport model (or data reanalysis system), and an air-mass factor lookup table were all introduced in the TROPOMI NO₂ retrieval algorithm. Related steps are as below: (1) Retrieve NO₂ total slant column density via the differential optical absorption spectroscopy method. (2) Separate the stratospheric and tropospheric parts using the chemical transport model (or data reanalysis system). (3) Transform the tropospheric slant column density to TroVCD according to the air-mass factor lookup table. In our study, the record of “nitrogendioxide_tropospheric_column” is applied as a primary variate for modelling global ambient O₃ concentrations from 2019 to 2021. Supplementary Table 6 lists more information about the TROPOMI NO₂ TroVCD.

TROPOMI applied a method based on differential optical absorption spectroscopy and combined ultraviolet spectral bands to produce HCHO TroVCD globally50. The detailed procedures of TROPOMI HCHO retrieval algorithm were similar to those of NO₂. In the present study, the global record of “formaldehyde_tropospheric_vertical_column” during 2019–2021 is adopted as a primary variate. Supplementary Table 6 provides more details about the TROPOMI HCHO TroVCD.

CAMS was the fourth generation of globally gridded atmospheric reanalysis product from the European Centre for Medium-Range Weather Forecasts52. Based on the mechanisms of physics and chemistry, CAMS can provide multiple chemical components by integrating simulations of chemical transport model with worldwide measured data. The general spatial and temporal resolutions for CAMS were 0.75° and 3-hour, respectively. CAMS reanalysis product has been extensively exploited for previous atmospheric works over the globe52, which suggested its reliable data quality. In this study, the global record of “ozone_mass_mixing_ratio” (O₃ profile) from 2019 to 2021 is selected as a primary variate to introduce the apriori information of ambient O₃ into the model. Considering that ground truths were MDA8 ambient O₃ concentrations, the daily maximum 9-hour averaged O₃ profile is employed. Supplementary Table 6 shows the specific information of the CAMS O₃ profile.

Similar to CAMS, ERA5 reanalysis product was also devised by the European Centre for Medium-Range Weather Forecasts60, which involved simulations of chemical transport model and actual measured data. In general, ERA5 can generate atmospheric/surficial parameters with spatial and temporal resolutions of 0.25° and hourly, respectively. In our study, the records of solar radiation intensity (i.e., “surface_solar_radiation_downwards”) and several meteorological fields are regarded as auxiliary variates for modelling global ambient O₃ concentrations during 2019–2021. Attributed to that ground truths are MDA8 ambient O₃ concentrations, the MDA8 solar radiation intensity, air temperature, and dew point temperature are utilized. As for other meteorological fields, the hourly values for each day are averaged in this study. Detailed information can be referred to in Supplementary Table 6.

Previous works broadly adopted the geographic elements as auxiliary inputs of the model8,9,20,21,61 due to their significant association with the spatial distribution of ambient O₃. In this study, the normalized differential vegetation index (NDVI)62 and land use classes63 from the moderate resolution imaging spectroradiometer (MODIS) with LandScan population density64 worldwide are deemed as auxiliary variates to increase the robustness of the model. Supplementary Table 6 lists the specific details of geographic elements.

GEOS-CF was developed by NASA in 2021, including two versions: forecast and replay (improved through reanalysed meteorological fields)56. It can produce various global atmospheric chemical components based on simulations of chemical transport model, with spatial and temporal resolutions of 0.25° and hourly, respectively. In the present study, the replay record of “surface_ozone” is exploited in the validation for comparison with the modelled MDA8 ambient O₃ concentrations globally. More information about GEOS-CF can be found in Keller et al. 56.

Data preprocessing

The surface emissions of O₃ precursors had continuous and complicated impact on ambient O₃ distribution65,66,67. Therefore, the NO₂ and HCHO TroVCD are averaged to acquire monthly data, which reflect the conditions for surface emissions of O₃ precursors in our study. Meanwhile, the average by month can reduce data noise (especially for HCHO TroVCD68) and improve the coverage of available values.

Afterward, the data interpolating empirical orthogonal functions (DINEOF) method69 is employed to recover the missing information in monthly NO₂ TroVCD, HCHO TroVCD, and NDVI, relying on their spatiotemporal self-correlation. The brief procedures of the DINEOF method are as follows: 1) Initialize missing values and unfold the 3-dimensional origin data to a 2-dimensional matrix (M) along the spatial dimension; 2) Decompose M using the singular value decomposition; 3) Reconstruct a matrix Mr with top-k singular values and replace the missing values of M with those of Mr; and 4) Repeat the procedures of 2) and 3) until the errors reach a pre-determined threshold and reshape M to final results according to the origin dimensions. More details of the DINEOF method can be referred to in Alvera-Azcárate et al. 69. Supplementary Table 7 shows the simulated experiment results of the DINEOF method. The missing masks for validation in the simulation are acquired from the real scenes of origin products. As listed, the imputed results in monthly NO₂ TroVCD, HCHO TroVCD, and NDVI present expected accuracy via the DINEOF method, with the correlation coefficient (CC) of 0.85, 0.8, and 0.91, respectively.

Furthermore, the spatial resolutions of various variates need to be consistent in our study. Considering the spatial resolution of TROPOMI products (~ 5 km), a global grid of 3600 × 7200 (0.05°) is adopted. To be specific, the monthly NO₂ and HCHO TroVCD are re-sampled to 0.05° through the nearest neighbouring interpolation70. The re-sampling methods for other variates are inverse distance weighted interpolation71 and area-weighted aggregation72 (Supplementary Table 6).

Remote-sensing and reanalysis products were gridded data with various spatial and temporal resolutions, while the applicable ranges of in-situ measurements only focused on small regions. Therefore, it is required to unify the spatial and temporal dimensions between gridded data and ground truths. Initially, the variates with multiple temporal resolutions, such as monthly NO₂ TroVCD and daily meteorological fields, are mutually aligned. Next, all the ground truths falling on the same grid are averaged to collocate with the aligned gridded data.

Model description

The GL-CEF model includes global, local, and global-local modules, which can adopt global and local knowledge over station-sparse (or no-station) and station-dense regions, respectively. Specific information about the three modules is shown in the following parts.

Global module: as depicted in Supplementary Fig. 9b, a certain number of collocated grids are firstly given (blue circles). Next, a transition zone is set in the regions that are [lmin, lmax] from the collocated grids. If the distance between the target grid i (yellow square) and its h-th nearest collocated grid is greater than lmax, i is deemed as station-sparse, and the global module should be exploited for the estimation (G-value). In our study, the global module utilizes all collocated samples in the modelling, which adopts the deep forest73 (see the Supplementary Methods for details) as the global sub-model. The general expression of the global sub-model is defined in Eq. (1).

$$VG_O3=F_G(V_THCHO,V_TNO2,V_CAMS,V_ESRI,V_EMF,V_GE,V_TC)$$

(1)

where \(VG_O3\) represents the modelled MDA8 ambient O₃ concentrations through the global sub-model (G-value). \(F_G\) stands for the global sub-model based on the deep forest. \(V_THCHO\), \(V_TNO2\), \(V_CAMS\), \(V_ESRI\), and \(V_EMF\) indicate the TROPOMI HCHO TroVCD, TROPOMI NO₂ TroVCD, CAMS O₃ profile, ERA5 solar radiation intensity, and EAR5 meteorological fields. \(V_GE\) denotes geographic elements, including NDVI, land use classes, and population density. \(V_TC\) signifies the temporal encoding74. Supplementary Table 8 lists the parameters of the global sub-model designed in this study.

Local module: as illustrated in Supplementary Fig. 9c, if the distance between the target grid i and its h-th nearest collocated grid is less than lmin, i is regarded as station-dense, and the local module ought to be adopted for the estimation. The local module involves two highlights: sliding block strategy and variable local sub-model.

Sliding block strategy. Previous localized methods15,19,20 normally built independent sub-model for each target grid, which yielded commendable performance but required large time consumptions. These methods were likely unfit for the global modelling task. As a result, the sliding block strategy is proposed for the fast training of local sub-models in our study. Related procedures are as follows. A sliding block with the radius of r is first selected and then traverses all the target grids with the step size of s. To ensure that collocated samples can smoothly change in adjacent sliding blocks, a buffer distance of db is introduced, which is larger than r. If the collocated samples in some sliding blocks are too few (<cm), they should be discarded. Next, the collocated samples in each sliding block are utilized for the independent modelling, which can generate several well-trained local sub-models. The local sub-models trained by the nearest N sliding blocks from the target grids are adopted for their intermediate estimation (Ls-values). Finally, the Ls-values are aggregated to acquire the modelled result (L-value). In addition, if the collocated samples included in a few adjacent sliding blocks remain unchanged, they need to be discarded. This can avoid multiple identical local sub-models that likely lead to discontinuous modelled results. Supplementary Table 8 shows the parameters of the sliding block strategy devised in this study.

Variable local sub-models. Generally, the complexity of the model is supposed to be positively correlated with the number of collocated samples. In our study, the variable local sub-models are developed deriving from the light gradient boosting machine75 (see the Supplementary Methods for details). The parameters of variable local sub-models can automatically vary depending on the number of collocated samples, whose key parameters (see bold fonts in Supplementary Table 8) are determined as provided in Eq. (2).

$$p=\left\{\beginarraycp_\min ,cou \, < \, q_\min \\ Rou\left(\fraccou-q_\min q_\max -q_\min \times \left(p_\max -p_\min \right)+p_\min \right),q_\min \, < \, cou \, < \, q_\max \\ p_\max ,cou\ge q_\max \endarray\right.$$

(2)

where \(p_\min \) and \(p_\max \) stand for the minimum and maximum of key parameters, respectively. \(q_\min \) and \(q_\max \) represent the minimum and maximum of number thresholds, respectively. \(cou\) is the number of collocated samples. \(Rou(\bullet )\) indicates the rounding function. The general expression of the variable local sub-models is defined in Eq. (3).

$$VL_O3=\frac1N\mathop\sum \limits_n=1^NF_VLn(V_THCHO,V_TNO2,V_CAMS,V_ESRI,V_EMF,V_TC)$$

(3)

where \(N\) denotes the number of variable local sub-models. \(VL_O3\) signifies the modelled MDA8 ambient O₃ concentrations (L-value) aggregated from the intermediate estimation through the variable local sub-models (Ls-values). \(F_VLn\) indicates the n-th VLSM deriving from the light gradient boosting machine. It is worth noting that the inputs of variable local sub-models discard geographic elements, which ensures their collocated samples can present sufficient temporal differences.

Global-local module: as displayed in Supplementary Fig. 9d, a transition zone is set to smooth the modelled results in the connections of global and local modules. Regarding the target grid i in the transition zone, the modelled results of global (G-value) and local (L-value) modules are merged using a geospatial weighting method. Related weights are defined in Eqs. (4)–(6).

$$VW_O3=VG_O3\times w_1+VL_O3\times w_2$$

(4)

$$w_1=\fracl_1^2l_1^2+l_2^2$$

(5)

$$w_2=\fracl_2^2l_1^2+l_2^2$$

(6)

where \(VW_O3\) indicates the MDA8 ambient O₃ concentrations after geospatial weighting (W-value). \(l_1\) and \(l_2\) stand for the distances between i and two boundaries (lmin and lmax) of the transition zone, respectively.

Validation scheme

The spatial performance of the GL-CEF model need to be emphatically validated. Hence, the SICV and TESICV schemes of 5 folds are utilized in the present study, which focus on spatial accuracy and spatiotemporally predictive ability, respectively. As depicted in Supplementary Fig. 6a, all the collocated samples are first divided into 5 folds based on spatial locations in the SICV scheme. Next, the GL-CEF model will be trained and validated using 80% (4 folds) and 20% (1 fold) of the collocated samples, respectively. Finally, the above step should be repeated 4 times until each fold has been adopted. As for the TESICV scheme, the only difference is that the training and validation sets came from 2020–2021 and 2019, respectively. In our study, the global imputed and modelled results are verified with the help of 5 metrics: R2, as defined in Eq. (7); CC, as defined in Eq. (8); RMSE, as defined in Eq. (9); relative percentage error (RPE), as defined in Eq. (10); and mean bias, as defined in Eq. (11).

$$\rmR^2=1-\frac\sum (\hatv-v)^2\sum (\barv-v)^2$$

(7)

$$\rmCC=\frack\sum v\hatv-\sum v\sum \hatv\sqrtk\sum v^2-\left(\sum v\right)^2\sqrtk\sum \hatv^2-\left(\sum \hatv\right)^2$$

(8)

$$\rmRMSE=\sqrt\frac1k\sum (\hatv-v)^2$$

(9)

$$\rmRPE=\frac\rmRMSE\barv$$

(10)

$$\rmmean\; bias=\frac1k\sum (\hatv-v)$$

(11)

where \(k\) represents the number of collocated samples. \(\hatv\), \(v\), and \(\barv\) stand for the modelled, benchmark, and mean benchmark values, respectively.

Population exposure estimation

In this study, the population-weighted MDA8 ambient O₃ concentrations are acquired using Eq. (12)13,76.

$$v_pw=\frac\sum pop_i\times \hatv_i\sum pop_i$$

(12)

where \(\hatv_i\) indicates the modelled MDA8 ambient O₃ concentrations on grid i. \(pop_i\) stands for the population density on grid i. \(v_pw\) represents the population-weighted MDA8 ambient O₃ concentrations over target region. In the meantime, the standard deviation of \(v_pw\) (i.e., \(std_pw\)) is computed with Eq. (13)57,77.

$$std_pw=\sqrt{\frac{\sum pop_i\times (\hatv_i-v_pw)}\sum pop_i\times (e-1)/e}$$

(13)

where \(e\) signifies the grid count of the population density exceeding 0. Moreover, the exposure levels during a period are calculated by Eq. (14)77,78, which temporally cumulates the MDA8 ambient O₃ concentrations on each grid.

$$EL_U,D\left(U_x,D_y\right)=\frac\sum pop_i\times I_U,D\left(U_i \, > \, U_x,D_i \; > \; D_y\right)\sum pop_i\times 100\%$$

(14)

where \(EL_U,D(U_x,D_y)\) reflects the fraction of the population exposed to MDA8 ambient O₃ concentrations > \(U_x\) ppb for more than \(D_y\) days. \(I_U,D(U_i \, > \, U_x,D_i \, > \, D_y)\) denotes positive as the day count of MDA8 ambient O₃ concentrations > \(U_x\) ppb surpasses \(D_y\) days. Otherwise, it is configured to negative.

Mortality burden estimation

In the present study, the long-term all-cause mortality burden due to O₃ exposure is computed according to a log-linear exposure-response function between ambient O₃ and premature deaths from epidemiological researches. The all-cause diseases are defined by A00 to R99 in the International Classification of Diseases 10. The full procedures can be expressed as Eqs. (15)–(18)2,11,34,79.

$$RR_i=e^\gamma (v_mi-v_s)$$

(15)

where \(RR_i\) represents the pooled relative risk on grid i. \(\gamma\) indicates the pooled effect value of long-term ambient O₃ exposure for all-cause deaths from a worldwide meta-analysis involving 226 million participants36, with a value of 1.39 × 10−3 (95% CI: 8.9597 × 10−4, 1.8822 × 10−3) per ppb. This suggests that an increment of 10 ppb in warm-season MDA8 ambient O₃ concentrations is associated with a pooled relative risk of 1.014 (95% CI: 1.009, 1.019) for long-term exposure. \(v_mi\) is the warm-season modelled MDA8 ambient O₃ concentrations on grid i. \(v_s\) stands for the long-term threshold of counterfactual concentrations advised by WHO in 202180, which denotes its AQG (30.07 ppb) as the starting point.

$$paf_i=1-\frac1RR_i$$

(16)

where \(paf_i\) reflects the PAF on grid i, indicating the proportion of mortality burden that will be eliminated when ambient O₃ is decreased to the threshold of counterfactual concentrations (30.07 ppb for long term).

$$tpaf=\frac\sum pop_i\times paf_i\sum pop_i$$

(17)

$$dea=\sum {pop}_i\times tpaf\times mor$$

(18)

where \(tpaf\) signifies the PAF over target region. \(dea\) denotes the annual long-term O₃-attributable deaths over target region. \(mor\) indicates the annual baseline mortality from the GBD 202112 over target region. For different land use classes, \(tpaf\) is acquired using the population of a single class.

As for short-term O₃ exposure, the associated all-cause deaths are also calculated through Eqs. (15)–(18). By comparison, \(\gamma\) can be obtained from another worldwide meta-analysis for short-term all-cause deaths, with a value of 8.5627 × 10−4 (95% CI: 6.7736 × 10−4, 1.035 × 10−3)35 per ppb. This demonstrates that an increment of 10 ppb in MDA8 ambient O₃ concentrations is associated with a pooled relative risk of 1.0086 (95% CI: 1.0068, 1.0104) for short-term exposure. \(v_mi\) stands for the modelled MDA8 ambient O₃ concentrations on grid i. \(v_s\) indicates the short-term threshold of counterfactual concentrations provided by WHO in 202180, which deems its AQG (50.11 ppb) as the starting point. The daily PAF is first accumulated and then averaged to acquire the annual value for the short-term O₃ exposure.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link

Leave a Reply

Your email address will not be published. Required fields are marked *