Mapping disparities in education across low- and middle-income countries

listen audio

Study Justification:
– Educational attainment is an important social determinant of maternal, newborn, and child health.
– Gender equity and education have gained increasing attention in popular media, international aid strategies, and global agenda-setting.
– The global health agenda is focused on evidence of precision public health, which includes the subnational distribution of disease and illness.
– However, to achieve future equity, there is a need for comparable evidence on the distribution of social determinants of health, including educational attainment.
Study Highlights:
– The study estimates the subnational distribution of educational attainment across low- and middle-income countries from 2000 to 2017.
– It provides subnational estimates of mean years of education and the proportion of the population achieving key levels of education for women of reproductive age and equivalent male age groups.
– The analysis uses a Bayesian model-based geostatistical framework and synthesizes data from 528 household and census datasets.
– The study identifies inequalities in educational attainment across geography and within populations.
Study Recommendations:
– The study highlights the need for targeted interventions to address disparities in educational attainment.
– Policy makers should prioritize investments in education to improve maternal, newborn, and child health outcomes.
– Efforts should be made to improve data collection systems to track progress in educational attainment and quality of schooling.
– Further research is needed to understand the complex relationships between education and health outcomes.
Key Role Players:
– Researchers and data analysts to analyze and interpret the data.
– Policy makers and government officials to implement recommendations and allocate resources.
– Education professionals and teachers to improve educational systems and quality of schooling.
– Non-governmental organizations and international aid agencies to support interventions and programs.
Cost Items for Planning Recommendations:
– Investments in education infrastructure, including schools and classrooms.
– Teacher training and professional development programs.
– Educational materials and resources.
– Data collection systems and technology.
– Monitoring and evaluation of educational programs.
– Awareness campaigns and community engagement initiatives.

The strength of evidence for this abstract is 8 out of 10.
The evidence in the abstract is strong because it describes a comprehensive analysis using a Bayesian model-based geostatistical framework and synthesizing data from 528 household and census datasets. The analysis provides subnational estimates of educational attainment across low- and middle-income countries. However, to improve the evidence, the abstract could provide more information on the validation procedures and results, as well as the limitations of the analysis.

Educational attainment is an important social determinant of maternal, newborn, and child health1–3. As a tool for promoting gender equity, it has gained increasing traction in popular media, international aid strategies, and global agenda-setting4–6. The global health agenda is increasingly focused on evidence of precision public health, which illustrates the subnational distribution of disease and illness7,8; however, an agenda focused on future equity must integrate comparable evidence on the distribution of social determinants of health9–11. Here we expand on the available precision SDG evidence by estimating the subnational distribution of educational attainment, including the proportions of individuals who have completed key levels of schooling, across all low- and middle-income countries from 2000 to 2017. Previous analyses have focused on geographical disparities in average attainment across Africa or for specific countries, but—to our knowledge—no analysis has examined the subnational proportions of individuals who completed specific levels of education across all low- and middle-income countries12–14. By geolocating subnational data for more than 184 million person-years across 528 data sources, we precisely identify inequalities across geography as well as within populations.

Using a Bayesian model-based geostatistical framework and synthesizing geolocated data from 528 household and census datasets, this analysis provides subnational estimates of mean numbers years of education and the proportion of the population who attained key levels of education for women of reproductive age (15–49 years), women aged 20–24 years, and equivalent male age bins between 2000 and 2017 in 105 countries across all low- and middle-income countries (LMICs). Countries were selected for inclusion in this analysis using the socio-demographic index (SDI) published in the Global Burden of Disease (GBD) study27. The SDI is a measure of development that combines education, fertility, and poverty. Countries in the middle, lower-middle, or low SDI quintiles were included, with several exceptions. Albania, Bosnia, and Moldova were excluded despite middle SDI status due to geographical discontinuity with other included countries and lack of available survey data. Libya, Malaysia, Panama, and Turkmenistan were included despite higher-middle SDI status to create better geographical continuity. We did not analyse American Samoa, Federated States of Micronesia, Fiji, Kiribati, Marshall Islands, Samoa, Solomon Islands, or Tonga, where no available survey data could be sourced. Analytical steps are described below, and additional details can be found in the Supplementary Information. We compiled a database of survey and census datasets that contained geocoding of subnational administrative boundaries or GPS coordinates for sampled clusters. These included datasets from 528 sources (see Supplementary Table 2). These sources comprised at least one data source for all but two countries on our list of LMICs: Western Sahara and French Guiana. We chose to exclude these two countries from our analysis; 42 of 105 included countries have only subnational administrative level data. We extracted demographic, education, and sample design variables. The coding of educational attainment varies across survey families. In some surveys, the precise number of years of attainment is not provided, with attainment instead aggregated into categories such as ‘primary completion’ or ‘secondary completion’. In such cases, individuals who report ‘primary completion’ may have gone on to complete some portion of secondary education, but these additional years of education are not captured in the underlying dataset. Previous efforts to examine trends in mean years of education have either assumed that no additional years of education were completed (that is, primary education only) or have used the midpoint between primary and secondary education as a proxy28. Trends in the single-year data, however, demonstrate that such assumptions introduce bias in the estimation of attainment trends over time and space, as differences in actual drop-out patterns or binning schema can lead to biased mean estimates29. For this analysis, we used a recently developed method that selects a training subset of similar surveys across time and space to estimate the unobserved single-year distribution of binned datasets29. In comprehensive tests of cross-validation that leveraged data for which the single-year distributions are observed, this algorithmic approach significantly reduces bias in summary statistics estimated from datasets with binned coding schemes compared to alternatives such as the standard-duration method28. The years in all coding schemes were mapped to the country- and year-specific references in the UNESCO International Standard Classification of Education (ISCED) for comparability30. We used a top coding of 18 years on all data; this is a common threshold in many surveys that have a cap and it is reasonable to assume that the importance of education for health outcomes (and other related SDGs) greatly diminishes after what is the equivalent of 2 to 3 years of graduate education in most systems. Data were aggregated to mean years of education attained and the proportions achieving key levels of education. The levels chosen were proportion with zero years, proportion with less than primary school (1–5 years of education), proportion with at least primary school (6–11 years of education), and proportion achieving secondary school or higher (12 or more years of education). A subset of the data for a smaller age bin (20–24 years) was also examined to more closely track temporal shifts. Equivalent age bins were aggregated for both women and men to examine disparities in mean years of attainment by sex. Where GPS coordinates were available, data were aggregated to a specific latitude and longitude assuming a simple-random sample, as the cluster is the primary sampling unit for the stratified design survey families, such as the Demographic and Health Survey (DHS) and Multiple Indicator Cluster Survey (MICS). Where only geographical information was available at the level of administrative units, data were aggregated with appropriate weighting according to their sample design. Design effects were estimated using a package for analysing complex survey data in R31. To leverage strength from locations with observations to the entire spatiotemporal domain, we compiled several 5 × 5-km2 raster layers of possible socioeconomic and environmental correlates of education (Supplementary Table 5 and Supplementary Fig. 6). Acquisition of temporally dynamic datasets, where possible, was prioritized to best match our observations and thus predict the changing dynamics of educational attainment. We included nine covariates indexed at the 5 × 5-km2 level: access to roads, nighttime lightstv, populationtv, growing season, ariditytv, elevation, urbanicitytv, irrigation, and yeartv (tv, time-varying covariates). More details, including plots of all covariates, can be found in the Supplementary Information. Our primary goal is to provide educational attainment predictions across LMICs at a high (local) resolution, and our methods provide the best out-of-sample predictive performance at the expense of inferential understanding. To select covariates and capture possible nonlinear effects and complex interactions between them, an ensemble covariate modelling method was implemented32. For each region, three submodels were fitted to our outcomes using all of our covariate data: generalized additive models, boosted regression trees, and lasso regression. Each submodel was fit using fivefold cross-validation to avoid overfitting and the out-of-sample predictions from across the five folds were compiled into a single comprehensive set of predictions from that model. Additionally, the same submodels were also run using 100% of the data and a full set of in-sample predictions were created. The five sets of out-of-sample submodel predictions were fed into the full geostatistical model as predictors when performing the model fit. The in-sample predictions from the submodels were used as the covariates when generating predictions using the fitted full geostatistical model. This methodology maximizes out-of-sample predictive performance at the expense of the ability to provide statistical inference on the relationships between the predictors and the outcome. A recent study has shown that this ensemble approach can improve predictive validity by up to 25% over an individual model32. More details on this approach can be found in the Supplementary Information. The primary goal of using the stacking procedure in our analyses was to maximize the predictive power of the raster covariates by capturing the nonlinear effects and complex interactions between covariates to optimize the model performance. It has previously been suggested32 that the primary purpose of the submodel predictions is to improve the mean function of the Gaussian process. Although we have determined a way to include the uncertainty from two of our submodels (lasso regression and generalized additive models (GAM)), we have not determined a way to include uncertainty from the boosted regression tree (BRT) submodel into our final estimates. Whereas GAM and lasso regression seek to fit a single model that best describes the relationship between response variable and some set of predictors, BRT method fits a large number of relatively simple models for which the predictions are then combined to give robust estimates of the response. Although this feature of the BRT model makes it a powerful tool for analysing complex data, quantifying the relative uncertainty contributed by each simple model as well as uncertainty from the complex interactions of the predictor variables is challenging33,34. It is worth noting, however, that our out-of-sample validation indicates that the 95% coverage is fairly accurate (for example, closely ranges around 95%) as shown in the figures and table of Supplementary Information section 4.3.2. This indicates that we are not misrepresenting the uncertainty in our final estimates. Gaussian and binomial data are modelled within a Bayesian hierarchical modelling framework using a spatially and temporally explicit hierarchical generalized linear regression model to fit the mean number years of education attainment and the proportion of the population who achieved key bins of school in 14 regions across all LMICs as defined in the GBD study (Extended Data Fig. ​Fig.1).1). This means we fit 14 independent models for each indicator (for example, the proportion of women with zero years of schooling). GBD study design sought to create regions on the basis of three primary criteria: epidemiological homogeneity, sociodemographic similarity, and geographical contiguity27. Fitting our models by these regions has the advantage of allowing for some non-stationarity and non-isotropy in the spatial error term, compared to if we modelled one spatiotemporal random-effect structure over the entire modelling region of all LMICs. Modelling regions were defined as follows. Andean South America, Central America and the Caribbean, central sub-Saharan Africa, East Asia, eastern sub-Saharan Africa, Middle East, North Africa, Oceania, Southeast Asia, South Asia, southern sub-Saharan Africa, Central Asia, Tropical South America, and western sub-Saharan Africa. Regions in grey were not included in our models due to high-middle and high SDIs27. The map was produced using ArcGIS Desktop 10.6. For each Gaussian indicator, we modelled the mean number of years of attainment in each survey cluster, d. Survey clusters are precisely located by their GPS coordinates and year of observation, which we map to a spatial raster location i at time t. We model the mean number of years of attainment as Gaussian data given fixed precision τ and a scaling parameter sd (defined by the sample size in the observed cluster). As we may have observed multiple data clusters within a given location i at time t, we refer to the mean attainment, μ, within a given cluster d by its indexed location i, and time t as μi(d),t(d). For each binomial indicator, we modelled the number of individuals at a given attainment level in each survey cluster, d. We observed the number of individuals reporting a given attainment level as binomial count data Cd among an observed sample size Nd. As we may have observed multiple data clusters within a given location i at time t, we refer to the probability of attaining that level, p, within a given cluster d by its indexed location i and time t as pi(d),t(d). We used a continuation-ratio modelling approach to account for the ordinal data structure of the binomial indicators35. To do this, the proportion of the population with zero years of education was modelled using a binomial model. The proportion with less than primary education was modelled as those with less than primary education of those that have more than zero years of education. The same method followed for the proportion of population completing primary education. The proportion achieving secondary school or higher was estimated as the complement of the sum of the three binomial models. The remaining parameter specification was consistent between all indicators in both binomial and Gaussian models: For indices d, i, and t, *(index) is the value of * at that index. The probabilities pi,t represent both the annual proportions at the space–time location and the probability that an individual had that level of attainment given that they lived at that particular location. The annual probability pi,t of each indicator (or μi,t for the mean indicators) was modelled as a linear combination of the three submodels (GAM, BRT, and lasso regression), rasterized covariate values Xi,t, a correlated spatiotemporal error term Zi,t, country random effects ϵctr(i) with one unstructured country random effect fit for each country in the modelling region and all sharing a common variance parameter, γ2, and an independent nugget effect ϵi,t with variance parameter σ2. Coefficients βh in the three submodels h = 1, 2, 3 represent their respective predictive weighting in the mean logit link, while the joint error term Zi,t accounts for residual spatiotemporal autocorrelation between individual data points that remains after accounting for the predictive effect of the submodel covariates, the country-level random effect ϵctr(i), and the nugget independent error term, ϵi,t. The purpose of the country-level random effect is to capture spatially unstructured, unobserved country-specific variables, as there are often sharp discontinuities in educational attainment between adjacent countries due to systematic differences in governance, infrastructure, and social policies. The residuals Zi,t are modelled as a three-dimensional Gaussian process (GP) in space–time centred at zero and with a covariance matrix constructed from a Kronecker product of spatial and temporal covariance kernels. The spatial covariance Σspace is modelled using an isotropic and stationary Matérn function36, and temporal covariance Σtime as an annual autoregressive (AR1) function over the 18 years represented in the model. In the stationary Matérn function, Γ is the Gamma function, Kv is the modified Bessel function of order v > 0, κ > 0 is a scaling parameter, D denotes the Euclidean distance, and ω2 is the marginal variance. The scaling parameter, κ, is defined to be κ=8v/δ where δ is a range parameter (which is about the distance for which the covariance function approaches 0.1) and v is a scaling constant, which is set to 2 rather than fit from the data37,38. This parameter is difficult to reliably fit, as documented by many other analyses37,39,40 that set this parameter to 2. The number of rows and the number of columns of the spatial Matérn covariance matrix are equal to the number of spatial mesh points for a given modelling region. In the AR1 function, ρ is the autocorrelation function (ACF), and k and j are points in the time series where |k − j| defines the lag. The number of rows and the number of columns of the AR1 covariance matrix are equal to the number of temporal mesh points (18). The number of rows and the number of columns of the space–time covariance matrix, Σspace ⊗ Σtime, for a given modelling region are equal to: the number of spatial mesh points × the number of temporal mesh points. This approach leveraged the residual correlation structure of the data to more accurately predict estimates for locations with no data, while also propagating the dependence in the data through to uncertainty estimates41. The posterior distributions were fit using computationally efficient and accurate approximations in R-integrated nested Laplace approximation (INLA) with the stochastic partial differential equations (SPDE) approximation to the Gaussian process residuals using R project version 3.5.142–45. The SPDE approach using INLA has been demonstrated elsewhere, including the estimation of health indicators, particulate air matter, and population age structure10,11,46,47. Uncertainty intervals were generated from 1,000 draws (that is, statistically plausible candidate maps)48 created from the posterior-estimated distributions of modelled parameters. Additional details regarding model and estimation processes can be found in the Supplementary Information. To transform grid cell-level estimates into a range of information that is useful to a wide constituency of potential users, these estimates were aggregated from the 1,000 candidate maps up to district, provincial, and national levels using 5 × 5-km2 population data49. This aggregation also enabled the calibration of estimates to national GBD estimates for 2000–2017. This was achieved by calculating the ratio of the posterior mean national-level estimate from each candidate map draw in the analysis to the posterior mean national estimates from GBD, and then multiplying each cell in the posterior sample by this ratio. National-level estimates from this analysis with GBD estimates can be found in Supplementary Table 44. To illustrate how subnational progress has contributed differentially to national progress (Fig. ​(Fig.3),3), we decomposed the improvement in the national rate of secondary completion since 2000 for each country into the additive contributions of rate changes at the second administrative level, where C is the national secondary rate change, N is the total number of second-level administrative units, ci is the population proportion in administrative unit i, and ri is the rate of secondary attainment in administrative unit i. Although the model can predict at all locations covered by available raster covariates, all final model outputs for which land cover was classified as ‘barren or sparsely vegetated’ were masked, on the basis of the most recently available Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (2013), as well as areas in which the total population density was less than 10 individuals per 1 × 1-km2 pixel in 201550. This step has led to improved understanding when communicating with data specialists and policy-makers. Models were validated using source-stratified fivefold cross-validation. To offer a more stringent analysis by respecting some of the source and spatial correlation in the data, holdout sets were created by combining sets of data sources (for example, entire survey- or census-years). Model performance was summarized by the bias (mean error), total variance (root-mean-square error) and 95% data coverage within prediction intervals, and the correlation between observed data and predictions. All validation metrics were calculated on the predictions from the fivefold cross-validation. Where possible, estimates from these models were compared against other existing estimates. Furthermore, measures of spatial and temporal autocorrelation pre- and post-modelling were examined to verify correct recognition, fitting, and accounting for the complex spatiotemporal correlation structure in the data. All validation procedures and corresponding results are provided in the Supplementary Information. Our analysis is not without several important limitations. First, almost all data collection tools conflate gender and sex and we therefore do not capture the full distribution of sex or gender separately in our data. We refer throughout to the measurement of ‘gender (in)equality’, following the usage in SDG 5. Second, it is extremely difficult to quantify quality of education on this scale in a comparable way. Quality is ultimately a large part of the SDG agenda and of utmost importance to achieving equity in opportunity for social mobility. However, many studies across diverse low- and middle-income settings have linked attainment, even very low levels, to measurable improvement in maternal and child health17. As our analysis highlights with the proportional indicators, there are still many subnational regions across the world where large proportions do not complete primary school. A third limitation is that we are unable to measure or account for migration. A concept note released from the forthcoming Global Education Monitoring Report 2019 focuses on how migration and displacement affects schooling51. Our estimates of the modelled outcome, educational attainment for a particular space–time–age–sex, are demonstrated to be statistically unbiased (Supplementary Information section 4.3); however, interpretation of any change in attainment as a change in the underlying education system could potentially be biased by the effects of migration. It is possible that geographical disparities reflect changes in population composition rather than changes in the underlying infrastructure or education system. Pathways for this change are complex and may be voluntary. Those who manage to receive an education in a low-attainment area may have an increased ability to migrate and choose to do so. This change may also be involuntary, particularly in politically unstable areas where displacement may make geographical changes over time difficult to estimate. A shifting population composition is a general limitation of many longitudinal ecological analyses, but the spatially granular nature of the analyses used here may be more sensitive to the effects of mobile populations. Our analysis is purely predictive but draws heavily in its motivation from a rich history of literature on the role of education in reducing maternal mortality, improving child health, and increasing human capital. Studies have also demonstrated complex relationships between increased education and a myriad of positive health outcomes, such as HIV risk reductions and spillover effects to other household members52,53. The vast majority of these studies are associational and recent attempts at causal analyses have provided more-mixed evidence54–56. Although causal analyses of education are very difficult and often rely on situational quasi-experiments, associational analyses using the most comprehensive datasets demonstrate consistent support for the connection between education and health17,57. Looking towards future analyses, it will be important to study patterns of change in these data and how they overlap with distributions of health. Lastly, our estimates cannot be seen as a replacement for proper data collection systems, especially for tracking contemporaneous change. Our analysis of uncertainty at a high-resolution may be used to inform investment in more robust data systems and collection efforts, especially if the ultimate goal is to measure and track progress in the quality of schooling. Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

The innovation described in the text is the use of a Bayesian model-based geostatistical framework to estimate the subnational distribution of educational attainment across low- and middle-income countries. This innovation allows for the precise identification of inequalities in educational attainment across geography and within populations. The analysis synthesizes geolocated data from 528 household and census datasets to provide subnational estimates of mean years of education and the proportion of the population who have attained key levels of education. The estimates are disaggregated by gender and age groups and cover the period from 2000 to 2017. The innovation also incorporates raster layers of socioeconomic and environmental correlates of education to improve the accuracy of the estimates. The estimates are validated using source-stratified fivefold cross-validation and compared against other existing estimates. However, there are limitations to the analysis, including the conflation of gender and sex in the data, the difficulty in quantifying the quality of education in a comparable way, and the inability to measure or account for migration.
AI Innovations Description
The recommendation to improve access to maternal health based on the provided description is to use the mapping disparities in education across low- and middle-income countries. This involves estimating the subnational distribution of educational attainment, including the proportions of individuals who have completed key levels of schooling, across all low- and middle-income countries from 2000 to 2017. By geolocating subnational data and using a Bayesian model-based geostatistical framework, this analysis can provide subnational estimates of mean years of education and the proportion of the population who attained key levels of education. This information can help identify inequalities in educational attainment and inform targeted interventions to improve access to maternal health services.
AI Innovations Methodology
The provided text describes a methodology used to estimate the subnational distribution of educational attainment across low- and middle-income countries. While the text does not directly address innovations to improve access to maternal health, it does provide insights into the methodology used to simulate the impact of recommendations on improving access to maternal health. Here is a brief summary of the methodology described:

1. Data Compilation: A database of survey and census datasets containing geocoding of subnational administrative boundaries or GPS coordinates was compiled. These datasets included information on demographic, education, and sample design variables.

2. Educational Attainment Coding: The coding of educational attainment varied across survey families. A method was used to estimate the unobserved single-year distribution of binned datasets, which reduced bias in summary statistics estimated from datasets with binned coding schemes.

3. Covariate Modelling: Several raster layers of socioeconomic and environmental correlates of education were compiled. An ensemble covariate modelling method was implemented, which included fitting three submodels (generalized additive models, boosted regression trees, and lasso regression) to the outcomes using all the covariate data.

4. Bayesian Hierarchical Modelling: Gaussian and binomial data were modelled within a Bayesian hierarchical modelling framework. The mean number of years of education attainment and the proportion of the population achieving key levels of education were modelled as Gaussian and binomial data, respectively.

5. Spatial and Temporal Modelling: A spatially and temporally explicit hierarchical generalized linear regression model was used to fit the mean number of years of education attainment and the proportion of the population achieving key levels of education. The model included covariates, spatiotemporal error terms, country random effects, and an independent nugget effect.

6. Model Validation: The models were validated using source-stratified fivefold cross-validation. Validation metrics such as bias, total variance, coverage within prediction intervals, and correlation between observed data and predictions were calculated.

7. Aggregation and Calibration: The grid cell-level estimates were aggregated to district, provincial, and national levels using population data. The estimates were calibrated to national estimates from the Global Burden of Disease study.

8. Limitations: The methodology has several limitations, including the conflation of gender and sex in the data, difficulty in quantifying the quality of education, inability to measure or account for migration, and the purely predictive nature of the analysis.

While this methodology focuses on educational attainment, a similar approach could be used to simulate the impact of recommendations on improving access to maternal health. Data on maternal health indicators, such as maternal mortality rates and access to prenatal care, could be compiled and modelled using a Bayesian hierarchical modelling framework. Covariates related to healthcare infrastructure, socioeconomic factors, and geographic variables could be included in the models. The models could then be validated and used to estimate the impact of different recommendations on improving access to maternal health at subnational levels.

Partilhar isto:
Facebook
Twitter
LinkedIn
WhatsApp
Email