Geostatistical linkage of national demographic and health survey data: a case study of Tanzania

listen audio

Study Justification:
– Combining Service Provision Assessment (SPA) surveys with Demographic and Health Surveys (DHS) can provide valuable information on health service availability in low- and middle-income countries.
– Existing linkage methods have limitations and can only be established at aggregate levels.
– This study aims to develop a geostatistical approach to link SPA and DHS data at a disaggregated level, addressing the limitations of existing methods.
Study Highlights:
– The geostatistical linkage method developed in this study outperformed existing methods in terms of prediction accuracy and classification error.
– The method was validated using information on the general service readiness of health facilities in Tanzania.
– The geostatistical linkage method was also found to be more robust in handling the displacement of DHS clusters.
Study Recommendations:
– The proposed geospatial approach can be used in various public health research applications where facility and population-based data need to be combined at a fine spatial scale.
– Further research can explore the application of this method in other countries and regions.
– Policy makers can consider implementing the geostatistical linkage method to improve the accuracy and precision of health service availability assessments.
Key Role Players:
– Ministry of Health and Social Welfare
– Prime Minister’s Office
– National Bureau of Statistics
– ICF International
– Field teams for data collection
– Researchers and statisticians for data analysis and modeling
Cost Items for Planning Recommendations:
– Data collection costs (including fieldwork, survey administration, and data management)
– Technical assistance costs
– Software and computing resources
– Research and analysis costs
– Training and capacity building costs
– Dissemination and communication costs

The strength of evidence for this abstract is 8 out of 10.
The evidence in the abstract is strong because it presents a detailed description of the methods used and the results obtained. The authors clearly explain how they developed a geostatistical approach to link two datasets at a disaggregated level, and they provide validation criteria for the results. The abstract also highlights the advantages of the proposed approach over existing methods. To improve the evidence, the authors could provide more information on the sample size and representativeness of the datasets used, as well as any potential limitations or biases in the data collection process.

Background: When Service Provision Assessment (SPA) surveys on primary health service delivery are combined with the nationally representative household survey—Demographic and Health Survey (DHS), they can provide key information on the access, utilization, and equity of health service availability in low- and middle-income countries. However, existing linkage methods have been established only at aggregate levels due to known limitations of the survey datasets. Methods: For the linkage of two data sets at a disaggregated level, we developed a geostatistical approach where SPA limitations are explicitly accounted for by identifying the sites where health facilities might be present but not included in SPA surveys. Using the knowledge gained from SPA surveys related to the contextual information around facilities and their spatial structure, we made an inference on the service environment of unsampled health facilities. The geostatistical linkage results on the availability of health service were validated using two criteria—prediction accuracy and classification error. We also assessed the effect of displacement of DHS clusters on the linkage results using simulation. Results: The performance evaluation of the geostatistical linkage method, demonstrated using information on the general service readiness of sampled health facilities in Tanzania, showed that the proposed methods exceeded the performance of the existing methods in terms of both prediction accuracy and classification error. We also found that the geostatistical linkage methods are more robust than existing methods with respect to the displacement of DHS clusters. Conclusions: The proposed geospatial approach minimizes the methodological issues and has potential to be used in various public health research applications where facility and population-based data need to be combined at fine spatial scale.

There are 20 regions in mainland Tanzania and we focused on the two regions: Iringa and Njombe. These were selected because two of the authors have extensive contextual knowledge of these regions. The two regions, located in the Southern Highlands in the southwestern part of the country, are comprised of multiple districts and town councils as shown in Fig. 1. The census of health facility overlaid with SPA sampled facility within the administrative boundaries of Iringa and Njombe (shaded with green and yellow color, respectively). The gray line denotes the boundary of districts (administrative level 2) within each region. The DHS clusters are denoted by red circles In mainland Tanzania, the public health care system is organized as follows. Primary health care services are the most common and comprise community-based health activities (disease control programs, community outreach, etc.), village health services (often health services offered in home by village health workers), dispensaries and health centers. Dispensaries are generally run by a clinical assistant and can provide preventive and curative services, while health centers can admit patients and provide some surgical services planning [20]. For example, dispensaries provide maternal and child health care, treat simple medical problems during pregnancy such as anemia, assist with normal deliveries, and offer basic outpatient curative care to between 6,000 and 10,000 people. Health centers are normally run by clinical offers and serve population of approximately 50,000 people. Health centers are intended to provide preventive care, but also often have between 10 and 20 beds and offer reproductive health services and minor surgery. Clinics are another category of health facilities at the primary level which are privately owned and offer general and specialist examinations and outpatient treatment services. At the secondary level, council hospitals provide health care and medical and basic surgical services, while regional hospitals provide specialist medical care. Finally, at the tertiary level, zonal and national hospitals provide advanced medical care and training in medical, paramedical, and nursing care. The two entities jointly responsible for the delivery of public health services are the Ministry of Health and Social Welfare and the Prime Minister’s Office. Data used in the present study came from multiple and publicly available sources. The 2014–2015 Tanzania SPA Provision Assessment (TSPA) provides information on availability of basic and essential health care services and readiness to provide quality services [20]. The TSPA is a sample of formal-sector health facilities in Tanzania. The sampling frame for the survey is a master list of 7102 facilities in Tanzania and Zanzibar, including hospitals, health centers, clinics, and dispensaries. These include government, private for-profit, parastatal, and faith-based facilities. A total of 1200 facilities were selected for the survey to represent a nationally representative sample by facility type and managing authority. Hospitals were oversampled as they exist in comparatively small numbers. In each facility, data were collected using a facility inventory questionnaire, health provider interview, observation protocols, and exit interviews for antenatal care and family planning clients. Fieldwork was carried out between October 2014 and February 2015. For the 2015–2016 Tanzania Demographic and Health Survey and Malaria Indicator Survey (TDHS-MIS), National Bureau of Statistics (NBS) carried out and supervised fieldwork activities, while ICF International provided technical assistance. Sampling was carried out in two stages. In the first stage, 608 clusters corresponding to enumeration areas delineated for the 2012 census were selected. For the second stage, a complete listing of households in all 608 selected clusters was carried out, and then 22 households were randomly selected from each cluster, yielding a representative probability sample of 13,376 [22]. GPS data were collected at the cluster level, accurate to less than 15 m, although their latitude/longitude positions were randomly displaced to ensure respondent confidentiality. The range of displacement in rural clusters is 0 to 5 km, and then 1% of clusters are displaced with a range of 0 to 10 km. Data collection was carried out by 16 field teams across Tanzania between August 2015 and February 2016. Census data on all health facilities in Tanzania were obtained from the Ministry of Health, Community Development, Gender, Elderly and Children (MoHCDEC) website (https://www.moh.go.tz/hfrportal). These data include facility name, type, ownership, and GPS coordinates. Data were obtained from the website in 2019 as an up-to-date census of existing facilities. Census data were unavailable for the same year as the TSPA was sampled, although the yearly rate of change in health facility placement is usually very low. We sourced the gridded population data for Iringa and Njombe regions at 100 m spatial resolution projected for years of 2014 and 2015 from WorldPop database (www.worldpop.org). The road network data were obtained from Humanitarian Data Exchange (https://data.humdata.org/). We classified this dataset originally derived from OpenStreetMap into one of the following classes: primary, secondary tertiary roads, and tracks. We collected the elevation information from Shuttle Radar Topography Mission (SRTM) at the resolution of 3-arc sec, which was produced by National Aeronautics and Space Administration (NASA). We downloaded the digital elevation model (DEM) dataset from Jet Propulsion Laboratory (https://www2.jpl.nasa.gov/srtm/). The land use product was obtained from AFRICOVER project (Di Gregorio & Latham, 2009), which we downloaded from Food and Agriculture Organization of the United Nations (http://www.fao.org/geonetwork/srv/en/main.home). The land cover data originally derived from Landsat satellite images were regrouped into ten types of land uses: built area, woodland, forest, tree savannah, shrub land, sparse herbaceous vegetation, herbaceous crops, shrub crop, flooded vegetation, and water bodies. We processed the raw data for modeling, including the elevation data being resampled at the spatial resolution of 500 m via a bilinear interpolation in Fig. 2D, and other datasets, such as road network in Fig. 2A, land cover information in Fig. 2B, and population density in Fig. 2C. (A) Road networks, (B) Land uses, (C) Population density, and (D) Elevation To estimate service readiness, we were guided by WHO standards outlined in the Service Availability and Readiness Assessment Reference Manual [35]. For each facility, we used data from SPA to create a score summarizing service readiness comprising various items related to its operational capacity (see Appendix A for details). If the facility had the specified item, we coded the item as 1 and 0 otherwise. Then for each dimension (amenities, basic equipment, infection prevention, diagnostic capacity, and essential medicines), we created a sub-score ranging from 0 to 1 indicating the average items the facility had for that dimension. Then we created the general service readiness (SR) score by averaging the sub-scores by facility and then multiplying by 100. Thus, the SR score has a potential range from 0 to 100. We aimed to account for both the location-specific conditions, such as elevation, land use types, and population density where the facility is located, as well as the spatial correlation among the general SR scores of health facilities. We assume that health facilities in the study area share common geographic, socio-economic, and demographic characteristics that attain spatial efficiency in the spatial distribution of health facilities [24], given that the cost of developing new health facilities in LMICs is high. For example, both Mishra et al. [21] and Pu et al. [25] identified key factors that are important to consider for optimal locations for new health facilities in LMICs. They include the distance to the nearest health facility, population ratio, connection to road networks, land use type, and elevation. We expected that the key determinants of health facility sites used in the present study, which were not necessarily exhaustive, would be shared by health facilities that were both included and not included in the SPA survey. Here we addressed the challenge of making a link based only on a sample data in SPA instead of a census of health facilities by borrowing information from the SPA sampled health facilities and their surrounding environmental characteristics. We would also account for both the geographic proximity to health care facilities from prediction locations, including DHS clusters, and the distance between any two SPA facilities. To achieve this goal, we used two kriging algorithms: simple kriging with varying local mean [9] and kriging with an external drift (also referred to as universal kriging) [4, 7, 27, 36]. Both algorithms can represent the spatial variation of the general service readiness via a stochastic surface. The general SR of each facility was considered a regionalized variable whose mean varied in space and was modeled by a local trend, while the remaining variation not captured by the local trend followed a second-order stationarity [12]. Variogram (or covariance) characterizes the second-order stationarity of the regionalized variable where the geographic proximity to health facilities can be realistically modeled. Hereafter, we use the following notation to explain the two kriging models: u denotes a location in the study area u∈A. The response variable, general SR, is denoted as Z(u), which is viewed as a regionalized variable and the observed general SR scores at the 77 SPA facilities [z(ui),i=1,…,77] correspond to a realization of the regionalized variable. Generalized linear models for trend estimation: We estimated the local trend of the general SR using a multivariate linear regression with covariates that summarize the physical environment, socio-economic conditions, and demographic characteristics surrounding each health facility. Specifically, the local trend was modeled as where the four covariates consist of elevation x1(u), log-transformed population density x2(u), road density x3(u), and the distance from the location u to the nearest major road x4(u). These covariates were selected based on an exploratory analysis where land use derived variables, such as the proportions of land use types per 500 m, were found to be insignificant in explaining the variation of the general SR scores of SPA facilities. We have used the four variables for both the model fit and the prediction: a linear model was fitted at SPA facility sites to estimate the local trend and prediction was made over a grid with a cell size 500 m imposed on the study area. We expect that the local trend model in Eq. (1) predict high general SR score at locations when a similar set of physical conditions are met, and thus overcome the challenges of using a subset of health facility data. Estimating spatial structure via variogram analysis: The variogram analysis was performed using the residuals of the general SR scores obtained at SPA health facility sites. The residual values were computed by subtracting the trend estimate from the general SR scores as r(ui)=z(ui)-m^SK(ui),i=1,…,77. In variogram, the spatial variation of general SR scores depends only on the relative locations of health facilities. That is, the similarities of general SR scores of two health facilities vary as a function of their distance regardless their absolute locations: if a pair of health facilities are away from each other, their SR scores likely to differ, but their SR scores are similar if they are close together. Spatial prediction of general service readiness: Under the model decision of second-order stationarity, the mean of the general SR does not depend on the location and it represents global information shared to all unsampled locations. In the simple kriging with varying local means (SKLM), we replaced the mean m by known varying means m^SK(u) to account for the set of covariates available at each location, which led to the following form of the simple kriging with varying local mean estimator Z^SKlm(u): where λiSK(u) denotes the kriging weight assigned to the general SR score at the i-th health facility in the SPA survey. The varying means m^SK(u) was estimated using a generalized linear models with a Gaussian distribution assumption in the present work. The residual values at locations (grid cells imposed on the study region) were estimated using the simple kriging with varying local means estimator with the residual data r(ui),i=1,…,77. The final estimate of general SR scores were obtained by adding the trend estimate m^SK(u) to the SK estimates of the residual r^(u). Kriging with an external drift (KED) is similar to kriging with the simple kriging with varying local means in Eq. (2) in that the trend is modeled as a linear function of a smoothly varying auxiliary variables, but it is different because the mean m(u) is not estimated through a regression process prior to the kriging of Z. The KED estimator is The kriging weights λiKED(u) are the solution of the following system of (77+5) linear equations where the trend m(u) is modeled as a linear function of smoothly varying variables, the covariance function of separation vector between any pair of residuals estimated at SPA facility sites CR(ui-uj), and the covariance of a pair of facility in the survey and the prediction location CR(ui-u). The four covariates xk,k=1,…,4 denote the elevation, log-transformed population density, road density, and the distance to the nearest road, respectively. Statistical Calibration: In both the local trend estimation and kriging models, some model predictions exceeded the range of general SR scores. Although it is possible that some dispensaries may have a lower or higher general SR score than measured values in the SPA facility survey, any model prediction outside the range of 0 and 100 is unacceptable. To avoid these spurious results and rectify a systematic bias if present, we performed a stochastic calibration using a subset of the census data, which hereafter referred to as testing sites. Note that the census data contain the locational information of complete health facilities in the study area, although their SR scores are unavailable except at SPA facility sites. However, the information on the facility type—Hospital, Health Center, Dispensary, and Clinic—was known at all health facilities registered in the census data. We created testing sites as a subset of the census data by applying a criterion based on the geographic proximity to the same type of facilities in SPA sample. Here, we assumed that two adjacent health facilities that were classified as the same type of health facility likely shared equivalent or similar general SR scores. To operationalize this concept, we first examined each facility in the census if there was a SPA facility of the same type located within 5 km. If so, the facility in the census data would be included in the ‘testing sites’ and we assigned the general SR score of the nearest SPA facility to the testing site. Here we limited the search radius of the nearest same type SPA facility within 5 km based on a sensitivity analysis (see Appendix B for details). We examined the correlation between a pair of the same type of two nearest SPA health facilities’ SR scores with respect to a maximum search range, which are summarized in Table 4. Sensitivity analysis: correlation coefficients of general SR scores In the second step, we extracted predicted values at testing sites from a local trend model, SKLM, and KED models. Based on both the modeled and reference values at the testing sites, we developed a simple linear model to minimize the gap between model predictions and reference values. The fitted model was used to calibrate the model predictions across the study area. Performance Evaluation of Linkage Methods: We evaluated the performance of the geostatistical linkage methods using the census of health facilities. Because general SR scores were available only at the subset of the census (77 facilities included in SPA), we could not directly use the census data in the evaluation. Instead, we used the information on the facility type as a proxy variable and quantified two evaluation metrics: (1) the prediction error at a subset of the census health facilities, i.e., testing sites identified in the calibration; (2) the classification error per facility type over the census of health facilities. To calculate prediction error of the three algorithms of a generalized linear regression model, SKLM, and KED, we fitted the models with SPA data and obtained the predicted SR scores at testing sites. We also estimated general SR scores at testing sites using three commonly used linkage methods: administrative boundary link, Euclidean buffer link, and kernel density estimation methods. The details of the three linkage methods can be found in Skiles et al. [28], but we linked the two data sets within a district boundary (i.e., the administrative level 2 mapped in Fig. 1) for the administrative boundary link, and we created a 5 km buffer zone at each facility in testing sites for Euclidean buffer link. Lastly, the kernel density was created at a grid with cell size of 500 m using general SR scores of SPA data as a density variable. We calculated the absolute mean prediction error by taking the differences between linkage method specific predictions and reference values at testing sites and summarized them by calculating their mean. For the classification error, we assessed if the predicted general SR score at each health facility is within the facility type specific range of general SR scores inferred from SPA. For each health facility in the census, we compared the general SR score predictions obtained from different linkage methods with the corresponding lower and upper bounds specific to the facility type. The comparison results were quantified as classification error across all the 563 health facility sites. If a model predicts a general SR score that is within the range of lower and upper bounds per facility type in Table 2, a value of 1 was assigned and 0, otherwise. We conducted this evaluation across all health facilities and computed the total success rate (0 to 100%) for each model. The range of general service readiness scores per health facility type *SD: Standard Deviation Based on the proposed geostatistical linkage approach, we estimated the general SR at 40 DHS clusters. Under the consideration of the displacement each DHS cluster, we generated 100 sets of alternative and equally probable cluster locations. We simulated locations that are within 5 km for urban clusters and 10 km for rural clusters [2]. At each simulated DHS cluster, we estimated the general SR scores from the geostatistical methods and quantified their variability originating from the displacement. For the purpose of comparison, we also computed the general SR scores at the simulated cluster locations using other traditional methods and compared the absolute differences with the proposed geostatistical methods, as well as the variability. All statistical analysis was conducted with R software (version 4.0.2) using glm for a trend estimation, gstat package for variogram modeling and spatial prediction, and ks package for kernel density estimation.

N/A

The innovation described in the title and description is the geostatistical linkage of national demographic and health survey data. This approach combines data from Service Provision Assessment (SPA) surveys and Demographic and Health Surveys (DHS) to provide information on the access, utilization, and equity of health service availability in low- and middle-income countries. The geostatistical approach accounts for limitations in the survey datasets and uses contextual information and spatial structure to infer the service environment of unsampled health facilities. This approach has been shown to improve prediction accuracy and classification error compared to existing methods. It has the potential to be used in various public health research applications where facility and population-based data need to be combined at a fine spatial scale.
AI Innovations Description
The recommendation described in the provided information is to use a geostatistical linkage approach to combine the Service Provision Assessment (SPA) surveys with the nationally representative household survey, the Demographic and Health Survey (DHS), in order to improve access to maternal health. This approach addresses the limitations of existing linkage methods by linking the two datasets at a disaggregated level. It takes into account the contextual information around health facilities and their spatial structure to make inferences on the service environment of unsampled health facilities. The geostatistical linkage method has been shown to exceed the performance of existing methods in terms of prediction accuracy and classification error. It is also more robust with respect to the displacement of DHS clusters. This approach has the potential to be used in various public health research applications where facility and population-based data need to be combined at a fine spatial scale.
AI Innovations Methodology
Based on the provided information, the methodology used to simulate the impact of recommendations on improving access to maternal health is a geostatistical approach. Here is a brief description of the methodology:

1. Data Collection: Data from the Service Provision Assessment (SPA) survey and the Demographic and Health Survey (DHS) in Tanzania were collected. The SPA survey provides information on the availability and readiness of health services, while the DHS survey provides demographic and health-related data.

2. Linkage of Data Sets: The geostatistical approach involves linking the SPA and DHS data sets at a disaggregated level. This is done by identifying the sites where health facilities might be present but not included in the SPA survey. The approach takes into account the contextual information around facilities and their spatial structure to infer the service environment of unsampled health facilities.

3. Validation of Linkage Results: The geostatistical linkage results on the availability of health services are validated using two criteria: prediction accuracy and classification error. The performance of the geostatistical approach is compared to existing methods in terms of these criteria.

4. Estimation of Service Readiness: Service readiness is estimated using the World Health Organization (WHO) standards outlined in the Service Availability and Readiness Assessment Reference Manual. A score summarizing service readiness is created based on various items related to operational capacity. The geostatistical approach takes into account location-specific conditions and spatial correlation among the service readiness scores of health facilities.

5. Estimation of Spatial Structure: The spatial variation of service readiness scores is estimated using variogram analysis. The analysis characterizes the second-order stationarity of the regionalized variable and models the spatial variation based on the distance between health facilities.

6. Spatial Prediction: Two kriging algorithms, simple kriging with varying local mean and kriging with an external drift, are used to predict service readiness scores at unsampled locations. The algorithms represent the spatial variation of service readiness scores via stochastic surfaces.

7. Statistical Calibration: To avoid spurious results and rectify systematic bias, a stochastic calibration is performed using a subset of the census data. A linear model is developed to minimize the gap between model predictions and reference values.

8. Performance Evaluation: The performance of the geostatistical linkage methods is evaluated using the census of health facilities. Prediction error and classification error are quantified to assess the accuracy of the predictions.

9. Simulation of Impact: The impact of recommendations on improving access to maternal health is simulated by estimating the general service readiness scores at simulated DHS clusters. The variability of the scores due to displacement is quantified, and the results are compared to traditional methods.

Overall, the geostatistical approach combines data from different sources, accounts for spatial variation, and provides estimates of service readiness at fine spatial scales. This methodology can be used to assess the impact of recommendations on improving access to maternal health in Tanzania and potentially in other public health research applications where facility and population-based data need to be combined at a fine spatial scale.

Partilhar isto:
Facebook
Twitter
LinkedIn
WhatsApp
Email