Unlike water and sanitation infrastructures or socio-economic indicators, landscape features are seldomly considered as predictors of diarrhoea. In contexts of rapid urbanisation and changes in the physical environment, urban planners and public health managers could benefit from a deeper understanding of the relationship between landscape patterns and health outcomes. We conducted an ecological analysis based on a large ensemble of open-access data to identify specific landscape features associated with diarrhoea. Designed as a proof-of-concept study, our research focused on Côte d’Ivoire. This analysis aimed to (i) build a framework strictly based on open-access data and open-source software to investigate diarrhoea risk factors originating from the physical environment and (ii) understand whether different types and forms of urban settlements are associated with different prevalence rates of diarrhoea. We advanced landscape patterns as variables of exposure and tested their association with the prevalence of diarrhoea among children under the age of five years through multiple regression models. A specific urban landscape pattern was significantly associated with diarrhoea. We conclude that, while the improvement of water, sanitation, and hygiene infrastructures is crucial to prevent diarrhoeal diseases, the health benefits of such improvements may be hampered if the overall physical environment remains precarious.
This cross-sectional, ecological study combined open-access data on health, land cover, and basic infrastructures, readily obtained from different sources (Table 1). Geospatial data on the occurrence of diarrhoea among children under the age of five years can be obtained from the Demographic and Health Surveys (DHS) programme in vector format [28]. The DHS provides anonymised survey data at the individual and household levels that include cases of diarrhoea that had occurred in the 2 weeks preceding their survey. The DHS also provides data on access to WASH facilities and education, which were relevant to the current analysis. DHS data can be georeferenced by linking the observations to the point locations of survey clusters. To ensure anonymity, these locations do not correspond to the precise locations of participating households, but rather to the mean location of households belonging to a same cluster—i.e., for every cluster of surveyed households, there is one single GPS position that is attributed to all households belonging to the same cluster. For the datasets used in this analysis [29,30,31,32], there are 351 clusters distributed across Côte d’Ivoire, for a total of 9686 surveyed households. These clusters contain a median number of 27 households, or 134 people. Models of weather conditions can be obtained from Terraclimate in raster format, at a spatial resolution of ~4 × 4 km [33], and these data were included in our study because climatic conditions can be associated with diarrhoea [34]. Night illumination was used as a proxy for the presence of urban infrastructures, and can be obtained in raster format (500 × 500 m) from the National Aeronautics and Space Administration’s (NASA) Earth Observatory [35]. Land cover data can be obtained in raster format (300 × 300 m) from the European Space Agency’s Land Cover Climate Change Initiative Project (ESA Land Cover CCI) [36]. Population estimates at high spatial resolutions are provided by WorldPop [37], also in raster format (100 × 100 m). Finally, vector data on mobility infrastructures can be obtained from OpenStreetMap. List of open-access datasets used to explore relations between landscape and diarrhoea in Côte d’Ivoire. 1 shapefile. Combining health and environmental data is often a challenge, considering the differences in terms of the spatial and temporal resolutions of openly available data [38]. In this study, the areal units used to aggregate data were based on the geolocations of DHS clusters, which had the coarsest spatial resolution. All data processing steps were done in Python language, using the Jupyter computing environment. The following packages were used to process and visualise the data: PyLandStats 2.3.0 [39], rasterstats 0.15.0 [40], rasterio 1.2.9 [41], earthpy 0.9.2 [42], Fiona 1.8.20 [43], pandas 1.3.4 [44], geopandas 0.9.0 [45], numpy 1.20.3 [46], statsmodels 0.13.0 [47], pysal 2.5.0 [48], matplotlib 3.4.3 [49], and seaborn 0.11.2 [50]. The different data layers were harmonised spatially and temporally based on the buffer areas generated from DHS clusters, hereinafter called “spatial units”. Based on previous studies [51], we generated circular buffer zones originating from each cluster centroid, with radii based on the geographic blur determined by the DHS data anonymisation protocol: urban clusters had a buffer radius of 2 km, while rural clusters had a buffer radius of 5 km (Figure 1). The different data layers were aggregated at the cluster level: environmental data were aggregated based on the respective buffer areas, while DHS survey data were aggregated based on the clusters’ unique identifiers. In this way, for each spatial unit we obtained the following: (i) the prevalence of diarrhoea among children under the age of five; (ii) the proportion of the cluster’s population with access to at least “basic” water and sanitation, as defined by the World Health Organisation (WHO) and the United Nations Children’s Fund (UNICEF) Joint Monitoring Programme [25]; (iii) the proportion of the cluster’s female population who never attended school; (iv) local climatic conditions; (v) a list of landscape metrics derived from remotely sensed data (NASA and ESA Land Cover CCI), hybrid models (WorldPop), and volunteered geographic information (OpenStreetMap). Table 2 shows the variables obtained from this pre-processing in more detail. Buffer areas (red circles generated from DHS clusters’ centroids) were used as reference areas to calculate landscape metrics. Elaborated by the authors with QGIS, from: DHS [28] and ESA Land Cover CCI [36]. © ESA Climate Change Initiative—Land Cover led by UCLouvain (2017). List of variables calculated for each spatial unit (variables aggregated by buffer area). 1 As defined by the WHO and UNICEF Joint Monitoring Programme. 2 Aged between 15 and 49 years. 3 Applied to all land cover categories. Landscape metrics have been widely used to quantify and describe spatial patterns of “patches” of similar land cover categories [11,14,39], and thus are useful to analyse spatial patterns of urban settlements. These metrics were key features to our study, allowing us to relate the prevalence of diarrhoea (dependent variable) to indicators describing the form and composition of urban settlements (independent variables). We used the Python package PyLandStats [39] to calculate a series of landscape metrics for each land cover class contained within each spatial unit’s perimeter (Table 2), based on the data provided by ESA’s Land Cover Climate Change Initiative Project [36]. The metrics employed here—i.e., the proportion, edge and shape index of land cover patches—were based on previous studies that referred to land cover data to analyse spatial patterns of urban settlements [14] and environmental determinants of disease [18]. We added a level of detail to our landscape metrics by reclassifying patches originally categorised as “urban”, based on the levels of night illumination and demographic density (Figure 2 and Figure 3). Our study used the intensity of night illumination as a proxy for the presence of urban infrastructures—or, in other words, for the “quality” of urbanisation. Moreover, the quality of illumination affects the use of WASH amenities, especially by females [52], and thus may impact the risk of diarrhoea. We defined “precarious” and “regular” urban areas, building on the hypothesis that, if basic infrastructures are present, the level of illumination follows the level of demographic density. In this sense, urban pixels were considered “precarious” if they presented a high demographic density but a low (or relatively low) intensity of night illumination; on the other hand, “regular” urban patches had night illumination levels that matched the demographic density. This categorisation was done based on quantiles (Figure A1, Appendix A): for each urban pixel, if its density value was situated in a higher quantile than its night illumination value, it was considered “precarious”. Layers used to reclassify urban patches based on demographic density and night illumination (zoom on Abidjan’s metropolitan area). Elaborated by the authors with QGIS, from: NASA [35], ESA Land Cover CCI [36] and WorldPop [37]. © ESA Climate Change Initiative—Land Cover led by UCLouvain (2017). Reclassified urban patches (zoom on Abidjan’s metropolitan area). Elaborated by the authors with QGIS, from: NASA [35], ESA Land Cover CCI [36] and WorldPop [37]. © ESA Climate Change Initiative—Land Cover led by UCLouvain (2017). Another indicator of the quality of urbanisation was the presence of roads, for which we calculated two indicators: road availability (km of roads per ha of built-up area) and linearity (ratio between edge length and linear distance between the two vertices of the same edge). The latter also served as an indicator of the urban form. Finally, based on the literature, a selection of features that have been associated with diarrhoea were included as control variables. Basic water and sanitation services are key to prevent diarrhoea [2], and were therefore included. Access to these services was measured using the DHS household datasets [30]. Maternal education has been associated with a lower risk of diarrhoea [53]; at the same time, it has been related with reporting bias, as households with higher levels of maternal education have shown increased rates of reported child diarrhoea [54]. A proxy variable of maternal education (i.e., women’s educational attainment) was therefore added, based on individual DHS data [31]. Climatic conditions also have been associated with diarrhoea [34], but here they showed no significant correlation (see Table A3, Appendix D). Hence, these data were discarded (for more details, see the computer code section at the end of this article). To preserve some level of detail in the data, the variables resulting from the aggregation at the cluster level consisted of proportions (percentage) rather than simple means or medians. For example, demographic density in each spatial unit was given by the ratio between the number of built-up pixels classified as “dense” (with a value higher than the statistical series’ median) and the total number of built-up pixels contained in the respective spatial unit. By the end of the pre-processing, we obtained a geo-dataset with 351 spatial units. Each spatial unit had three control variables (water, sanitation, and women’s educational attainment) and a total of 90 independent variables, i.e., the landscape metrics described in Table 2 (the full list of variables is given in Table S1, see Supplementary Materials). We used four different regression models to assess the significance of the associations between the calculated landscape metrics and diarrhoea, while accounting for the effects of access to water, sanitation, and education, as well as spatial autocorrelation. The input data were standardised with a min/max scaler function, so that the effects of the different features could be compared through their coefficients. We started by running a feature selection algorithm to identify the most important variables to be included in the regression models. Given the high number of independent variables that were derived from the landscape metrics (n = 90), a preliminary feature selection was necessary to avoid multicollinearity and to clarify the scope of the analysis. The feature selection algorithm was composed of two “filters”. The first filter was a stepwise selection, which is a process that adds variables from a predefined list to the model, one-by-one, rechecking at each step the importance of all previously included variables [55]. In other words, the stepwise process combines forward and backward feature selection processes, consisting of iterative linear regressions allowing to identify the “best” features based on predefined significance thresholds (maximum p-value was set to 0.1) and model performance (residual sum of squares). The second and final filter was based on Spearman’s rank correlations (ρ): we used the Python package statsmodels [47] to calculate bivariate correlations between each feature selected with the stepwise method and the dependent variable (i.e., prevalence of diarrhoea at cluster level); only features with a p-value smaller than 0.1 were kept. We purposively opted for relative high thresholds of p-values because of the exploratory nature of this study. Once we concluded the feature selection, we ran both weighted and unweighted regression models. First, we built an unweighted ordinary least squares (OLS) model containing the dependent, control, and independent variables, as well as a constant. Then, we built a weighted model with the same features using the cluster weights given by the DHS. In fact, when conducting country-level analyses, the DHS suggests using cluster weights to adjust for eventual biases resulting from their sampling method. Given the infectious nature of diarrhoeal diseases, the analysis also needed to account for spatial dependence [56]. To this end, we used the Python package pysal [48] to run two models of spatial regression with the same features, spatial lag and spatial error, as explained in Section 2.4. Spatial dependence, or spatial autocorrelation, is the phenomenon by which values of observations are associated with each other through geographic distance (e.g., high values close to other high values) [57]. Accounting for spatial dependence is essential because linear regressions assume a normal, random distribution of error terms and the absence of spatial autocorrelation in the dependent variable. We estimated the probability of spatial autocorrelation in the dependent variable by calculating the global Moran’s I, which indicated whether the observed values of prevalence of diarrhoea were clustered, or randomly distributed, in space. As for the error terms in the OLS regression, we detected the probability of spatial dependence through the Lagrange multiplier test for spatial error. Contrary to an OLS model, spatial regressions can account for the spatial autocorrelation of the dependent variable (spatial lag dependence) and of the error term (spatial error dependence) [58]. The spatial lag model used in this study incorporates the spatial autocorrelation of the dependent variable by introducing the average values of neighbours as an additional variable into the regression specification (Equation (1)): where y is an N × 1 vector of observations on a dependent variable taken at each of N locations, α is the intercept, ρ is a scalar spatial lag parameter, W is an N × N matrix of weights indicating the spatial framework of neighbourhood effects among the dependent variable, X is an N × k matrix of explanatory variables, β is a k × 1 vector of parameters, and ε is an N × 1 vector of error terms. Similarly, the spatial error model used in this study also incorporates spatial autocorrelation by introducing the average values of neighbours as an additional variable into the regression specification, but this time using the values of the error terms (Equation (2)): where u is the vector of spatially autocorrelated residuals with constant variance and covariance terms, specified by an N × N matrix of weights indicating the spatial framework of neighbourhood effects among the error terms (W) and a spatial error coefficient (λ). The unit of analysis was the buffer area generated from each DHS cluster’s centroid (spatial unit). Out of the 351 spatial units, 10 were excluded as they did not have valid geographic coordinates. Furthermore, because our analysis focused on human settlements, we opted to keep only those spatial units with at least 1 “urban” pixel (300 × 300 m). Hence, we excluded 74 spatial units where no human occupation was detected—including units with settlements not sufficiently large to be detected at the spatial resolution used here. In the end, 267 spatial units (out of 351) were included in our regression analyses. Details about the discarded units are given in Table A1 (Appendix B). To determine whether the size and proportion of urban areas affected the association between landscape features and diarrhoea, the processes described in Section 2.3 were stratified into two levels. First, we conducted the regression analyses with all the 267 spatial units that met our inclusion criteria. Then, we conducted the same analyses with an “urban” subset, which contained 105 spatial units. The criterion for a spatial unit to be classified as “urban” was to have a proportion of urbanised area (ratio between the surface of “urban” pixels and the spatial unit’s total area) that was above the average of the retained 267 spatial units. Figure 4 shows the location of the 267 spatial units included in the analysis, specified by subset. Location of DHS clusters, by subset. Elaborated by the authors with QGIS, from: DHS [28], GADM and OpenStreetMap.