Background Sustainable Development Goal (SDG) 4 aims to ensure inclusive and equitable access for all by 2030, leaving no one behind. One indicator selected to measure progress towards achievement is the participation rate of youth in education (SDG 4.3.1). Here we aim to understand drivers of school attendance using one country in East Africa as an example. Methods Nationally representative household survey data (2015–16 Tanzania Demographic and Health Survey) were used to explore individual, household and contextual factors associated with secondary school attendance in Tanzania. These included, age, head of household’s levels of education, gender, household wealth index and total number of children under five. Contextual factors such as average pupil to qualified teacher ratio and geographic access to school were also tested at cluster level. A two-level random intercept logistic regression model was used in exploring association of these factors with attendance in a multi-level framework. Results Age of household head, educational attainments of either of the head of the household or parent, child characteristics such as gender, were important predictors of secondary school attendance. Being in a richer household and with fewer siblings of lower age (under the age of 5) were associated with increased odds of attendance (OR = 0.91, CI 95%: 0.86; 0.96). Contextual factors were less likely to be associated with secondary school attendance. Conclusions Individual and household level factors are likely to impact secondary school attendance rates more compared to contextual factors, suggesting an increased focus of interventions at these levels is needed. Future studies should explore the impact of interventions targeting these levels. Policies should ideally promote gender equality in accessing secondary school as well as support those families where the dependency ratio is high. Strategies to reduce poverty will also increase the likelihood of attending school.
We used the Demographic and Health Surveys (DHS) data for Tanzania (2015–16 DHS, n = 595 clusters) [24, 42] downloaded from the MEASURE DHS website, and extracted a potential list of key individual (age and sex of the child, education attainment of the head of the household and parents) and household (household wealth index, number of children under the age of 5 in the household) level characteristics used as variables (or factors) that impact secondary school attendance. The DHS program collects nationally representative household surveys in over 90 countries, which provide indicators on a wide range of topics including population, wealth, maternal and child health, fertility and family planning, nutrition, and education indicators. DHS sampling design is implemented using a two (or sometimes three) stages stratified sampling design using censuses as sampling frames. During the first stage of selection, enumeration areas (EAs), also known as clusters, are selected by using a probability proportional-to-size selection (EA size). During the second stage household are usually sampled from a complete household listing in the selected EAs using systematic sampling. Specific details on the sampling procedures for can be found on the DHS final report [24] and DHS Sampling Manual [42]. Clusters (also used interchangeably with primary sampling units (PSU) and enumeration areas (EAs)) are defined as a group of households in the same area or a block (if in urban areas) selected for the interview within the complex survey design used by the DHS. The adjusted net attendance rate used as outcome variable in this study was defined as the total number of students of the official secondary school age-group attending primary or secondary or higher education at a reference academic year, following indications from UIS UNESCO and the DHS methods [43, 44]. It therefore included children of official school age who accessed school earlier or later than the normal enrolment age and was expressed as a percentage of the corresponding population [45], giving a more precise picture of participation to school. The designated age-ranges for secondary school in Tanzania ranges between 14 to 19 years old. The numerator was the de facto total population of secondary school age attending secondary school (or primary, secondary or higher in the case of the adjusted rate) while the denominator was the total number of de facto secondary school age population. The age at the start of the academic year was used to determine the eligible secondary school age population used in the numerators and denominators for the net attendance rate [44]. To establish these age ranges, full information on the date of birth of the child in question was triangulated with the start of academic year, to account for temporal gap between the interviews and the start of the academic year. The out-of-school rate for secondary school was calculated by subtracting the adjusted net attendance rate for secondary education from 100%. Alongside individual and household level factors, contextual level factors such as travel time to nearest secondary school (a proxy for access to school) and pupil to teacher qualified ratio (PQTR, a proxy for school service offered / quality) [46] were constructed. Cluster level information about travel time to the nearest secondary school were extracted from the gridded estimates to account for travel time as contextual variable. The average travel time to nearest secondary school were extracted at each cluster. The methodology for computing travel time has been documented in previous studies [47–50]. Firstly, school locations were triangulated with ancillary spatial data on elevation (DEM), obtained from HydroSHEDS dataset [51], land cover, obtained from MERIS GlobCover [52], and road networks, assembled from Open Street Maps (OSM) and other online resources such as the National Geospatial-Intelligence Agency (NGA) [53] and MapCruzin [54], using Access Mod version 5 software [55]. Secondly, a raster surface of travel times to the school locations that include walking across land cover and motorised travel along major roads was generated and used in the analysis. The focal statistics tool available under ArcGIS Spatial Analyst (ESRI ArcGIS 10.7) was employed to calculate means within a 2km or 5km buffer around each cluster, depending on whether they were an urban or rural cluster in order to take into account the urban/rural split in survey sampling, and the respective 2km and 5km displacement of DHS clusters. The PQTR is defined as the average number of pupils per qualified teacher at a given level of education, based on headcounts of both pupils and teachers [44], and it was also tested as a contextual variable. The PQTR gives indication of how many teachers per pupil are present in a school, and therefore how much care and attention can be given to each individual pupil. A qualified teacher is one who has at least the minimum academic qualifications required for teaching his/her subjects at the relevant level in a given country [44]. Information on pupil-to-qualified teacher ratio (PQTR) was extracted from online education database for Tanzania (www.africaopendata.org [46]). This included information on the number of children enrolled at each school and teachers in every classroom of each secondary school. The higher the pupil-qualified teacher ratio, the lower the relative access of pupils to qualified teachers, where a high pupil-teacher ratio suggests that each teacher is responsible for a large number of pupils. On the contrary, it is generally assumed that a low pupil-qualified teacher ratio signifies smaller classes, which enables the teacher to pay more attention to individual students [44]. A GIS inverse distance weighting (IDW) interpolation technique was used to create a continuous surface of PQTR in Tanzania. S1 Fig and S1 Text show the distribution of the pupil-qualified teacher ratio (PQTR) in Tanzania for each school and a relative surface. PQTR values for each DHS cluster were extracted using focal statistics around urban areas (using a buffer of 2km) and rural areas (5km). Information about children of secondary school age were linked to the PQTR quality indicator, and values at cluster level were therefore employed in the modelling framework as a contextual variable to understand factors associated with access to secondary school. Based on the data availability and on the theoretical relationship discussed in previous studies between determinants and school attendance [among others: 8, 21], a full list of individual, family and household and contextual characteristics was derived and their association with the outcome variable tested using bivariate analysis and a forward stepwise covariate selection process. A bivariate analysis was performed to identify demographic and socio-economic characteristics associated with the adjusted secondary school attendance ratio. This descriptive analysis tested for differences within groups using F test and t-test for equality of means (for continuous variables) adjusting for sample design and with a significance level of p<0.05. Data were analysed with Stata/SE 16.0 for Windows [56] and adjusted for the survey sampling design. Additionally, a forward-stepwise covariate selection procedure using an alpha level of 0.05 was implemented to identify a parsimonious set, while collinearity between independent variables was explored using VIF statistic. Collinearity was considered high for covariates with a VIF greater than 4, which indicates a twofold increase in the standard error of a regression coefficient, in presence of collinearity. In case of two collinear variables (with high VIF), the variables with the highest R2 statistic when compared to the outcome variable was retained. Interaction terms for variables age and level of education of the household head were initially tested outside the modelling stage, followed by an assessment within the modelling stage, where they resulted to be not significant, and therefore not included in the final model. A two-level (multilevel) random intercept logistic regression analysis for the probability of attending secondary school was conducted, with individuals nested within primary sampling units (clusters) [39, 40], and the notation for a two-level random intercept model for binary responses as follows: where, uj~N(0, σu2), and πij = is the probability of an event occurring for the i level 1 unit in the j level 2 unit; β0 is the log-odds that y = 1 when x = 0 and u = 0; β1 is effect on log-odds of 1-unit increase in x for individuals in same group; uj is the effect of being in group j on the log-odds that y = 1; also known as a level 2 residual; σu2 is the level 2 (residual) variance, or the between-group variance in the log-odds that y = 1 after accounting for x; x1ij is a generic level one nested within level 2 independent variable; x2j indicates a level two independent variable. The response variable “School attendance” was binary distributed, with value equal to 0 when the eligible children of secondary school age wasn’t attending school, and value equal to 1 when the eligible children of secondary school age was attending school. The analysis aimed at describing factors associated with children attending school. At the first level, we defined the child, parents or household level; with level two we defined the cluster (community/contextual) level. Due to the sample size, there was no rationale for having either parents or household as second level in the model. Interaction terms were also explored outside the modelling frameworks and tested within the full multilevel models to assess their significance level. Log-likelihood tests for goodness of fit were performed between a simple logistic regression and a null model with random intercept at level two. Adding a random intercept at cluster level proved to be statistically significant and therefore random intercepts were retained. Finally, to find the best possible full multilevel model, log-likelihood tests for goodness of fit were also performed by comparing the null model with random intercept and by adding one independent variable at the time. For ease of interpretability, our results for the multilevel models are presented using odds ratios, taking the exponent of the log-odds and confidence intervals at 95% of probability. Intraclass correlation coefficients (ICC) were calculated for the final model. ICC measure the correlation of the observations of the children belonging to the same cluster (community), and it is defined as the variance between clusters divided by the total variance, where the total variance is formed by the variance between groups and the variance within groups [41]. Finally, adjusted mean predictions for the fixed portion of the model were calculated after running the multilevel logistic model, to compute the probability of accessing secondary school for selected characteristics in the model, holding all the other independent variables in the model at their mean values. University of Southampton number: 45660.