Background: In low-income countries, studies demonstrate greater access and utilization of maternal and neonatal health services, yet mortality rates remain high with poor quality increasingly scrutinized as a potential point of failure in achieving expected goals. Comprehensive measures reflecting the multi-dimensional nature of quality of care could prove useful to quality improvement. However, existing tools often lack a systematic approach reflecting all aspects of quality considered relevant to maternal and newborn care. We aim to address this gap by illustrating the development of a composite index using a step-wise approach to evaluate the quality of maternal obstetric and neonatal healthcare in low-income countries. Methods: The following steps were employed in creating a composite index: 1) developing a theoretical framework; 2) metric selection; 3) imputation of missing data; 4) initial data analysis 5) normalization 6) weighting and aggregating; 7) uncertainty and sensitivity analysis of resulting composite score; 8) and deconstruction of the index into its components. Based on this approach, we developed a base composite index and tested alternatives by altering the decisions taken at different stages of the construction process to account for missing values, normalization, and aggregation. The resulting single composite scores representing overall maternal obstetric and neonatal healthcare quality were used to create facility rankings and further disaggregated into sub-composites of quality of care. Results: The resulting composite scores varied considerably in absolute values and ranges based on method choice. However, the respective coefficients produced by the Spearman rank correlations comparing facility rankings by method choice showed a high degree of correlation. Differences in method of aggregation had the greatest amount of variation in facility rankings compared to the base case. Z-score standardization most closely aligned with the base case, but limited comparability at disaggregated levels. Conclusions: This paper illustrates development of a composite index reflecting the multi-dimensional nature of maternal obstetric and neonatal healthcare. We employ a step-wise process applicable to a wide range of obstetric quality of care assessment programs in low-income countries which is adaptable to setting and context. In exploring alternative approaches, certain decisions influencing the interpretation of a given index are highlighted.
To illustrate the development of this obstetric quality of care score, we used data taken from the baseline assessment of the Results Based Financing for Maternal and Newborn Health (RBF4MNH) program in Malawi [28]. This evaluation included a sample of 33 Emergency Obstetric Care (EmOC) facilities (five hospitals, 18 health centers) offering obstetric and newborn care services located in four districts: Balaka, Dedza, Mchinji, Ntcheu. Baseline data was collected in 2013 prior to the start of the implementation of RBF4MNH and included four different data collection tools: a facility inventory, structured patient-provider observations, a structured interview with health workers, and a structured exit interview with women who recently delivered at the facility. All data was collected by trained research assistants. The facility inventory assessed the availability of equipment, essential medications, guidelines, emergency transportation and human resources. The provider-patient sample consisted of a total of 82 direct observations of uncomplicated delivery cases and assessed birth attendants’ adherence to clinical guidelines during routine obstetric care. Interviews were conducted with a total of 81 midwives and midwifery nurses, assessing health worker satisfaction in the work place and their experiences with supervision and training. The exit interview sample consisted of 204 women who delivered at these facilities; interviews assessed women’s experience receiving obstetric care at the facility and their perceptions of the quality of care received. We employed the step-wise approach outlined by the Organization of Economic Cooperation and Development (OECD) guidelines for composite index development [12]. Although developed for high-income countries, the identified standards are fully applicable to the context of LICs. The OECD guidelines includes the following steps with slight modifications: 1) developing a theoretical framework; 2) metric selection; 3) imputation of missing data; 4) initial data analysis 5) normalization 6) weighting and aggregating of selected variables; 7) uncertainty and sensitivity analysis of resulting composite score; 8) and deconstruction of score into its components [12, 14]. Based on this approach, we developed a base composite index resulting in a composite score for each facility and tested alternatives by altering the decisions taken at different stages of the construction process [12, 14]. Table 1 provides an overview of different approaches at each step to further illustrate the base and alternative index scenarios taken to formulate the composite scores. Steps in developing base case composite indicator with alternative methods D. Geometric aggregation of cell scores The conceptual framework, which provided the basis of choosing single indicators to contribute to the composite index, was slightly modified from a multidimensional matrix measuring quality of care first introduced by Maxwell [29] and later refined by Profit et al. [14] (See Table Table2).2). We consider this matrix ideal for the purpose of measuring quality of care as it incorporates two complementary approaches of measuring quality of care. This results in a quality matrix which sufficiently reflects the dynamic process of healthcare delivery [14, 31]. The matrix includes the six key dimensions of quality of care as initially outlined by the Institute of Medicine (IOM) [32] and subsequently adapted by the World Health Organization (WHO): effective, efficient, accessible, acceptable/patient-centered, equitable, and safe [30]. These are complemented by the three quality of care elements first described by Donabedian: structure, process, and outcome [33]. We felt that the definition of the WHO dimensions correlated best with the contextual environment of LICs with the aspect of timeliness included under the WHO quality dimension of accessibility, which also considers that healthcare services need to occur in a setting that is equipped with adequate resources to meet the needs of the community. Conceptual Frameworka,b a Adapted from Profit J, Typpo KV, et al. Improving benchmarking by using an explicit framework for the development of composite indicators: an example using pediatric quality of care. Implement Sci. 2010;5 (1):13 [14] b WHO, editor. Quality of care: a process for making strategic choices in health systems. Geneva: WHO; 2006. 38 p [30] Guided by this conceptual framework, the indicator selection process was based on a literature review focused on obstetric and neonatal care quality indicators. The starting point was the recent WHO publication on “Standards for Improving Quality of Maternal and Newborn Care in Health Facilities” [34] with a set of quality of care indicators identified through literature review, expert consultations, and a consensus-building Delphi process representing 116 maternal health experts in 46 countries. We further examined additional sources of maternal and neonatal quality of care indicators [4, 35–45] to identify any further indicators that had not been specified in the WHO document. Using multiple sources in combination with the WHO document, we identified an initial set of indicators most relevant to obstetric and neonatal care quality. Starting with this indicator selection, the content and definition of each indicator was reviewed with duplicated indicators removed or redundant indicators combined (e.g. adequate supervision available vs. number of supervisory visits). We mapped the resulting indicators by assigning them to the cells provided by the conceptual quality of care matrix (Table 2). Generally, there was little to no overlap in assigning indicators to single matrix cells. In situations where an indicator could be assigned to more than one cell, consensus between co-authors was sought for the most appropriate indicator assignment given both the dimension definition and content suggested by the reviewed literature. For example, “availability of clean water” could conceptually fall under “accessible” to represent access to water or “safe” to highlight the importance of clean water. Ultimately, the indicator was assigned to the safe dimension to represent “sanitation and hygiene”. For the following steps we transition from the literature to the existing data from Malawi as described above. Generally, data quality in terms of completeness was high. Most missing values were due to certain data collection tools not being applied at certain facilities. As our aim was to develop a composite score including information from each of the different data sources, we included only facilities where all four data collection tools were actually applied resulting in a final sample of 26 facilities out of a total 33 EmOC facilities. The vast majority of missing values occurred in variables stemming from direct observations where observers were asked to enter “1” if they observed a certain task and “0” if they did not, in the course of the observation. Supervision and debriefings during data collection revealed that the latter tended to be an issue, with observers not being aware of the implications of not entering zeros for non-observed behavior at the end of the observation. We are, therefore, highly confident that missing values on these variables actually reflect non-observation of behavior and replaced missing values with “0” accordingly. We are further confident that the small remaining number of missing values can be assumed to be missing at random and were replaced with the respective sample mode (or sample mean for the one continuous variable with missing values). Due to the nature of how the data was collected and the missing values, multiple imputation would not have been appropriate as an alternative [46]. Therefore, we searched for a proxy variable that would be a close substitute for the missing data in the original variable [47]. When it was not possible to identify an appropriate proxy variable, we used the mode for binary variables as an alternative for missing data in the direct observations. In addition, we provided a second alternative by coding the missing values in the direct observations as “1” thus providing a full range of possible outcomes (detailed information on missing data and results of using alternative missing imputation methods is provided in the Additional file 1). As the composite index was intended to be calculated at facility level, we aggregated the data from individual- level data collection tools which measured information at the individual to the facility level. We did this by averaging data across all individual-level observations (i.e. cases, interviews) for each variable and facility. This resulted in scores between 0 and 1. For reasons of simplicity, these proportions were then retransformed into binary variables using a 0.5 cut-off (i.e. “0” for less 0.5, “1” for 0.5 or greater). The few continuous variables were averaged. This resulted in one observation for each variable and each facility, which was necessary to combine the data sets. In the following step, we matched the variables contained in the available datasets with the mapped matrix indicators. Once matched, we analyzed the variables contained in each matrix cell for internal consistency by correlating each variable pair within the cells. Variables of a given cell with correlations > 0.7 were re-evaluated and were merged into one single variable, in cases where the variables measured approximately the same quality construct and were consistent with the conceptual framework. Due to the necessity of a uniform scale for aggregation, normalization of the indicator values is required when different units of scale exist [12, 14]. As the vast majority of variables was binary, in the base case, we transformed the couple remaining non-binary variables using cut-off values supported by standards reported in the literature. To define the number of skilled birth attendants per facility, a cut-off value of at least 3 was used based on the literature and requirements of the program [48]. For the other continuous variable, time from arrival to contact with the provider, we used the median time of 20 min as our cut-off value. The remaining variables were ordinal variables with the median used as a cut-off value. For Alternative A (Table 1), we rescaled the few non-binary variables to a range of values between 0 and 1 (see below). To identify weights, we considered data-driven methods (e.g. principal component analysis) relatively inappropriate given our variables mainly represented measures of adherence to universally established quality of care standards [12]. Thus, statistically derived weights may have assigned more importance to readily measurable or easily achievable input or process measures relevant to the observed context, but independent of the defined standards, making comparability across settings difficult. Therefore, we identified weights using expert ratings identified by the WHO Delphi study [34]. However, as these indicator ratings varied only minimally and thus did not sufficiently support a clear weighting pattern for indicators identified by the matrix, we applied equal weights. Additional publications on quality of care indicator weights almost uniformly suggested the use of equal weights. [12, 49, 50]. For the base case scenario, the indicators within each matrix cell were then aggregated using an additive approach, meaning that the values for each indicator within a cell were added together to reflect a raw sum with the maximum sum (i.e. cell score) varying between cells depending on the total number of indicators within a given cell. In the respective Alternative B (Table 1), we used geometric instead of additive aggregation (see below). In a next step, we further combined the cell scores into a single composite score. As the maximum cell scores (ranging from 6 to 19) differed depending on the number of indicators identified for each cell, we rescaled each cell score based a range from 0 to 1, except in Alternative C where Z-score standardization was used to rescale [12]. Rescaling the cell scores ensured each cell contributes equally to the overall composite score. These rescaled scores were subsequently aggregated, using equal weights to obtain an overall composite score ranging from zero to twelve. In the respective Alternative D (Table 1), we replaced the additive aggregation of cell scores with geometric aggregation (see below). A number of uncertainties based on decisions, such as normalization and aggregation methods, taken at various steps can influence the outcome of a composite score. Therefore, we calculated the outcomes with theoretically equally valid but different decisions to evaluate for a practically relevant difference [51]. Given these many steps and decisions taken in response to the underlying data, we further explored possible uncertainties introduced by not opting for an alternative approach at a given step [51]. Therefore, we created a set of alternative composite indices that differed in one decision step and compared these to our base composite index. The four alternative approaches are as follows (see also Table 1): Instead of transforming non-binary variables (ordinal or continuous) into a binary form, we re-scaled them to a range between 0 and 1. This alternative approach could increase distortion by extreme values, but at the same time widens the contribution of variables that have a narrow range of values across the sample, thus better reflecting the actually measured information and underlying variance of these variables [12]. Geometric aggregation (i.e. multiplying indicator values) of indicators to obtain a cell score, rather than arithmetic aggregation. With this alternative, “0” values in single indicators can no longer be compensated by the remaining indicators, which would have a larger effect on the outcome in the case of binary measurements. Standardization using Z-scores in each cell to achieve normalization, which converted cell score values to a normally distributed scale with a mean of 0 and a standard deviation of 1. Standardization of cell indicators with extreme values will have a greater effect in the resulting composite score [12]. Geometric aggregation to combine cell score into a composite score, decreasing the extent of compensation of low cell score values by high values. Our sensitivity analysis consisted of a descriptive comparison of the scores and applying ranks to each studied facility using the base and alternative scores. Robustness of facility ranking using the base index compared to the alternative indices was determined using Spearman rank correlation. We deconstructed the base and alternative composite scores by evaluating each cell within the matrix, comparing sample means and confidence intervals (95% confidence intervals, +/− 2 standard deviations) using base and alternative scores. Furthermore, we applied the same methods to evaluate the elements of structure, process, and outcome.