Background: Two neonatal mortality prediction models, the Neonatal Essential Treatment Score (NETS) which uses treatments prescribed at admission and the Score for Essential Neonatal Symptoms and Signs (SENSS) which uses basic clinical signs, were derived in high-mortality, low-resource settings to utilise data more likely to be available in these settings. In this study, we evaluate the predictive accuracy of two neonatal prediction models for all-cause in-hospital mortality. Methods: We used retrospectively collected routine clinical data recorded by duty clinicians at admission from 16 Kenyan hospitals used to externally validate and update the SENSS and NETS models that were initially developed from the data from the largest Kenyan maternity hospital to predict in-hospital mortality. Model performance was evaluated by assessing discrimination and calibration. Discrimination, the ability of the model to differentiate between those with and without the outcome, was measured using the c-statistic. Calibration, the agreement between predictions from the model and what was observed, was measured using the calibration intercept and slope (with values of 0 and 1 denoting perfect calibration). Results: At initial external validation, the estimated mortality risks from the original SENSS and NETS models were markedly overestimated with calibration intercepts of − 0.703 (95% CI − 0.738 to − 0.669) and − 1.109 (95% CI − 1.148 to − 1.069) and too extreme with calibration slopes of 0.565 (95% CI 0.552 to 0.577) and 0.466 (95% CI 0.451 to 0.480), respectively. After model updating, the calibration of the model improved. The updated SENSS and NETS models had calibration intercepts of 0.311 (95% CI 0.282 to 0.350) and 0.032 (95% CI − 0.002 to 0.066) and calibration slopes of 1.029 (95% CI 1.006 to 1.051) and 0.799 (95% CI 0.774 to 0.823), respectively, while showing good discrimination with c-statistics of 0.834 (95% CI 0.829 to 0.839) and 0.775 (95% CI 0.768 to 0.782), respectively. The overall calibration performance of the updated SENSS and NETS models was better than any existing neonatal in-hospital mortality prediction models externally validated for settings comparable to Kenya. Conclusion: Few prediction models undergo rigorous external validation. We show how external validation using data from multiple locations enables model updating and improving their performance and potential value. The improved models indicate it is possible to predict in-hospital mortality using either treatments or signs and symptoms derived from routine neonatal data from low-resource hospital settings also making possible their use for case-mix adjustment when contrasting similar hospital settings.
The reporting of this study follows the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines, which is a set of recommendations for the reporting of studies developing, validating, or updating prediction models for prognostic purposes [17]. The Scientific and Ethics Review Unit of the Kenya Medical Research Institute (KEMRI) approved the collection of the de-identified data that provides the basis for this study as part of the Clinical Information Network (CIN). The CIN is run in partnership with the Ministry of Health (MoH) and participating hospitals. Individual consent for access to the de-identified patient data was not required. The study used data on all patients admitted to the New-Born Units (NBUs) from 16 public hospitals representative of different malaria transmission zones in Kenya, purposefully selected in partnership with the MoH. From the map in Fig. 1, the hospitals that cluster west of the map are in moderate to high malaria transmission zone while the cluster at the centre of the map are in moderate to low malaria transmission zones. These hospitals largely provide maternal care services to immediately surrounding populations including accepting referrals from smaller rural clinics. They were purposefully selected to have moderately sized NBUs with an interquartile range of annual NBU inpatient admissions of 550 to 1640 (Fig. 1). Hospitals providing data for model derivation and external validation represented by the dots. The hospitals that cluster west of the map are in moderate to high malaria transmission zone while the cluster at the centre of the map is in moderate to low malaria transmission zones De-identified patient-level data were obtained after being recorded by clinicians as part of routine care. This data collection system linked to the CIN includes data quality assurance procedures and is described in detail elsewhere [11, 18, 19]. In brief, structured paper newborn admission record (NAR) and NBU exit forms that are endorsed by the Kenyan MoH are the primary data sources for the CIN. CIN supports one data clerk in each hospital to abstract data from the paper hospital records each day for all patients after discharge with the data entered directly into a non-proprietary Research Electronic Data Capture (REDCap) tool [20] with inbuilt range and validity checks. Data entry is guided by a standard operating procedure manual that forms the basis of the data clerks’ training with automated error-checking systems. To ensure no record is missed, the research team benchmarks the admission numbers entered in the CIN database with the aggregate statistics submitted to the MoH. External data quality assurance is done by KEMRI research assistants who perform an on-site concordance check every 3 months by comparing results from 5% randomly selected records they re-enter into REDCap to data clerks’ entries. The overall concordance of the external data quality audits has been ranging between 87 and 92% over time with feedback given to the data clerks and any challenges addressed for continuous improvement of data quality. This study included neonates admitted to the NBUs between August 2016 and March 2020, from 16 hospitals representing different regions of the country, with 15 hospitals providing the external validation dataset (n = 53,909) (Fig. 1) and the 16th hospital dataset used for the model derivation and temporal validation. For objective 2, the data that was used for the model updating (i.e. re-estimating all the original SENSS and NETS regression coefficients) consisted of derivation stage dataset (April 2014 to December 2015: n = 5427), temporal validation stage dataset (January 2016 to July 2016: n = 1627), and additional data collected from August 2016 to December 2020 (n = 8848), all from the same hospital (16th hospital). Model updating is typically required where there is observed deterioration in model performance in the new population (e.g. during model external validation) [21]. We provide explanations of the meaning and significance of the different datasets in Additional file 1: Table S1. The outcome was all-cause in-hospital neonatal unit mortality. Outcome assessment was blind to predictor distribution as the hospital data clerks were unaware of the study [9]. No new predictors were considered for SENSS and NETS models’ external validation and updating, only those used in the derivation and temporal validation study were included [13, 21]. For the NETS model, the use/non-use of supplementary oxygen, enteral feeds, intravenous fluids, first-line intravenous antibiotics (penicillin and gentamicin), and parenteral phenobarbital predictors were used [10]. For the SENSS model, the presence or absence of difficulty feeding, convulsions, indrawing, central cyanosis, and floppy/inability to suck, as assessed at admission, were used [10, 13]. Neonate’s birth weight by category (< 1 kg, 1.0 to < 1.5 kg, 1.5 to 4 kg) and sex were also included in both models. Weight was treated as a categorical predictor rather than being continuous, despite categorisation likely causing information loss, based on a priori clinical consensus [9, 10]. Detailed descriptions and arguments for the selection of these variables are provided in the derivation study [13] and in Additional file 1: Table S2 and Additional file 1: Table S3. The proportion of predictor missingness is consistent with previous work in Kenyan hospitals [22]. Sample size guidance for external validation of prediction models suggests a minimum recommended 100 events and 100 non-events for validation studies [23]. For SENSS and NETS models, there were 7486/53,909 (13.89%) and 6482/45,090 (14.38%) events (deaths), respectively, with 46,358/53,909 (85.99%) and 38,576/45,090 (85.55%) non-events (survived), respectively. Based on an outcome prevalence of 508/5427 (9.36%) and 447/4840 (9.24%) for the SENSS and NETS derivation datasets, respectively; 10 predictor parameters; and R-squared values of 0.453 and 0.380, using the pmsampsize library in R, the required sample sizes required for SENSS and NETS model updating were 323 and 341 patients with 31 and 32 deaths, respectively [24]. There were 7486 (from 53,909 patients) and 6482 deaths (from 45,090 patients) observed for SENSS and NETS models, respectively, which exceeds the required sample sizes [24, 25]. Predictor missingness in the SENSS external validation dataset (Additional file 1: Table S4) ranged from 1.19% (sex) to 14.63% (floppy/inability to suck). The derivation model assumed a missing at random (MAR) mechanism for the observed missingness and performed multiple imputation using the chained equation (MICE) approach [26]. Therefore, for external validation before updating, the same mechanism was assumed. Similar to the derivation study, mode of delivery, outborn, Apgar score at 5 min, HIV exposure, and outcome were used as auxiliary variables in the imputation process [13, 27]. Consistent with the NETS model derivation approach, 8819 (16.36%) observations in the external dataset with missing treatment sheets in the patient files were excluded, leaving 45,090 observations with 6482 (14.38%) in-hospital deaths. Multiple imputation was considered inappropriate and therefore not done for NETS where the entire treatment sheets were missing (i.e. no information on any of the treatment predictors was available) because individual missing treatment data was judged to be systematically missing due to the factors not reflected in the dataset. Therefore, all patients with no treatment sheets (8819/53,909 in the external dataset and 2238/8848 in the model updating dataset) and those missing data in any treatment variable in the resultant NETS dataset (9440/45,090 in the NETS external dataset and 941/6610 in the NETS model updating dataset) were dropped from NETS model analyses (Additional file 1: Table S5). Consequently, NETS analyses were complete case analyses based on the missingness of the patient’s sex and birth weight. The overall recommended process of predictive modelling is well articulated in the scientific literature [9, 14, 17, 24, 28]. To externally validate the performance of the original SENSS and NETS models, these models were applied to the CIN dataset from 15 hospitals (geographical external validation). For external validation before updating (i.e. objective one), external validation was done by applying the model coefficients obtained at the model derivation stage to the external validation data [13]. The models and coefficients are presented in Table Table1.1. The SENSS model was fit on each of the 33 imputed datasets (based on 33% of observations missing at least one variable [29]) with parameter estimates combined using Rubin’s rule [30]. Logistic regression models for NETS and SENSS from derivation study For each variable, the presence of the indicator takes a value of 1, and the absence takes a value of 0. The coefficients are summated to give the linear predictor, which is then converted to the predicted probability of in-hospital mortality [13] ELBW Extremely low birth weight, LBW Low birth weight, LP Linear predictor, NETS Neonatal Essential Treatment Score, SENSS Score of Essential Neonatal Symptoms and Signs, VLBW Very low birth weight Model calibration was assessed by both plotting the predicted probability of in-hospital death against the observed proportion and calculating the calibration slope and calibration-in-the-large [16]. Discrimination was assessed by the c-statistic (equivalent to the area under the receiver operating curve) [23, 28]. The confidence intervals for both c-statistic and calibration slope and intercept were calculated through bootstrapping (i.e. iterative sampling with replacement). Additionally, to facilitate a comparison of SENSS and NETS model performance to the Neonatal Mortality Rate (NMR)-2000 [12] score findings, we also report the Brier score which reflects the combined model discrimination and calibration. These metrics are briefly described in Table Table22 and explained in detail elsewhere [31]. Measures for model’s performance assessment (definitions adapted from Riley et al. [31] ) For objective 2 (i.e. model updating), given that simple recalibration did not resolve poor model performance (Additional file 2 [21, 32]), we refit the SENSS and NETS models and re-estimated the coefficients while applying regularisation (a technique for reducing model overfitting) using data from the 16th hospital (i.e. the models’ derivation study site). Model overfitting is when the model fits too closely to the training dataset making it unable to generalise well to new datasets. We used elastic net regularisation which combines L1 regularisation (introduces sparsity by shrinking the less important covariates’ coefficients towards zero) and L2 regularisation (minimises biassed estimates due to highly correlated independent variables) [33]. Also, to minimise model overfitting from the selection of elastic-net tunning parameters, we applied tenfold internal cross-validation repeated 20 times [34]. Cross-validation is a re-sampling procedure where the model development dataset is randomly split into an equally sized number of partitions (i.e. folds) and one of the random partitions is left out during model fitting for use as the internal validation dataset, with the model then built on the remaining portion of development dataset, and predictive performance evaluated on the left-out partition. This process is repeated with each iteration using a different partition as the validation data source. It could also include optional extra iterations to repeat the random splitting of the development dataset which would generate different folds [34]. The goal of cross-validation is assessing how accurately a predictive model might perform in practice given, for example, the different elastic net thresholds used during model fitting (i.e. thereby aiding the selection of the most optimum model hyperparameters such as regularisation parameters) [34]. The SENSS and the NETS models were fit on data collected between August 2016 and December 2020 collected from the 16th hospital. The updated SENSS and NETS model performance was evaluated on data from the other 15 hospitals (Additional file 1: Table S6). All cases included in the NETS model are a subset of those included in the SENSS model but with a treatment sheet present. Given the models are developed independently of each other, there is no substantive implication on the interpretation of findings. We provide explanations of the meaning and significance of the different datasets in Additional file 1: Table S1. To examine the heterogeneity in model performance, we compared the updated models’ internal–external cross-validation performance where we omitted one hospital at a time using it as the validation dataset, built the model on the remaining hospitals, and evaluated the model’s discrimination and calibration performance on the hospital left out. We repeated this process with each iteration using a different hospital as the validation data source [35].