Using verbal autopsy to measure causes of death: The comparative performance of existing methods

listen audio

Study Justification:
– The study aims to investigate the validity of different methods for assigning cause of death from verbal autopsy (VA) forms.
– The use of VA is important for monitoring progress with disease and injury reduction in populations.
– The reliability of existing VA methods is uncertain, limiting their application.
Study Highlights:
– Three automated diagnostic methods (Tariff, Simplified Symptom Pattern, and Random Forest) performed better than physician review in all age groups and study sites.
– Tariff method performed as well or better than other methods and should be widely applied in routine mortality surveillance systems with poor cause of death certification practices.
– Physician review of VA questionnaires is less accurate than automated methods in determining individual and population causes of death.
Study Recommendations:
– The Tariff method should be widely used in routine mortality surveillance systems.
– Further research and development of automated VA methods should be conducted to improve accuracy and reliability.
Key Role Players:
– Researchers and scientists in the field of mortality surveillance and cause of death determination.
– Public health officials and policymakers responsible for implementing and improving mortality surveillance systems.
Cost Items for Planning Recommendations:
– Research and development costs for improving automated VA methods.
– Training and capacity building costs for implementing the Tariff method in routine mortality surveillance systems.
– Costs for data collection, analysis, and reporting in mortality surveillance systems.
– Costs for monitoring and evaluation of the effectiveness of the implemented recommendations.

The strength of evidence for this abstract is 8 out of 10.
The evidence in the abstract is strong because it presents the results of a study comparing multiple methods for assigning cause of death from a verbal autopsy. The study includes a large sample size (12,535 cases) and assesses the performance of the methods across different age groups and study sites. The results show that three automated methods (Tariff, SSP, and RF) outperformed physician review in all age groups and for most causes of death. The abstract also provides specific performance measures (sensitivity, specificity, Kappa, CCC, and CSMF accuracy) for each method. To improve the evidence, the abstract could include more information about the study design, such as the sampling method and any potential limitations or biases.

Background: Monitoring progress with disease and injury reduction in many populations will require widespread use of verbal autopsy (VA). Multiple methods have been developed for assigning cause of death from a VA but their application is restricted by uncertainty about their reliability.Methods: We investigated the validity of five automated VA methods for assigning cause of death: InterVA-4, Random Forest (RF), Simplified Symptom Pattern (SSP), Tariff method (Tariff), and King-Lu (KL), in addition to physician review of VA forms (PCVA), based on 12,535 cases from diverse populations for which the true cause of death had been reliably established. For adults, children, neonates and stillbirths, performance was assessed separately for individuals using sensitivity, specificity, Kappa, and chance-corrected concordance (CCC) and for populations using cause specific mortality fraction (CSMF) accuracy, with and without additional diagnostic information from prior contact with health services. A total of 500 train-test splits were used to ensure that results are robust to variation in the underlying cause of death distribution.Results: Three automated diagnostic methods, Tariff, SSP, and RF, but not InterVA-4, performed better than physician review in all age groups, study sites, and for the majority of causes of death studied. For adults, CSMF accuracy ranged from 0.764 to 0.770, compared with 0.680 for PCVA and 0.625 for InterVA; CCC varied from 49.2% to 54.1%, compared with 42.2% for PCVA, and 23.8% for InterVA. For children, CSMF accuracy was 0.783 for Tariff, 0.678 for PCVA, and 0.520 for InterVA; CCC was 52.5% for Tariff, 44.5% for PCVA, and 30.3% for InterVA. For neonates, CSMF accuracy was 0.817 for Tariff, 0.719 for PCVA, and 0.629 for InterVA; CCC varied from 47.3% to 50.3% for the three automated methods, 29.3% for PCVA, and 19.4% for InterVA. The method with the highest sensitivity for a specific cause varied by cause.Conclusions: Physician review of verbal autopsy questionnaires is less accurate than automated methods in determining both individual and population causes of death. Overall, Tariff performs as well or better than other methods and should be widely applied in routine mortality surveillance systems with poor cause of death certification practices. © 2014 Murray et al.; licensee BioMed Central Ltd.

The design, implementation, and broad findings from the PHMRC Gold Standard Verbal Autopsy validation study are described elsewhere [13]. Briefly, the study collected VAs in six sites: Andhra Pradesh and Uttar Pradesh in India, Bohol in the Philippines, Mexico City in Mexico, and Dar es Salaam and Pemba Island in Tanzania. Gold standard (GS) clinical diagnostic criteria were specified by a committee of physicians for 53 adult, 27 child and 13 neonatal causes plus stillbirths prior to data collection. Deaths fulfilling the GS criteria were identified in each of the sites. It is important to note that the stringent diagnostic criteria used in this validation study differ from traditional validation studies, which frequently use physician judgment to certify deaths based on available clinical records. Even if independent clinicians are used to certify the cause of death, the diagnosis is subjective in nature, non-standardized and further limited by any biases of the individual clinician and the availability of diagnostic tests. Once the GS deaths that met the criteria were identified, VA interviews were then conducted with household members by interviewers who had no knowledge of the cause of death. Separate modules were used for adults, children and neonates [13]. The PHMRC instrument was based on the WHO recommended VA instrument with some limited modifications [13]. At the end of the study, 12,535 verbal autopsies on deaths with GS diagnoses were collected (7,846 adults, 2,064 children, 1,620 neonates and 1,005 stillbirths). This is seven fewer than previously published due to final revision of the preliminary dataset. Additional revisions include recoding several items in the dataset including the question ‘Did decedent suffer from an injury?’ which was considered an endorsement conditional on the injury occurring within thirty days of death. Questions not directly related to cause of death, such as ‘Was care sought outside the home?’, are no longer used in order to avoid potential bias when analyzing data sets from other populations. Additional files 1, 2 and 3: Tables S1a to S1c provide information on the number of GS deaths collected for adults, children and neonates by cause and by diagnostic level. The study protocol defined three levels of cause of death assignment based on the diagnostic documentation: Level 1, 2A and 2B. Level 1 diagnoses are the highest level of diagnostic certainty possible for that condition, consisting of either an appropriate laboratory test or X-ray with positive findings, as well as medically observed and documented illness signs. Level 2A diagnoses are of moderate certainty, consisting of medically observed and documented illness signs. Level 2B was used rarely in place of level 2A if medically observed and documented illness signs were not available but records nonetheless existed for treatment of a particular condition. Details of the clinical and diagnostic criteria for each cause have been published [13]. Of all GS deaths collected, 88% met Level 1 criteria, which we used for all primary analysis. In various sensitivity analyses that have been conducted, the results do not differ when only Level 1 deaths are used compared to all deaths. Because of small numbers of deaths collected for some causes, we were able to estimate causes of death and evaluate the methods for 34 causes for adults, 21 causes for children and 5 causes for neonates plus stillbirths [13]. The choice of the causes used in the study is elaborated elsewhere [13]. The number of neonatal causes evaluated was reduced from 10 to 5, excluding stillbirths, because of the use of combinations of causes that do not map to the International Classification of Diseases and Injuries (ICD). Results from these analyses are presented based on the Global Burden of Disease (GBD) 2010 cause list, which divides causes of death into three broad groups: communicable, maternal, neonatal and nutritional disorders; non-communicable diseases; and injuries [22]. The VA data, consisting of both the interview and open narrative, were sent to physicians at each data collection site who were trained to fill out standardized death certificates for each VA interview. Substantial efforts were taken to standardize PCVA across sites including using standardized training material and the same trainers. Further details on these efforts to standardize PCVA are described in detail elsewhere [14]. In addition to the standard VA, we sent VAs excluding the open narrative and information on the recall of health care experience to a different set of physicians to test what would be the performance of PCVA in settings where decedents had had limited contact with health services. A well-known problem with the analysis of VAs is that performance may vary as a function of the true cause of death composition in the population studied. To avoid this limitation, as part of the PHMRC study, 500 train-test data analysis datasets were generated. Each train-test pair has a different true cause of death composition. Figure 1 illustrates how the validation data have been used to generate each train-test pair. This procedure ensures that in each train-test dataset pair, the train set and test set contain no deaths in common. It further guarantees that there is no correlation between the CSMF composition in the train set and the test set. This is important because some automated methods can yield exaggerated performance when the test and train datasets have similar cause compositions [20,23]. Process of generating 500 test and train validation datasets. A detailed flowchart illustrating the process by which 500 different populations with different cause distributions were simulated in order to validate the analytical models in 500 separate scenarios. As noted, the process of separating the data into test and train datasets was repeated 500 times to eliminate the influence of cause composition on the results of our analysis. Each of the 500 test data sets has a different cause composition and analysis of all 500 datasets results in a distribution of the metrics of performance, from which we can calculate overall metrics and their uncertainty intervals. By analyzing performance of methods across multiple pairs of train-test datasets, we can ensure that conclusions about comparative performance are not biased by the particular cause composition of the test dataset. All methods except InterVA-4 have been compared using exactly the same train-test datasets, and all methods except InterVA-4 have been compared using exactly the same cause lists. InterVA-4 yields cause assignments for a different list of causes than the list developed for the PHMRC study [21]. Since the publication of the study on the comparative performance of InterVA 3.2 [15], InterVA-4 [21] has been released. InterVA-4 includes a longer list of possible cause assignments than InterVA 3.2, including maternal and stillbirth causes. In this study, we use InterVA-4 for comparison. The cause list has changed slightly between InterVA 3.2 and InterVA-4. Therefore, the mapping of the PHMRC cause list to the InterVA-4 cause list has also been revised. This new cause mapping is described in Additional files 4, 5 and 6: Tables S2a to S2c. The new cause mapping requires a ‘joint cause list,’ which is a shorter list than the PHMRC cause list. When shorter lists are used, a method will usually perform better than when longer lists are used so performance for InterVA-4 may be exaggerated. The Tariff method has also been updated so that only tariffs that are statistically significant are used to generate a tariff score for a death. This revision along with other slight modifications is explained in detail in Additional file 7. RF and SSP use tariff scores as an input into their algorithms so the revisions to Tariff slightly modify the performance of these automated methods as well. Performance of each method has been assessed in two dimensions: how well methods assign cause of death correctly for individual deaths and how accurately CSMFs are estimated for populations. For assessing the performance of each method at assigning true cause of death for individual deaths for specific causes, we report sensitivity and specificity. Because these measures, particularly specificity, are a function of the underlying cause of death structure in the population, we report the median value of each metric and provide in additional tables further detail across the 500 splits. To summarize performance of each method at assigning deaths to the correct cause across all causes, we report two overall measures: chance-corrected concordance (CCC) and Cohen’s Kappa [20]. Both summary measures count how often a cause is correctly assigned and then adjust for how often this is expected on the basis of chance. Chance-corrected concordance for cause j (CCCj) is measured as: where TP is true positives, FN is false negatives, and N is the number of causes. TP plus FN equals the true number of deaths from cause j. For many purposes, it is more important to assess how well a VA method does in estimating CSMFs. For individual causes, we compare the true CSMF and the estimated CSMF by regressing estimated CSMF on true CSMF and report the slope, intercept and root mean square error (RMSE) of this regression. If a method perfectly predicts the true CSMF, the slope would be 1.0, the intercept would be zero and RMSE would be zero. This concept is illustrated in Figure 2 which shows a comparison of estimated CSMFs and true CSMF for one cause for one method. Each point represents the true and estimated CSMF for a cause from one of the 500 splits. The red line represents the circumstances where the true CSMF would equal the estimated CSMF. The blue line is the linear regression line fit to the observed relationship between the true and estimated CSMFs. As indicated, the slope will tend to be less than one when sensitivity is reduced; likewise, the intercept will tend to be non-zero when specificity is reduced. The exact impact, however, is also a function of the correlation structure of misclassification errors. Estimated cause-specific mortality fraction (CSMF) versus true CSMF example. A graphical example of the regression of the estimated over true CSMF. This particular example is for the estimation of epilepsy using SSP without HCE. Each dot represents a single split, or simulated population, and SSP’s estimate of fraction of epilepsy in the population as compared to the true fraction. The red line represents a perfect estimate while the blue line represents a line of best fit for the data. HCE, health care experience; SSP, Simplified Symptom Pattern. To summarize overall performance of a method across causes, we compute the CSMF accuracy for each of the 500 test datasets, defined as [20]: As defined, CSMF accuracy will be 1 when the CSMF for every cause is predicted with no error. CSMF accuracy will be zero, when the summed errors across causes reach the maximum possible. To summarize overall performance of a method in predicting CSMFs that is robust to variation in the cause composition in the population, we report the median CSMF accuracy across the 500 splits. Performance was also assessed with and without household recall of health care experience (HCE), if any, prior to death. HCE includes information about the cause of death or other characteristics of the illness told to the family by health care professionals transmitted in the open section of the instrument or evidence from medical records retained by the family and the responses to questions specifically related to disease history including all questions from the section 1 of the Adult module, such as ‘Did the deceased have any of the following: Cancer’ [13]. The open text information was parsed and tokenized using the Text Mining Package in R version 2.14.0 [24]. The resulting information is a series of dichotomous variables indicating that a certain word was included in the open text. By excluding from the analysis information on the household experience of health care, the applicability of various methods in populations with limited or no access to care may be approximated. However, it is possible that the process of contact with health services may also change responses to other items in the instrument. Performance of methods varies depending on the underlying CSMF composition in the test population. In other words, for a given CSMF composition one method may outperform another even if in most cases the reverse is true. To quantify this, we assess which method performs best for CCC and CSMF accuracy for each of the 500 test data sets (which have different cause compositions). We also compute which method has the smallest absolute CSMF error for each cause across the 500 splits. This provides an evaluation of how often the assessment of which method works best is a function of the true CSMF composition of the test data and which method performs best for a specific cause.

Based on the provided information, one potential innovation to improve access to maternal health is the use of verbal autopsy methods for measuring causes of death. Verbal autopsy involves conducting interviews with household members to gather information about the circumstances surrounding a person’s death. This method can be used to determine the cause of death when medical records or other official documentation are not available.

In the study described, several automated verbal autopsy methods were compared to physician review of verbal autopsy forms. The automated methods, including Tariff, Simplified Symptom Pattern, and Random Forest, performed better than physician review in assigning cause of death for adults, children, and neonates. These methods showed higher accuracy in estimating cause-specific mortality fractions compared to physician review.

The findings suggest that automated verbal autopsy methods, such as Tariff, could be widely applied in routine mortality surveillance systems with poor cause of death certification practices. By using these methods, accurate information about the causes of maternal deaths can be obtained, which can help inform public health interventions and improve access to maternal health services.
AI Innovations Description
The recommendation to improve access to maternal health is to use verbal autopsy as a method to measure causes of death. Verbal autopsy involves conducting interviews with household members to gather information about the deceased individual’s symptoms and medical history. This information is then used to determine the cause of death.

Multiple methods have been developed for assigning cause of death from a verbal autopsy, including InterVA-4, Random Forest (RF), Simplified Symptom Pattern (SSP), Tariff method (Tariff), and King-Lu (KL). A study compared the performance of these methods to physician review of verbal autopsy forms (PCVA) based on a large dataset of 12,535 cases.

The study found that three automated diagnostic methods, Tariff, SSP, and RF, performed better than physician review in all age groups and for the majority of causes of death studied. For example, for adults, the accuracy of cause-specific mortality fraction (CSMF) ranged from 0.764 to 0.770 for the automated methods, compared to 0.680 for PCVA and 0.625 for InterVA. The chance-corrected concordance (CCC) varied from 49.2% to 54.1% for the automated methods, compared to 42.2% for PCVA and 23.8% for InterVA.

Based on these findings, the study recommends that the Tariff method be widely applied in routine mortality surveillance systems with poor cause of death certification practices. This method has shown to perform as well or better than other methods in accurately determining both individual and population causes of death.

By implementing verbal autopsy with the Tariff method, healthcare providers can improve access to maternal health by accurately identifying the causes of maternal deaths. This information can then be used to develop targeted interventions and policies to prevent future maternal deaths and improve overall maternal health outcomes.
AI Innovations Methodology
The study mentioned in the description focuses on the use of verbal autopsy (VA) methods to determine the cause of death in maternal health. The goal is to improve access to maternal health by accurately identifying the causes of death and implementing appropriate interventions. The study compares the performance of five automated VA methods (InterVA-4, Random Forest, Simplified Symptom Pattern, Tariff method, and King-Lu) with physician review of VA forms.

To simulate the impact of these recommendations on improving access to maternal health, a methodology is used. The study collected verbal autopsies in six different sites and established gold standard clinical diagnostic criteria for various causes of death. The VA interviews were conducted with household members who had no knowledge of the cause of death. The data collected included information on the cause of death, demographic factors, and health care experience.

The methodology involved generating 500 train-test data analysis datasets to eliminate the influence of cause composition on the results. Each train-test pair had a different cause composition, ensuring that there was no correlation between the cause composition in the train set and the test set. This allowed for a robust evaluation of the performance of the VA methods across different cause compositions.

The performance of the VA methods was assessed based on their ability to assign cause of death correctly for individual deaths and accurately estimate cause-specific mortality fractions (CSMFs) for populations. Measures such as sensitivity, specificity, chance-corrected concordance (CCC), and Cohen’s Kappa were used to evaluate the performance of the methods at the individual level. CSMF accuracy was used to assess the performance at the population level.

The study found that three automated diagnostic methods (Tariff, Simplified Symptom Pattern, and Random Forest) performed better than physician review in all age groups and for the majority of causes of death studied. Tariff method was identified as performing as well or better than other methods and recommended for routine mortality surveillance systems with poor cause of death certification practices.

In summary, the methodology used in the study involved collecting verbal autopsies, establishing gold standard diagnostic criteria, and comparing the performance of different VA methods. The impact of these recommendations on improving access to maternal health was simulated by evaluating the performance of the methods across different cause compositions.

Share this:
Facebook
Twitter
LinkedIn
WhatsApp
Email