Can we design the next generation of digital health communication programs by leveraging the power of artificial intelligence to segment target audiences, bolster impact and deliver differentiated services? A machine learning analysis of survey data from rural India

listen audio

Study Justification:
– The study aims to explore the use of artificial intelligence and machine learning to segment target audiences in digital health communication programs.
– The current programs provide the same content to large beneficiary populations, but segmenting populations into distinct clusters may improve reach and impact.
– The study focuses on rural India, where there are significant disparities in literacy, phone access, and health indicators.
Study Highlights:
– The study used survey data from pregnant women with access to a phone and their husbands in four districts of Madhya Pradesh, India.
– Through an iterative process of K-Means clustering and Lasso regression, the researchers identified three distinct clusters of couples based on their digital access and skills.
– Exposure to the D2B program ‘Kilkari’ showed significant differences in health outcomes, including contraceptive use, child immunization, and timeliness of immunization, particularly in Cluster 2.
Study Recommendations:
– The findings suggest that segmenting populations into distinct clusters for differentiated program design and delivery can improve reach and impact.
– The study recommends leveraging artificial intelligence and machine learning to tailor digital health communication programs to the specific needs and characteristics of different clusters.
– Policymakers should consider implementing targeted interventions based on the identified clusters to improve reproductive, maternal, neonatal, and child health outcomes.
Key Role Players:
– Researchers and data analysts experienced in artificial intelligence, machine learning, and data analysis.
– Public health professionals with expertise in digital health communication and program design.
– Policy makers and government officials responsible for implementing and funding health programs.
– Community health workers and healthcare providers who can deliver the tailored digital health communication programs.
Cost Items for Planning Recommendations:
– Research and data analysis costs, including personnel, software, and hardware.
– Program design and development costs, including content creation, platform development, and user interface design.
– Training and capacity building costs for healthcare providers and community health workers.
– Monitoring and evaluation costs to assess the impact and effectiveness of the tailored digital health communication programs.
– Communication and dissemination costs to promote the programs and reach the target audience.
– Infrastructure costs, such as mobile network coverage and access to smartphones, to ensure the delivery of the programs.

The strength of evidence for this abstract is 8 out of 10.
The evidence in the abstract is rated 8 because it provides a clear description of the study objectives, methodology, and findings. The study used machine learning to segment populations of women and their husbands in rural India into distinct clusters based on their digital access and skills. The study found that segmenting populations into clusters improved the reach and impact of a mobile health communication program. The abstract provides specific details about the study design, data collection, and analysis methods. However, it could be improved by including more information about the limitations of the study and the generalizability of the findings.

OBJECTIVES: Direct to beneficiary (D2B) mobile health communication programmes have been used to provide reproductive, maternal, neonatal and child health information to women and their families in a number of countries globally. Programmes to date have provided the same content, at the same frequency, using the same channel to large beneficiary populations. This manuscript presents a proof of concept approach that uses machine learning to segment populations of women with access to phones and their husbands into distinct clusters to support differential digital programme design and delivery. SETTING: Data used in this study were drawn from cross-sectional survey conducted in four districts of Madhya Pradesh, India. PARTICIPANTS: Study participant included pregnant women with access to a phone (n=5095) and their husbands (n=3842) RESULTS: We used an iterative process involving K-Means clustering and Lasso regression to segment couples into three distinct clusters. Cluster 1 (n=1408) tended to be poorer, less educated men and women, with low levels of digital access and skills. Cluster 2 (n=666) had a mid-level of digital access and skills among men but not women. Cluster 3 (n=1410) had high digital access and skill among men and moderate access and skills among women. Exposure to the D2B programme ‘Kilkari’ showed the greatest difference in Cluster 2, including an 8% difference in use of reversible modern contraceptives, 7% in child immunisation at 10 weeks, 3% in child immunisation at 9 months and 4% in the timeliness of immunisation at 10 weeks and 9 months. CONCLUSIONS: Findings suggest that segmenting populations into distinct clusters for differentiated programme design and delivery may serve to improve reach and impact. TRIAL REGISTRATION NUMBER: NCT03576157.

Kilkari is an outbound service that makes weekly, stage-based, prerecorded calls about reproductive, maternal, neonatal and child health (RMNCH) directly to families’ mobile phones, starting from the second trimester of pregnancy until the child is 1 year old. Kilkari is comprised of 90 min of RMNCH content sent via 72 once weekly voice calls (average call duration: 1 min, 15 s). Approximately 18% of cumulative call content is on family planning; 13% on child immunisation; 13% on nutrition; 12% on infant feeding; 10% on pregnancy care; 7% on entitlements; 7% on diarrhoea; 7% on postnatal care; and the remainder on a range of topics including intrapartum care, water and sanitation, and early childhood development. BBC Media Action designed and piloted Kilkari in the Indian state of Bihar in 2012–2013, and then redesigned and scaled it in collaboration with the Ministry of Health and Family Welfare between 2015 and 2019. Evidence on the evaluation design and programme impact are reported elsewhere.20 Data used in this analysis were collected from four districts of the central Indian state of Madhya Pradesh as part of the impact evaluation of Kilkari described elsewhere.3 19 Madhya Pradesh (population 75 million) is home to an estimated 20% of India’s population and falls below national averages for most sociodemographic and health indicators.21 Wide differences by gender and between urban and rural areas persist for wide range of indicators including literacy, phone access and health seeking behaviours. Among men and women 15–49 years of age, 59% of women (78% urban and 51% rural) were literate as compared with 82% of men in 2015–2016.21 Among literate women, 23% had 10 or more years of schooling (44% urban and 14% rural).21 Despite near universal access to phones at a household level, only 19% of women in rural areas and 50% in urban had access to a phone that they themselves could use in 2015.21 Among pregnant women, over half (52%) of pregnant women received the recommended four antenatal care (ANC) visits in urban areas as compared with only 30% in rural areas.21 Despite high rates of institutional delivery (94%) in urban areas, only 76% of women in rural areas reported delivering in a health facility in 2015.21 These disparities underscore the population heterogeneity within and across Madhya Pradesh. The samples for this study were obtained through cross-sectional surveys administered between 2018 and 2020 to women (n=5095) with access to a mobile phone and their husbands (n=3842) in four districts of Madhya Pradesh.20 At the time of the first survey (2018–2019), the women were 4–7 months pregnant; the latter survey (2019–2020) reinterviewed the same women at 12 months post partum. Their husbands were only interviewed once, during the latter survey round. The surveys spanned 1.5 hours in length. In this analysis, modules on household assets and member characteristics; phone access and use, including observed digital skills (navigate interactive voice response (IVR) prompts, give a missed call, store contacts on a phone, open SMS, read SMS) were used to develop models. Data on practice for maternal and child health behaviours, including infant and young child feeding, family planning, pregnancy and postpartum care were used to explore the differential impact of Kilkari across clusters but not used in the development of clusters.20 Figure 1 presents a framework used for developing homogenous clusters of men and women in four districts of rural Madhya Pradesh India. Box 1 describes the steps undertaken at each point in the framework in detail. We started with data elements collected on phone access and use as well as population sociodemographic characteristics collected as part of a cross-sectional survey described elsewhere.3 22 Unsupervised learning was undertaken using K-Means cluster and strong signals were identified. Strong signals were defined as variables that had at least a prevalence of 70% in one or more clusters and differed from another cluster by 50% or more. For example, 6% of men own a smart phone in Cluster 1, 88% in Cluster 2 and 75% in Cluster 3. Therefore, having a smart phone can be considered as a strong signal. Additional details are summarised in box 1. Once defined, we then explored differences in healthcare practices across study clusters among those exposed and not exposed to Kilkari within each cluster. Data collected from special surveys like the couple’s dataset used here are relatively smaller in terms of sample size but large with regard to the number of data elements available. In such high-dimensional data, there are many irrelevant dimensions which can mask existing clusters in noisy data, making more difficult the development of effective clustering methods.3 31 Several approaches have been proposed to address this problem. They can be grouped into two categories: static or adaptive dimensionality reduction, including principal components analysis32 33 and subspace clustering consisting on selecting a small number of original dimensions (features) in some unsupervised way or using expert knowledge so that clusters become more obvious in the subspace.34 35 In this study, we combined subspace clustering using expert knowledge and adaptive dimensionality reduction (online supplemental figure 1) to find subspace where clusters are most well separated and well defined. Therefore, as part of subspace clustering, we chose to start with couples’ survey data, including variables related to sociodemographic characteristic, phone ownership, use and literacy (online supplemental table 1). Emergent clusters were overlapping. We decided to use men’s survey data on phone access and use as a starting point. Analyses started with a predefined set of data elements captured as part of a men’s cross-sectional survey including sociodemographic characteristics and phone access and use. K-Means clustering was used to identify clusters and the elbow method was used to define the optimal number of clusters. Strong signals were then identified. Variables which had at least a prevalence of 70% in one or more clusters and differed from another cluster by 50% or more were considered to have a strong signal. Once an initial model was developed drawing from the predefined set of data from the men’s survey and strong signals were identified, we reviewed available data from the combined dataset (data from the men’s survey and women’s survey). Signal strength was used as an outcome variable or target in a linear regression with L1 regularisation or Lasso regression (Least Absolute Shrinkage and Selection Operator). Regularisation is a technique used in supervised learning to avoid overfitting. Lasso regression adds absolute value of magnitude of coefficient as penalty term to the loss function. The loss function becomes: Loss=Error(y,y)+α∑i=1N|ωi| where ωi are coefficients of linear regression y=ω1×1+ω2×2+…+ωNxN+b . Lasso regression works well for selecting features in very large datasets as it shrinks the less important features of coefficients to 0.36 37 Merged women’s survey and men’s survey data were used as predictors for the regression, excluding variables related to heath knowledge and practices. We ended up with a sample of 3484 rows and 1725 variables after data preprocessing. We then reran K-Means clustering with three clusters (K=3) using important features selected by Lasso regression. This methodology was used to refine the clusters and subsequently identify new strong signals. After step 3 was conducted, we repeated step 2, and kept on iteratively repeating step 2 and 3 until there was no gain in strong signals. Data preparation and results formatting have been conducted in R V.4.1.1,38 K-Means clustering has been performed in Python V.3.8.5.39 Framework for segmentation analysis. bmjopen-2022-063354supp001.pdf Patients were first engaged on identification in their households as part of a household listing carried out in mid/late 2018. Those meeting eligibility criteria were interviewed as part of the baseline survey, and ultimately randomised to the intervention and control arms. Prior to the administration of the baseline, a small number of patients were involved in the refinement of survey tools through qualitative interviews, including cognitive interviews, which were carried out to optimise survey questions, including the language and translation used. Finalised tools were administered to patients at baseline and endline, and for a subsample of the study population, additional interviews carried out over the phone and via qualitative interviews between the baseline and endline surveys. Unfortunately, because travel restrictions associated with COVID-19, findings were not disseminated back to community members. As part of steps 1 and 3, K-Means algorithms were used (box 1). We chose to use K-Means algorithm because of its simplicity and speed to handle large dataset compared with hierarchical clustering.23 A K-Means algorithm is one method of cluster analysis designed to uncover natural groupings within a heterogeneous population by minimising Euclidean distance between them.24 When using a K-Means algorithm, the first step is to choose the number of clusters K that will be generated. The algorithm starts by selecting K points randomly as the initial centres (also known as cluster means or centroids) and then iteratively assigns each observation to the nearest centre. Next, the algorithm computes the new mean value (centroid) of each cluster’s new set of observation. K-Means reiterates this process, assigning observations to the nearest centre. This process repeats until a new iteration no longer reassigns any observations to a new cluster (convergence). Four metrics have been used for the validation of clustering: within cluster sum of squares, silhouette index, Ray-Turi criterion and Calinski-Harabatz criterion. Elbow method was used to find the right K (number of clusters).25 Figure 2 is a chart showing the within-cluster sum of squares (or inertia) by the number of groups (k value) chosen for several executions of the algorithm. Elbow method used to help decide ultimate number of clusters appropriate for the data. Inertia is a metric that shows how dissimilar the members of a group are. The less inertia there is, the more similarity there is within a cluster (compactness). The main purpose of clustering is not to find 100% compactness, it is rather to find a fair number of groups that could explain with satisfaction a considerable part of the data (k=3 in this case). Silhouette analysis helped to evaluate the goodness of clustering or clustering validation (figure 3). It can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighbouring clusters. This measure has a range of [−1, 1]. Silhouette coefficients near+1 indicate that the sample is far from the neighbouring clusters. A value of 0 indicates that the sample is very close to the decision boundary between two neighbouring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. Figure 3 shows that choosing three clusters was more efficient than four for the data from the available surveys for two reasons: (1) there were less points with negative silhouettes and (2) the cluster size (thickness) was more uniform for three groupings. Other criteria used to evaluate quality of clustering are obtained by combining the ‘within-cluster compactness index’ and ‘between-cluster spacing index’.26 Calinski-Harabatz criterion is given by: C(k)=Trace(B)(n−k)Trace(W)(k−1) and Ray-Turi criterion is given by r(k)=distance(W)distance(B) , where B is the between-cluster covariance matrix (so high values of B denote well-separated clusters) and W is the within-cluster covariance matrix (so low values of W correspond to compact clusters). They both ended up with same conclusions that three clusters were the best choice for the data we had. Online supplemental table 2 gives different metrics used and values obtained for various clusters. Silhouette analysis for three and four clusters.

The recommendation proposed in the publication is to use machine learning and segmentation analysis to develop differentiated digital health communication programs for improving access to maternal health. The study conducted in rural India used K-Means clustering and Lasso regression to segment populations of pregnant women and their husbands into three distinct clusters based on their digital access and skills. The findings showed that exposure to the D2B program ‘Kilkari’ had a greater impact on Cluster 2, including increased use of reversible modern contraceptives, child immunization rates, and timeliness of immunization.

The innovation suggested is to leverage the power of artificial intelligence (AI) to design the next generation of digital health communication programs. By using machine learning algorithms, target audiences can be segmented into distinct clusters based on their specific needs and characteristics. This segmentation allows for the development of tailored and differentiated services that can effectively reach and impact different population groups. The use of AI can enhance the effectiveness and efficiency of maternal health interventions by delivering personalized and relevant information to individuals and their families.

Overall, the recommendation is to utilize machine learning and AI to develop innovative digital health communication programs that can improve access to maternal health services and information. This approach has the potential to address the heterogeneity within populations and deliver targeted interventions that meet the specific needs of different groups.
AI Innovations Description
The recommendation proposed in the publication is to use machine learning and segmentation analysis to develop differentiated digital health communication programs for improving access to maternal health. The study conducted in rural India used K-Means clustering and Lasso regression to segment populations of pregnant women and their husbands into three distinct clusters based on their digital access and skills. The findings showed that exposure to the D2B program ‘Kilkari’ had a greater impact on Cluster 2, including increased use of reversible modern contraceptives, child immunization rates, and timeliness of immunization.

The innovation suggested is to leverage the power of artificial intelligence (AI) to design the next generation of digital health communication programs. By using machine learning algorithms, target audiences can be segmented into distinct clusters based on their specific needs and characteristics. This segmentation allows for the development of tailored and differentiated services that can effectively reach and impact different population groups. The use of AI can enhance the effectiveness and efficiency of maternal health interventions by delivering personalized and relevant information to individuals and their families.

Overall, the recommendation is to utilize machine learning and AI to develop innovative digital health communication programs that can improve access to maternal health services and information. This approach has the potential to address the heterogeneity within populations and deliver targeted interventions that meet the specific needs of different groups.
AI Innovations Methodology
To simulate the impact of the recommendations proposed in the abstract on improving access to maternal health, you can follow these steps:

1. Collect Data: Gather data on pregnant women and their husbands from rural areas, similar to the study conducted in rural India. This data should include information on digital access, skills, healthcare practices, and exposure to digital health communication programs.

2. Preprocess Data: Clean and preprocess the collected data to ensure its quality and consistency. This may involve handling missing values, standardizing variables, and transforming data if necessary.

3. Segment the Population: Use machine learning algorithms, such as K-Means clustering, to segment the population into distinct clusters based on their digital access and skills. This step will help identify different population groups with specific needs and characteristics.

4. Evaluate Impact: Analyze the impact of digital health communication programs, such as the ‘Kilkari’ program, on different clusters. Compare healthcare practices and outcomes between clusters that were exposed to the program and those that were not. Measure indicators such as the use of modern contraceptives, child immunization rates, and timeliness of immunization.

5. Assess Differential Impact: Determine the differential impact of the digital health communication program across clusters. Calculate the percentage difference in healthcare practices and outcomes between exposed and non-exposed groups within each cluster. This will help identify which clusters benefit the most from the program.

6. Compare Findings: Compare the findings from the simulation with the results reported in the publication. Assess whether the simulated impact aligns with the findings of the study conducted in rural India.

By following these steps, you can simulate the impact of using machine learning and segmentation analysis to develop differentiated digital health communication programs for improving access to maternal health. This simulation will provide insights into the potential effectiveness of these recommendations in different populations and help inform the design of future interventions.

Yabelana ngalokhu:
Facebook
Twitter
LinkedIn
WhatsApp
Email