OBJECTIVES: Direct to beneficiary (D2B) mobile health communication programmes have been used to provide reproductive, maternal, neonatal and child health information to women and their families in a number of countries globally. Programmes to date have provided the same content, at the same frequency, using the same channel to large beneficiary populations. This manuscript presents a proof of concept approach that uses machine learning to segment populations of women with access to phones and their husbands into distinct clusters to support differential digital programme design and delivery. SETTING: Data used in this study were drawn from cross-sectional survey conducted in four districts of Madhya Pradesh, India. PARTICIPANTS: Study participant included pregnant women with access to a phone (n=5095) and their husbands (n=3842) RESULTS: We used an iterative process involving K-Means clustering and Lasso regression to segment couples into three distinct clusters. Cluster 1 (n=1408) tended to be poorer, less educated men and women, with low levels of digital access and skills. Cluster 2 (n=666) had a mid-level of digital access and skills among men but not women. Cluster 3 (n=1410) had high digital access and skill among men and moderate access and skills among women. Exposure to the D2B programme ‘Kilkari’ showed the greatest difference in Cluster 2, including an 8% difference in use of reversible modern contraceptives, 7% in child immunisation at 10 weeks, 3% in child immunisation at 9 months and 4% in the timeliness of immunisation at 10 weeks and 9 months. CONCLUSIONS: Findings suggest that segmenting populations into distinct clusters for differentiated programme design and delivery may serve to improve reach and impact. TRIAL REGISTRATION NUMBER: NCT03576157.
Kilkari is an outbound service that makes weekly, stage-based, prerecorded calls about reproductive, maternal, neonatal and child health (RMNCH) directly to families’ mobile phones, starting from the second trimester of pregnancy until the child is 1 year old. Kilkari is comprised of 90 min of RMNCH content sent via 72 once weekly voice calls (average call duration: 1 min, 15 s). Approximately 18% of cumulative call content is on family planning; 13% on child immunisation; 13% on nutrition; 12% on infant feeding; 10% on pregnancy care; 7% on entitlements; 7% on diarrhoea; 7% on postnatal care; and the remainder on a range of topics including intrapartum care, water and sanitation, and early childhood development. BBC Media Action designed and piloted Kilkari in the Indian state of Bihar in 2012–2013, and then redesigned and scaled it in collaboration with the Ministry of Health and Family Welfare between 2015 and 2019. Evidence on the evaluation design and programme impact are reported elsewhere.20 Data used in this analysis were collected from four districts of the central Indian state of Madhya Pradesh as part of the impact evaluation of Kilkari described elsewhere.3 19 Madhya Pradesh (population 75 million) is home to an estimated 20% of India’s population and falls below national averages for most sociodemographic and health indicators.21 Wide differences by gender and between urban and rural areas persist for wide range of indicators including literacy, phone access and health seeking behaviours. Among men and women 15–49 years of age, 59% of women (78% urban and 51% rural) were literate as compared with 82% of men in 2015–2016.21 Among literate women, 23% had 10 or more years of schooling (44% urban and 14% rural).21 Despite near universal access to phones at a household level, only 19% of women in rural areas and 50% in urban had access to a phone that they themselves could use in 2015.21 Among pregnant women, over half (52%) of pregnant women received the recommended four antenatal care (ANC) visits in urban areas as compared with only 30% in rural areas.21 Despite high rates of institutional delivery (94%) in urban areas, only 76% of women in rural areas reported delivering in a health facility in 2015.21 These disparities underscore the population heterogeneity within and across Madhya Pradesh. The samples for this study were obtained through cross-sectional surveys administered between 2018 and 2020 to women (n=5095) with access to a mobile phone and their husbands (n=3842) in four districts of Madhya Pradesh.20 At the time of the first survey (2018–2019), the women were 4–7 months pregnant; the latter survey (2019–2020) reinterviewed the same women at 12 months post partum. Their husbands were only interviewed once, during the latter survey round. The surveys spanned 1.5 hours in length. In this analysis, modules on household assets and member characteristics; phone access and use, including observed digital skills (navigate interactive voice response (IVR) prompts, give a missed call, store contacts on a phone, open SMS, read SMS) were used to develop models. Data on practice for maternal and child health behaviours, including infant and young child feeding, family planning, pregnancy and postpartum care were used to explore the differential impact of Kilkari across clusters but not used in the development of clusters.20 Figure 1 presents a framework used for developing homogenous clusters of men and women in four districts of rural Madhya Pradesh India. Box 1 describes the steps undertaken at each point in the framework in detail. We started with data elements collected on phone access and use as well as population sociodemographic characteristics collected as part of a cross-sectional survey described elsewhere.3 22 Unsupervised learning was undertaken using K-Means cluster and strong signals were identified. Strong signals were defined as variables that had at least a prevalence of 70% in one or more clusters and differed from another cluster by 50% or more. For example, 6% of men own a smart phone in Cluster 1, 88% in Cluster 2 and 75% in Cluster 3. Therefore, having a smart phone can be considered as a strong signal. Additional details are summarised in box 1. Once defined, we then explored differences in healthcare practices across study clusters among those exposed and not exposed to Kilkari within each cluster. Data collected from special surveys like the couple’s dataset used here are relatively smaller in terms of sample size but large with regard to the number of data elements available. In such high-dimensional data, there are many irrelevant dimensions which can mask existing clusters in noisy data, making more difficult the development of effective clustering methods.3 31 Several approaches have been proposed to address this problem. They can be grouped into two categories: static or adaptive dimensionality reduction, including principal components analysis32 33 and subspace clustering consisting on selecting a small number of original dimensions (features) in some unsupervised way or using expert knowledge so that clusters become more obvious in the subspace.34 35 In this study, we combined subspace clustering using expert knowledge and adaptive dimensionality reduction (online supplemental figure 1) to find subspace where clusters are most well separated and well defined. Therefore, as part of subspace clustering, we chose to start with couples’ survey data, including variables related to sociodemographic characteristic, phone ownership, use and literacy (online supplemental table 1). Emergent clusters were overlapping. We decided to use men’s survey data on phone access and use as a starting point. Analyses started with a predefined set of data elements captured as part of a men’s cross-sectional survey including sociodemographic characteristics and phone access and use. K-Means clustering was used to identify clusters and the elbow method was used to define the optimal number of clusters. Strong signals were then identified. Variables which had at least a prevalence of 70% in one or more clusters and differed from another cluster by 50% or more were considered to have a strong signal. Once an initial model was developed drawing from the predefined set of data from the men’s survey and strong signals were identified, we reviewed available data from the combined dataset (data from the men’s survey and women’s survey). Signal strength was used as an outcome variable or target in a linear regression with L1 regularisation or Lasso regression (Least Absolute Shrinkage and Selection Operator). Regularisation is a technique used in supervised learning to avoid overfitting. Lasso regression adds absolute value of magnitude of coefficient as penalty term to the loss function. The loss function becomes: Loss=Error(y,y)+α∑i=1N|ωi| where ωi are coefficients of linear regression y=ω1×1+ω2×2+…+ωNxN+b . Lasso regression works well for selecting features in very large datasets as it shrinks the less important features of coefficients to 0.36 37 Merged women’s survey and men’s survey data were used as predictors for the regression, excluding variables related to heath knowledge and practices. We ended up with a sample of 3484 rows and 1725 variables after data preprocessing. We then reran K-Means clustering with three clusters (K=3) using important features selected by Lasso regression. This methodology was used to refine the clusters and subsequently identify new strong signals. After step 3 was conducted, we repeated step 2, and kept on iteratively repeating step 2 and 3 until there was no gain in strong signals. Data preparation and results formatting have been conducted in R V.4.1.1,38 K-Means clustering has been performed in Python V.3.8.5.39 Framework for segmentation analysis. bmjopen-2022-063354supp001.pdf Patients were first engaged on identification in their households as part of a household listing carried out in mid/late 2018. Those meeting eligibility criteria were interviewed as part of the baseline survey, and ultimately randomised to the intervention and control arms. Prior to the administration of the baseline, a small number of patients were involved in the refinement of survey tools through qualitative interviews, including cognitive interviews, which were carried out to optimise survey questions, including the language and translation used. Finalised tools were administered to patients at baseline and endline, and for a subsample of the study population, additional interviews carried out over the phone and via qualitative interviews between the baseline and endline surveys. Unfortunately, because travel restrictions associated with COVID-19, findings were not disseminated back to community members. As part of steps 1 and 3, K-Means algorithms were used (box 1). We chose to use K-Means algorithm because of its simplicity and speed to handle large dataset compared with hierarchical clustering.23 A K-Means algorithm is one method of cluster analysis designed to uncover natural groupings within a heterogeneous population by minimising Euclidean distance between them.24 When using a K-Means algorithm, the first step is to choose the number of clusters K that will be generated. The algorithm starts by selecting K points randomly as the initial centres (also known as cluster means or centroids) and then iteratively assigns each observation to the nearest centre. Next, the algorithm computes the new mean value (centroid) of each cluster’s new set of observation. K-Means reiterates this process, assigning observations to the nearest centre. This process repeats until a new iteration no longer reassigns any observations to a new cluster (convergence). Four metrics have been used for the validation of clustering: within cluster sum of squares, silhouette index, Ray-Turi criterion and Calinski-Harabatz criterion. Elbow method was used to find the right K (number of clusters).25 Figure 2 is a chart showing the within-cluster sum of squares (or inertia) by the number of groups (k value) chosen for several executions of the algorithm. Elbow method used to help decide ultimate number of clusters appropriate for the data. Inertia is a metric that shows how dissimilar the members of a group are. The less inertia there is, the more similarity there is within a cluster (compactness). The main purpose of clustering is not to find 100% compactness, it is rather to find a fair number of groups that could explain with satisfaction a considerable part of the data (k=3 in this case). Silhouette analysis helped to evaluate the goodness of clustering or clustering validation (figure 3). It can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighbouring clusters. This measure has a range of [−1, 1]. Silhouette coefficients near+1 indicate that the sample is far from the neighbouring clusters. A value of 0 indicates that the sample is very close to the decision boundary between two neighbouring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. Figure 3 shows that choosing three clusters was more efficient than four for the data from the available surveys for two reasons: (1) there were less points with negative silhouettes and (2) the cluster size (thickness) was more uniform for three groupings. Other criteria used to evaluate quality of clustering are obtained by combining the ‘within-cluster compactness index’ and ‘between-cluster spacing index’.26 Calinski-Harabatz criterion is given by: C(k)=Trace(B)(n−k)Trace(W)(k−1) and Ray-Turi criterion is given by r(k)=distance(W)distance(B) , where B is the between-cluster covariance matrix (so high values of B denote well-separated clusters) and W is the within-cluster covariance matrix (so low values of W correspond to compact clusters). They both ended up with same conclusions that three clusters were the best choice for the data we had. Online supplemental table 2 gives different metrics used and values obtained for various clusters. Silhouette analysis for three and four clusters.