On Expectation-Maximization Algorithm with Split and Merge in R

brain MR image segmentation and bias field correction (Cheng et al ., 2022). Normal mixture models also play a role in developing robust estimators,


A B S T R A C T
This paper presents a comprehensive study on the implementation of the Expectation-Maximization (EM) algorithm for a 4-component bivariate Gaussian Mixture Model (GMM) with a focus on incorporating the split and merge techniques.The bivariate GMM is widely utilized in various fields, such as pattern recognition, image processing, and data clustering, due to its flexibility in capturing complex data distributions.Our proposed implementation leverages the versatility of the R programming language to create a robust and efficient framework for modeling and optimizing the parameters of the 4-component GMM.The EM algorithm is employed as a powerful tool for iteratively estimating the model parameters, ensuring convergence to a local maximum of the likelihood function.To enhance the model's flexibility, we introduce the split and merge strategies, enabling the algorithm to adapt to diverse data structures and efficiently manage the complexity of the mixture components.The split operation allows for the subdivision of components when needed, while the merge operation facilitates the combination of components that represent similar patterns.This adaptive approach contributes to the model's ability to capture intricate data patterns and improve convergence during the optimization process.The implementation is validated through extensive experimentation on synthetic, demonstrating its effectiveness in accurately estimating the parameters of the 4-component bivariate GMM.The proposed methodology proves to be a valuable addition to the existing tools available for GMM based modeling, providing researchers and practitioners with a flexible and powerful framework for analyzing complex data structures in diverse applications.
addressing challenges such as sensitivity to outliers under mixtures with light-tailed distributions like the normal distribution (Sugasawa et al., 2022).
Cluster analysis, especially in the field of mixture likelihood approach, witnesses a growing use of mixtures of distributions, normal or otherwise.The approach assumes that observations on entities to be clustered are from a mixture of a specified number of populations or groups in varying proportions.By adopting a parametric form for the density function in each group, a likelihood is formed, and unknown parameters are estimated through this likelihood.Probabilistic clustering is then obtained based on estimated posterior probabilities of group membership, and entities can be assigned to groups by allocating them to the group with the highest estimated posterior probability of belonging (Ganesalingam and McLachlan, 1979).While decomposing a finite mixture of distributions is a challenging problem, the advent of high-speed computers has shifted attention to likelihood estimation of parameters in mixture distributions.The Expectation-Maximization (EM) algorithm has become a widely applicable approach for iterative computation of maximum likelihood estimates in incomplete data problems, where traditional methods may be more complicated (Dempster et al., 1977).The EM algorithm, with its Expectation step (E-step) and Maximization step (M-step), has proven valuable in various problems involving incomplete data, offering an intuitive and natural approach to estimation even before its formalization in the seminal DLR paper (Dempster et al., 1977).
The Expectation-Maximization (EM) algorithm and its applications span from theoretical studies on convergence, such as the analysis of EM and its variant deterministic annealing EM (DA-EM) (Yu and Yang, 2018), to diverse modifications tailored for specific purposes, including image matching (Ma et al., 2019), parameter estimation (Liu et al., 2019;Du and Gui, 2019), malaria diagnoses (Pag`es-Zamora et al., 2019), mixture simplification (Yu and Chan, 2018), and audio-visual scene analysis (Gebru et al., 2016).
The surge in EM's popularity is largely attributed to its effectiveness in estimating parameters of mixture models (MM) (McLachlan, 2000;McLachlan and Krishnan, 2007).Mixture models refer to probabilistic models where the modeled population consists of several sub-populations.Each sub-population is assumed to follow a simple parametric distribution, referred to as the component of the MM.The type of the MM is identified by the distribution family of its components, with Gaussian Mixture Models (GMM) being a notable example where components follow a Gaussian distribution.Once estimated, MM parameters find applications in density estimation (Yu et al., 2018;B¨acklin et al, 2018) and clustering tasks (Pag`es-Zamora et al., 2019;Yang et al., 2012;Celeux and Govaert, 1995).In the context of MM parameter estimation, the EM algorithm serves as a clustering algorithm that maximizes the missing data log-likelihood function, acting as a maximum-likelihood estimator.EM possesses favorable properties like monotonic convergence and probabilistic constraints, making it particularly appealing in the context of MM parameter estimation.
However, the EM algorithm comes with documented drawbacks (Baudry and Celeux, 2015).It necessitates initial parameters for the first iteration, known as initialization, which can be challenging for MMs with numerous parameters.Moreover, EM is highly sensitive to initialization due to the multi-modality of the likelihood function for the MM.As a strictly hill climbing algorithm with a monotonic convergence property, EM can get trapped in local optima.Consequently, the emphasis on the initialization of the EM algorithm is significant, leading to extensive research in this area.Effective initialization not only enhances the results of the estimation problem but can also expedite the estimation process.Despite its advantages, the EM algorithm is noted for its slow linear convergence rate (Baudry and Celeux, 2015;Ng and McLachlan, 2004), yet, with good initial parameters close to a saddle point (local optima) of the likelihood function, fewer computationally expensive EM iterations are needed to complete the process.
In their paper (Zhang et al., 2004), a novel Competitive EM (CEM) algorithm designed for finite mixture models was introduced aiming to address the primary limitations of the EM algorithm, namely, susceptibility to local maxima entrapment and occasional convergence to the boundary of the parameter space.The proposed algorithm exhibits the ability to autonomously determine the clustering number and efficiently execute "split" or "merge" operations, leveraging a new competitive mechanism.Notably, it demonstrates insensitivity to the initial configuration of the mixture component number and model parameters.
The aim of this paper is to implement the EM algorithm for a multivariate Gaussian mixture model with "split" or "merge" operations using R-platform (R-core Team, 2010).

Methods
For convenience, we provide some necessary and useful definitions and some mathematical derivations on finite mixture model, EM algorithm and the Competitive EM algorithm which we need throughout the research work.

Learning finite mixture models
It is said a d-dimensional random variable x = [x1,x2,....,xd] T follows a k-component finite mixture distribution, if its probability density function can be written as where πm is the prior probability of the mth component and satisfies where θm is the parameter of the mth density model and is the parameter of the mixture models.For our study we used a 4component 2-dimensional Gaussian mixture model.

E-Step
We apply Baye's rule since the posterior is yet unknown The unnormalized responsibilities is given by, ̃= ( = ,  =  [] ;  [] ) Therefore, normalised responsibility is given by,

Maximising with respect to
The E and M step is performed iteratively until convergence.

The Competitive EM(CEM)
When EM encounters local maxima, the components usually overpopulate in some regions, i.e. the model overfits the data, but under-populate in other regions.The difficulty of passing through some low likelihood regions prevents them from getting to the expected positions.To overcome this problem, CEM do the split or merge operation when EM has converged to a local maximum, thus the components' distribution can self-adapt, and split will take place where the components are too few and merge will take place where the components are too many.where β is a constant.

Operations of Split and Merge
The operation type and the candidates of split or merge components are sampled by psplit(m;θ) and pmerge (m,l;θ) Split Operation: Suppose that the mth component is chosen to split, CEM will generate two new components from the current samples in the mth component.We initialize the two new components parameters by random initialization method and optimize them by the basic EM algorithm.
operation: Suppose that the mth and lth components are selected to merge, the parameters of the merged component can be calculated directly from the original mth and lth components parameters as follows: (25)

Probability of Acceptance
After the operation type and operation candidates are chosen by sampling according to the split and merge probabilities, the acceptance probability is calculated to prevent poor operation.It is calculated by:   =  (exp ( ((+1),)−((),)  ) , 1) (26 where γ is a constant.

Results and Discussion
In this section, the same synthetic data is used to determine the distribution for experiments.Results of the experiments i.e parameter estimates combined with the randomly selected value for either of the variables is presented and discussed.

Synthetic data
We use 1000 samples from 4-component GMM.In this GMM, the two components (kernels 1 and 2) share a common mean, but have different covariance matrices.The prior probability of kernel 4 is a little lower than the other kernels.The parameters of GMM are given as follows:
Figure 3: Plot of fitted model.

Conclusion
This work showed how the EM algorithm can be implemented for bivariate 4 -component GMM with 'split' and 'merge' in R. Slight modification of the same code can be applied to multivariate cases with more than 2 variables.It can be seen that the split and merge operation helped in overcoming the challenges of EM algorithm and aided the program in reaching global solution.
CEM use local Kullback divergence to measure the distance between the local data density fm(x) and model density pm(x) of the mth component.Define (; ) = ∫   ()   ()  ()  (19) Split probability of the mth component is in direct proportion to j(m;θ) and the merge probability of the mth component is in inverse proportion ( ′ ;  ′ ), where m and θ denote, respectively, the new index of merged component and new parameters of mixture models if the mth and lth components merge.  (; ) ) = (    +     )/()

Figure 2 :
Figure 2: Plot of simulated data showing clustering

Table 1 :
estimates of 2 component Gaussian mixture model for test data.