Random forest clustering

2/10/2024

CoClust gives effective results by clustering variables that show nonlinear dependency using copulas. In the feature selection step, the determination of nonlinear dependency is emphasized, and copulas are preferred. In this technique, the power and type of multivariate dependency between sets are modeled with a copula function and dependency parameter. It overcomes linear dependency constraints. The main advantage of the proposed approach using CoClust is to achieve high accuracy in big data in a short time.Ĭopula-Based Clustering technique called CoClust, which examines dependencies using copulas, is an alternative to classical clustering techniques. One of the popular methods used in analyzing nonlinear dependencies is copulas. Working with non-linear dependency during the correct determination of the relationship between variables is one of the side benefits of the article. The efficient operation of the clustering method used in nonlinear dependence is one of the side benefits of the study.

Although there is an expectation of linear dependence in the studies, nonlinear dependence is also frequently encountered. Correct determination of the dependency between variables in the feature selection step is one of the most critical steps of the study. Especially when working with big data, it is very important to increase speed and accuracy by using a correct clustering method. The main contribution of this study is to increase the speed and accuracy of RF by adding a new feature selection step. It is specifically devised to operate quickly and efficiently over large datasets because of the simplification and it offers the highest prediction accuracy compared to other models in the setting of classification. Random forest (RF) has been used in biology and medicine, such as high-dimensional genetic or tissue microarray data and MIMIC-III. It is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The random forest technique is an effective and popular method to solve classification and regression problems based on decision trees. The Random Forest, Gradient Boosting and Logistic Regression results obtained with these clusters and the success of RF and CoClust working together are examined. CoClust clustering results are compared with K-means and hierarchical clustering techniques. The obtained results are compared in terms of CPU time, accuracy and ROC (receiver operating characteristic) curve. Then, random forest is repeated in the clusters obtained with CoClust. In the proposed approach, first, random forest is employed without adding the CoClust step. The first dataset is large in terms of rows referring to individual IDs, while the latter is an example of longer column length data with many variables to be considered. We work with two different large datasets, namely, the MIMIC-III Sepsis Dataset and the SMS Spam Collection Dataset. We show that it is possible to achieve a remarkable improvement in CPU times and accuracy by adding the CoClust-based feature selection step to the random forest technique. Copula-Based Clustering technique (CoClust) clusters variables with copulas according to nonlinear dependency. As the dependency structure is mostly nonlinear, making use of a tool that considers nonlinearity would be a more beneficial approach.

The dependency structure between the variables is considered to be the most important criterion behind selecting the variables to be used in the algorithm during the feature selection phase. The random forest algorithm could be enhanced and produce better results with a well-designed and organized feature selection phase.

0 Comments

BLOG

Random forest clustering

Leave a Reply.

Author

Archives

Categories