Publication date: 2 March Abstract Purpose The purpose of this paper is to present and analyze the current literature related to developing and improving the Mahalanobis-Taguchi system MTS and to present the shortcomings related to this method for future research. For this purpose, 46 articles are considered for classification from to on the basis of: MTS contribution area, description of the issue, and results. Findings In this paper a review on the concepts and operations of the MTS was provided as a new method in the field of pattern recognition, multivariable diagnosis, and forecasting. The analysis of the articles showed the fields of MTS which had more potential for future studies and developing. The comparison of the MTS to other methods and the selection of the normal group for constructing the Mahalanobis space have received the most attention by researchers.
|Published (Last):||12 September 2008|
|PDF File Size:||12.68 Mb|
|ePub File Size:||18.58 Mb|
|Price:||Free* [*Free Regsitration Required]|
Unfortunately, MTS lacks a method for determining an efficient threshold for the binary classification. MMTS outperforms the benchmarked algorithms especially when the imbalance ratio is greater than A real life case study on manufacturing sector is used to demonstrate the applicability of the proposed model and to compare its performance with Mahalanobis Genetic Algorithm MGA. Introduction Classification is one of the supervised learning approaches in which a new observation needs to be assigned to one of the predetermined classes or categories.
If the number of the predetermined classes is more than two, it is a multiclass classification problem; otherwise, the problem is known as the binary classification problem. At present, these problems have found applications in different domains such as product quality [ 1 ] and speech recognition [ 2 ]. The classification accuracy depends on both the classifier and the data types.
The classifier types can be categorized according to supervised versus unsupervised learning, linear versus nonlinear hyperplane, and feature selection versus feature extraction based approach [ 3 ]. On the other hand, Sun et al. If the data distribution of one class is different from distributions of others, then the data is considered imbalance. The border that separates balance from imbalance data is vague; for example, imbalance ratio, which is the ratio between the major to minor class observations, is reported from small values of to 1 to : 1 [ 5 ].
The assumption of an equal number of observations in each class is elementary in using the common classification methods such as decision tree analysis, Support Vector Machines, discriminant analysis, and neural networks [ 6 ]. Imbalance data occurs often in real life such as text classification [ 7 ]. The problem of treating the applications that have imbalance data with the common classifiers leads to bias in the classification accuracy i.
To handle the classification of imbalanced data problem, the research community uses data and algorithmic or both approaches. For the data approach, the main idea is to balance the class density randomly or informatively i. While at the algorithmic approach, the main idea is to adapt the classier algorithms towards the small class, a combination of the data and algorithmic levels approaches is also used and known as cost-sensitive learning solutions.
The problems reported [ 4 ] using data approach are as follows: deleting significant information for certain instances in case of downsampling, bringing noise to original data in case of oversampling, determining the appropriate sample size in within-class concept data, specifying the ideal class distribution, and using clear criteria for selecting samples. While the problem reported [ 4 ] using the algorithmic approach is that it needs a deep understanding about the classier used itself and the application area i.
Finally, the problem in using the cost-sensitive learning approach is the assumption of previous knowledge for many errors types and imposing a higher cost to the minority class to improve the prediction accuracy. Knowing the cost matrices in most cases is practically difficult. While data and algorithmic approaches constitute the majority efforts in the area of imbalanced data, several other approaches have also been conducted, which will be reviewed in Literature Review.
To overcome the pitfalls of data and algorithmic approaches to solve the problem of imbalanced data classification, the classification algorithm needs to be capable of dealing with imbalance data directly without resampling and should have a systematic foundation for determining the cost matrices or the threshold. One of the promising classifiers is the Mahalanobis Taguchi System MTS , which has shown good classification results for imbalance data without resampling, it does not require any distribution assumption for the input variables, and it can be used to measure the degree of abnormality i.
Three operating point selection criteria, shortest distance, harmonic mean, and antiharmonic mean, have been compared, and the results in [ 9 ] showed that there is no difference among classifiers performances. The aim of this work is to enhance the Mahalanobis Taguchi System MTS classifier performance by providing a scientific, rigorous, and systematic method using the ROC curve for determining the threshold that discriminates between the classes.
The organization of the paper is as follows: Section 2 reviews the previous work of imbalance data classifications methods, the Mahalanobis Taguchi System, and its applications. Section 5 presents a case study to demonstrate the applicability of the proposed research. And in Section 6 , the results obtained from this research are summarized. Literature Review In this section, an overview of the imbalance classification approaches, the Mahalanobis Taguchi System concept, its different areas of applications, weakness points, and its variants is presented.
Solutions to deal with the imbalanced learning problem can be summarized into the following approaches [ 10 ]: sampling sometimes called the data level approach , algorithmic, and cost-sensitive approaches. Data level approach [ 11 ] is mainly returning the balance distribution between the classes through resampling techniques.
Algorithmic level approach solutions are based upon creating a biased algorithm towards positive class. The algorithmic level approach has been used in many popular classifiers such as decision trees, Support Vector Machines SVMs , association rule mining, back-propagation BP neural network, one-sample learning, active learning methods, and the Mahalanobis Taguchi System MTS.
The adaptation of decision tree classifier to suit the imbalance data can be accomplished by adjusting the probabilistic estimate of the tree leaf or developing new trimming approaches [ 14 ]. Support Vector Machines SVMs showed good classification results for slightly imbalanced data [ 15 ], while for highly imbalanced data researchers [ 16 , 17 ] reported poor performance classification results, since SVM try to reduce total error, which will produce results shifted towards the negative majority class.
To handle the imbalance data, there are proposals such as using penalty constants for different classes found in Lin et al. Therefore, in this paper, SVM was selected as one of the benchmarked algorithms to compare with ours; the results showed that SVM classification performance largely degrades with a high imbalance ratio, which supports the previous findings of the researchers more details will be presented in Results. Association rule mining is a recent classification approach combining association mining and classification into one approach [ 20 — 22 ].
To handle the imbalance data, determining many minimal supports for different classes to present their varied recurrence is required [ 23 ]. On the other hand, one-class learning [ 24 , 25 ] used the target class only to determine if the new observation belongs to this class or not.
BP neural network [ 26 ] and SVMs [ 27 ] are examined as one-class learning approach. In the case of highly imbalanced data, one-class learning showed good classification results [ 28 ]. Unfortunately, one-class learning algorithms drawbacks are that the size of the training data is relatively larger than those for multiclass approaches, and it is also hard to reduce the dimension of features used for separation.
Active learning approach is used to handle the problems related to the unlabeled training data. Research on active learning for imbalance data reported by Ertekin et al. Unfortunately one of the bit falls for using this approach is that it can be computationally expensive [ 30 ]. The problem with the algorithmic approach is that it needs an extensive knowledge of specific classifier i.
Cost-sensitive methods use both data and algorithmic approaches, where the objective is to optimize i. Cost-sensitive methods used different costs or penalties for different misclassification types. For example, let be the cost of wrongly classifying positive instant as a negative one, while is the cost of the contrary case.
In imbalance data classification, usually, the revealing of the positive instant is more important than the negative one; hence, the cost of positive instance misclassification outweighs the cost of negatives ones i. Different types of cost-sensitive approaches have been reported in the literature: i Modifying the weights of the data space: in this approach, modification to the training data density is performed using the misclassification cost criteria, in a way that the density is adjusted towards the costly class.
The problem of using the cost-sensitive approach is that it is based on previous knowledge of the cost matrix for the misclassification kinds, while in most cases it is unavailable. Mahalanobis Taguchi System MTS MTS is a multivariate supervised learning approach, which aims to classify new observation into one of the two classes i. MTS was used previously in predicting weld quality [ 3 ], exploring the influence of chemicals constitution on hot rolling manufactured products [ 34 ], and selecting the significant features in automotive handling [ 35 ].
The MTS approach starts with collecting considerable observations from the investigated dataset, tailed by separating of the unhealthy dataset i.
Calculation of the Mahalanobis Distance MD using the negative observation is performed first, followed by scaling i. The scaled MD for the positive date set supposes to be different from MD for those for the negative dataset. Since many features are used to calculate the MD, so that the probability to have significant features for the multivariable dataset is high, Taguchi orthogonal array is used to screen these features.
The criterion for selecting the appropriate features is determined by selecting the features that possess high MD values for the positive observations. It is worth noticing that a continuous scale is constructed from the single class observations by using MTS; unlike other classification techniques, learning is done directly from the positive and negative observations. This characteristic helps the MTS classifier to deal with the imbalance data problems.
The step of determining the optimal threshold is a critical one for effective MTS classier. To determine the appropriate threshold, loss function approach was proposed by [ 36 ]; however, it is not a practical approach because of the difficulty in specifying the relative cost [ 37 ].
It has been shown in [ 6 ] that PTM classifier performance outperformed MTS classifier performance; therefore, it has been selected to be benchmarked with the proposed classifier. Unfortunately, the PTM method is based on previously assumed parameters, and the accuracy of the classification results was less than the benchmarked classifiers this is one of the findings in this research, which will be discussed in Results.
The other research area in the MTS is related to the modification of the Taguchi method not in the threshold determination. Both the MGA and MTS Particle Swarm Optimization methods deal with the Taguchi system orthogonal array part, while the threshold determination still lacks a solid foundation or is hard to be determined in reality. Finally, the aim of this research is to enhance the Mahalanobis Taguchi System MTS classifier performance through providing a scientific, rigorous, and systematic method of determining the binary classification threshold that discriminates between the two classes, which can be applied to the MTS and its variants i.
The currently used approaches either are difficult to use in practice such as the loss function [ 36 ] due to the difficulty in evaluating the cost in each case or are based on previously assumed parameters [ 6 ]. Prerequisite: Obtain healthy negative and unhealthy positive observations Split the obtained data into two groups; training and validation Initialization, let:.
Mahalanobis Taguchi system: a review
Modified Mahalanobis Taguchi System for Imbalance Data Classification