Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented). These terms are used both in statistical sampling, survey design methodology and in machine learning. Oversampling and undersampling are opposite and roughly equivalent techniques. There are also more complex oversampling techniques, including the creation of artificial data points
quote from wikipedia.
Oversampling
Oversampling consists in generating synthetic instances for a minority class instead of just deleting instances from a majority class.
There a several algorithms that allow performing oversampling:
SMOTE
ADASYN
Random Oversampling
Oversampling with Azure Machine Learning
SMOTE takes the entire dataset as an input, but it increases the percentage of only the minority cases. For example, suppose you have an imbalanced dataset where just 1% of the cases have the target value A (the minority class), and 99% of the cases have the value B. To increase the percentage of minority cases to twice the previous percentage, you would enter 200 for SMOTE percentage in the module's properties.
quote from official Microsoft documentation
Thus, you only need to provide the number of nearest neighbors, SMOTE percentage and your label column. This will generate a new dataset with synthetic instances.
Limitations
However, using Azure Machine Learning studio you can only apply it to a binary-class dataset, which is quite disappointing. Fortunately there is a workaround using regular expressions/
Implementation
1) Add your unbalances data to the experiment
2) Detect the majority class. This will serve as the basis for your normalization experiment.
3) Add a split data module
4) Set splitting mode to "Regular Expression"
5) Enter the the following expression
\"Your label column" (your majority class|you minority class 1)
This will only select the majority class and the first minority class
6) Add SMOTE module and configure it according to your needs
7) Add a new split data module, and again set splitting mode to "Regular Expression"
8) Enter the following expression
\"Your label column" (you minority class 1)
This is needed to exclude your majority class from the newly generated dataset. As you are going to repeat this procedure N times (where n is the number of classes) without this step you will have all the instances of the majority classes repeated N times which is not efficient for us.
9) At this step your experiment should look like this
10) Repeat the steps 3-8 N times (N = number of minority classes)
11) At this point your experiment should be something like this:
12) The only thing left is to bind all this data (with the help of Add rows module) together and finally add the majority class (again using split data and add rows).
13) To accelerate the experiment I have also created a custom R module that binds 5 datasets simultaneously. Here is the code (r file and xml).
Hope you will find this helpful
Comments