MeanNearestNeighbors (MNN) - algorithm for balancing dataset - In progress #1
One of the challenges in classification problems are unbalanced datasets. I was Data Science Intern when the company that I worked for, assigned me such an interesting challenge where the dataset was unbalanced.







However, I realized this type of problem like unbalanced dataset is а common thing in real life.
I tried most of the algorithms (undersampling, oversampling) like SMOTE, NearMiss, CondensedNearestNeighbors, RandomUnderSampler, RandomOverSampler, KMeansSMOTЕ and rest of them. Anyway, they didn't help me in that case, on the contrary, they worsened my model.
I was like: "but, but, you should have been helpful in creating the predictive model"
So, I'm trying to create another algorithm based on undersampling concept when it comes to balancing datasets.
I called it Mean Nearest Neighbors (MNN).
What's the initial idea:
It's simple. Actually, the algorithm is just a modification of the other undersampling algorithms.
In the data where target label is majority, we aggregate them in pair of records by the mean. The pair of records is aggregated if their euclidean distance is the smallest. Default metric in NearestNeighbors model is 'minkowski', but the default parameter for minkowski distance is p=2, which is euclidean distance. Therefore, the majority class will halve without losing the context in the datas.
This is the initial idea, but also a lot to improve.
I created synthetic dataset with four features and they have a normal standardized distribution. Of Course, and target class where 90% ones and 10% zeros.
Testing was based on different balancing algorithms and the predictive model was Logistic Regression.
Here are the results:
Unbalanced dataset: (AUC ROC: 0.5)
RandomOverSampler: (AUC ROC: 0.52)
SMOTE: (AUC ROC: 0.54)
CondensedNearestNeighbor: (AUC ROC: 0.5)
EditedNearestNeighbor: (AUC ROC: 0.5)
NearMiss: (AUC ROC: 0.55)
ClusterCentroids: (AUC ROC: 0.61)
MNN: (AUC ROC: 0.63)
Next testing was on different real datasets and more models (metric - f1 score):




