machine learning - Training on imbalanced data using TensorFlow -

January 15, 2011

the situation:

i wondering how use tensorflow optimally when training data imbalanced in label distribution between 2 labels. instance, suppose mnist tutorial simplified distinguish between 1's , 0's, images available either 1's or 0's. straightforward train using provided tensorflow tutorials when have 50% of each type of image train , test on. case 90% of images available in our data 0's , 10% 1's? observe in case, tensorflow routinely predicts entire test set 0's, achieving accuracy of meaningless 90%.

one strategy have used success pick random batches training have distribution of 0's , 1's. approach ensures can still use of training data , produced decent results, less 90% accuracy, more useful classifier. since accuracy useless me in case, metric of choice typically area under roc curve (auroc), , produces result respectably higher .50.

questions:

(1) strategy have described accepted or optimal way of training on imbalanced data, or there 1 might work better?

(2) since accuracy metric not useful in case of imbalanced data, there metric can maximized altering cost function? can calculate auroc post-training, can train in such way maximize auroc?

(3) there other alteration can make cost function improve results imbalanced data? currently, using default suggestion given in tensorflow tutorials:

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y)) optimizer = tf.train.adamoptimizer(learning_rate=learning_rate).minimize(cost)

i have heard may possible up-weighting cost of miscategorizing smaller label class, unsure of how this.

i'm 1 struggling imbalanced data. strategy counter imbalanced data below.

1) use cost function calculating 0 , 1 labels @ same time below.

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(_pred) + (1-y)*tf.log(1-_pred), reduction_indices=1))

2) use smote, oversampling method making number of 0 , 1 labels similar. refer here, http://comments.gmane.org/gmane.comp.python.scikit-learn/5278

both strategy worked when tried make credit rating model.

logistic regression typical method handle imbalanced data , binary classification such predicting default rate. auroc 1 of best metric counter imbalanced data.

Search This Blog

Live one

machine learning - Training on imbalanced data using TensorFlow -

Comments

Post a Comment

Popular posts from this blog

authentication - Mongodb revoke acccess to connect test database -

c - getting error: cannot take the address of an rvalue of type 'int' -

python - GitPython: check if git is available -