Hobo Chen

Top 10 Data Mining Algorithms(TBD)

This is a note of Data Mining of XJTU.

Confusion Matrix

Actual class \ Predicted class C1 not C1
C1 TP True Positive FP False Negative
not C1 FP False Positive TN True Negative

$$ \text{Accuracy} = \frac{TP + TN}{\text{All}} $$
$$ \text{Precision} = \frac{TP}{TP + FP}$$
$$ \text{Recal(Sensitivity)} = \frac{TP}{TP + FN}$$
$$ \text{Specificity} = \frac{TN}{TN+FP} $$

10 Algorithms

Decision Tree

C4.5

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan’s earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.

Entropy

$$ Info(D) = -\sum_{i=1}^{m}p_i\log_2(p_i), \quad 0\log0=0$$

Gain Ratio

$$ SplitInfo(A) = -\sum_{j=1}^{m}\frac{|D_j|}{|D|}\log_2(\frac{|D_j|}{|D|}) $$
$$ GainRatio(A) = Gain(A) / Split(A) $$

CART

$$
Gini = 1 - \sum_i p_i^2
$$

$$
G_{split} = \sum \frac{|Dj|}{D} * Geni_D
$$

K-means

SVM

Apriori

Page Rank

Ada Boost

K-NN

Naive Bayes