Ch1

Data Mining Tasks
- Prediction Methods
  - Use some variables to predict unknown or future values of other variables.
- Description Methods
  - Find human-interpretable patterns that describe the data.
- classification / regression / clustering / Association Rule Discovery / Anomaly Detection
Challenges
- Scalability
- High Dimensionality
- Heterogeneous and Complex Data
- Data Ownership and Distribution
- Non-traditional Analysis

Ch2 data

what is data
- Collection of data objects and their attributes
- An attribute is a property or characteristic of an object
- A collection of attributes describe an object
Attribute Values
- Attribute values are numbers or symbols assigned to an attribute
- Distinction between attributes and attribute values
  - Same attribute can be mapped to different attribute values（不同单位）
- Different attributes can be mapped to the same set of values
Types of Attributes
- Nominal：不可比
- Ordinal：可以比较顺序
- Interval（日期，摄氏度）：可以加减
- Ratio（长度，时间）：可以加减乘除
Data sets
- Record / Graph / Ordered
- Characteristics
  - Dimensionality / Sparsity / Resolution / Size
Data quality
- Noise and outliers
- missing values
  - Eliminate Data Objects
  - Estimate Missing Values
  - Ignore the Missing Value During Analysis
  - Replace with all possible values (weighted by their probabilities)
- duplicate data
  - Data cleaning
Data Preprocessing
- Aggregation
  - for Data reduction / Change of scale（城市聚合为国家） / More “stable” data（方差更小）
- Sampling
  - for data selection
  - target: representative
- Simple Random Sampling / Sampling without replacement / Sampling with replacement / Stratified sampling
Dimensionality Reduction
- Purpose
  - Avoid curse of dimensionality
  - Reduce amount of time and memory required by data mining algorithms
  - Allow data to be more easily visualized
  - May help to eliminate irrelevant features or reduce noise
- Techniques
  - Principle Component Analysis
  - Singular Value Decomposition
  - supervised and non-linear techniques
Feature Subset Selection
- reduce dimensionality of data
- Techniques
  - Brute-force approch：寻找feature的所有子集
  - Embedded approaches：特征选择是数据挖掘算法的一步
  - Filter approaches：在算法之前对特征进行选择
  - Wrapper approaches：将数据挖掘算法看成是选择特征的黑箱
Feature Creation
- 创造新的属性，能比原属性更好的抓住数据的特点
- Techniques
  - Feature Extraction：domain-specific
  - Mapping Data to New Space：Fourier transform
  - Feature Construction：dividing mass by volume to get density
Discretization
- 基于频率 / 等宽划分 / 聚类算法 / 基于熵的划分
Attribute Transformation
- Standardization and Normalization
Similarity and Dissimilarity
- 度量相似性（0-1），和不相似性（0- $\infty$ ）
- 对于不同类型的数据，有不同的方法
- Binary Vectors
  - Simple Matching / Jaccard Coefficients / Cosine Similarity
  - SMC = number of matches / number of attributes = $\frac{M_{11}+ M_{00}}{M_{01}+M_{10}+M_{11}+M_{00}}$
  - J = number of 11 matches / number of not-both-zero attributes values = $\frac{M_{11}}{M_{01}+M_{10}+M_{11}}$
  - $\cos \left(d_{1}, d_{2}\right)=\left(d_{1} \bullet d_{2}\right) /\left\|d_{1}\right\|\left\|d_{2}\right\|$
- 连续属性
  - $E J(\mathbf{x}, \mathbf{y})=\frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|^{2}+\|\mathbf{y}\|^{2}-\mathbf{x} \cdot \mathbf{y}}$
- 距离度量（metric）
  - Euclidean Distance
  - Minkowski Distance（generalization）
  - 无穷范式距离（最大值）
- 相关性
  - $p_{k}^{\prime}=\left(p_{k}-\operatorname{mean}(p)\right) / \operatorname{std}(p)$
  - $q_{k}^{\prime}=\left(q_{k}-\operatorname{mean}(q)\right) / \operatorname{std}(q)$
  - correlation $(p, q)=p^{\prime} \bullet q^{\prime}$

Ch3 Exploring Data

Summary Statistics
- Frequency and Mode
- Percentiles
Mean and Median
Range and Variance：sensitive to outliers
Visualization
- objects -> points
- attribute values -> color, size, shape
- relationships -> form group / outlier
- Techniques
  - Histograms: Usually shows the distribution of values of a single variable
  - Box Plots: can be used to compare attributes
  - Scatter plots: graphically show the relationship between two attributes.
  - Matrix plots:

Ch4 Classification

Confusion matrix
- false negative: 错误的认为没有患病
- false positive：错误的认为患病了

Decision Tree

Hunt's algorithm
- 如果Dt中所有的记录都属于同一个类yt，则结点t是叶子结点，用yt标记；
- 如果Dt中包含多个类的记录，则选择一个属性测试条件，将记录划分为较小的子集。对于测试条件的每个输出，创建一个子女结点，并根据测试结果将Dt中的记录分布到子女结点中，然后对每个子女结点递归地调用该算法

如何根据属性划分子集？
- 多路划分：根据值的不同划分
- 二元划分：选择最佳partition进行划分
如何选择最佳属性划分？
- Gini index（越小越好）
  - $G I N I(t)=1-\sum_{j}[p(j | t)]^{2}$
  - $p(j|t)$ 是在t节点上，j类的相对频率
  - 最大值为 $1-\frac{1}{n}$ ，最小值为0（最纯，只属于一个类）
  - 最后计算划分的所有子节点的Gini系数： $G I N I_{\text { split }}=\sum_{i=1}^{k} \frac{n_{i}}{n} G I N I(i)$
  - 如何计算？
    - 对于连续属性，首先sort，然后选取每个值作为划分，选择Gini系数最小的值作为阈值划分点。
      - 改进：只对label突变的点进行计算，因为这时候才可能Gini变得最小
- Entropy
  - Entropy $(t)=-\sum_{j} p(j | t) \log p(j | t)$
  - $GAIN_{split} = Entropy(p) - (\sum \frac{n_i}{n}Entropy(i))$
  - 熵越小，越纯，选择信息增益最大的作为划分
  - 缺点
    - 倾向于选择划分更多的，例如1/3而不是1/2，造成每个子节点个数过少，容易过拟合
- Gain Ratio
  - $SplitINFO=-\sum_{i=1}^{k} \frac{n_{i}}{n} \log \frac{n_{i}}{n}$
  - $GainRatio_{split} = \frac{Gain_{split}}{SplitINFO}$
  - 可以避免倾向于选择large number of small partitions的问题
什么时候停止分裂？
- 当某个节点的所有记录都属于一个类别时
- 当某个节点的所有记录都具有相同属性时
- 剪枝
  - 预剪枝
    - 如果所有的样本都是一个类/属性都相同
    - 如果样本数小于阈值
    - 如果继续分割，不能提高纯度
  - 后剪枝
    - 子树提升：将经常用到的子树提高到上层
    - 子树替代：从下到上进行剪枝，如果泛化误差减少，则用节点代替
优缺点
- 优点
  - 非参数方法，易解释性，抗噪声，很好的处理冗余属性
- 缺点
  - 寻找树的结构是NP难问题，通常用贪心算法不能求得最优解。
  - 叶子节点的样本可能过少，在统计意义上不显著
  - 没有考虑属性之间的交互作用

模型选择

欠拟合
- 模型过于简单，训练集和测试集上的误差都很大。模型还没有学习到数据本身的结构
过拟合
- 模型过于复杂，训练集上的误差很小，但测试集上很大。
处理过拟合的基本方法
- 由于噪声，会导致过拟合
- 由于数据lack of representative samples。增加样本量可以减少过拟合。
模型选择
- 使用验证集
  - 用来估计泛化误差（但训练的数据变少了）
- 考虑模型复杂度
  - 简单的模型更好，使用L2正则项等
- 估计统计的误差上界
模型评估
- holdout方法
  - 没有用到所有的数据集
  - 有些样本可能没有被使用，有的被重复使用
  - 训练集和测试集不相互独立
- 交叉验证
  - 最大限度的用到了所有数据
  - 测试集之间相互独立，而且覆盖到了所有的数据集
- Bootstrap
  - 每次抽样后会放回，因此一个样本可能多次出现在训练集中。

Ch5.6 Ensemble

使用多个基分类器构建一个更强的分类器，但每个基分类器的分类能力高于随机猜测时，ensemble得到的分类器有更强的分类能力。

ensemble对于unstable classifiers有更好的分类能力。

Bias-Variance Decomposition

$d_{f, \theta}(y, t)=\operatorname{Bias}_{\theta}+ Variance _{f}+ Noise _{t}$

Error反映的是整个模型的准确度

Bias：反映的是模型在样本上的输出与真实值之间的误差，即模型本身的精准度

Variance：反映的是模型每一次输出结果与模型输出期望之间的误差，即模型的稳定性

Bias和模型复杂度的关系

当模型复杂度上升时，Bias减小。当模型复杂度降低时，Bias增加。（反比关系）

Variance和模型复杂度的关系

当模型复杂度低时，Variance更低，当模型复杂度高时，Variance更高。（正比关系）

例如，k-neast neighbor方法的bias比决策树低，但variance比决策树高。

Bagging

使用boostrap对数据进行抽样，并训练一个分类器。

最后使用投票的方式决定最终样本的label。

bagging方法通过减少基模型的variance来提高模型的泛化误差。
bagging的好坏取决于基分类器的稳定性
- 如果基分类器不稳定，那么bagging可以通过降低数据中的随机误差来降低误差
- 如果分类器很稳定，那么bagging的误差主要由基分类器的bias决定。而bagging减少了训练集，构建的模型可能更差。
在存在noise时，更容易避免overfitting。

Boosting

通过关注于前面错分的样本，来实现自适应的分类器。

给定每个样本一个权重，初始相同
- 可用于使用boostrap抽样，也可用于基分类器的模型学习
预测正确的样本的权重会降低，而错分的样本权重会增加

Boosting算法可分为两类：

如何更新样本的权重
如何根据基分类器进行预测

AdaBoost

Error rate： $\epsilon_{i}=\frac{1}{N}\left[\sum_{j=1}^{N} w_{j} I\left(C_{i}\left(\mathbf{x}_{j}\right) \neq y_{j}\right)\right]$
Importance of a classifier： $\alpha_{i}=\frac{1}{2} \ln \left(\frac{1-\varepsilon_{i}}{\varepsilon_{i}}\right)$
Weight update：

$w_{i}^{(j+1)}=\frac{w_{i}^{(j)}}{Z_{j}} \times\left\{\begin{array}{ll}{\exp ^{-\alpha_{j}}} & {\text { if } C_{j}\left(\mathbf{x}_{\mathbf{i}}\right)=y_{i}} \\ {\exp ^{\alpha_{j}}} & {\text { if } C_{j}\left(\mathbf{x}_{\mathbf{i}}\right) \neq y_{i}}\end{array}\right.$

如果某个中间阶段的预测误差大于50%，那么权重会重新变为1/n
- Classification： $C^{*}(x)=\underset{y}{\arg \max } \sum_{j=1}^{T} \alpha_{j} \delta\left(C_{j}(x)=y\right)$

由于boosting方法主要关注于错分样本，因此有可能会造成过拟合。

Random forests

对样本和feature都进行bootstrap抽样，然后使用决策树进行学习，最后进行投票。

Ch7 Imbalanced class

某些问题中，数据的类别有较大的倾向性，例如欺诈检测，不合格商品抽查等。

使用准确率来衡量，不是一个好的方式。

Matrices

例如，如果一个数据集中只有10个正例，且全部未被识别，但准确率还是很高
Precision： $p = \frac{TP}{TP+FP}$ ，被预测为正类的有多少是positive
Recall(True Positive Rate)： $r = \frac{TP}{TP+FN}$ ，本来是正类的，有多少被预测正确
False Positive Rate： $\frac{FP}{FP+TN}$
$F_1 = \frac{2}{\frac{1}{r} + \frac{1}{p}}$

ROC

通过TPR和FPR，来衡量模型好坏，越靠近左上角越好。

做法是：使用分类器得到每个样本分类的概率，然后遍历阈值得到。

AUC：ROC曲线下的面积。

特点：当测试集中的正负样本的分布变换的时候，ROC曲线能够保持不变。在实际的数据集中经常会出现样本类不平衡，即正负样本比例差距较大，而且测试数据中的正负样本也可能随着时间变化。

Cost Matrix

通过对不同的错误率分配不同的权重，从而对unbalanced数据的cost进行衡量。

对于二分类来说：

Cost(+) = p(+|x) C(+,+) + p(+|x) C(+,-)
Cost(-) = p(-|x) C(-,+) + p(-|x) C(-,-)
选择cost小的作为样本的label

sampling

可以通过重采样来改变样本数据的分布，来更好的代表数据

对majority进行undersample
- 某些代表性的样本可能未被选中
对rare class进行oversample
- 可能导致过拟合，增加训练时间

ch6 Association analysis

基本概念
- support count：某个itemset出现的频数
- support：transaction中包含某个itemset的频率： $s(X \longrightarrow Y)=\frac{\sigma(X \cup Y)}{N}$
- frequent itemset：超过最小支持度的itemset
- confidence：衡量在Y中的itemset出现在X中： $c(X \longrightarrow Y)=\frac{\sigma(X \cup Y)}{\sigma(X)}$
算法
1. Frequent Itemset Generation
  - 首先从所有的itemset中得到support大于阈值的集合
2. Rule Generation
  - 然后从每个itemset中得到高置信度的集合

Frequent Itemset Generation

暴力枚举
- 复杂度 $O(N\times2^d \times w)$
- 其中N是transaction的个数，w是每个itemset的长度
有两种减少复杂度的方式
Reduce the number of candidates ( $2^d$ $2^{d}$ )
- Apriori算法
  - 如果一个itemset是频繁的，那么它的所有子集也都是频繁的
  - 如果一个itemset是不平凡的，那包含它的超集也都是不频繁的
  - 因此，从1-itemsets开始，每次将小于min sup的子集删除，然后再对2-itemset搜索
Candidate Generation
- 如何有效生成子集？
- Merget $F_{k-1}$ $F_{k - 1}$ 和 $F_1$ $F_{1}$
  - 有些生成的集合是不必要的，因为其子集是非频繁的
  - 可能会生成很多相同的子集、同时还是有很多非必须的候选集合
- $F_{k-1} \times F_{k-1}$ $F_{k - 1} \times F_{k - 1}$
  - 只合并那些， $k-2$ $k - 2$ 个前面的item是相同的
    - Merge(ABC,ABD) = ABCD
  - 还有一种实现的方式是将第一个 $F_{k-1}$ $F_{k - 1}$ 的后 $k-2$ $k - 2$ 个与另一个 $F_{k-1}$ $F_{k - 1}$ 的前 $k-2$ $k - 2$ 个合并
    - Merge(ABC,BCD) = ABCD
Reducing Number of Comparisons
- 如何查找其子集是否为频繁的？
- 最简单的方式是一个个去数据库中查找
- 使用hash table进行比较
  - 定义hash function
  - 定义最大叶子节点容纳的itemset数量，也就是

Rule Generation

现在已经选出满足支持度的子集，但对于每个子集，其置信度不同
若 $X'\in X$ ，那么如果 $X -> Y- X$ 不满足置信度，则所有的 $X'->Y-X'$ 都不满足
从同一个子集生成的置信度规则满足单调递减性质：
- $\mathrm{c}(\mathrm{ABC} \rightarrow \mathrm{D}) \geq \mathrm{c}(\mathrm{AB} \rightarrow \mathrm{CD}) \geq \mathrm{c}(\mathrm{A} \rightarrow \mathrm{BCD})$
- 因此，若 $\mathrm{c}(\mathrm{ABC} \rightarrow \mathrm{D})$ 不满足，则后面的都不满足
Maximal Frequent Itemset
- 提供了一个compact representation of frequent itemsets
- 没有超集是frequent的
- 没有提供子集的支持度信息
Closed itemset
- 如果没有它的超集和他的置信度一样，则为closed itemset
- {b,c} is closed, {b}->{d,e} is redundant, as it has the same support and confidence as {b,c}->{d,e}
Frequent itemsets > closed frequent itermset > maximal frequent itemsets

FP-growth Algorithm

用类似于trie tree的结构表示item，每个节点存储当前items的个数，同时用指针指向同样值的节点

Measure for Association Rules

置信度可能会造成误解
- 需要保证 Confidence(X -> Y) > support(Y)
interset factor
- $L i f t=\frac{c(X \rightarrow Y)}{s(Y)}$ $L i f t = s ( Y ) c ( X \to Y )$
  - 小于1，表示负相关
  - 缺点？
- iterest factor $=\frac{s(X, Y)}{s(X) * s(Y)}=\frac{N f_{11}}{f_{1+f}+1}$
If P(X,Y) > P(X) P(Y) : X & Y are positively correlated
If P(X,Y) < P(X) P(Y) : X & Y are negatively correlated
度量的性质
- Symmetric measures:
  - support, lift, collective strength, cosine, Jaccard,
- Asymmetric measures:
  - confidence, conviction, Laplace, J-measure
- Invariant measures（增加一些与计算无关的数据不会改变计算的结果，增加了两者都不出现的概率，不影响计算两者同时出现的概率）
  - support, cosine, Jaccard, etc
- Non-invariant measures
  - correlation, Gini, mutual information, odds ratio, etc
$\theta$ $θ$ 系数
- $\frac{P(A, B)-P(A) P(B)}{\sqrt{P(A) P(1-P(A))(1-P(B))}}$
- 适用场景：0和1同等重要的情况下，该方法适用，算出来的值是一样的
- 但是不适合用于某一个变量更加重要的情况，即不适用于有侧重的情况

Ch7 Advanced

如何将关联分析应用到非binary的属性呢？
- one-hot向量化
  - 对于同一个属性，每次只需要考虑有1的那一列
- 将一些属性出现比较少的全部合并起来
  - 减少列的个数
  - 更容易找到关联规则
- 如果数据有偏，则丢弃频繁项
处理连续属性
- 离散化处理
  - 等宽 / 等深 / 聚类
  - 间距如果太宽
    - 可能会merge很多其他属性
    - 可能会丢失一些属性
  - 间距如果太窄
    - 某些属性可能会分成多个
    - 某些窗口可能太小，不满足阈值
- 统计方法
  - 计算统计特征：方差 / 均值 / 中位数
  - 使用卡方统计量
- Min-apriori
  - 如果直接进行01化，会丢失数量信息
  - 对每一个属性进行标准化，使得其总support为1
  - $\sup (C) = \sum \min D(i,j)$ $sup (C) = \sum min D (i, j)$
    - 对每一个属性取最小的，然后加起来
    - 随着属性增多，support减小（单调递减）
Multi-level Association Rules
- 为什么用层次关系来表示
  - 某些底层的关系可能支持度不够
  - 某些高层的关系可能太过泛化
- 实现方式
  - 在transaction中加入higher level的属性
    - 这样会增加higher level的支持度
    - 增加数据的维度
    - 会产生冗余的规则
  - 首先在highest level中产生规则，然后递归求 next highest
    - 数据处理时间增加
    - 可能会错过某些cross-level的规则

Sequence data

一个sequence是一系列element的集合，每一个element是一个events的集合
sequence是element的个数
看是否contain
- 保持相对顺序，在subsequence中的每个element是否为data sequence的子集

Generalized Sequential Pattern (GSP)

Candidate Generation
- k-subsequence
  - 可以在同一个item中出现，也可以在不同item中出现，序列中考虑相同的item在前后两次中重复出现
- Merge sequence（不能用 $F_{k-1}\times F_{k-1}$ $F_{k - 1} \times F_{k - 1}$ 的方法，因为需要保证顺序）
  - w 1 =<{1} {2 3} {4}> and w 2 =<{2 3} {4 5}>
    - < {1} {2 3} {4 5}> （最后两个event属于相同的element）
  - w 1 =<{1} {2 3} {4}> and w 2 =<{2 3} {4} {5}>
    - < {1} {2 3} {4} {5}>
  - w 1 =<{1} {2 6} {4}> and w 2 =<{1} {2} {4 5}>
    - < {1} {2 6} {4 5}>
Candidate Pruning

Subgraph mining

基本概念
- Labeled graph：节点和边构成
- Subgraph：节点和边都在原图的节点和边的子集中
- Induced subgraph：节点所连接的边，若出现在原图中，则一定出现在induced subgraph
可以用transaction来表示图，每一个item是边和两点
support
- 是多少个图的子图
  Apriori-like approach
- 子图增长方式
  - Vertex growing
  - Edge growing
- candidate generation
  - merge会产生多个子图（看PPT）
  - identical vertex labels
  - Core contains identical labels
  - Core multiplicity（相同的结构有多个）
- 拓扑等价
  - 根据拓扑等价来决定生成的边

Ch10 Anomaly Detection（不考）

异常
- 数量并不一定少（只是相对于正常数据的比例低）
- 上下文很重要，判断什么是异常
Cause
- Data from different classes
- Natural variation
- Data errors
Distinction Between Noise and Anomalies
- Noise is erroneous, perhaps random, values or contaminating objects
  - not interesting
- Anomalies may be interesting if they are not a result of noise
General Issues
- 某些变量单看是正常的，但合起来看就不是正常的（例如身高年龄）
- anomaly scoring
  - 如果只是binary，数据被分为正常或者不正常（要求有标签）
  - 打分，判断是异常的程度
- 可以一次只找出一个异常点，或多个异常点（聚类）
- 如何评估
  - 在有监督的情况下，直接根据Confusion Matrix评判
  - 将检测出的异常点去除，看是否效果更好
Visual Approaches
- Boxplots or scatter plots
- Limitations
  - Not automatic
  - Subjective
Statistical Approaches
- Normal Distributions
- Grubbs’ Test
- Likelihood Approach
- Strengths/Weaknesses
  - Firm mathematical foundation
  - Can be very efficient
  - Good results if distribution is known
    - In many cases, data distribution may not be known
  - For high dimensional data, it may be difficult to estimate the true distribution
  - Anomalies can distort the parameters of the distribution
Distance-Based Approaches
- The outlier score of an object is the distance to its kth nearest neighbor
- Strengths/Weaknesses
  - Simple
  - Expensive – $O(n^2 )$
  - Sensitive to parameters
  - Sensitive to variations in density
  - Distance becomes less meaningful in highdimensional space
Density-Based Approaches
- Consider the density of a point relative to that of its k nearest neighbors
  - 也就是比较当前点和邻近点density之间的差异
- 优缺点和Distance-Based Approaches一样。
Clustering-Based Approaches
- An object is a cluster-based outlier if it does not strongly belong to any cluster

据挖掘课程大纲回顾

Ch1