04/02/2022 KevinZonda
Expectation-Maximization is EM.
- Segment data int clusters, if:
- high intra-cluster similarity
i.e., in the group, similarity is high - low inter-cluster similarity
i.e., between the group, similarity is low
- high intra-cluster similarity
- Informally, finding natural groupings among objects
Data:
Each data point is m-dimensional:
Define a distance function (i.e. similarity measures) between data:
Goal: segment
Euclidean:
Manhattan:
Chebyshev:
Minkowski:
- Partitional clustering. e.g. K-means, K-medoids
- Hierarchical clustering
Bottom-up (agglomerative)
Top-down - Density-based clustering, e.g. DBScan
- Mixture density based clustering
- Fuzzy theory based
- Graph theory based
- Grid based
- etc.
- Create a hierarchical decomposition of the set of objects using some criterion (标准)
- Produce a dendrogram (树枝状图)
- Place each data point into itw own singleton group
- Repeat: iteratively merge the two closest groups
- Until: all the data are merged into a single cluster
将每个 Data Point 设定为一个集(group),通过计算每个 Data Point 之间的距离,合并最近的两个。
输出:
Height
8 +----------------+
6 +---+---+ |
4 +-+-+ | |
2 | | | |
* * * *
a1 a2 a3 a4
Distance between:
d(a1, a2) = 4
d(a3, (a1, a2)) = 6
d((a1, a2, a3), a4) = 8
- Output: a dendrogram
- Reply on: a distance metric between clusters
- Provides deterministic results
提供确定的结果 - No need to specify number of clusters beforehand
不需要提前明确聚类数量 - Can create cluster of arbitrary shapes
可以创建任何形状的聚类
- Does not scale up for large dataset, time complexity at least O(n2)
因为时间复杂度(最少 O(n2)),很难在大数据集使用。
- Different decisions about similarities can lead to vastly different dendrograms
不同相似算法的决策会导致截然不同的相似图像 - The algorithms imposes a hierarchical structure on the data, even data for which such structure is not appropriate
算法会在数据上生成一个树枝状结构,甚至这个结构不正确