Communication-Efficient Privacy-Preserving Clustering
Geetha Jagannathan(a),(*), Krishnan Pillaipakkamnatt(b), Rebecca N. Wright(a), Daryl Umano(b)
Transactions on Data Privacy 3:1 (2010) 1 - 25
(a) Department of Computer Science, Rutgers University, New Brunswick, NJ, USA.
(b) Department of Computer Science, Hofstra University, Hempstead, NY,USA.
e-mail:geetha @cs.rutgers.edu; csckzp @hofstra.edu; Rebecca.Wright @rutgers.edu; dumano33 @hotmail.com
The ability to store vast quantities of data and the emergence of high speed networking
have led to intense interest in distributed data mining. However, privacy concerns, as well as regulations,
often prevent the sharing of data between multiple parties. Privacy-preserving distributed
data mining allows the cooperative computation of data mining algorithms without requiring the
participating organizations to reveal their individual data items to each other.
This paper makes several contributions. First, we present a simple, deterministic, I/O-efficient kclustering
algorithm that was designed with the goal of enabling an efficient privacy-preserving version
of the algorithm. Our algorithm examines each item in the database only once and uses only
sequential access to the data. Our experiments show that this algorithm produces cluster centers
that are, on average, more accurate than the ones produced by the well known iterative k-means
algorithm, and compares well against BIRCH.
Second, we present a distributed privacy-preserving protocol for k-clustering based on our new clustering
algorithm. The protocol applies to databases that are horizontally partitioned between two
parties. The participants of the protocol learn only the final cluster centers on completion of the protocol.
Unlike most of the earlier results in privacy-preserving clustering, our protocol does not reveal
intermediate candidate cluster centers. The protocol is also efficient in terms of communication and
does not depend on the size of the database. Although there have been other clustering algorithms
that improve on the k-means algorithm, ours is the first for which a communication efficient cryptographic
privacy-preserving protocol has been demonstrated.
* Corresponding author.