Privacy in Multiple On-line Social Networks -- Re-identification and Predictability
David F. Nettleton(a),(*), Vladimir Estivill-Castro(a), Julián Salas(b)
Transactions on Data Privacy 12:1 (2019) 29 - 56
(a) Web Science and Social Computing Research Group, Department of Information and Communications Technology (DTIC), Universitat Pompeu Fabra, UPF Tanger Building, 08018 Barcelona, Catalonia, Spain.
(b) Internet Interdisciplinary Institute (IN3), Universitat Oberta de Catalunya (UOC), Parc Mediterrani de la Tecnologia (Edifici B3), Av. Carl Friedrich Gauss, 5, 08860 Castelldefels (Barcelona), Spain.
e-mail:david.nettleton @upf.edu; vladimir.estivill @upf.edu; jsalaspi @uoc.edu
We consider the re-identification of users of on-line social networks when they participate in several different on-line social networks, potentially using several different accounts. The re-identification of users serves several purposes: (i) commercial use so as to avoid redundant mailing to the same user; (ii) enhancement of the information available about these users by unifying information from different sources; (iii) consolidation of accounts by on-line social network providers; (iv) identification of potentially malicious users and/or bots. We highlight that all this should occur within the bounds of the data protection and privacy laws as well as the users' expectations on such matters to avoid backlash. In this paper, we explore this situation first by a formalization using the SAN model to conceptually structure information as a graph, which includes user and attribute type nodes. This formalization enables us to reason on two issues. First, how to identify that two or more user-accounts belong to the same user. Second, what gains in predictability are obtained after re-identification. For the first issue, we show that a set-difference approach is remarkably effective. For the second issue we explore the impact of re-identification on the predictability by two different machine learning algorithms: C4.5 (decision tree induction) and SVM-SMO (Support Vector Machine with SMO kernel). Our results show that as predictability improves, in some cases different SAN metrics emerge as predictors.