20 20

Transactions on
Data Privacy
Foundations and Technologies


Articles in Press

Accepted articles here

Latest Issues

Year 2019

Volume 12 Issue 3
Volume 12 Issue 2
Volume 12 Issue 1

Year 2018

Volume 11 Issue 3
Volume 11 Issue 2
Volume 11 Issue 1

Year 2017

Volume 10 Issue 3
Volume 10 Issue 2
Volume 10 Issue 1

Year 2016

Volume 9 Issue 3
Volume 9 Issue 2
Volume 9 Issue 1

Year 2015

Volume 8 Issue 3
Volume 8 Issue 2
Volume 8 Issue 1

Year 2014

Volume 7 Issue 3
Volume 7 Issue 2
Volume 7 Issue 1

Year 2013

Volume 6 Issue 3
Volume 6 Issue 2
Volume 6 Issue 1

Year 2012

Volume 5 Issue 3
Volume 5 Issue 2
Volume 5 Issue 1

Year 2011

Volume 4 Issue 3
Volume 4 Issue 2
Volume 4 Issue 1

Year 2010

Volume 3 Issue 3
Volume 3 Issue 2
Volume 3 Issue 1

Year 2009

Volume 2 Issue 3
Volume 2 Issue 2
Volume 2 Issue 1

Year 2008

Volume 1 Issue 3
Volume 1 Issue 2
Volume 1 Issue 1

Volume 5 Issue 3

t-Plausibility: Generalizing Words to Desensitize Text

Balamurugan Anandan(a), Chris Clifton(a), Wei Jiang(b), Mummoorthy Murugesan(c), Pedro Pastrana-Camacho(a),(*), Luo Si(a)

Transactions on Data Privacy 5:3 (2012) 505 - 534

Abstract, PDF

(a) Department of Computer Science; Purdue University; 305 N University St; West Lafayette; IN 47907-2107; USA.

(b) Department of Computer Science; Missouri University of Science and Technology; 310 Computer Science Building; 500 W 15th St; Rolla; MO 65409-0350; USA.

(c) Teradata; 100 N Sepulveda Blvd; El Segundo; CA 92045; USA.

e-mail:banandan @purdue.edu; clifton @purdue.edu; wjiang @mst.edu; Mummoorthy.Murugesan @teradata.com; ppastran @purdue.edu; lsi @purdue.edu


De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information (names, addresses, dates), they fail on two counts. The first is that complete text redaction may not be necessary to prevent re-identification, since this can affect the readability and usability of the text. More serious is that identifying information, as well as sensitive information, can be quite subtle and still be present in the text even after the removal of obvious identifiers.

Observe that a diagnosis ``tuberculosis'' is sensitive, but in some situations it can also be identifying. Replacing it with the less sensitive term ``infectious disease'' also reduces identifiability. That is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive and identifying information, without unnecessarily degrading the amount of information contained in the document. Based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization and develop efficient heuristics to sanitize text documents.

* Corresponding author.

Follow us


ISSN: 1888-5063; ISSN (Digital): 2013-1631; D.L.:B-11873-2008; Web Site: http://www.tdp.cat/
Contact: Transactions on Data Privacy; Vicenç Torra; U. of Skövde; PO Box 408; 54128 Skövde; (Sweden); e-mail:tdp@tdp.cat
Note: TDP's web site does not use cookies. TDP does not keep information neither on IP addresses nor browsers. For the privacy policy access here.


Vicenç Torra, Last modified: 10 : 44 June 27 2015.