Knowledge hiding in emerging application domains

Abul, Osman

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.11851/5812

Title:	Knowledge hiding in emerging application domains
Authors:	Abul, Osman
Publisher:	CRC Press
Abstract:	Without any doubt data is among the most valuable assets of many profit or non-profit organizations including government agencies, corporations and non-governmental institutions. Apart from daily operational use, this is simply because data is the source of precise information and knowledge that organizations need to stay in business, i.e., their success heavily depends on better data utilization. Fast-pacing technological advancements provide organizations with easy collection of massive volumes of data, the utilization of which is organization-dependent, e.g., ranging from simple querying to complex analysis like data mining. However, the process is not straightforward as data management per se has long been known as a tough issue as it encompasses several non-trivial aspects, e.g., indexing, integrity, consistency and security. Fortunately, database management systems provide tools to simplify the processes to some extent by means of automation. The security aspect can simply be cast as enforcing a set of data protection policies to limit unauthorized accesses to individual data elements and their aggregations. In a broader sense, the security measures extend to the disclosure limitation of knowledge, patterns/rules derivable from the data. Clearly, security measures are indispensable requirements whenever sensitivity issues do exist. Sensitive data, information or knowledge when disclosed, either intentionally or accidentally, can potentially lead to privacy, anonymity or secrecy violations. This is a serious issue as the violations may cause countless troubles, e.g., unauthorized disclosure of sensitive personal data is a crime in many countries. In the typical database publishing scenario, an organization maintains a database, and releases it as a whole or in part to third parties, mainly for do-it-yourself kind of queries and analysis. We distinguish two basic models of database release: physical and logical. In the physical release model, interested parties can obtain the released database as a whole. In the logical release model, third parties can not get their own copy but are allowed to query against the shared database with restrictions based on their privileges. In the former, since the to-be-released database may contain sensitive raw data (e.g., salary of general manager), the data publisher simply erases or mixes such entries before the release. The challenge here is to maintain data integrity and statistics. We call this operation data-level sanitization. In the latter, third parties are not allowed to query sensitive data directly but only some aggregate information (e.g., average salary of managers) over the sensitive/nonsensitive data. Unfortunately, even simple inferences are shown to be enough to extract sensitive information even though it can not be queried directly. To see, consider that the salary of individual managers is sensitive, but their average salary, is allowed to be queried. In case there is only one manager, then anybody can easily learn the sensitive information through issuing the aggregate query. This is known as the inference problem and studied in the context of statistical databases; see [18] for a survey. In this setting, selectively denying to answer some queries can be seen as sanitization since it preserves the disclosure. We call this operation information-level sanitization. The main objective with data-level sanitization and information-level sani- tization is the protection of privacy of individuals. However, privacy protection of individuals is not the only sensitivity issue. Consider for instance that the database is not necessarily about individuals but its content implies valuable patterns which must be kept inaccessible to third parties. In this case, we speak about the “privacy of patterns” and knowledge-level sanitization. The positive side of data mining has long been known. However, it is now a serious threat to database security as data mining techniques allow extraction of almost every derivable knowledge including the sensitive data. Privacyaware data mining, i.e., the study of data mining side effects on privacy, has rapidly become a hot research area [16, 8, 49, 12] since its introduction in 1991 by O’Leary [35]. Since then, many completely different problem formulations with differing objectives and techniques have been introduced and studied in established data mining domains. Knowledge hiding is one of such formulation approaches aimed at hiding some knowledge tagged sensitive from shared databases. Other approaches include data obfuscation/perturbation, secure multi-party computation, secure knowledge sharing and k-anonymity [12]. Many different approaches for knowledge hiding have emerged over the years, mainly in the context of frequent itemset and association rule mining. But emerging real-world applications’ data demands (and associated knowledge demands) are versatile with a wide-spectrum: from unstructured (e.g., text) databases to structured (e.g., graph) databases. This in turn calls for advanced data analysis and respective sensitive knowledge hiding formulations. In this chapter, we introduce the generic knowledge hiding problem which is to be used as a template in the development of concrete knowledge hiding tasks. We consider there are at least three dimensions for data mining activities depending on (i) the kind of dataset, e.g., relational, structured and text (ii) the kind of data mining task, e.g., associations, clustering and classification (iii) the kind of knowledge format, e.g., itemsets, rules, patterns and clusters. Since knowledge is the focal point in knowledge hiding, we select it to be the main dimension to organize the content. To this end, we first present the generic knowledge hiding problem and obtain specializations for certain knowledge formats and discuss respective sanitization approaches. Then, we present the frequent itemset hiding problem, the classical knowledge hiding domain, and related association rule hiding problem. After that, we continue with the (relatively new) problem of sequential knowledge hiding. Following this, we visit other emerging domains for which the knowledge hiding task is not addressed at all or is immature. For those knowledge hiding domains, we present our view of problem, definitions and possible approaches. Finally, we provide conclusions along with challenging future research directions. © 2011 by Taylor and Francis Group, LLC.
URI:	https://doi.org/10.1201/b10373 https://hdl.handle.net/20.500.11851/5812
ISBN:	9781439803660; 9781439803653
Appears in Collections:	Bilgisayar Mühendisliği Bölümü / Department of Computer Engineering Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection

Show full item record

CORE Recommender

SCOPUS^TM
Citations

3

checked on Apr 20, 2024

Page view(s)

46

checked on Apr 22, 2024

Google Scholar^TM

Check

SCOPUSTM Citations

Page view(s)

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

Google Scholar^TM