Home | | **Database Management Systems** | | **FUNDAMENTALS OF Database Systems** | | **Database Management Systems** | Introduction to Statistical Database Security

Statistical databases are used mainly to produce statistics about various populations.

**Introduction to Statistical Database Security**

Statistical databases are used mainly to produce statistics about various populations. The database may contain confidential data about individuals, which should be protected from user access. However, users are permitted to retrieve statistical information about the populations, such as averages, sums, counts, maximums, minimums, and standard deviations. The techniques that have been developed to protect the privacy of individual information are beyond the scope of this book. We will illustrate the problem with a very simple example, which refers to the relation shown in Figure 24.3. This is a PERSON relation with the attributes Name, Ssn, Income, Address, City, State, Zip, Sex, and Last_degree.

A **population**
is a set of tuples of a relation (table) that satisfy some selection
condition. Hence, each selection condition on the PERSON relation will specify a particular population of PERSON tuples. For example, the condition Sex = ‘M’ specifies the male population; the
condition ((Sex = ‘F’) AND (Last_degree = ‘M.S.’ OR Last_degree
= ‘Ph.D.’)) specifies the female population that has an
M.S. or Ph.D. degree as their highest degree; and the condition City = ‘Houston’ specifies the population that lives in Houston.

Statistical queries involve applying
statistical functions to a population of tuples. For example, we may want to
retrieve the number of individuals in a population or the average income in the
population. However, statistical users are not allowed to retrieve individual
data, such as the income of a specific person. **Statistical database** **security
**techniques must prohibit the retrieval of individual data. This can be** **achieved by prohibiting queries that
retrieve attribute values and by allowing only queries that involve statistical
aggregate functions such as COUNT, SUM, MIN, MAX, AVERAGE, and
STANDARD DEVIATION. Such
queries are sometimes called **statistical queries**.

It is the responsibility of a database
management system to ensure the confidentiality of information about
individuals, while still providing useful statistical summaries of data about
those individuals to users. Provision of **privacy
protection** of users in a statistical database is paramount; its violation
is illustrated in the following example.

In some
cases it is possible to **infer** the
values of individual tuples from a sequence of statistical queries. This is
particularly true when the conditions result in a

population consisting of a small number of
tuples. As an illustration, consider the following statistical queries:

Q1: SELECT COUNT (*) FROM PERSON

WHERE <condition>;

Q2: SELECT AVG (Income) FROM PERSON

WHERE <condition>;

Now suppose that we are interested in finding
the Salary of Jane Smith, and we know that she has a Ph.D. degree and that
she lives in the city of Bellaire, Texas. We issue the statistical query Q1 with the following condition:

(Last_degree=‘Ph.D.’ AND Sex=‘F’ AND City=‘Bellaire’ AND
State=‘Texas’)

If we get a result of 1 for this query, we can
issue Q2 with the same condition and find the Salary of Jane Smith. Even if the result of Q1 on the preceding condition is not 1 but is a small number—say 2 or
3—we can issue statistical queries using the functions MAX, MIN, and AVERAGE to identify the possible range of values for
the Salary of Jane Smith.

The possibility of inferring individual
information from statistical queries is reduced if no statistical queries are
permitted whenever the number of tuples in the population specified by the
selection condition falls below some threshold. Another technique for
prohibiting retrieval of individual information is to prohibit sequences of
queries that refer repeatedly to the same population of tuples. It is also
possible to introduce slight inaccuracies or *noise* into the results of statistical queries deliberately, to make
it difficult to deduce individual information from the results. Another
technique is partitioning of the database. Partitioning implies that records
are stored in groups of some minimum size; queries can refer to any complete
group or set of groups, but never to subsets of records within a group. The
interested reader is referred to the bibliography at the end of this chapter
for a discussion of these techniques.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

**Related Topics **

Copyright © 2018-2020 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.