In the field of computer science , uncertain data is data containing noise that causes the data to deviate from the correct, assumed or original values. In the era of big data, uncertainty or certainty is one of the defining characteristics of data. Data is constantly growing in volume, variety, speed and uncertainty (1 / veracity). Uncertain data is abundant today on the Internet , in sensor networks , in enterprises in both structured and unstructured sources. For example, this may be uncertainty about the customer address in the company database or temperature readings read by a special sensor due to the aging process of the sensor. In 2012, IBM published information on “ managing uncertain data to scale” in its Global Technology Forecast Report [1] , which is a comprehensive analysis of three to ten years in the near future aimed at identifying significant, disruptive technologies that will change the world. . In order to make confident business decisions based on real data, the analysis must necessarily take into account a number of different types of uncertainty present in large volumes of data. Analysis based on uncertain data will affect the quality of subsequent decisions, so the degree and types of inaccuracies in specific uncertain data cannot be ignored.
Uncertain data is found in the field of sensor networks ; texts with noise are found in abundance on social networks, the Internet and in enterprises where structured and unstructured data may be old, outdated or simply incorrect; in modeling, when a mathematical model can only be an approximation of a real process. When presenting such data in a database , an indication of the probability and correctness of the various values should also be made.
There are three main models of uncertain data in databases. In attribute uncertainty , each uncertain attribute in a tuple is an object of its own independent probability distribution . [2] For example, if temperature and wind speed readings are taken, each of the readings will be described by its probability distribution, since knowing the readings of one measurement will not give any information about the other measurements.
In correlated uncertainties , several attributes can be described using a joint probability distribution . For example, if the position of an object is taken as x- and y- coordinates, then the probability of various values may depend on the distance from the recorded coordinates. Since the distance depends on both coordinates, it may be advisable to use a joint distribution for these coordinates, since they are not independent .
In tuple uncertainty , all attributes of a tuple are subject to a joint probability distribution. This covers the case of correlated uncertainty, and also includes the case when there is a probability that the tuple does not belong to the corresponding relation, which can be judged by the sum of all the probabilities not equal to 1. For example, suppose we have the following tuple from the probability base data:
| (a, 0.4) | (b, 0.5) |
This means that the tuple has a 10% chance that it does not exist in the database.
Notes
- ↑ Global technology outlook , 2012 , < http://www.zurich.ibm.com/pdf/isl/infoportal/GTO_2012_Booklet.pdf >
- ↑ Prabhakar, Sunil. ORION: Managing Uncertain (Sensor) Data (unspecified) . Archived July 20, 2011.
Literature
- Volk, Habich . "Error-Aware Density-Based Clustering of Imprecise Measurement Values." Seventh IEEE International Conference on Data Mining Workshops, 2007. ICDM Workshops 2007. , IEEE. Retrieved 2008-08-01 .
- Rosentahl, Volk . "Clustering Uncertain Data With Possible Worlds." Proceedings of the 1st Workshop on Management and mining Of Uncertain Data in conjunction with the 25th International Conference on Data Engineering, 2009. , IEEE. Retrieved 2008-08-01 .