Big data ( English big data , [ ˈbɪɡ ˈdeɪtə ]) - designation of structured and unstructured data of huge volumes and significant variety, effectively processed by horizontally scalable software tools that appeared in the late 2000s and are alternative to traditional database management systems and Business class solutions Intelligence [1] [2] [3] .
In a broad sense, “big data” is spoken of as a socio-economic phenomenon associated with the emergence of technological capabilities to analyze huge amounts of data, in some problem areas - the entire world volume of data, and the transformational consequences arising from this [4] .
As the defining characteristics for big data, three V is traditionally distinguished: volume ( English volume , in terms of physical volume), velocity ( velocity in the sense of both the growth rate and the need for high-speed processing and obtaining results), variety ( variety , in the sense of the possibility of simultaneous processing of various types of structured and semi-structured data) [5] [6] ; later various variations and interpretations of this feature arose.
From the point of view of information technology , the set of approaches and tools initially included massively parallel processing of indefinitely structured data, primarily, database management systems of the NoSQL category, MapReduce algorithms and the software frameworks and libraries of the Hadoop project [7] that implement them. In the future, a variety of information technology solutions began to be attributed to a series of big data technologies, which to one degree or another provided similar characteristics in processing capabilities for extra-large data arrays.
* 2002 was a turning point in changing the ratio of the global volume of analog and digital data in favor of the latter, the volume of which increased exponentially (avalanche-like).
* In 2007, the volume of digital data exceeded the volume of analog by almost 15 times, amounting to 280 exabytes of digital data to 19 analog.
History
The broad introduction of the term “big data” is associated with Clifford Lynch , editor of Nature magazine , who prepared a special issue by September 3, 2008 with the topic “How can technologies that open up opportunities for working with large amounts of data affect the future of science?” about the phenomenon of explosive growth in volumes and variety of processed data and technological prospects in the paradigm of a possible leap "from quantity to quality"; The term was proposed by analogy with the metaphors “big oil” and “big ore” [9] [10] that are common in the English business environment.
Despite the fact that the term was introduced in the academic environment and, first of all, the problem of growth and variety of scientific data was examined, starting from 2009, the term was widely distributed in the business press, and by 2010 the appearance of the first products and solutions related exclusively and directly to the processing problem big data. By 2011, most of the largest suppliers of information technology for organizations in their business strategies use the concept of big data, including IBM [11] , Oracle [12] , Microsoft [13] , Hewlett-Packard [14] , EMC [15] , and the main analysts of the information technology market devote the concept to dedicated research [5] [16] [17] [18] .
In 2011, Gartner noted big data as the number two trend in the information technology infrastructure (after virtualization and as more significant than energy saving and monitoring ) [19] . At the same time, it was predicted that the introduction of big data technologies will have the greatest impact on information technology in manufacturing , healthcare , trade , government , as well as in the fields and industries where individual resource movements are recorded [20] .
Since 2013, big data as an academic subject has been studied in the emerging university programs in data science [21] and computational sciences and engineering [22] .
In 2015, Gartner excluded big data from the maturity cycle of new technologies and stopped releasing a separate big data technology maturity cycle, which came out in 2011-2014, motivating it with a transition from the stage of hype to practical application. The technologies featured in the dedicated maturity cycle, for the most part, went into special cycles in advanced analytics and data science, in BI and data analysis, corporate information management, resident computing , information infrastructure [23] .
VVV
The set of attributes VVV ( volume, velocity, variety ) was originally developed by the Meta Group in 2001 outside the context of the concept of big data as a certain series of information technology methods and tools, in it, due to the growing popularity of the concept of a central data warehouse for organizations, it was noted equivalence of data management issues in all three aspects [24] . In the future, interpretations appeared with “four V” ( veracity was added - reliability, was used in IBM advertising materials [25] ), “five V” (in this version viability was added - vitality, and value - value [26] ), and even “ family V ”(besides everything else, variability was added - variability, and visualization [27] ). IDC interprets the “fourth V” as value from the point of view of the importance of economic feasibility of processing the corresponding volumes in appropriate conditions, which is also reflected in the definition of big data from IDC [28] . In all cases, these signs emphasize that the defining characteristic for big data is not only their physical volume, but other categories that are essential for understanding the complexity of the task of processing and analyzing data.
Sources
The Internet of things and social media are recognized as classical sources of big data, it is also believed that big data can come from internal information of enterprises and organizations (generated in information media, but not previously stored and not analyzed), from the fields of medicine and bioinformatics , from astronomical observations [ 29] .
[30] [31] continuously incoming data from measuring devices, events from radio frequency identifiers , message flows from social networks , meteorological data , Earth remote sensing data, streams of data on the location of subscribers of cellular communication networks , devices are given as examples of sources of big data [30] [31] audio and video recordings . It is expected that the development and the beginning of the widespread use of these sources will initiate the penetration of big data technologies both in research and development, as well as in the commercial sector and public administration.
Analysis Methods
Analysis methods and techniques applicable to big data, highlighted in the McKinsey report [32] :
- methods of the Data Mining class: teaching association rules ( English association rule learning ), classification (methods for categorizing new data based on principles previously applied to existing data), cluster analysis , regression analysis ;
- crowdsourcing - the categorization and enrichment of data by the forces of a wide, indefinite circle of persons attracted on the basis of a public offer, without entering into an employment relationship;
- data fusion and integration is a set of techniques that allow you to integrate heterogeneous data from a variety of sources for deep analysis, as examples of such techniques that make up this class of methods include digital signal processing and natural language processing (including tone analysis );
- machine learning , including instruction with a teacher and without a teacher , as well as Ensemble learning - the use of models built on the basis of statistical analysis or machine learning to obtain complex forecasts based on basic models ( Eng. constituent models , cf. with a statistical ensemble in statistical mechanics);
- artificial neural networks , network analysis , optimization , including genetic algorithms ;
- pattern recognition ;
- predictive analytics ;
- simulation modeling ;
- Spatial analysis - a class of methods that use topological , geometric and geographical information in data;
- statistical analysis , A / B testing and time series analysis are given as examples of methods;
- visualization of analytical data - presentation of information in the form of drawings, diagrams, using interactive features and animations both to obtain results, and to use as source data for further analysis.
Technology
Most often, horizontal scalability is indicated as the basic principle of big data processing, which provides processing of data distributed over hundreds and thousands of computing nodes without degradation of performance; in particular, this principle is included in the definition of big data from NIST [33] . Moreover, McKinsey, in addition to technologies NoSQL, MapReduce, Hadoop, R considered by most analysts, also includes Business Intelligence technologies and relational database management systems with SQL support in the context of applicability for big data processing [34] .
NoSQL
MapReduce
Hadoop
R
Hardware Solutions
There are a number of hardware and software systems that provide preconfigured solutions for processing big data: Aster MapReduce appliance ( Teradata corporation), Oracle Big Data appliance , Greenplum appliance ( EMC corporation, based on solutions of the acquired Greenplum company). These systems are supplied as telecommunication cabinets ready for installation in data centers , containing a cluster of servers and control software for mass-parallel processing.
Hardware solutions for resident computing , first of all, for in- memory databases and in- memory analytics, in particular, offered by Hana hardware-software complexes (pre-configured hardware-software solution of SAP company) and Exalytics ( Oracle complex based on Timesten relational system and multidimensional Essbase ), also sometimes referred to as solutions of large data area [35], [36] , despite the fact that such treatment is not initially massively parallel, and a RAM volume node granichivayutsya several terabytes.
In addition, sometimes solutions for big data include hardware and software systems based on traditional relational database management systems - Netezza , Teradata , Exadata , which are capable of efficiently processing terabytes and exabytes of structured information, solving tasks of fast search and analytical processing of huge volumes of structured data . It is noted that the first massively parallel hardware-software solutions for processing extremely large volumes of data were Britton Lee machines , first released in 1983 , and Teradata (began to be produced in 1984 , moreover, in 1990, Teradata absorbed Britton Lee) [37] .
Hardware solutions of DAS - data storage systems directly connected to nodes - in conditions of independence of processing nodes in the SN architecture, are also sometimes referred to as big data technologies. The emergence of the concept of big data is associated with a surge of interest in DAS solutions in the early 2010s , after displacing them in the 2000s with network solutions of NAS and SAN classes [38] .
Notes
- ↑ Prymesberger, 2011 , “Big data refers to the volume, variety and velocity of structured and unstructured data pouring through networks into processors and storage devices, along with the conversion of such data into business advice for enterprises.”.
- ↑ PwC, 2010 , The term “big data” refers to data sets with possible exponential growth that are too large, too unformatted, or too unstructured for analysis by traditional methods., P. 42.
- ↑ McKinsey, 2011 , “Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze, p. one.
- ↑ Mayer-Schoenberger, 2014 .
- ↑ 1 2 Gartner, 2011 .
- ↑ Canarakus, Chris. Big Data Machine . Networks , No. 04, 2011 . Open Systems (November 1, 2011). - “... big data as“ three V ”: volume (“ volume ”- petabytes of stored data), velocity (“ speed ”- receiving data, converting, loading, analyzing and polling in real time) and variety (“ variety ”- processing structured and semi-structured data of various types). " Date of treatment November 12, 2011. Archived on September 3, 2012.
- ↑ PwC, 2010 , By the beginning of 2010, Hadoop, MapReduce, and their associated open source technologies were the driving force behind a whole new phenomenon that O'Reilly Media, The Economist, and other publications dubbed big data, p. 42.
- ↑ The World's Technological Capacity to Store, Communicate, and Compute Information . MartinHilbert.net . Date of treatment April 13, 2016.
- ↑ Chernyak, 2011 , Big Data is one of the few names that have a very reliable date of birth - September 3, 2008, when a special issue of the oldest British scientific journal Nature was published, dedicated to finding the answer to the question “How can technologies that open up the future of science opportunities to work with large amounts of data? ”[...] realizing the scale of future changes, Nature editor Clifford Lynch proposed the special name Big Data for the new paradigm, which he chose by analogy with metaphors such as Bo shaya Oil, Big Ore and so on. p., which reflect not so much the quantity of something, as the transition from quantity to quality.
- ↑ An example of using the Big Oil metaphor (eng.) , Cf. also the story "Big Ore" , the film "Big Oil"
- ↑ Dubova, Natalia. Big Conference on Big Data . Open Systems (November 3, 2011). “At the IBM Information on Demand forum, which brought together more than 10,000 participants, Big Data analytics became the central topic.” Date of treatment November 12, 2011. Archived on September 3, 2012.
- ↑ Henschen, Doug. Oracle Releases NoSQL Database, Advances Big Data Plans . InformationWeek (October 24, 2011). Date of treatment November 12, 2011. Archived on September 3, 2012.
- ↑ Finley, Klint. Steve Ballmer on Microsoft's Big Data Future and More in This Week's Business Intelligence Roundup . ReadWriteWeb (July 17, 2011). Date of treatment November 12, 2011. Archived on September 3, 2012.
- ↑ Shah, Agam. HP is changing personal computers to Big Data . Open Systems (August 19, 2011). Date of treatment November 12, 2011. Archived on September 3, 2012.
- ↑ EMC Tries To Unify Big Data Analytics . InformationWeek (September 21, 2011). Date of treatment November 12, 2011. Archived on September 3, 2012.
- ↑ Woo, Benjamin et al. IDC's Worldwide Big Data Taxonomy . International Data Corporation (October 1, 2011). Date of treatment November 12, 2011. Archived on September 3, 2012.
- ↑ Evelson, Boris and Hopkins, Brian. How Forrester Clients Are Using Big Data . Forrester Research (September 20, 2011). Date of treatment November 12, 2011. Archived on September 3, 2012.
- ↑ McKinsey, 2011 .
- ↑ Thibodeau, Patrick. Gartner's Top 10 IT challenges include exiting baby boomers, Big Data . Computerworld (October 18, 2011). Date of treatment November 12, 2011. Archived on September 3, 2012.
- ↑ Черняк, 2011 , По оценкам экспертов, например McKinsey Institute, под влиянием Больших Данных наибольшей трансформации подвергнется сфера производства, здравоохранения, торговли, административного управления и наблюдения за индивидуальными перемещениями.
- ↑ MSc in Data Science (англ.) . School of Computing . Dundee University (1 January 2013). — «A data scientist is a person who excels at manipulating and analysing data, particularly large data sets that don't fit easily into tabular structures (so-called “Big Data”)». Дата обращения 18 января 2013. Архивировано 22 января 2013 года.
- ↑ Master of Science degree. Harvard's first degree program in Computational Science and Engineering is an intensive year of coursework leading to the Master of Science (англ.) . Institute for Applied Computational Science . Harvard University (1 January 2013). — «“…Many of the defining questions of this era in science and technology will be centered on 'big data' and machine learning. This master's program will prepare students to answer those questions…”». Дата обращения 18 января 2013. Архивировано 22 января 2013 года.
- ↑ Simon Sharwood. Forget Big Data hype, says Gartner as it cans its hype cycle (англ.) . The Register (21 August 2015). Date of treatment February 19, 2017.
- ↑ Doug Laney. 3D Data Management: Controlling Data Volume, Velocity, and Variety (англ.) . Meta Group (6 February 2001). Date of treatment February 19, 2017.
- ↑ The Four V´s of Big Data (англ.) . IBM (2011). Date of treatment February 19, 2017.
- ↑ Neil Biehn. The Missing V's in Big Data: Viability and Value (англ.) . Wired (1 May 2013). Date of treatment February 19, 2017.
- ↑ Eileen McNulty. Understanding Big Data: The Seven V's (англ.) . Dataconomy (22 May 2014). Date of treatment February 19, 2017.
- ↑ Чэнь и др., 2014 , p. four.
- ↑ Чэнь и др., 2014 , p. 19-23.
- ↑ McKinsey, 2011 , pp. 7—8.
- ↑ Черняк, 2011 .
- ↑ McKinsey, 2011 , pp. 27—31.
- ↑ Чэнь и др., 2014 , “Big data shall mean the data of which the data volume, acquisition speed, or data representation limits the capacity of using traditional relational methods to conduct effective analysis or the data which may be effectively processed with important horizontal zoom technologies”, p. four.
- ↑ McKinsey, 2011 , pp. 31—33.
- ↑ Черняк, 2011 , Следующим шагом может стать технология SAP HANA (High Performance Analytic Appliance), суть которой в размещении данных для анализа в оперативной памяти.
- ↑ Darrow, Barb. Oracle launches Exalytics, an appliance for big data (англ.) . GigaOM (2 October 2011). Дата обращения 12 ноября 2011. Архивировано 3 сентября 2012 года.
- ↑ Черняк, 2011 , …первой создать «машину баз данных» удалось компании Britton-Lee в 1983 году на базе мультипроцессорной конфигурации процессоров семейства Zilog Z80. В последующем Britton-Lee была куплена Teradata, с 1984 года выпускавшая компьютеры MPP-архитектуры для систем поддержки принятия решений и хранилищ данных.
- ↑ Леонид Черняк. Большие данные возрождают DAS . «Computerworld Россия» , № 14, 2011 . Открытые системы (5 мая 2011). Дата обращения 12 ноября 2011. Архивировано 3 сентября 2012 года.
Literature
- Min Chen, Shiwen Mao, Yin Zhang, Victor CM Leung. Big Data. Related Technologies, Challenges, and Future Prospects. — Spinger, 2014. — 100 p. — ISBN 978-3-319-06244-0 . — DOI : 10.1007/978-3-319-06245-7 .
- Виктор Майер-Шенбергер, Кеннет Кукьер. Большие данные. Революция, которая изменит то, как мы живём, работаем и мыслим = Big Data. A Revolution That Will Transform How We Live, Work, and Think / пер. с англ. Инны Гайдюк. — М. : Манн, Иванов, Фербер, 2014. — 240 с. — ISBN 987-5-91657-936-9.
- Preimesberger, Chris Hadoop, Yahoo, 'Big Data' Brighten BI Future (англ.) . EWeek (15 August 2011). Дата обращения 12 ноября 2011. Архивировано 17 мая 2012 года.
- Леонид Черняк. Большие Данные — новая теория и практика (рус.) // Открытые системы. СУБД . - 2011. - No. 10 . — ISSN 1028-7493 .
- Алан Моррисон и др. Большие Данные: как извлечь из них информацию . Технологический прогноз. Ежеквартальный журнал, российское издание, 2010 выпуск 3 . PricewaterhouseCoopers (17 декабря 2010). Дата обращения 12 ноября 2011. Архивировано 11 марта 2012 года.
- Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data (англ.) . Gartner (27 June 2011). Дата обращения 12 ноября 2011. Архивировано 17 мая 2012 года.
- James Manyika et al. Big data: The next frontier for innovation, competition, and productivity (англ.) (PDF). McKinsey Global Institute, June, 2011 . McKinsey (9 August 2011). Дата обращения 12 ноября 2011. Архивировано 11 декабря 2012 года.