This paper presents an overview of the recently evolved technology Big Data, an initiative aimed at revealing the latest trends involved in the management of these huge chunks of data. While the topic of Big Data is broad and encompasses many trends and new technology developments, we have tried to give me a brief idea of what this newly emerging technology is and the ways that would help users to cope with and handle Big Data in a cost-effective manner. This paper focuses mainly on big data analytics in different spheres of information storages and extraction. In this paper we have discussed the scope and future handling in India. It also gives a brief insight into the challenges faced in handling and working with Big Data.
Keywords used- Big Data, 4 V's of Big Data, Data Acquisition and recording, Metadata, NoSQL, Hadoops, Information Extraction and cleaning, Data integration and aggregation, Data Interpretation.
We are drowned in a flood of data today. In a broad range of application areas, data is being collected at unprecedented scale. Decisions that previously were based on guesswork, or manual evaluation, can now be made based on the data itself. Such huge amount of data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences.
Millions of bank accounts are created and are operating successfully every now and then. Millions of users log in to social networking sites everyday like Facebook, Twitter. Several email accounts are created every now and then. May it be the railway reservation system or a simple school attendance management system Big Data comes into existence.
The hot IT buzzword became popular in the year 2012. Basically Big Data is the data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit into the structures of your database architectures. The challenges of dealing with Big Data include capture, storage, search, sharing, transfer, analysis and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data.
Here we can easily visualize the basic sources of data formation. In a matter of few seconds huge chunks of data are originated. Big data has four V's.
- Volume- There is a lot more data made now than ever before. Everyday we create 2.5 quintillion of data out of 90% of the total present data has been generated in the last two years only.
- Velocity- All this data is being created really fast, and notably in very short time-duration..
- Variety- All this data comes in many forms, including social data – i.e. information generated and held in social networks, such as Facebook and Twitter. In addition, much of it is unstructured, i.e. not organized in a database, which presents the problem of analysis.
- Veracity- This data is generally mingled together-i.e. some organized and some unorganized.
Big Data Analytics:
Big data analysis involves the following :
- Examining large amount of data.
- Deriving the appropriate information from one time examination of data.
- Identification of hidden patterns and unknown correlations.
- Better business decision: strategic and operational.
- Effective marketing, increased satisfaction & increased revenue.
- The analysis of Big Data involves multiple distinct phases as shown in the figure below:
- Data Acquisition and Recording- Big Data does not arise out of a vacuum: it is recorded from some data generating source. For eg- scientific experiments and simulations can easily produce petabytes of data today. Much of this data is of no interest, and is first filtered and then compressed. The second challenge involves the generation of right metadata automatically. Metadata is "data about data". The term is ambiguous, as it is used for two fundamentally different concepts. Structural metadata is about the design and specification of data structures and is more properly called "data about the containers of data"; descriptive metadata, on the other hand, is about individual instances of application data, the data content.
- Information Extraction and Cleaning- Frequently, the information collected will not be in a format ready for analysis. We require an information extraction process that pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis. Data cleaning refers to the identification of possible errors and its removal.
- Data Integration, Aggregation, and Representation- It is not enough merely to record it and throw it into a repository. The data has to be organized and represented in a sensible and simple form .
- Query Processing, Data Modeling, and Analysis- Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis on small samples. Mining requires integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and big-data computing environments. At the same time, data mining itself can also be used to help improve the quality and trustworthiness of the data, understand its semantics, and provide intelligent querying functions.
- Interpretation- Having the ability to analyze Big Data is of limited value if users cannot understand the analysis. Ultimately, a decision-maker, provided with the result of analysis, has to interpret these results. This interpretation cannot happen in a vacuum. Usually, it involves examining all the assumptions made retracing the analysis and debugging of possible errors.
Why the current decisions are data-driven?
The value of big data to an organization falls into two categories: analytical use, and enabling new products. Big data analytics can reveal insights hidden previously by data too costly to process, such as peer influence among customers, revealed by analyzing shoppers' transactions, social and geographical data. Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data, in contrast to the somewhat static nature of running predetermined reports.
The past decade's successful web startups are prime examples of big data used as an enabler of new products and services. For example, by combining a large number of signals from a user's actions and those of their friends, Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business.
Successfully exploiting the value in big data requires experimentation and exploration. It gives opportunity to create new products and gain competitive advantage in the field of business.
Applications of Big Data:
· SMARTER HEALTHCARE
· 80% of medical data is unstructured and is clinically relevant.
· Data resides in multiple places like individual EMRs, lab and imaging systems, physician notes, medical correspondence, claims etc.
· Leveraging big data - build sustainable healthcare systems & improve access to healthcare.
· MULTI-CHANNEL SALES
Using integrated Big Data approaches, we are now informing the holistic data view to gain the fullest understanding of consumer interactions, intent and value possible. This current shift centers on how customer intelligence across channels is not just used for insights, but actioned at great velocity to power multi-channel targeting and personalization, made real through dynamic digital messaging. From insight to action, we're now finally implementing consistent and relevant messaging approaches that provide cohesive consumer experiences. With so many opportunities for insight and learning, we create a 360-degree view of each individual in the database.
· FINANCE- The financial services sector has gone through unprecedented change in the last few years. Customers are expecting a more personalized service from their banks. Regulators have reacted to the credit crunch with significant changes to regulation with more intrusive and granular supervision. While it is crucial to ensure the integrity of data provided to executive management and regulators, unlocking the insights in the data to better understand customers, competitors and employees represents a significant opportunity to gain competitive advantage. While regulatory pressure is forcing organizations to improve the integrity of the data, many financial institutions are seeing improved data quality and the use of analytics as an opportunity to fundamentally change the way decisions are made and to use the data for commercial gain.
LOG ANALYSIS- Using Hadoop successfully to analyze log data is not a predictor of success in a typical enterprise scenario. The factors that make Hadoop a good fit for log analytics can mask what is required for real enterprise use and success. Log data is fairly structured. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the APACHE project sponsored by the Apache Software Foundation.
Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative. The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising.
· SECURITY ISSUES- The convergence of data science to security analytics was not an overnight event, more so because it was not a creation of the information security world to begin with. The path of convergence first came with an overlapping field - fraud detection and investigation - where data analytics has been a key driver for many years now in identifying what constitutes normal and abnormal patterns of activity. For anyone who has ever found their debit card locked out after a transaction they consider normal, there's the data analytics in action, running that can help us gain information about some serious data threats.
SCOPE Of BIG DATA In INDIA- The 'Big Data' industry - the ability to access, analyze and use humungous volumes of data through specific technology - will require a whole new army of data workers globally. India itself will require a minimum of 1,00,000 data scientists in the next couple of years, in addition to scores of data managers and data analysts, to support the fast emerging Big Data space. Big Data is giving rise to an interesting collaboration among diverse disciplines of computer science, communication networks and devices, and behavioral science. The advent of data science as a mainstream subject is the outcome of these cross-domain efforts.
Applying Big Data solutions, enterprises can now translate mountains of digital data into effective business insights in real time. They can avoid risks, cut costs, analyze patterns to follow trends and customers' preferences and suggest better choices for the customers and increase revenue. A recent study conducted by EMC among Indian companies revealed that 91 per cent of Indian businesses were aware of the potential benefits of Big Data, but 26 per cent of companies had no current plans to utilize Big Data technology. Meanwhile, according to another study by technology researcher International Data Corp., the Big Data market in India is expected to grow at nearly 38 per cent annually, reaching $153.1 million in 2014.
FUTURE Of BIG DATA In INDIA
· $15 billion on software firms only specializing in big data management and analytics. This industry on its own is worth more than $100 billion at and growing at almost 10% a year which is twice as fast as the software business as a whole.
· In February 2012, the open source analyst firm Wikibon released the first market forecast for big data, listing $5.1B revenue in 2012 with growth rate to 53.4B in 2017.
· The McKinsey Global Institute estimates that data volume is growing 40% per year, and will grow 44 times in between 2009 and 2020.
· Big Data is supposed to create a job boom in the IT market.
Challenges in Big Data Analysis-
· Heterogeneity and Incompleteness- Even after data cleaning and error correction, some incompleteness and some errors in data are likely to remain. This incompleteness and these errors must be managed during data analysis. Doing this correctly is a challenge. Recent work on managing probabilistic data suggests one way to make progress.
· Scale- The first thing that comes in mind regarding of with Big Data is its size. After all, the word "big" is there in the very name. Managing large and rapidly increasing volumes of data has been a challenging issue for many decades. In the past, this challenge was mitigated by processors getting faster, following Moore's law, to provide us with the resources needed to cope with increasing volumes of data. But, there is a fundamental shift underway now: data volume is scaling faster than compute resources, and CPU speeds are static.
· Human Collaboration- In spite of the tremendous advances made in computational analysis, there remain many patterns that humans can easily detect but computer algorithms have a hard time finding. Indeed, CAPTCHAs exploit precisely this fact to tell human web users apart from computer programs. Ideally, analytics for Big Data will not be all computational – rather it will be designed explicitly to have a human in the loop. Therefore big data involves the effective participation of humans in this handling procedure.
We have entered an era of Big Data. Through better analysis of the large volumes of data that are becoming available, there is the potential for making faster advances in many scientific disciplines and improving the profitability and success of many enterprises. However, many technical challenges described in this paper must be addressed before this potential can be realized fully. The challenges include not just the obvious issues of scale, but also heterogeneity, lack of structure, error-handling, privacy, timeliness, provenance, and visualization, at all stages of the analysis pipeline from data acquisition to result interpretation. These technical challenges are common across a large variety of application domains, and therefore not cost-effective to address in the context of one domain alone. Furthermore, these challenges will require transformative solutions, and will not be addressed naturally by the next generation of industrial products. We must support and encourage fundamental research towards addressing these technical challenges if we are to achieve the promised benefits of Big Data.
· [NYT2012] The Age of Big Data. Steve Lohr. New York Times, Feb 11, 2012. http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html