CS.Lectures: Introduction to BigData

Every day, we create roughly 2.5 quintillion bytes of data

500 million tweets are sent
294 billion emails are sent
4 petabytes of data are created on Facebook
4 terabytes of data are created from each connected car
65 billion messages are sent on WhatsApp
5 billion searches are made

Big Data is a collection of data that is huge in volume, yet growing exponentially with time.

It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.

Data: The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Examples Of Big Data

* The New York Stock Exchange generates about one terabyte of new trade data per day.

* Social Media: The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.

* A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.

Characteristics Of Big Data

Big data can be described by the following characteristics:

Volume (Scale of data)
Variety (Forms of data)
Velocity (Analysis of data flow)
Veracity (Uncertainty of data)
Value
Unfortunately, sometimes volatility isn’t within our control. The volatility, sometimes referred to as another “V” of big data, is the rate of change and lifetime of the data

Volume

The name Big Data itself is related to a size which is enormous.
Size of data plays a very crucial role in determining value out of data.
Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big Data.

As an example of a high-volume dataset, think about Facebook. The world’s most popular social media platform now has more than 2.2 billion active users, many of whom spend hours each day posting updates, commenting on images, liking posts, clicking on ads, playing games, and doing a zillion other things that generate data that can be analyzed. This is high-volume big data in no uncertain terms.

Variety

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications.
Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications.
This variety of unstructured data poses certain issues for storage, mining and analyzing data.

Facebook, of course, is just one source of big data. Imagine just how much data can be sourced from a company’s website traffic, from review sites, social media (not just Facebook, but Twitter, Pinterest, Instagram, and all the rest of the gang as well), email, CRM systems, mobile data, Google Ads – you name it. All these sources (and many more besides) produce data that can be collected, stored, processed and analyzed. When combined, they give us our second characteristic – variety. Variety, indeed, is what makes it really, really big. Data scientists and analysts aren’t just limited to collecting data from just one source, but many. And this data can be broken down into three distinct types – structured, semi-structured, and unstructured.

Velocity

The term 'velocity' refers to the speed of generation of data.
How fast the data is generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc.
The flow of data is massive and continuous.

Facebook messages, Twitter posts, credit card swipes and ecommerce sales transactions are all examples of high velocity data.

Veracity

This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Veracity refers to the quality, accuracy and trustworthiness of data that’s collected. As such, veracity is not necessarily a distinctive characteristic of big data (as even little data needs to be trustworthy), but due to the high volume, variety and velocity, high reliability is of paramount importance if a business is draw accurate conclusions from it. High veracity data is the truly valuable stuff that contributes in a meaningful way to overall results. And it needs to be high quality.

Value
Value sits right at the top of the pyramid and refers to an organization’s ability to transform those tsunamis of data into real business. With all the tools available today, pretty much any enterprise can get started with big data

Types Of Big Data

Following are the types of Big Data:

Structured
Unstructured
Semi-structured

Structured

By structured data, we mean data that can be processed, stored, and retrieved in a fixed format. It refers to highly organized information that can be readily and seamlessly stored and accessed from a database by simple search engine algorithms. For instance, the employee table in a company database will be structured as the employee details, their job positions, their salaries, etc., will be present in an organized manner.

Unstructured

Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured data. Email is an example of unstructured data. Structured and unstructured are two important types of big data.

Semi-structured

Semi structured is the third type of big data. Semi-structured data pertains to the data containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it refers to the data that although has not been classified under a particular repository (database), yet contains vital information or tags that segregate individual elements within the data.

Importance

Ability to process Big Data brings in multiple benefits, such as-

Businesses can utilize outside intelligence while taking decisions- Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.

Improved customer service- Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.

Early identification of risk to the product/services, if any

Better operational efficiency- Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps an organization to offload infrequently accessed data.

Advantages of Big Data (Features)

- Opportunities to Make Better Decisions. ...
- Increasing Productivity and Efficiency. ...
- Reducing Costs. ...
- Improving Customer Service and Customer Experience. ...
- Fraud and Anomaly Detection. ...
- Greater Agility and Speed to Market. ...
- Questionable Data Quality. ...
- Heightened Security Risks.
One of the biggest advantages of Big Data is predictive analysis. Big Data analytics tools can predict outcomes accurately, thereby, allowing businesses and organizations to make better decisions, while simultaneously optimizing their operational efficiencies and reducing risks.
By harnessing data from social media platforms using Big Data analytics tools, businesses around the world are streamlining their digital marketing strategies to enhance the overall consumer experience. Big Data provides insights into the customer pain points and allows companies to improve upon their products and services.
Being accurate, Big Data combines relevant data from multiple sources to produce highly actionable insights. Almost 43% of companies lack the necessary tools to filter out irrelevant data, which eventually costs them millions of dollars to hash out useful data from the bulk. Big Data tools can help reduce this, saving you both time and money.
Big Data analytics could help companies generate more sales leads which would naturally mean a boost in revenue. Businesses are using Big Data analytics tools to understand how well their products/services are doing in the market and how the customers are responding to them. Thus, the can understand better where to invest their time and money.
With Big Data insights, you can always stay a step ahead of your competitors. You can screen the market to know what kind of promotions and offers your rivals are providing, and then you can come up with better offers for your customers. Also, Big Data insights allow you to learn customer behavior to understand the customer trends and provide a highly ‘personalized’ experience to them.

Applications

1) Healthcare

Big Data has already started to create a huge difference in the healthcare sector.
With the help of predictive analytics, medical professionals and HCPs are now able to provide personalized healthcare services to individual patients.
Apart from that, fitness wearables, telemedicine, remote monitoring – all powered by Big Data and AI – are helping change lives for the better.

2) Academia

Big Data is also helping enhance education today.
Education is no more limited to the physical bounds of the classroom – there are numerous online educational courses to learn from.
Academic institutions are investing in digital courses powered by Big Data technologies to aid the all-round development of budding learners.

3) Banking

The banking sector relies on Big Data for fraud detection.
Big Data tools can efficiently detect fraudulent acts in real-time such as misuse of credit/debit cards, archival of inspection tracks, faulty alteration in customer stats, etc.

4) Manufacturing

According to TCS Global Trend Study, the most significant benefit of Big Data in manufacturing is improving the supply strategies and product quality.
In the manufacturing sector, Big data helps create a transparent infrastructure, thereby, predicting uncertainties and incompetencies that can affect the business adversely.

5) IT

One of the largest users of Big Data, IT companies around the world are using Big Data to optimize their functioning, enhance employee productivity, and minimize risks in business operations.
By combining Big Data technologies with ML and AI, the IT sector is continually powering innovation to find solutions even for the most complex of problems.

6) Retail

Big Data has changed the way of working in traditional brick and mortar retail stores.
Over the years, retailers have collected vast amounts of data from local demographic surveys, POS scanners, RFID, customer loyalty cards, store inventory, and so on.
Now, they’ve started to leverage this data to create personalized customer experiences, boost sales, increase revenue, and deliver outstanding customer service.

Retailers are even using smart sensors and Wi-Fi to track the movement of customers, the most frequented aisles, for how long customers linger in the aisles, among other things. They also gather social media data to understand what customers are saying about their brand, their services, and tweak their product design and marketing strategies accordingly.
7) Transportation
Big Data Analytics holds immense value for the transportation industry. In countries across the world, both private and government-run transportation companies use Big Data technologies to optimize route planning, control traffic, manage road congestion, and improve services. Additionally, transportation services even use Big Data to revenue management, drive technological innovation, enhance logistics, and of course, to gain the upper hand in the market.
Use cases: Fraud detection patterns
Traditionally, fraud discovery has been a tedious manual process. Common methods of discovering and preventing fraud consist of investigative work coupled with computer support. Computers can help in the alert using very simple means, such as flagging all claims that exceed a prespecified threshold. The obvious goal is to avoid large losses by paying close attention to the larger claims.
Big Data can help you identify patterns in data that indicate fraud and you will know when and how to react.
Big Data analytics and Data Mining offers a range of techniques that can go well beyond computer monitoring and identify suspicious cases based on patterns that are suggestive of fraud. These patterns fall into three categories.
1. Unusual data. For example, unusual medical treatment combinations, unusually high sales prices with respect to comparatives, or unusual number of accident claims for a person.
2. Unexplained relationships between otherwise seemingly unrelated cases. For example, a real estate sales involving the same group of people, or different organizations with the same address.
3. Generalizing characteristics of fraudulent cases. For example, intense shopping or calling behavior of items/locations that have not happened in the past, or a doctor’s billing for treatments and procedures that he rarely billed for in the past. These patterns and their discovery are detailed in the following sections. It should be mentioned, that most of these approaches attempt to deal with "non-revealing fraud". This is the common case. Only in cases of "self-revealing" fraud (such as stolen credit cards) will it become known at some time in the future that certain transactions had been fraudulent. At that point only a reactive approach is possible, the damage has already occurred; this may, however also set the basis for attempting to generalize from these cases and help detect fraud when it re-occurs in similar settings (see Generalizing characteristics of fraud, below).
Unusual data
The unusual data refer to three different situations: unusual combinations of other quite acceptable entries, a value that is unusual with respect to a comparison group, and an unusual value of and by itself. The latter case is probably the easiest to deal with and is an example of "outlier analysis". We are interested here only in outliers that are unusual, but are still acceptable values; an entry of a negative number for the number of staplers that a procurement clerk purchased would simply be a data error, and presumably bear no relationship to fraud. An unusual high value could be detected simply by employing descriptive statistics tools, such as measures of mean and standard deviation, or a box plot; for categorical values the same measures for the frequency of occurrence would be a good indicator.
Somewhat more difficult is the detection of values that are unusual only with respect to a reference group. In a case of real estate sales price, the price as such may not be very high, but it may be high for dwellings of the same size and type in a given location and economic market. The judgment that the value is unusual would only become apparent through specific data analysis techniques..
Unexplained Relationships
The unexplained relationships may refer to two or more seemingly unrelated records having unexpectedly the same values for some of the fields. As an example, in a money laundering scheme funds may be transferred between two or more companies; it would be unexpected if some of the companies in question have the same mailing address. Assuming that the stored transactions consist of hundreds of variables and that there may be a large number of transactions, the detection of such commonalities is unlikely if left uninvestigated. When applying this technique to many variables and/or variable combinations, the presence of an automated tool is indispensable. Again positive findings do not necessarily indicate fraud but are suggestive for further investigation.
Generalizing Characteristics of Fraud
Once specific cases of fraud have been identified we can use them in order to find features and commonalities that may help predict which other transactions are likely to be fraudulent. These other transactions may already have happened and been processed, or they may occur in the future. In both cases, this type of analysis is called "predictive data mining".
The potential advantage of this method over all alternatives previously discussed is that its reliability can be statistically assessed and verified. If the reliability is high, then most of the investigative efforts can be concentrated on handling the actual fraud cases, rather than screening many cases, which may or may not be fraudulent.

Monday, June 14, 2021

Introduction to BigData