BIG DATA AND ANALYTICS

Page: 1
BIG DATA AND ANALYTICS
INTRODUCTION
These notes aim to provide an introduction to big data and analytics. Big data is not
a new technology however its visibility within organisations and their requirement to
understand and leverage insight has become a priority. Together big data and
analytics has the ability to transform not just processes but organisations and even
our society.
LEARNING OUTCOMES
On completion of this unit you should be able to:
– Define big data and the implications for analytics
– Understand the three defining characteristics: volume, variety and velocity
– Describe the analytics lifecycle
– Understand big data technologies such as MapReduce.
PRESCRIBED READING
Chapter 7 Big Data Concepts and Tools, Sharda, R., Delen, D., & Turban, E. (2018).
Business intelligence, analytics, and data science : A managerial perspective (Fourth
edition.; Global ed.). Harlow, England: Pearson.
https://contentstore.cla.co.uk/secure/link?id=b14c407b-6b22-ea11-80cd-
005056af4099
Watch the TedTalk: How we’ll predict the next refugee crisis with Rana Novack
https://www.ted.com/talks/rana_novack_how_we_ll_predict_the_next_refugee_crisis
Page: 2
10.1 Introduction to Big Data
Big Data is data that is unstructured, time sensitive or too large to be processed
using traditional data mining and handling techniques and requires a different
processing approach called big data, which uses massive parallelism on readilyavailable hardware.
Typically, big data has three defining characteristics: volume, variety and velocity.
Volume refers to the sheer quantity of data, variety to the poly-structured
1 nature of
the data and velocity the rate at which it is generated.
Because big data is not suitable for analysis using traditional relational database
management systems (RDBMSs), a variety of scalable data analysis tools have
evolved including:
Apache’s open-source Hadoop distributed data processing system
NoSQL non-relational databases
Hadoop includes the HBase database and Hive data warehouse system. Examples
of the NoSQL technology include: Couchbase and Cassandra (Apache), Dynamo DB
(Amazon), MongoDB, and Neo4j.
The role of a data scientist is to integrate and analyse data from disparate big data
sources and present the results to decision makers. However, there is a skills gap
that means many organisations who cannot pay for expensive consultancy remain
rich but information-poor. Unsurprisingly, there is much interest in developing tools
that can integrate enterprise-centric SQL/RDBMS and internet-centric
Hadoop/NoSQL that can be used by non-specialists.
Applications for this kind of analysis, in addition to traditional transactional data in
enterprise data warehouses include:
Surveillance footage, useful in crime, retail and military applications.
Data from embedded and medical devices, useful for real-time epidemiological
studies, for example.
Information from entertainment and social media, useful for mining the views and
tastes of the public.
You may wish to add to this list, the increasing amounts of data generated by all
manner of sensors in the fast-developing Internet of Things. The continuous growth
of the Internet of Things (IoT) has provided several new resources for Big Data.
Where does big data come from?
The gib answer to the question is “Everywhere”, however the bulk of it comes from
social data, machine data and transactional data. Some of this data will be
generated internally and some is imported from external sources. Some of the data
will be structured and some of it unstructured.
Unstructured data does not have a
pre-defined data model and therefore requires more resources to make sense of it.
Social data is garnered from sources such as the Web, Tweets, Video uploads,
Facebook, and Instagram. This sort of data can provide invaluable insights into
1 ‘Poly-structured’ data is a mixture of structured, semi-structured and unstructured data such as text,
audio and video.

Page: 3
consumer behaviour and can be enormously influential in marketing analytics. Tools
like Google Trends can be used to good effect to increase the volume of big data.
Machine data is defined as information which is generated by sensors embedded in
machinery, and web logs which track user behaviour. Sensors can be found in:
satellites, smart meters, road cameras, games, medical devices,
scientific
experiments,
such as atmospheric science, biology, genomics, astronomy, nuclear
physics, biochemical research, military surveillance,
and the Internet of Things (IoT),
which is and will continue to deliver high velocity, value, volume and variety of data.
This data is expected to grow rapidly as the IoT grows ever more pervasive.
Transactional data is generated from all the daily transactions that take place both
online and offline such as contractual fees, lending and borrowing, interest received,
interest paid, sales and purchases of assets, dividends, purchases of raw materials
and the sale of finished goods and salary payments. These activities generate bills
of sale, delivery receipts, delivery notes, invoices, credit notes, payroll data and
much more. However, the data alone is not very helpful until it has been properly
analysed and put to good use.
Watch the video at
https://youtu.be/u8w0e_kelX8
10.2 Big Data in Modern Business
Currently there are Yottabytes of usable data available for big data analysis, but are
businesses using big data? The evidence [Foote 2018] suggests that many
businesses, particularly those online, consider big data a mainstream business
practice. The increasing use of big data is driving the development of new
innovative and more sophisticated technologies for managing big data. New
technologies change not only how Business Intelligence is gathered, but how
business is done. What follows is a few examples.
Streaming the IoT for Machine Learning
There are currently efforts to use the Internet of Things (IoT) to combine Streaming
Analytics and Machine Learning. In this model streaming data provides information
from the IoT to facilitate machine-learning in real time in a less controlled
environment. A major aim of this process is the provision of more flexible, more
appropriate responses to a variety of business situations, with a special focus on
communicating with humans. Once the system has been migrated from the training
environment, machine learning trains the system to predict outcomes with
reasonable accuracy.
Ted Dunning, the Chief Application Architect at MapR said:
“… we will see more and more businesses treat computation in terms of
data flows rather than data that is just processed and landed in a
database. These data flows capture key business events and mirror
business structure. A unified data fabric will be the foundation for building
these large-scale flow-based systems…” [ARN 2017]

Page: 4
AI Platforms
In the drive to utilise big data is the development of Artificial Intelligence (AI)
platforms. According to Anil Kaul, CEO and Co-Founder of Absolutdata:
“…we started an email campaign, which I think everybody uses Analytics
for, but because we used AI, we created a 51 percent increase in sales.
While Analytics can figure out who you should target, AI recommends
and generates what campaigns should be run…” [Foote 2018]
A well-designed AI platform can help reduce costs by automating basic tasks,
eliminating duplication of effort and simple, but time-consuming activities such as
copying, data processing, and constructing customer profiles.
Artificial Intelligence platforms are arranged into five layers of logic:
1. The Data & Integration Layer
2. The Experimentation Layer
3. The Operations & Deployment Layer
4. The Intelligence Layer
5. The Experience Layer
The Data & Integration Layer gives access to the data. The rules are “learned” by the
AI process not hand-coded by developers.
The Experimentation Layer lets data scientists develop, test, and prove their
theories.
The Operations & Deployment Layer supports model governance and deployment
and provides tools to manage the deployment of various “containerised” models and
components.
The Intelligence Layer organises and delivers intelligent services and supports the
AI.
The Experience Layer is designed to interact with users using augmented reality,
conversational user interfaces, and gesture control.
Data Curator
The role of a data curator (DC) will be the management of an organisation’s
metadata, data protection, data governance, and data quality. His/her role may also
encompass in determining best practices for working with that data.
The DC will need to interact and collaborate with researchers and communicate with
users across the organisation so good communication and people skills are a distinct
plus.
According to Kelly Stirman, vice president of strategy and CMO of Dremio,:
“The Data Curator is responsible for understanding the types of analysis
that need to be performed by different groups across the organization,
what datasets are well suited for this work, and the steps involved in taking
the data from its raw state to the shape and form needed for the job a data
consumer will perform. The data curator uses systems such as self-service
data platforms to accelerate the end-to-end process of providing data

Page: 5
consumers access to essential datasets without making endless copies of
data.” [Stirman 2018].
Politics and GDPR
The European Union’s General Data Protection Regulation (GDPR) went into effect
on May 25, 2018. While GDPR is concentrated in Europe, outside Europemany
corporations are seeking to maximise the private data they “can” gather. Although
GDPR fines can be as high as 20,000,000 Euros or four percent of the annual global
turnover, depending on which is higher, many businesses, especially in the United
States, are still not prepared.
5 G
Moving to 5G is expensive and comes with potential issues. Although the US
Federal Government supports 5G, some communities have passed legislation
halting the installation of a 5G infrastructure.
An additional impediment to the uptake of 5G is a decision by the United States
FCC, which eliminated regulations supporting net neutrality, which is the concept
that offered internet providers, and their users, a level playing field, and promoted
competition.
Hybrid Clouds
Hybrid Clouds combine an organisation’s private Cloud with the rental of a public
Cloud, offering the advantages of both. Clouds and Hybrid Clouds have been
steadily gaining in popularity and while an organisation may want to keep some data
secure in its own data storage, the tools and benefits of a hybrid system make it
worth the expense.
Typically, the data and applications in a hybrid cloud can be transferred between an
organisation’s on-premises cloud and Infrastructure as a service (IaaS) public
clouds. For example, a low security high volume such as email could be deployed to
a public cloud, while high security projects such as financial reports and research
and development could be kept in-house on the organisation’s private cloud.
Cloud Bursting” is a feature of Hybrid Cloud systems. This is technology where an
application runs within the on-premises Cloud, until there is a spike in the demand
when the application will “burst” through, into the public Cloud, and tap into additional
resources.
The Four Essentials Vs
Data is only as valuable as the business outcomes it makes possible. How the data
is measured helps determine its value. To achieve this requires a big data analytics
platform. Once that platform is in place, you can measure the 4Vs – volume, velocity,
variety, and veracity.
Companies have invested heavily in building data warehouses and business
intelligence systems, but the data for the most part is structured. Unlike BI systems

Page: 6
where you know what information you want and design systems to deliver those
specific types of information, exploring big data is all about establishing correlations
between things you don’t know that may lead to new possibilities.
Global corporations continue to produce ever increasing volumes of data. Failing to
harness this data can seriously impair effective decision-making, reduce efficiency
and profitability. Using a big data analytics platform, and by carefully considering
volume, velocity, variety and veracity, organisations can achieve robust and rapid
reporting that ensures successful compliance audits.
Understanding the customer, their tastes, buying habits and likely future needs is
paramount. The 4 Vs provide management criteria to customer lifetime value (CLV)
that enhances and improves customer relationship management:
Volume-based value: The more comprehensive your integrated view of the
customer and the more historical data you have on them, the more insight you can
extract from it. These insights can be used to improve decision making when it
comes to acquiring, retaining, growing and managing customer relationships.
Velocity-based value: The faster you can process information into your data and
analytics platform, the more current the data in your reports, queries and dashboards
will be, leading to timely and correct decisions to achieve your customer relationship
management objectives.
Variety-based value: The more varied customer data you can collect from customer
relationship systems (CRM), social media, call centre logs, point of sale devices
(POS) and so on, the more multi-dimensional your view of your customers. This will
Figure 10.1V’s of Big Data (source: https://www.ibmbigdatahub.com/infographic/four-vs-big-data )
Page: 7
help you develop a deeper understanding of current and potential customers and
how to engage with them more profitably.
Veracity-based value: No amount of data is useful unless it is clean and accurate.
If you are going to make the correct decisions based upon data, that data must
remain consolidated, cleansed, consistent, and current to make the right decisions.
Example Applications
A big data analytics solution for a large-scale citizen identification program gathered
150 TB of data and in the process detected 3500 instances of fraud among 1.5
million enrolments. This may have gone undiscovered without big data analytics
capabilities.
A large Internet Service Provider (ISP) used a web analytics solution to process
unstructured data in real-time in order to identify top performing channels and
improve customer engagement and retention opportunities. The ISP gained insights
that resulted in increased revenue and greater customer retention.
By building a stable, cost-efficient and highly responsive cloud-based data
warehousing and analytics solution, a leading pharmaceutical and consumer goods
company achieved several unexpected benefits that improved the management of
the company’s day-to-day operations across sales, planning and promotions, and
enabled next-generation data mining, including big data processing and analytics
capabilities. This translated into new strategic promotions that took advantage of
unexpected market shifts and kept the organisation ahead of the competition.
Big data used to be about lower total cost of ownership (TCO), but as the technology
improves, big data is now being targeted at revenue growth or new market creation
opportunities.
The path to gaining greater value from big data starts by deciding what problems you
are trying to solve. If the biggest challenges are within IT, then the issues will be
largely driven by operational efficiency and increased performance.
However, if there are business problems that need to be solved, then the issues will
take a different perspective, such as customer journey mapping. Either way, by
applying volume, velocity, variety and veracity based values to big data
measurement, companies are now transforming big data analytics from a cost centre
to a profit centre.
The Ten Vs of Big Data
We have looked at the 4 Vs of Big Data but for some commentators this doesn’t go
far enough. [Firican 2019]. According to George Firican there are 10 Vs:
1. Volume
2. Velocity
3. Variety
4. Veracity
5. Variability
6. Validity

Page: 8
7. Vulnerability
8. Volatility
9. Visualisation
10. Value
We have already previously considered volume, velocity, variety and veracity, so
now we will look at the additional six Vs.
Variability refers to the number of inconsistencies in the data and the multitude of data
dimensions resulting from multiple disparate data types and sources. It can also refer to the
inconsistent speed at which big data is loaded into the organisation’s database.
Validity refers to the accuracy and correctness of the data for its intended use. Big Data
Analytics obeys the GIGO principle. The benefit from big data analytics is only as good as
its underlying data. An organisation must adopt good data governance practices to ensure
consistent data quality, common definitions, and metadata.
Vulnerability Big data brings new security concerns. A big data breach by its very nature is
a serious breach, examples include the Ashley Madison hack [Hackett 2015] and when a
hacker who called themself “Peace” posted for sale, on the dark web, information on 167
million LinkedIn accounts and … 360 million emails and passwords for MySpace users.”
These are by no means rare occurrences. Visit
https://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/ for
more information. You might even have second thoughts about using social media.
Volatility Ultimately data will be too out of date to be useful. In the past, relatively small
amounts of data didn’t cost very much to store, but the huge amounts of data that are
common today can have significant storage costs. Because of the velocity and volume of
big data, its currency needs to be carefully considered. An organisation needs carefully
established rules for data currency so that out of date data doesn’t add to data storage
costs.
Visualization Current big data visualization tools are limited by the limitations of in-memory
technology poor scalability, functionality, and response time. Traditional graphical
techniques are not up to plotting a billion data points. Different ways of representing data
are required such as tree maps, sunbursts, parallel coordinates, circular network diagrams,
or cone trees.
The multitude of variables resulting from big data’s variety and velocity and the complex
relationships between them, makes developing a meaningful visualization non-trivial.
Value Arguably this is the most important of the Vs. The other features of big data are
meaningless if you don’t derive business value from the data.

Page: 9
10.3 Data Analytics
Data Analytics is the science of taking raw data and analysing it in order to draw
conclusions about the information contained therein. Many of the processes and
techniques of data analytics are now automated because of the vast amounts and
types of data currently available. Done well, data analytics can reveal business
trends that would otherwise be lost in the mass of raw data. The information
revealed can then be used to optimise operations to increase the overall efficiency
and profitability of a business.
Understanding Data Analytics
Data analytics encompasses a diverse range of data analysis. By enlarge, any types
of data are amenable to data analytical techniques. For example a supermarket can
monitor when checkouts are open or closed and patterns of queuing, then analyse
the data to better plan the workloads so the checkouts operate closer to peak
capacity.
The process involved in data analysis involves several different steps including:
1. Determining how the data is structured
2. Data Collection
3. Organising the data
4. Cleaning the data
Step one involves determining the structure of the data and how it is grouped. Data
may be categorised by age, demographic, income, or gender. Data may be numeric
or take some other form.
Data collection can be achieved by using environmental sensors, mining data bases,
Internet sources, cameras and social media.
Once the data has been collected, it must be organised so that it can be analysed.
The data is cleaned to ensure that it is free of errors, omissions and duplications
before it goes on to a data analyst to be analysed.
By implementing data analytics, a business can optimise its performance through
better decision making leading to reduced costs, and improved customer satisfaction
through the introduction of new and better products and services.
In week one’s notes we covered types of analytics in some detail. However, as a
reminder data analytics can be broken down into three types: descriptive, predictive
and prescriptive.
Descriptive analytics describes what has happened over a given period of time.
Predictive analytics deals with what is likely going to happen.
Prescriptive analytics suggests a course of action that might be taken. For more
information refer to your previous materials.
Data analytics underpins many financial quality control systems, including the Six
Sigma program, which is covered in your week 3 notes on Business Process
Management.
If you aren’t properly measuring something it is virtually impossible to optimise it.

Page: 10
Who is Using Data Analytics?
The travel and hospitality industries have embraced data analytics, where rapid
turnaround is required. The NHS uses data analytics to help it utilise its limited
resources to the best effect. The retail industry uses large amounts of data to meet
the ever-changing demands of shoppers.
The Analytics Lifecycle
A big data analytics cycle can be described by the following stages:
Business Problem Definition
Research
Human Resources Assessment
Data Acquisition
Data Munging
Data Storage
Exploratory Data Analysis
Data Preparation for Modelling and Assessment
Modelling
Implementation
Business Problem Definition: It is a self-evident proposition that you cannot solve
a problem until you know what the problem is and understand it. This is a non-trivial
stage of any project and needs to be evaluated in terms of the costs and benefits of
solving the particular problem.
Research: Look at what other companies have done in the same situation. Don’t reinvent the wheel; if there is already a solution out there use it, even if you have to
adapt it to your particular needs. You can also read trade journals or arrange for
presentations from likely solution vendors. In this stage, a methodology for the
future stages should be defined.
Human Resources Assessment: You need to make sure you have the expertise to
deliver the project on time and on budget. If not you may need to retrain staff, take
on new staff or outsource a part of the project
Data Acquisition: This is a vitally important non-trivial step in the process. It will
involve gathering unstructured data from a multitude of different sources including
the Web. It could involve dealing with text in different languages that will need to be
translated.
Data Munging: Once the data is retrieved, for example, from the web, it needs to be
stored in an easy-to-use format. Many reviewers rate something in terms of one-tofive stars. You can map this in terms of the response variable y
{1, 2, 3, 4, 5}.
Other reviewers may use two arrows, one for up voting and the other for down
voting, implying a response variable of the form y
{positive, negative}. In order to
combine both the data sources, a decision has to be made in order to make these
two response representations equivalent.
Data Storage: Once the data is processed, it sometimes needs to be stored. One
storage solution is the Hadoop File System for storage that provides users a limited

Page: 11
version of SQL, known as HIVE Query Language. Other options include MongoDB,
Redis, and SPARK.
Modified versions of traditional data warehouses are still being used in large scale
applications. Teradata and IBM offer SQL databases that can handle terabytes of
data. Open source solutions include postgreSQL and MySQL. Most solutions
provide a SQL API, so a good understanding of SQL is still highly desirable.
Interestingly you might think data storage the most important stage, however it is
possible to implement a big data solution that collects and models the data in real
time and therefore there is no need to store the data at all.
Exploratory Data Analysis: The objective of this stage is to understand the data,
this is typically done using statistical analysis. This is also a good stage to reevaluate whether the problem definition makes sense or is feasible.
Data Preparation: This stage is about taking the cleaned data and using and
applying statistical pre-processing for feature selection and extraction, imputing
missing values, normalisation and outlier detection.
Modelling: The data preparation stage should have produced several datasets that
can be used for training and testing. The best model or combination of models is
selected and tested on an unused dataset.
Implementation: This is where the analytics product is implemented in the
organisation’s data pipeline. A method of validation will be required to track the
performance of the product to make sure that is working properly. In the case of
implementing a predictive model, this stage would involve applying the model to new
data and once the response is available, evaluating the accuracy of the model’s
predictions.
Data Mining Process
Five Data Management Best Practices to prepare Data for Analytics
1. Simplify access to traditional and emerging data. More data generally means better
predictions.
2. Strengthen the data scientist’s arsenal with advanced analytic techniques. For
example, frequency analysis helps identify outliers and missing values that can skew
other measures like mean, average and median.
3. Cleanse data to build quality into existing processes. Up to 40 percent of all strategic
processes fail because of poor data.
4. Shape data using flexible manipulation techniques. Preparing data for analytics
requires merging, transforming, de-normalizing and sometimes aggregating your
source data from multiple tables into one very wide table.
5. Share metadata across data management and analytics domains. This promotes
collaboration, provides lineage information on the data preparation process, and
makes it easier to deploy models.
Most common standard processes
The Cross-Industry Standard Process for Data Mining (CRISP-DM)
was created in 1996
and consists of six phases that occur in a cyclical process: business understanding, data

Page: 12
understanding, data-preparation, modelling, evaluation and deployment. Although created
specifically for data mining and although technology has moved on the stages still have
relevance for analytics.
Figure 10.1: CRISP-DM (source: datasciencecentral.com)
Business Understanding: Focuses on understanding the project objectives
and requirements from a business perspective, and then converting this
knowledge into a data mining problem definition and a preliminary plan.
Data Understanding: This phase starts with initial data collection and
proceeds to getting familiar with the data, for example quality, the detection of
interesting subsets to form hypotheses for hidden information.
Data Preparation: This phase covers all activities to construct the final dataset
from the initial raw data.
Modelling: Modelling techniques are selected and applied. Since some
techniques like neural networks have specific requirements regarding the form
of the data, there can be a loop back here to data prep.
Evaluation: Once the model(s) have been built, they need to be tested to
ensure they generalise against unseen data and that all key business issues
have been sufficiently considered.
Deployment: Typically this will mean deploying a code representation of the
model into an operating system to score or categorise new unseen data as it
arises and to create a mechanism for the use of that new information in the
solution of the original business problem.

Page: 13
The CRISP-DM offers a uniform framework to create documentation and guidelines. In
addition, the CRISP-DM can be applied to various industries with different types of data.
For more information read “CRISP-DM – a Standard Methodology to Ensure a Good
Outcome” by William Vorhies at
https://www.datasciencecentral.com/profiles/blogs/crisp-dma-standard-methodology-to-ensure-a-good-outcome
Sample, Explore, Modify, Model, and Assess (SEMMA)
The steps in this process are as follows: sample, explore, modify, model and assess.
Figure 10.2: SEMMA (source: sas.com)
Sample: In the Sample phase, the sample must be large enough so that
hidden relationships and patterns can be detected, but small enough to be
manageable.
Explore: In the Explore phase, techniques like clustering, classification and
regression look for relationships to study during the process. Anomalies and
outliers are also examined.
Modify: The Modify phase selects and transforms variables for the next
phases using various analytical tools to determine the best model for
predicting outcomes.
Model: This stage consists of modelling the data by allowing the software to
search for data that predicts a desired outcome.
Assess: The Assess phase studies the reliability and usefulness of the results.
Modifications might be necessary and some of the steps might need to be
repeated.

Page: 14
Hadoop and MapReduce
Hadoop is an open source framework, from the Apache foundation, which has the
capability to process large amounts of heterogeneous across clusters of hardware.
Architecture of Multi-Node Hadoop Cluster
Hadoop works in a masterworker / master-slave
fashion.
Hadoop has two core
components: Hadoop
Distributed File System
(HDFS) and MapReduce.
HDFS is a very reliable
distributed storage system.
When data is pushed to
HDFS it is automatically split
into blocks that are replicated
across various data nodes.
This ensures high levels of
availability and fault
tolerance.
MapReduce is an analytical system that can perform complex computations on large
datasets and is responsible for performing all the computations. It works by breaking
down a large complex computation into multiple tasks, which it assigns to individual
worker/slave nodes. It also takes care of coordinating and consolidation of results.
In the master / slave and master / worker relationships, the master contains the
Namenode and Job Tracker components.
Namenode is a repository of information about all the other nodes in the Hadoop
Cluster, files present in the cluster, constituent blocks of files and their locations in
the cluster together with other information useful for the operation of the Hadoop
Cluster. Job Tracker keeps track of the individual tasks/jobs assigned to each of the
nodes and coordinates the exchange of information and results.
Each Worker / Slave contains the Task Tracker and Datanode components. Task
Tracker runs the task / computation assigned to it. Datanode holds the data.
The computers present in the cluster can be present in any location and there is no
dependency on the location of the physical server.
Characteristics of Hadoop
The main features of Hadoop are that it offers a reliable shared storage and analytics
package. It is linearly scalable and can therefore contain tens, hundreds, or even
thousands of servers. Hadoop is a cost-effective solution as it doesn’t require
expensive high-end hardware. It can process both structured as well as
unstructured data.

Page: 15
Because data is replicated across multiple nodes, Hadoop is fault-tolerant. In the
event of a node failing, the required data can be read from another node which has
the copy of that data. It even ensures that the replication factor is maintained, when
a node fails, by replicating the data to other available nodes.
Hadoop is optimised for large and very large data sets. If a small amount of data is
fed to Hadoop it will probably take longer to run than on a traditional system.
When to Use Hadoop
Hadoop is not a universal solution. It is good at somethings and works less well for
others. Hadoop works best in scenarios such as:
Analytics
Search
Data Retention
Log file processing
Analysis of Text, Image, Audio, & Video content
Recommendation systems like in E-Commerce Websites
When Not to Use Hadoop
There are few scenarios in which Hadoop is not the right fit:
Low-latency or near real-time data access.
When there is a large number of small files to process. Namenode holds the file
system metadata in memory and as the number of files increases, the amount of
memory required to hold the metadata increases.
Multiple writes scenario or scenarios requiring arbitrary writes or writes between
the files.
For more information on Hadoop framework and the features of the latest Hadoop
release, visit the Apache Website:
http://hadoop.apache.org.
Other Open Source Options
Apart from Hadoop, there are a huge number of other options; too many to cover all
of them so we will take a brief look at some of them and leave you to fill in the
details.
Apache Spark
Apache Spark promises to run programs up to 100x times faster than Hadoop
MapReduce in memory, or 10x faster from the disk. Apache Spark DAG execution
supports acyclic data flow and in-memory computing. Spark powers a stack of
libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and
Spark Streaming. For more information visit
https://spark.apache.org/
Talend
Talend offers commercial products as well as free products. The free and opensource product is called Talend Open studio. Open Studio comprises Open Studio
for Big Data, Open Studio for Data Integration, Open Studio for Data Quality, Open
Studio for ESB and Open Studio for MDM. For more information visit
https://www.talend.com/
Page: 16
MongoDB
MongoDB is a non-relational free and open source data storage solution, known for
the NoSQL databases. The companies who use the MongoDB, as mentioned on its
website, are Expedia, Forbes, Metlife, OTTO, BOSCH and City of Chicago. For
more information visit
https://www.mongodb.com/
SpagoBI
SpagoBI is an open source business intelligence and big data
analytics platform. SpagoBI offers a variety of tools for reporting, multidimensional
analysis (OLAP), charts, location intelligence, data mining, ETL and much more. For
more information visit
https://www.spagobi.org/
Apache Storm
Apache Storm is a free and open source data analysis app that is known for its realtime processing. It can be used with any programming language. It can be used for
purposes such as real-time data analytics, online machine learning, distributed RPC,
continuous computation, ETL and more. It is scalable, fault-tolerant, has fast
processing capabilities and easy to operate and deploy. For more information visit
https://storm.apache.org/
Apache Drill
Apache Drill is a schema-free SQL Query Engine for Hadoop, NoSQL and Cloud
Storage. Apache Drill supports multifarious NoSQL databases and file systems such
as Google Cloud Storage, Swift, NAS HBase, MongoDB, MapR-DB, HDFS, MapRFS, Amazon S3, Azure Blob Storage and local files. For more information visit
https://drill.apache.org/
Seven Steps to Successful Big Data Analytics
There are seven factors that will help to make a Big Data Analytics project
successful. These are: a clear business need, strong committed sponsorship,
alignment between the business and IT strategies, a fact-base decision-making
culture, a strong data infrastructure, the right analytical tools,and personnel with
advanced analytical skills.

Page: 17
A Clear business need: Big
data analytics can deliver
massive value, but too often
companies let technology guide
their efforts. Instead, decisions
must be based on business
priorities.
Strong committed sponsorship:
For any projects to be
successful it needs strong
committed sponsorship from all
of the stakeholders starting from
the top.
Alignment between the
business and IT Strategy
: The
alignment of business strategy
with data management ensures
that technology investments
provide the insight needed for organisations to make the right decisions in a timely
fashion. This requires selecting tools that support current and future strategic and
business-oriented goals. This is a non-trivial task but the way data is managed is
essential.
A fact-base decision-making culture: A fact-driven culture is when an
organisation’s operational progress is measured using data rather than intuition and
guesswork. Scientists refer to this as evidence-based decision making.
Transparency and accountability are nurtured around data, and team members are
driven by decisions through hypotheses testing where the data results ultimately
drive the decisions.
A strong data infrastructure: If implemented correctly a strong data infrastructure
should reduce operational costs, boost supply chains and serve as a baseline for
developing a progressive global economy.
The right analytical tools: In order to choose the right big data analysis tools, it’s
important to understand the transactional and analytical data processing
requirements of your systems and choose accordingly.
Personnel with advance analytic skills: Includes knowledge of the problem area
and the ability to translate business questions into a set of analytical tasks that
produce the output required by decision makers. In addition, staff should have:
An up-to-date knowledge of current and emerging technologies
The ability to understand the relationships that exist between the data
The ability to properly clean and prepare the data for analysis
Communication and inter-personnel skills
Big Data Technologies
We have looked at Hadoop, MapReduc. To round off here are a few more big data
technologies to consider.
Google
Page: 18
Google claim that their cloud-based platform, Google Cloud Platform (GCP) allows
organisations to
“Realize the benefits of serverless, integrated, and end-to-end data
analytics services that surpass conventional limitations on scale,
performance, and cost efficiency” [Google 2019]
BigQuery, a highly scalable enterprise data warehouse allows users to analyse
petabytes of data using ANSI SQL. As there is no infrastructure to manage, users
can focus on uncovering meaningful insights using SQL without the need for a
database administrator. For more information on BigQuery read the article at
https://cloud.google.com/bigquery/
Representational State Transfer (REST) is a web services API used for integrating
with other applications. Developers can create analytics applications using a widerange of programs. For more information about Representational State Transfer
read the article at
https://www.codecademy.com/articles/what-is-rest
Cloud Pub/Sub provides a simple and reliable staging location for your event data on
its journey towards processing, storage, and analysis allowing an organisation to
ingest millions of events per second from anywhere in the world via an open API and
publish it anywhere in the world. For more information about Cloud Pub/Sub read
the article at
https://cloud.google.com/pubsub/
Cloud Dataflow is a fully-managed service for transforming and enriching data in
stream (real time) and batch (historical) modes with equal reliability and
expressiveness. It enables rapid streaming and batch data pipeline development
without compromising robustness, accuracy, or functionality. For more information
on Cloud Dataflow read the article at
https://cloud.google.com/dataflow/
GCP’s open architecture can be fully integrated with popular open source tools.
Google claim that an organisation can increase the performance of Apache Spark
and Hadoop workloads and reduce costs by moving to Cloud Dataproc, which it is
claimed can be used quickly to create and resize clusters from 3 to many hundreds
of nodes. This means that data pipelines shouldn’t outgrow the number of clusters.
Each cluster action takes less than 90 seconds on average. For more details on
Cloud Data Proc read the article at
https://cloud.google.com/dataproc/
Integration with open-source tools such as Apache Kafka and data formats (such as
Apache Avro) is straight-forward. Data pipelines built on Apache Beam within GCP
work on a choice of open-source runtimes such as Spark or Flink. For more
information on Apache Beam read the article at
https://beam.apache.org/.
Cloud Data Fusion delivers managed data integration through a graphical interface
and a broad open-source library of preconfigured connectors and transformations
based on CDAP. For more information about Cloud Data Fusion and CDAP read the
articles at
https://cloud.google.com/data-fusion/ and
https://medium.com/cdapio/building-a-data-lake-on-google-cloud-platform-with-cdapaf9aece7a2bc.
Cloud Composer built on Apache Airflow is used to workflow across public clouds
and take advantage of visualization tools like Tableau, Qlik, Looker, Data Studio, and
BigQuery BI Engine. For more information on Cloud Composer, Apache Airflow and
Data Studio visit the websites at
https://cloud.google.com/composer/,
Page: 19
https://airflow.apache.org and https://marketingplatform.google.com/about/datastudio/.
Amazon Web Services (AWS)
AWS is a platform consisting of a variety of cloud computing services offered by
Amazon.com.
CloudFront is a content-delivery network (CDN) that mirrors resources at “edge
locations” in order to improve page loading time. For more information visit:
https://aws.amazon.com/cloudfront/
Relational Database Service (RDS) is a scalable database server that as well as
supporting Amazon’s Aurora implementation of MySQL, also supports
MySQL/MariaDB, PostgreSQL, Oracle, and Microsoft SQL Server. For more
information see:
https://aws.amazon.com/rds/
DynamoDB offers scalable NoSQL database support. For more information visit:
https://aws.amazon.com/dynamodb/
Elastic Beanstalk allows users to quickly deploy and manage applications in the
cloud from preconfigured container images. For more information visit:
https://aws.amazon.com/elasticbeanstalk/
Other specialised resources offered by AWS include Elastic Transcoder which
permits videos stored on S3 to be easily transcoded for mobile devices. For more
information visit:
https://aws.amazon.com/elastictranscoder/
Amazon Connect is a cloud-based contact centre service delivered through AWS,
allowing businesses to scale to thousands of customer support agents. For more
information visit:
https://www.techrepublic.com/article/amazon-connect-is-a-cloudbased-call-center-you-can-launch-within-minutes/.
Oracle
The Kscope19 event in Seattle Oracle announced a major upgrade to Oracle
Analytics. Oracle Business Intelligence Enterprise Edition (OBIEE) is to be replaced
by Oracle Analytics Server (OAS).
OAC and OAS Overview: Rather than having separate terms for in-house and cloud
analytics products, Oracle groups everything under the umbrella of Oracle Analytics,
this includes Oracle Analytics Cloud (OAC) and Oracle Analytics Server (OAS). OAC
is their flagship product. To operate it requires:
Oracle Cloud Storage (OCS) for backups, log files, etc.
Oracle Database Cloud Instance (DBC) which is used for Repository Creation Utility
(RCU) schemas.
Oracle Analytics Cloud Instance (OAC)
Oracle Cloud Storage
provides cloud storage, through a global network of Oracle
Corporation managed data centres. These services are provisioned on demand over
the Internet. For more information visit:
https://www.oracle.com/cloud/storage/
Oracle Database Cloud Service provides a platform to create full Oracle database
instances in a virtual machine (VM). Oracle provides access to the features and
operations available with Oracle Database and optionally providing some database
maintenance and management operations automatically. For more information visit:
https://www.oracle.com/database/cloud-services.html
Page: 20
Oracle Analytics Cloud is a scalable and secure public cloud service that provides a
full set of capabilities to explore and perform collaborative analytics for workgroups,
and enterprise. OAC also provides flexible service management capabilities,
including fast setup, easy scaling and patching, and automated lifecycle
management. For more information visit:
https://docs.oracle.com/en/cloud/paas/analytics-cloud/acsgs/what-is-oracleanalytics-cloud.html#GUID-E68C8A55-1342-43BB-93BC-CA24E353D873
OAS
Oracle Analytics Server brings the capabilities of Oracle Analytics Cloud to
customers requiring in-house deployments. OAS is designed to provide customers in
highly regulated industries or with multi-cloud architectures to have analytics
capabilities on their own terms and preferred deployment architecture. OAS allows
heritage systems to be maintained while offering a path to the cloud if required.
The Oracle Corporate Perspective
Oracles vision for the future of analytics has three precepts:
Augmented
Integrated
Collaborative
Augmented: Oracle sees a future in which machine learning and AI will create a
world in which there will be less human involvement. The new augmented world will
empower the end user, helping to achieve that goal of 100 percent data literacy.
Integrated: Oracle wants Analytics to be integrated within everyday workflow, on the
desktop and mobile.
Collaborative: This involves fostering collaboration throughout an organisation,
getting rid of what Oracle see as the disconnection of having different analytic tools
for each different department across an organisation.
For OAC and OAS, analytics will be governed, self-service, and augmented and
addressing the needs of business users, developers, and IT.
Governed services include:
Corporate Dashboards
Pixel Perfect Report
Semantic Models
Role-based Access Control
Query Federation
Self-Service services include:
Data Preparation
Data Visualization
Storytelling
Sharing and Collaboration
Mobile Apps
Augmented services include:
Natural Language Processing
Voice & Chatbot
Page: 21
Data Enrichment
One Click “Explain”
Adaptive Personalization
Though OAC and OAS will look almost the same, OAS will not include the Natural
Language Generator feature, which Oracle argues is a powerful feature that
generates explanations of your visualizations and works in 28 different languages.
Cloudera
Cloudera Data Platform (CDP) manages data with a suite of multi-function analytics
to ingest, transform, query, optimize and predict as well as the sophisticated and
granular security and governance policies. CDP manages the entire platform, on
multiple public clouds, in-house, or in any combination of these from a single pane.
For more information visit:
https://www.cloudera.com/products/cloudera-dataplatform.html
Cloudera Data Warehouse (CDW) is an auto-scaling, highly concurrent and costeffective analytics service that ingests data anywhere, from structured, unstructured
and edge sources. CDW supports hybrid and multi-cloud infrastructure models by
seamlessly moving workloads between in house and any cloud for reports,
dashboards, ad-hoc and advanced analytics, including AI, with consistent security
and governance. Cloudera Data Warehouse offers zero query wait times, reduced IT
costs and agile delivery.
Cloudera Machine Learning (CML) brings the agility and economics of cloud to selfservice machine learning workflows with governed business data and tools that data
science teams need, anywhere.
Cloudera DataFlow (CDF), is a scalable, real-time streaming analytics platform that
ingests, curates, and analyses data for key insights and immediate actionable
intelligence. DataFlow addresses the key challenges enterprises face with data-inmotion:
Processing real-time data streaming at high volume
Tracking data provenance and lineage of streaming data
Managing and monitoring edge applications and streaming sources
Cloudera Data Science Workbench (CDSW) Accelerates machine learning from
research to production with the secure, self-service enterprise data science platform
built for the enterprise.
Free/Open source tools
Here are a few additional open source tools.
Orange
Orange is an open source machine learning and data visualization platform for
performing
simple data analysis with data visualizations that supports box plots and
scatter plots, decision trees, hierarchical clustering, heatmaps, MDS and linear
projections. You can get more information at
https://orange.biolab.si/home/interactive_data_visualization/
The Orange graphic user interface (GUI) allows the user to undertake fast
prototyping of a data analysis workflow, place widgets on the canvas, connect them,
load datasets and analyse the data. For more information see:

Page: 22
https://orange.biolab.si/home/visual-_programming/ and
https://www.youtube.com/watch?v=lb-x36xqJ-E%3fstart%3d6&autoplay=1
Orange provides a suite of add-ons for mining data from external data sources,
performing natural language processing and text mining, conducting and network
analysis. Additionally, bioinformaticians and molecular biologists can use Orange to
rank genes by their differential expression and perform enrichment analysis. For
more information see
https://www.youtube.com/watch?v=OANsA6fMJKg
Google Charts
Google chart tools are simple to use, and free. Users can choose from a variety of
charts, from scatter plots to hierarchical tree-maps. For more information on the
Chart gallery, visit
https://developers.google.com/chart/interactive/docs/gallery
Users can create their own charts to match the look and feel of a website. See:
https://developers.google.com/chart/interactive/docs/customizing_charts
Google charts supports cross-browser compatibility and cross-platform portability to
iOS and new Android releases without the use of plug-ins. See
https://developers.google.com/chart/interactive/docs
Easily connect charts and controls into an interactive dashboard. See:
https://developers.google.com/chart/interactive/docs/gallery/controls
Connect to data in real-time using a variety of data connection tools and protocols.
See:
https://developers.google.com/chart/interactive/docs/queries
BIRT
BIRT is a top-level not-for-profit software project within the Eclipse Foundation. It is
an open source software project for creating data visualizations and reports that can
be embedded into client and web applications, especially those based on Java and
Java EE.
BIRT has two main components: a visual report designer for creating designs, and a
runtime component for generating those designs that can be deployed to any Java
environment. There is also a charting engine that can be used to integrate charts
into an application.
BIRT designs can access a number of different data sources including JDO
datastores, JFire Scripting Objects, POJOs, SQL databases, Web Services and
XML. You can get more information on BIRT architecture at:
https://www.eclipse.org/birt/about/architecture.php
For more information on BIRT generally visit https://www.eclipse.org/birt/about/
Data-Driven Documents (D3.js)
D3.js is a JavaScript library for manipulating documents based on data. D3 helps
bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards
supports the full capabilities of modern browsers without reliance on a proprietary
framework.
You can discover more about D3 at
https://d3js.org/
Cytoscape.js
Page: 23
Cytoscape.js is an open-source graph theory library written in JS for graph analysis
and visualisation created by The Donnelly Centre at the University of Toronto.
Cytoscape.js allows the client to hook into user events and is easily integrated into
an app, particularly as Cytoscape.js supports desktop browsers, and mobile
browsers.
The Cytocape.js library contains many useful functions in graph theory. Cytoscape.js
can be used on Node.js to do graphical analysis in the terminal or on a web server.
Cytoscape.js is an open-source project, and anyone is free to contribute. For more
information, refer to the
https://github.com/cytoscape/cytoscape.js and
https://js.cytoscape.org/
Page: 24
10.4 Changing business environments
Business need to evolve and adapt to the ever changing business environment they
operate within. Organisations must work in agile ways, quickly adapting and making
strategic and operational decisions frequently. To make these decisions in an
efficient and effective manner requires understanding and analysis of vasts amount
of data.
Innovation: Innovation is when an organisation introduces new processes, services,
or products in order to affect positive change in their business either by improving
existing methods or practices or by introducing something new, with the ultimate aim
of reinvigorating a business, boosting growth and/or productivity.
For a business to thrive, it is crucial to be continually innovating and improving,
finding new revenue opportunities, optimising existing channels and, ultimately,
generating higher profits, giving the organisation an advantage over their
competitors.
Innovation Models: There are a number of ways in which an organisation can
innovate. Businesses at different stages of maturity and sizes have different reasons
for embarking on a process of business innovation. For some it will involve
streamlining and improving current operations. For others it may involve
diversification into other industries.
Revenue Model Innovation: If increasing profits is the main driver for business
innovation, an organisation may opt to change their revenue model. This might
involve revising the company’s pricing strategy or re-assessing the products and/or
services they offer. Innovation does not have to be radical, sometimes changing just
one element can yield significant results.
Business Model Innovation: This model involves identifying which processes,
products or services could be improved to boost profitability. This might mean
implementing new technologies, outsourcing specific tasks or forming new
partnerships.
Industry Model Innovation: This is perhaps the most radical model. An
organisation may decide to change industry completely or create a whole new
industry for themselves. For example, Virgin’s move into broadband.
Industries embracing Innovation
Examples of industries embracing innovation are the law, health, and packaging.
The Law: Taylor Vinters has formed a partnership with the artificial intelligencefocused companies Pekama and ThoughtRiver, while at the same time selling off its
regional real estate business. The company’s focus is now on entrepreneurship and
innovation, with technology forming an important part of the firm’s business, opening
doors to a whole new client base.
Packaging: Current environmental pressures mean that public opinion is turning
against plastics and one-use plastics in particular. This has encouraged
organisations to explore alternative materials such as biodegradable seaweed-based

Page: 25
pouches for ketchup. The on-line food delivery service Just Eat is trialling seaweed
sachets, developed by Skipping Rocks Lab, which are biodegradable within six
weeks. See
https://www.raconteur.net/sustainability/sustainable-packaging-foodsector
Healthcare: The NHS is one of the UK’s largest employers and the increase in
demand placed on it by the UK’s aging population means that it must find ways to
cut costs and improve services. Recent innovations include the introduction of
artificially intelligent chatbots to help patients self-serve. In the area of organ
donation, the NHS has begun working with digital consultancy T-Impact to automate
and improve the process of matching donated hearts with recipients. See
https://www.raconteur.net/technology/breathing-new-life-into-the-nhs. The new
digitised process has eliminated forty steps that were performed manually by staff
resulting in a 68% reduction in NHS administration time, freeing up clinicians to do
what they do best – treating their patients.
Technologies driving business innovation
Technologies that are driving innovation include: artificial intelligence (AI) and ultrahigh-speed internetworking.
The power and potential of AI cannot be over stated. Just about every industry and
every other walk of life will be transformed by it in some way. It has been suggested
that AI
could add $15.7 trillion to the global economy by 2030. Randy Dean of
Launchpad.AI has argued that; “Everything invented in the past 150 years will be
reinvented using AI within the next 15 years,” [Pickup 2018].
Forecasts suggest that by 2020 95% of all customer interaction will involve some
form of AI. My electricity supplier requests meter readings by email. Typically, my
bill will be emailed to me within two hours of my submitting the readings. This is
because the company makes extensive use of AI in running its business.
If you apply to a bank to open an account online, you will be dealt with by a machine.
The Japanese
investment bank, Daiwa Securities, found that the customer
purchases increased by approximately two-and-a-half timesafter they implemented
AI to detect and react to consumer emotions.
[Woollacott 2018].
In the healthcare and pharmaceuticals, AI tools have been built which can sort and
accumulate medical knowledge and data on a vast scale.
In the fight against cancer, AI has brought the time and cost of sequencing
someone’s genome, down to just 24 hours and just $1,000 respectively [Vella 2017].
It has long been acknowledged that time is money, and the most important tool for
business innovation is one which can help organisations move faster enter ultrafast
internetworking. UK provider Gigaclear offers speeds of 900 Mb/s, but in South
Korea 2500 Mbps have already been achieved. To put this in the context of
business innovation it will now be possible to restore a medium-sized corporate
server in a little over an hour, compared with 28 days previously.
Prolabs has production facilities in Gloucestershire and California. Their chief
technology officer says:
Having ultrafast fibre connectivity has enabled our group to harmonise production
by instantly sharing data, such as test reports and production templates, which are

Page: 26
key to the production facility. This allows significant operational savings. None of this
would have been possible on a traditional copper internet connection
.”
Who is responsible for business innovation?
A study of top company executives conducted by PwC in 2017 had innovation at the
top of their list of priorities. To achieve this, it’s vital that business leaders foster an
environment where innovation is a natural part of company culture. However,
although top-down leadership is vital, there are key roles and departments whose
collaboration and expert knowledge are necessary to affect the changes. For
example, the IT Department, Chief Data Officer and Chief Transformation Officer.
IT Department: A recent study https://www.raconteur.net/business-innovation/it-thedriving-force-behind-business-innovation suggested that over a quarter of
respondents saw IT as the main driver of innovation. The argument was that as
technology is at the core of business, those with the ability to master it have the
power to initiate change. Moreover, the IT Department typically has a close working
relationship with every part of a business, which allows them to drive innovation and
improve collaboration across the organisation.
Chief Data Officer (CDA): The role of CDA is an emerging role. CDOs are
responsible for highlighting where opportunities and threats lie. The CDO’s job is to
look for efficiencies, simplify needs, demonstrate cost-benefits, and encourage
businesses to be open and transparent.
[Cowan 2018].
Chief Transformation Officer (CTO): The role of CTO has emerged over the last
ten years.
The term CTO means different things to different people. Some CTOs
regard themselves as “visionaries”, while others see themselves as project
managers, appointed to overhaul a business’s processes, often through harnessing
technological change. Either way a successful CTO has the ability responsibility for
driving business innovation.
Regardless of job title a progressive and successful company needs people who can
identify what products and services their customers need, now and in the future, and
can harness technology to drive successful innovation to deliver those products and
services to their current and future customer base.
REFLECTIVE EXERCISE 1
Big data and analytics are a hugely important area for business. Consider why big
data is important and what the future for this increasing trend might be?

Page: 27
REFLECTIVE EXERCISE 2
Read the article:
Zhaohao Sun, Lizhe Sun & Kenneth Strang (2018) Big Data Analytics Services for
Enhancing Business Intelligence, Journal of Computer Information Systems, 58:2,
162-169, DOI: 10.1080/08874417.2016.1220239
https://eu.alma.exlibrisgroup.com/view/action/uresolver.do;jsessionid=1B5A987F1A0
5C35EFB33866C8F109AB0.app05.eu00.prod.alma.dc03.hosted.exlibrisgroup.com:1
801?operation=resolveService&package_service_id=3745289420002111&institution
Id=2111&customerId=2110
This article finds that big data analytics services can enhance BI – consider the
findings from this article and the implications for business.
FURTHER READING LIST
Elisabetta Raguseo & Claudio Vitari (2018) Investments in big data analytics and firm
performance: an empirical investigation of direct and mediating effects, International
Journal of Production Research, 56:15, 5206-5221, DOI:
10.1080/00207543.2018.1427900
https://eu.alma.exlibrisgroup.com/view/action/uresolver.do?operation=resolveService
&package_service_id=3745289390002111&institutionId=2111&customerId=2110

Page: 28
END OF UNIT SUMMARY
To summarise, this unit has examined sources of big data and developed your
understanding of the 4 Vs – volume, velocity, variety and veracity. You have
examined data analytics and the analytics lifecycle. You have explored the data
mining process and examined the ever changing business environment within which
enterprises operate.