“Big data” is a very broad term for structured and unstructured data sets that cannot be opened or processed locally on a traditional computer or application because of memory restrictions. Big data sets need to he processed using distributed computing technologies such as Hadoop before being put through a data mining or predictive analytics process.


Predictive modeling is a process in which input data is put through statistical or data mining / machine learning techniques to find factors that are able to help make predictions or inferences about future events. Predictive modeling can be used to identify and quantify both risk and opportunity.


Data mining is the application of statistical methods to extract implicit, previously unknown, and potentially useful information from data. Advances in computing technology and the ability to create, store, and access increasingly larger volumes of data have created new opportunities for companies to implement predictive strategies. Such methods can be applied to a diverse set of real world problems within most industries. The primary objective of the research may be the prediction of future events or advancement of learning.



Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

Amazon Web Services

Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data. Amazon EMR securely and reliably handles your big data use cases, including log analysis, web indexing, data warehousing, machine learning, financial analysis, and scientific simulation.


Python is a widely used high-level programming language for general-purpose programming, created by Guido van Rossum and first released in 1991.


R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software. R is currently the most used programming language in data mining.