The Hadoop MapReduce Framework

[Developing an Analysis Process Using the Hadoop MapReduce Framework]

Author: Adwoa Osei-Yeboah

Introduction:

This study is about the development of an analysis process using the Hadoop MapReduce framework for GDey Corporation which is a hypothetical online and local retail company. The analysis process will help GDey acquire insights from the big data the company has gathered over time and a description of the specific use case will be provided. The details for the design of a big data analytics process using the MapReduce framework will be explored with justification from references. Description of the Map function and the Reduce function together with their specific order of operation as it relates to the GDey’s use case scenario will also be addressed. The specific insights that will be produced from the analysis process will be illuminated in addition to the implications of the different practices.

Hadoop MapReduce Framework:

This is a great tool for the distribution and scalable processing of big data (Rahman, 2018). Apache Hadoop MapReduce can distribute and store large amounts of data across numerous servers that are able to run simultaneously. MapReduce facilitates simultaneous processing by dividing petabytes of data into smaller chunks and processing them parallel on Hadoop commodity servers (Hashem et al., 2016). There are three classes of MapReduce API which are the mapper class, reducer class, and job class. The Mapper class maps the input key-value pairs to a set of intermediate key-value pairs. The Reducer class lowers the set of intermediate values by using its application to access the configuration of the job. The job class does the configuration of the job and submits it. It then controls the execution and queries the state.

Use Case Scenario:

A study on ‘Big data initiatives in retail environments’ emphasized considering the amount of data created from customers’ transactions online and in retail stores there is no surprise most retailers are interested in leveraging big data to enable them to provide their customers with superior services (Aloysius et al., 2018). As the number of GDey consumers increased over time, the data associated also grew. This makes the traditional database system of GDey unable to scale up to match the growth in their customer base.

Since this retailer needs to analyze huge volumes of log data, for marketing campaigns, sales management, and inventory management the old data storage and analytics system could be replaced with Hadoop due to its fast analytic support and low cost for storage. GDey could also employ Amazon Elastic MapReduce services to create a Hadoop cluster. This could be used to store and analyze data to estimate consumer behavior, search recommendations, product placement, inventory management, targeted marketing, and product promotion.

Designing/Developing Big Data Analytics Process Using MapReduce:

Research on ‘Handling Big Data Efficiently by using MapReduce Technique’ highlights how valuable insights are hidden in large databases. It goes on to elaborate on the considerable excitement around the MapReduce paradigm for large-scale data analysis. The researchers evaluated its advantages and disadvantages, and how MapReduce could be integrated with other technologies (Maitrey & Jha, 2015). There are several steps that need to be followed in designing the big data analytics process using MapReduce.

MapReduce library first splits the user program into several pieces before many copies of the program on a cluster of machines are initiated. The master controls all the functions including assigning work to the rest of the nodes that are called workers. When a worker is idle, the master is able to pick and assign a Map task or a Reduce task. The workers who are assigned the Map task read the content of the corresponding input file. It then parses key/value pairs out of the input data and hands out each pair to the user-defined Map function.

Next, the Map function produces the intermediate key/value pairs and cushions them in memory (Ramírez-Gallego et al., 2018). These buffered pairs are periodically noted on the local disk where the partition function proceeds to partition them into R regions. The local disk temporarily stores the output of the Map and removes it as soon as the Reduce function is completed. The goal of the temporary storage of the Map output is to help in the fault tolerance and recovery process to prevent the Map process system from restarting when a Reduce function fails. The master does a great job of redirecting the locations to the Reduce workers who invoke remote procedure calls to read the protected data from the local disks of the map workers.

The Reduce workers also read the intermediate data from its partition and sort them by the intermediate keys to gather all the existence of the same key simultaneously (Ramírez-Gallego et al., 2018). The role of the sorting is to ensure the various different keys map to the same Reduce task. But when the intermediate data volume is too huge to fit in memory, there will be the need for an external sort. The Reduce worker summarizes the sorted intermediate data to be passed onto the user’s Reduce function as a set of keys/values. The output of the reduce function is added to the final output file. When a system failure occurs, the completed reduced tasks do not get re-executed because their output is usually stored in a global system. The master stimulates the user program when all Map and Reduce tasks are completed.

Description of Model Functions: Map-function and Reduce-function

MapReduce performs through two functions which are the Map function and the Reduce function. Map function uses a set of key or value inputs and maps it into zero or more sets of key/value. The input pair Map takes are produced into a set of intermediate key/value pairs (Maitrey & Jha, 2015). In the next step, the MapReduce library groups together all intermediate values associated with the same intermediate key before passing them to the reduce function. The input keys/values do have different domains from the intermediate keys/values.

The Reduce-function also uses each unique key and groups its associated values into a unique key/value set. It accepts an intermediate key from Map and a set of values for the key (Maitrey & Jha, 2015). It then combines the values to form a smaller set of values. For each Reduce request, zero or more output value is produced. The intermediate values are supplied to the user’s Reduce functions using an iterator to allow the handling of lists of values that are too huge to fit in memory. Here the intermediate keys and values have the same domain as the output keys and values. Whereas the original keys and values have different keys and values compared to the final output keys/values.

Description of Insights Produced:

With the help of MapReduce, GDey can evaluate consumer buying patterns based on customers’ interests or historical purchasing patterns. It can also provide product suggestion methods for GDey consumers by analyzing data, purchase history, and user interaction logs. GDey can also use the MapReduce programming model to identify popular products based on their customer preferences or purchasing behavior. MapReduce can also offer product recommendations based on GDey inventory, website records, purchase histories, and user interaction logs.

Implications of Using Hadoop MapReduce in Big Data Analysis:

MapReduce has several advantages, for example abstracting the complexity of distributed computing and running on commodity hardware to make it affordable to process large datasets. It is created to scale horizontally, and this allows it to handle substantial datasets amid thousands of nodes. The scalability nature of MapReduce is very critical for big data processing and analytics. The built-in fault tolerance features of MapReduce aid in automatic data replication and task re-execution. This ensures the job continues without manual intervention even when a node fails during processing. Due to the flexible nature of MapReduce, both structured and unstructured data can be processed. MapReduce uses the locality of data to its advantage by processing the data on the same storage node to minimize latency and improve performance.

Despite the advantages that come with Hadoop MapReduce, it does have a low latency and is not well-suited for real-time data processing. It is predominantly designed for batch processing. With the complex nature of modern data that GDey is gathering, writing efficient Map and Reduce functions can become challenging for the data processing tasks of the MapReduce model. If GDey needs some machine learning and graph processing tasks to be carried out on their consumer dataset, that can be a challenge for MapReduce since it does not natively support iterative algorithms.

Security Issues of MapReduce:

According to a research paper on the vulnerabilities, security issues, and attacks of Big Data, all Hadoop environment components like Sentry, Flink, and Storm are predisposed to attacks triggered by different vulnerabilities. These vulnerabilities could be associated with the software, the configuration interface, or linked to the network policies. And it can be categorized into the architecture dimension, the life cycle of the data dimension, and the data value dimension (Bhathal & Singh, 2019).

Security might have been a late addition to Hadoop, and thus, Hadoop lacks a consistent security model. However, currently, there is no existing evaluation for these Hadoop security modules (Bhathal & Singh, 2019). Due to the huge volume, rapid growth, and diversity of data, these are unstoppable and existing security solutions are not adequate, which were not designed and built with Big Data in consideration. Each of these applications requires hardening to add security capabilities to a Big Data environment and functions to be scaled with the data. Bolt-on security doesn’t scale well.

Whilst big data modeling is of great importance, the privacy and security of individuals whose data are being modeled should be considered. The GDPR regulation expects companies to comply with the main standards of conduct for direct marketing. There should be transparency and lawfulness applied. The stakeholders involved in big data modeling should also consider all the security compliance and guidelines that need to be implemented on-site or off-site (Bhathal & Singh, 2019).

Conclusion:

As more and more complex and extremely large datasets emerge within the retail industry, big data analytics using the Hadoop MapReduce Framework has been useful in changing the retail landscape. GDey is seeking to exceed, and grab hold of its consumers’ inclinations using data. It will allow retailers like GDey to make data-driven decisions to gain a competitive edge whilst improving the security and privacy of their consumers during the data processing analysis phases.

References:

Aloysius, J. A., Hoehle, H., Goodarzi, S., & Venkatesh, V. (2018). Big data initiatives in retail environments: Linking service process perceptions to shopping outcomes. Annals of Operations Research, 270(1-2), 25-51. 10.1007/s10479-016-2276-3

Bhathal, G. S., & Singh, A. (2019). Big Data: Hadoop framework vulnerabilities, security issues and attacksArray 1-2 (2019) 100002 Hashem, I. A. T., Anuar, N. B., Gani, A., Yaqoob, I., Xia, F., & Khan, S. U. (2016). MapReduce: Review and open challenges. Scientometrics, 109(1), 389-422. 10.1007/s11192-016-1945-y

Maitrey, S., & Jha, C. K. (2015). Handling Big Data Efficiently by Using Map Reduce Technique. Paper presented at the 703-708. 10.1109/CICT.2015.140 https://ieeexplore.ieee.org/document/7078794

Rahman, N. (2018). Data Warehousing And Business Intelligence With Big Data. IGI Global. 10.4018/IJBIR.2015070104

Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., & Herrera, F. (2018). Big Data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Information Fusion, 42, 51-61. 10.1016/j.inffus.2017.10.001