Cost Efficient Batch Processing in Amazon Cloud

Tamrakar, Kabin

Master thesis

View/Open

oppginnlev-ee0f ... 098e28eTamrakar_Master.pdf (6.138Mb)

Year

2016

Abstract

Cloud Computing provides computing and storage resources at economical price with flexibility, mobility and availability. These resources range from small capacity to very high capacity computes. The cloud providers also offer spare compute instances at significantly low price. Amazon Cloud Service provider has a popular bidding scheme on their spare computes called spot instances which can be requested with bid price. The spot instances are vulnerable to termination at any time if spot market price exceeds the bid price. Amazon also rents on-demand instances which are persistent with fixed price. Spot innstance price may drop up to 90% compared to price of on-demand instance. In this project, spot instances are resorted in task instances’ group of Amazon EMR cluster to process batch jobs with deadline. Amazon EMR makes it convenient to process big data by the aid of managed Hadoop framework. The processed intermediate results in the task nodes of the cluster are lost if the spot instances gets terminated that can cause processing delay. The cost efficiency can be realized by exploiting on-real time nature of batch computing for Big Data. Two algorithms are devised for achieving cost efficient processing in Hadoop MapReduce. Both algorithms process data in divisions such that abrupt termination of spot instances affects that division only. Based on progress at some interval and checkpoints, task group’s capacity is resized to complete processing in time. Progress is completion of number of divisions of work. The first algorithm begins with spot instances in estimated quantity. To complete processing of all data in time, on-demand instances are employed after threshold time. The second algorithm starts by using higher number of spot instances than required to complete the work within deadline. It has higher probability to utilize only spot instances because of faster work progress. On-demand instances are deployed only in case of slow progress. The experiments show that both algorithms minimize the cost of processing. The second algorithm further minimizes the cost in most cases.