Big-Data NoSQL Databases: A Comparison and Analysis of “Big-Table”, “DynamoDB”, and “Cassandra” Sultana Kalid, Ali Syed, Azeem Mohammad, Malka N. Halgamuge School of Computing and Mathematics, Charles Sturt University, Melbourne, Victoria 3000, Australia e-mail: [email protected], [email protected], [email protected] Abstractâ€” The growth and enhancement of technology in the corporate society has led to data storage and confidentiality issues. The problem arises from the management of trillions of data, generated every second in corporations, precisely known as â€œBig Dataâ€. Big Data needs to be stored and managed by larger companies that do not have the right storage systems, as there is not any developed yet. The aim of this paper is to find a solution to this growing problem by analyzing gaps in the literature, and to evaluate possible solutions. This study has analyzed content from top reviewed scientific publications, to gather compared and contrasted data from articles and highlight gaps. The highlighted literature will address this problems, and find solutions by contrasting BigData management approaches of NoSQL databases; BigTable, DynamoDB, and Cassandra. The findings summarized from publications are highlighted and the main features of all three databases and their applications are displayed. The system performances are analyzed based on their consistency, availability and partition intolerance. The study concluded that Google’s BigTable and Amazon’s DynamoDB are also critical and efficient on their own, and also found that the combination of both systems had caused the development of Cassandra. Cassandra is now the primary focus of numerous companies to develop different applications. Furthermore, all three systems are NoSQL storage systems, and BigTable, and based on one master node approach, unlike Dynamo, and Cassandra, it follows a Peer-to-Peer system. BigTable however, with some additional features from DynamoDB has helped the development of Cassandra, which is the basis of various modern applications available both open source and socially. Keywords- Big Data, BigTable, DynamoDB, Cassandra, RDMS, CAP I. INTRODUCTION Data is generated and processed more rapidly than ever; in recent times as more than 2.5 trillions of data is generated on a daily basis. Data generation will continue to increase in volumes in the future at an exponential level. This gives an understanding of how useful the tool “Big Data” is in today’s concept of data generation as it helps store and manages vast amounts of data that is produced every second, each day . One key technological problem the world is facing in regards to data storage and management is that, over the last few decades the storage and management of millions of data is being generated at intervals of less than a nanosecond. Therefore, the management of large amounts of data is a serious challenge to manage. With the growth of population, there is a need for an advanced data collection and management technology. Although relational database management systems (RDMS) have existed for some years for insistent data storage, they were yet to tackle scalability, consistency, system efficiency, data collection and integration of data extraction. To manage this phenomenal increase in data generation and to address all those above-mentioned challenges, an alternative solution must be brought into practice. Some databases such as, “NoSQL databases” has been introduced  nonetheless it remains inefficient due to high volumes of data production. Real-time Data processing is a demand in today’s word, as the processing is required in such a way that it does not have an adverse effect on the system’s efficiency. Real-time Data processing requires: (i) The processing of large amounts of data while sustaining a systemâ€™s high performance, (ii) a high availability and tolerance to network partitioning, (iii) The simple storage access â€˜APIâ€™, (iv) Distribute systems and replicate data to globally distribute them to data centres, (v) Requirements varying between high throughput and low latency . In response to the mentioned problem of data storage issues, and with rise of technological enhancement, this study has analyzed content to highlight gaps in the literature by comparing and contrasting from 12 top-reviewed scientific publications. The purpose of this research is to outline a detailed analysis of Googleâ€™s BigTable, Amazonâ€™s Dynamo and Apacheâ€™s Cassandra by comparing and contrasting top reviewed articles. “Google” and “Amazon” use two different approaches of NoSQL (BigTable and Dynamo), nonetheless Cassandra is a combination of the main features involved in BigTable and Dynamo database. This study will give a correct understanding of how companies are currently managing BigData instead of the â€œTraditional Data Managementâ€ system; the paper will then follow with the discussion about Facebook inbox search feature that encourages the introduction of the Cassandra NoSQL database. II. NOSQL DATABASES OVERVIEW A. Googleâ€™s BigTable The BigTable approach is a column-oriented storage system for applications of high performance. Being a distributed storage system, BigTable can handle “Structured” data. Various requirements of systems like scalability, performance, and availability of the systems achieved by BigTable products and projects  need to be considered. Google has developed a NoSQL based data management system for internal use of their corporation, titled “BigTable”. BigTable is a distributed storage system that manages “structured data”. Big data scalability to serve for some Google applications is availed from this system. Due to their use of “Multi-dimensional sorted Map”, BigTable is very flexible regarding many aspects such as data model. Google File System (GFS) is the storage platform for BigTable, highly scalable  . The files are divided into chunks (chubby files) that are replicated across multiple machines involved in the whole system. The purpose of this process is to increase the availability of records and reliability of the system on real-time processing. The key features are: (i) Distributed storage system for â€œStructured Dataâ€ (ii) Hierarchical namespace (iii) Its Sparse (iv) High scalability (v) Strong consistency (vi) Persistent (vii) Multi-dimensional and Sorted-map (viii) Map is indexed by a unique â€œrow-keyâ€, â€œcolumn keyâ€ and â€œa time-stampâ€. B. Amazonâ€™s Dynamo Database Amazon is using Dynamo Databases, which is a fully managed NoSQL database that provides predictable and super-fast performance with unified scalability. This provides a fundamental storage platform within Amazonâ€™s system. There are various services on Amazonâ€™s platform that involves high-reliability requirements, and only requires â€œprimary-keyâ€ access to data storage nonetheless it is a relational database. However, to have a limited scalability and choose consistency over availability may not be ideal. In contrast with RDMS, Dynamo is designed to be eventually consistent, nonetheless highly scalable. The key features are: (i) Peer-to-peer systems Structured and unstructured, (ii) Distributed file system, (iii) Scalable and decentralized, (iv) Support only key-value API (no hierarchical namespaces or relational schema), (v) Efficient Latencies (no multi-hops), (vi) Highly-available key-value storage system, (vii) Availability, consistency, performance, (viii) Data partitioned using consistent hashing, (ix) Consistency facilitated by object versioning, (x) Trusted Network, no authentication, (xi) Incremental scalability, (xii) Symmetry, (xiii) Heterogeneity, and Load distribution. C. Apache Cassandra Database Apache Cassandra is one of the vastly scalable NoSQL. The approach of Cassandra allows corporates to analyze data in a particular fashion due to its ability to tackle generated data of BigData. Open source Cassandra was established in 2009, for the big data management ability of big compan
, this approach is in the form of Cassandra. Although Cassandra is a combination of BigTable and DynamoDB, they are mostly a system orientation and other features that Googleâ€™s BigTableâ€™s properties have. Cassandra offers the customers a simple model that allows a dynamic control over the data format and layout. It was firstly introduced for Facebook inbox search, nonetheless now numerous applications such as Instagram, Twitter, Rackspace, and gaming platforms are also used in this format. On top of this, investigating is a various security features  that could be an interesting avenue to explore in the future to protect BigData . V CONCLUSION Our study has reviewed databases based on top-reviewed scientific publications from 2010 to 2016 and has found that NoSQL, BigTable, DynamoDB, and Cassandra are applicable when managing big data. The contrast and comparison of these data’s also show and demonstrate that the distinct features of these systems are potentially applicable to these systems from open sources. The observations based on the data are from the reported studies that clearly show that Cassandra is mainly based on BigTable’s system architecture and data model. Key features of BigTable indicate that some modifications provided on the basis for the development of Cassandra would be ideal to use when managing big data. Cassandra was firstly used by Facebook inbox search, however, it is now being utilized by some social and private applications. In comparison with the hypothetical, interpretations of other studies that do not show all three systems that are non-relational storage systems. BigTable is based on one master node approach, unlike BigTable, Dynamo, and Cassandra, follows Peer-to-Peer system. BigTable however with some additional features from DynamoDB has helped in the development of Cassandra which is the basis of various modern applications available both open source, and socially. REFERENCES  G. Weintraub, â€œDynamo and BigTable â€” Review and comparisonâ€, IEEE 28th Convention of Electrical & Electronics Engineers in Israel (IEEEI), 5(4), 133-140, 2014.  L. Wang and J. Zhan, J. â€œBigDataBench: A big data benchmark suite from internet servicesâ€, IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014.  P. C. Neves, â€œBig Data in the Cloud: A Surveyâ€, OJBD, vol 1, no. 2, pp. 1-18, 2015.  A. K. Kala, â€œBigTable, Dynamo & Cassandra â€“ A Reviewâ€, International Journal of Electronics and Computer Science Engineering, 2(1) pp. 133-140, 2010.  D. Pearson, â€œAmazon DynamoDB – Fast, Predictable, Highly-Scalable NoSQL Databaseâ€, 2012.  G. Weintraub, â€œDynamo and BigTable – Review and comparisonâ€, IEEE 28th Convention of Electrical & Electronics Engineers in Israel (IEEEI), 1(2), 1-5, 2014.  Datastax, Cassandra Summit, 2014.  A. Kashlev, â€œKDM: An Automated Data Modeling Tool for Apache Cassandra, 2015.  F. Chang, â€œBigtableâ€. ACM Trans. Comput. Syst, 26(2), pp. 1-26, 2008.  J. Barr, Amazon SQS | AWS Blog, 2016.  A. Lakshman and P. Malik, â€œCassandraâ€ ACM SIGOPS Operating Systems Review, 44(2), 35, 2010.  M. Leppich, â€œBuilding a Mars Rover Application with DynamoDBâ€, 2015 Retrieved from http://www.infoq.com/articles/mars-rover-application-DynamoDB  M. P. Chary and S. Kumar, â€œA Survey on Implementation of Column-Oriented NoSQL Data Stores (Bigtable and Cassandra)â€, 2015.  P. Yeh, â€œDistributed Database and BigTableâ€, Lecture Slides, Google, 2009.  A. B. Moniruzzaman and S. A. Hossain, â€œNoSQL Database: New Era of Databases for Big data Analytics-Classification, Characteristics, and Comparisonâ€ IJDTA, 6(4), pp.1-14, 2013.  F. Heinze, â€œDynamoDB vs. BigTable comparison | vs Chart.com.  S. Ramanathan and S. Goel, â€œComparison of Cloud database: Amazon’s SimpleDB and Google’s BigTableâ€, International Conference on Recent Trends in Information Systems, 1(2), pp. 165-168, 2011.  J. Ellis, Jbellis presentations, 2014.  D. V. Pham, A. Syed, A. Mohammad and M. N. Halgamuge, “Threat Analysis of Portable Hack Tools from USB Storage Devices and Protection Solutions”, International Conference on Information and Emerging Technologies, pp. 1-5, Karachi, Pakistan, 14-16 June 2010.  D. V. Pham, A. Syed, and M. N. Halgamuge, â€œUniversal serial bus based software attacks and protection solutionsâ€, Digital Investigation 7 (3), pp. 172-184, 2011.  D. V. Pham, M. N. Halgamuge, A. Syed and P. Mendis, â€œOptimizing windows security features to block malware and hack tools on USB storage devicesâ€, Progress in electromagnetics research symposium, pp. 350-355, 2010.  V. Vargas, A. Syed, A. Mohammad, and M. N. Halgamuge, “Pentaho and Jaspersoft: A Comparative Study of Business Intelligence Open Source Tools Processing Big Data to Evaluate Performances”, International Journal of Advanced Computer Science and Applications (IJACSA), vol 7, no 10, pp. 20-29, November 2016.