Data Lake Storage System: Which One Should You Use?
When you create a Data Lake, one of the most overlooked questions is, “What storage technology should back the lake?” Most companies just go with whatever tech stack they are familiar with, or are being sold. In reality, the Data Lake storage system should be chosen using the same questions you ask when you build out any other piece of the system:
1. Does the system cover all requirements and SLAs that are currently known?
2. Can the system be easily expanded if more functionality (or space) is needed?
3. Is the system in line with budgetary and engineering talent constraints?
Once these questions have been reviewed and answered the selection of storage technology can be started.
There are five widely accepted storage systems being used for Data lakes. Each of them have both pros and cons as the basis for a Lake.
Data Storage System Pros and Cons
|Type of System||Pro||Con|
|Hadoop Based System||Easily expandable and cheaper storage||Slower data retrieval times|
|Non-Hadoop Based Storage + Hadoop / non-Hadoop Compute, e.g. S3 + Hive / Spark||Decouples storage and compute, optimized for cloud platforms||More difficult to implement on-prem|
|Massively Parallel Processing System (MPP), e.g. H.P. Vertica or IBM Netezza||Fast record retrieval and ease of setup||High Cost|
|NoSQL System (Cassandra, HBase)||Easily expandable and fast||Tech community less familiar with NoSQL systems|
|SQL Database (SQL Server, Oracle, MySQL)||Well defined technology||Cannot handle large amounts of data without high cost|
Questions? Contact Mark Kidwell of our Big Data consulting team. Learn how to get started now by requesting our white paper.