Choosing a Data Lake Storage System


Data Lake Storage System: Which One Should You Use?

When you create a Data Lake, one of the most overlooked questions is, “What storage technology should back the lake?” Most companies just go with whatever tech stack they are familiar with, or are being sold.  In reality, the Data Lake storage system should be chosen using the same questions you ask when you build out any other piece of the system:

Data Lake Storage System

1. Does the system cover all requirements and SLAs that are currently known?
2. Can the system be easily expanded if more functionality (or space) is needed?
3. Is the system in line with budgetary and engineering talent constraints?

Once these questions have been reviewed and answered the selection of storage technology can be started.

There are five widely accepted storage systems being used for Data lakes. Each of them have both pros and cons as the basis for a Lake.


Data Storage System Pros and Cons

Type of System Pro Con
Hadoop Based System Easily expandable and cheaper storage Slower data retrieval times
Non-Hadoop Based Storage + Hadoop / non-Hadoop Compute, e.g. S3 + Hive / Spark Decouples storage and compute, optimized for cloud platforms More difficult to implement on-prem
Massively Parallel Processing System (MPP), e.g. H.P. Vertica or IBM Netezza Fast record retrieval and ease of setup High Cost
NoSQL System (Cassandra, HBase) Easily expandable and fast Tech community less familiar with NoSQL systems
SQL Database (SQL Server, Oracle, MySQL) Well defined technology Cannot handle large amounts of data without high cost

DesignMind Data Lake Storage System White Paper

At DesignMind, we have developed a proprietary pattern that not only ingests large amounts of data, but:

  1. Makes data available to users at all levels of the system
  2. Allows data to be accessed by multiple formats
  3. Allows for simplified schema evolution management

Read more in our white paper, “Data Lake Storage Systems That Work”. Questions? Contact us and we’ll get back to you promptly.