Conventional Big Data Solutions are Inefficient


Conventional big data solutions are expensive to own and operate.

Most big data solutions use replication for resiliency:

Replication usually consists of three copies of the data so the costs in hardware, space, power and cooling grow commensurately. The operating complexities and costs of administering three times the number of servers can exceed the original purchase price.

Relying on data replication for fault tolerance decreases data ingestion throughput and reduces the efficient use of the available storage capacity. Replication has the advantage of not requiring regeneration in case of failure, but the backup data is usually remote from the intended compute platform.

Most big data solutions use dedicated servers for persistent filesystem metadata:

Only these servers know the filesystem hierarchy and on which nodes every block (including replicates) of every file are stored. The loss of these servers will result in the loss of all the data on the cluster. For this reason metadata servers usually have a backup server configured for high availability which, though costly and complex, still represents a weakness in the cluster resiliency.

These metadata servers must be able to contain all the namespace information in memory which limits the quantity of entries that can fit. There are proposed mechanisms for dividing the cluster namespace among several namenodes. Of course each of the namenodes should have a backup namenode configured for high availability which adds costs and complexity.

Big data storage solutions that use encoding for resiliency have significant limitations:

Erasure coding policies are limited to just three options for Reed-Solomon codecs: 3-2, 6-3, and 10-4. These three option can be inappropriate depending upon the application performance and resiliency requirements.

Most encoding solutions are unable to encode data as it is being ingested due to performance and architectural limitations. Data is not encoded until a full data set (block) is received, and until completely received the data set is replicated for resiliency.

Most encoding solutions cannot truncate files. Depending upon the application truncation can be critical. For example, a failed database transaction may have to be rolled back by truncation a file. Without the ability to truncate files database roll back can be very complex.

Most encoding solutions cannot append to files. An example of the need for appending to a file is in order to aggregate streamed data that is arriving in real-time. Without the ability to append, new files must be created which can pollute a limited namespace with small files.