Source: ENKI Blog

ENKI Blog Making Big Data work in the public cloud

An increasing number of ENKI's clients are monetizing their transactional data by storing large amounts of it permanently for the revenue opportunities that come from mining that data. Because of this they are starting to run two databases: one for high speed transactional work and the other for data mining queries on their data warehouse. However making both of these databases work successfully in the cloud requires radically different skills and even sometimes technologies, and we see our clients struggling with the challenge, especially if they are short on database admin skills or have administrators used to smaller deployments.To quickly review transactional database performance management in the cloud, it comes down to three critical points: query efficiency, cache size, and IO speed. Aside from query optimization, there's a discussion of the other points in Optimizing Machine Performance and Costs in a Provisioned Storage Performance Cloud. Query optimization is critical: you must have an application DBA on staff or on call if it looks like adjusting IO performance and caching isn't doing anything. In fact we saw this as we migrated customers to our new standard storage system in our latest datacenter: a number of them showed little of the expected increase in application performance because they were still running inefficient queries that were maxing out I/O on the virtual machines. The access time to storage had decreased by 30% and the throughput had increased by at least 3x, but their queries - being so dependent on access time due to all the random disk accesses they created, and not using cache efficiently - only sped up slightly. To help these customers, we placed our provisioned-IOPS storage tier into an array provided by Nimble Storage which optimizes access time by caching a large amount of data in SSD storage. However, even these clients would best be served with query optimization to avoid unnecessary IO performance charges.Big data operations follow a different rule. Typically big data queries will sift through large portions of a database looking for patterns. This requires more I/O throughput to complete the query (often MUCH more!) but is less dependent on access times since accesses are serial, perfect for rotating disk storage. Cache, so important for transactional database loads, is almost irrelevant because the amounts of data are so large that they cannot be cached in virtual machine memory either due to cost or hardware architecture limitations. Instead, both the database and the storage - as well as the cloud itself - must be optimized for passing large amounts of data through the VM for analysis. To make this possible, the entire ENKI cloud runs on dual 10GbE links that connect servers to storage, and we offer very wide stripe RAID 10 with 15kRPM drives for massive data transfer speeds to the VM.In order to optimize the data access on big data queries to utilize this kind of storage throughput effectively, the database must actually perform serial data access when a query is run. For a relational database, the queries and the schema must be tuned to allow this. One of the key requirements (as with transactional queries) is to set the indexes up properly so that they can be cached or quickly retrieved - which means they have to be small. If very fast storage is available, putting the index on it may significantly improve overall response time. Also, the database VM must have enough RAM to cache the indexes if your performance plan requires it. And, the data being accessed for queries must be stored in the largest possible chunks to maximize sequential read performance. This is why many big data projects have turned to NoSQL databases like Mongo, which retrieve larger objects using relatively small indices. However, the choice of database also depends on the application architecture and the data itself, since practical use of an object database for data warehousing may require grouping or agglomerating smaller chunks of data from the transactional records in such a way that they can no longer be individually retrieved. If an object database is not appropriate, traditional techniques for reducing complexity and file size have to be applied to the SQL database to allow it to be maintainable and reliable, including sharding and spreading storage provisioning across multiple storage arrays both for performance and reliability. ENKI can accommodate these requirements. The one area to watch out for if using object storage directly is aggregate throughput: some object storage systems simply can't deliver the aggregate throughput of a block/file based storage, especially if writes are involved, due to common object store architecture of storing data on multiple separate hardware systems.Finally, some cloud providers offer pure SSD storage - usually within the hardware hosting the cloud instance. While the speed on this SSD is hard to equal, the challenges of keeping the SSD current from longer-term object storage as well as flushing the SSD regularly to the object storage if writes occur, are not solved automatically and would require your application or a third-party file system to be configured for maintaining the local SSD "cache." In addition, these SSD deployments are in clouds with ephemeral instances, so your application will have to take into account that a server failure will cause incomplete results to be returned from a query, or even cause the query to crash. On-instance SSD may pay off for the largest big data projects but for the bulk of 10-100 terabyte deployments, the techniques above should prove adequate.

Read full article »
Est. Annual Revenue
$5.0-25M
Est. Employees
1-25
CEO Avatar

CEO

Update CEO

CEO Approval Rating

- -/100



ENKI LLC's headquarters is in denver, Colorado. ENKI LLC has a revenue of $10M, and 10 employees. ENKI LLC has 4 followers on Owler.