Source: Deep BI Blog

Deep BI Blog Druid Queries: All About Querying Data on Apache Druid

Apache Druid Query ProcessingApache Druid provides an API, called the Query Service API, that enables the processing of real-time data streams on demand. The Apache Druid Query Processing documentation explains how queries are processed using Apache Druid's storage backends and query processing framework and presents how to implement custom processors, how to optimise query performance and how to develop real-time dashboards on top of Apache Druid's Query Service API. This document uses diagrams and examples to illustrate the functionality and terminology introduced by the Query Service API.Query execution on Apache DruidDruid queries are executed by the Broker, which is the query processing engine. The Broker compiles the query into a native Druid query plan and then executes it.The query processor uses Druid's column-oriented data layout to execute queries very efficiently. It can push down filters and aggregations to the storage nodes so that only the relevant data is read and processed.The result of a Druid query is always a table with columns and rows.Query caching in Apache DruidWhen data is queried in Apache Druid, the query is run against a real-time node first. If the node doesn't have the required data, it will fetch the data from historical nodes. The query will then be executed on the historical nodes and the results will be merged. In addition to query caching, an internal application called Flux has been developed for Druid. Flux helps with processing queries on the historical nodes as well as performing aggregation and joining of datasets. It does this by using a user-defined function (UDF) to process each piece of data that's received from one or more sources before passing it back to the requesting service via Kafka or HTTP API.Apache Druid Query Cache typesThe main query types that Apache Druid supports are batch and real-time. Batch queries are run against historical data that has been stored in Druid, while real-time queries are run against data that is currently being ingested into Druid. In order to execute a batch query, the whole table of events needs to be loaded into memory before the queries can be executed. If this load takes too long, then it's possible for a user's query to time out before they get their results back. Real-time queries only need to read small batches of data from disk as they execute, so they complete much faster than their batch counterparts.Per-segment cachingOne of the benefits of Apache Druid is that it supports per-segment caching. This means that data for each segment is cached in memory, which can help improve query performance. When configuring per-segment caching, there are a few things to keep in mind : • The cache should be refreshed after a certain period of time, such as every 10 minutes or hour (depending on your needs). The refresh interval should be specified with the -t flag when launching Druid. • Refresh intervals should be lower if you have an event-sourced application (where segments never change) because updates will have to occur during low-traffic hours. In this case, use the -r flag instead of -t when launching Druid so that the cache refreshes periodically and automatically. If this type of application doesn't exist, then a lower value for the refresh interval would make sense.The following options can also be used to configure how many segments are cached: If a very large number of segments are being queried, some may not fit into memory at once. By specifying that only one segment should be cached at any given time with the -s flag, queries will execute faster and perform better by only looking at one segment at a time. Alternatively, specifying the maximum number of segments allowed to cache simultaneously using the -n flag may lead to better results.Whole-query cachingA query cache is a mechanism for storing the results of frequently-run queries in memory so that future executions can avoid the overhead of re-running the query. Apache Druid supports whole-query caching, which means that the entire result set of a query is cached, not just individual rows. This can be helpful when running large, complex queries whose results are used multiple times. For example, say we want to find the total sales by month and year for each customer. It's easy to imagine that we might need this data more than once in our analysis pipeline. We could use a database to store this information, but then every time we needed it, we would have to go back and get it from the database again; it would be much more efficient if we could access it directly from Druid.How to enable caching in Apache DruidCaching is a critical performance optimization in Apache Druid. By default, all queries are cached at the broker and historical nodes. This allows the same query to be run multiple times without having to re-read data from deep storage. The cache size can be configured via configs/druid.YAML; this defaults to 100 GB on each node type (Historical, Logger). You can adjust this by specifying the appropriate amount of disk space in gigabytes for that node type.For example, you could use historical_node_size:1. If you have only one node of that type.2. If you have 2 nodes of that type, set it to 50 GB per node.This is a tradeoff between performance and querying against those logs; there is no right answer. An additional factor that could impact your decision is whether or not you plan to use current data with 1-day granularity. If so, you should set your historical node cache size accordingly to allow it to cache 2 days' worth of data.Enabling query caching on HistoricalQuery caching can be extremely helpful in increasing the performance of Apache Druid, especially when querying large data sets. By default, query caching is disabled on historical nodes. To enable query caching, you will need to set the following property in your druid.properties file.In order for a query to be cached, it must meet two criteria: The historical node must successfully complete processing and commit all changes to the disk. This means that queries should not perform actions that alter or roll back historical data (i.e., delete events). Additionally, it is important to understand that since queries are cached per host, some older data may not be available if other hosts were unavailable at the time.Enabling query caching on task executor servicesBy default, task executor services used by query processing operators are not configured to cache queries. To enable query caching, you must set the following property on each task executor service org.apache.curator.cache = true . Queries will be cached in memory for a period of time and will be automatically discarded when memory is full or at regular intervals (depending on whether we are using LRU or LFU). If a query cannot be cached, it will be executed as usual. The limit on the number of queries cached is controlled by the Apache Curator configuration property maximumQueriesToCachePerExecutor, which defaults to 500. You can also configure how long a given query should stay in the cache before being removed from memory via the maxQueryCacheLifetimeMillis property, which defaults to one hour (3,600 seconds).Enabling query caching on BrokersWhen query caching is enabled on a Broker, the Broker will keep track of the most recent queries that have been run against each data source. If a new query comes in that is identical to a cached query, the Broker will simply return the cached results instead of running the query again. This can dramatically improve performance for dashboards or other applications that run the same queries repeatedly. We can enable this feature by adding cache-timeout-in-seconds and max-cached-queries. The cache timeout parameter specifies how long we want to retain queries when we don't need them anymore. The default value is 3600 seconds (1 hour). The max-cached-queries parameter specifies how many queries are allowed before it starts purging old ones.Enabling caching in the query contextQuery processing in Apache Druid is often thought of as a two-step process: first, the query is converted into a logical plan, and then that plan is executed against the data. However, there is a third step that happens between those two: query optimization. Optimization improves the execution time by reordering or restructuring parts of a query to reduce the number of operations needed to execute it. Optimization can happen at both the physical level (SQL) and the logical level (Druid). On the SQL side, we have materialised views, which are pre-computed queries that can be used with joins to avoid any work related to joining operators. On the Druid side, we have caching, which is typically enabled at the query context level (i.e., caching applies to all queries in a given context), but can also be enabled at lower levels such as aggregation and storage contexts.

Read full article »

Deep BI's Competitors | Deep BI's News | Deep BI's Financials

Followers on Owler

16

Est. Annual Revenue

$100K-5.0M

Est. Employees

25-100

Sebastian Zontek's photo - Co-Founder & CEO of Deep BI

Co-Founder & CEO

Sebastian Zontek

CEO Approval Rating

90/100