Workload scheduling
When ClickHouse execute multiple queries simultaneously, they may be using shared resources (e.g. disks). Scheduling constraints and policies can be applied to regulate how resources are utilized and shared between different workloads. For every resource a scheduling hierarchy can be configured. Hierarchy root represents a resource, while leafs are queues, holding requests that exceed resource capacity.
Currently only remote disk IO can be scheduled using described method. For CPU scheduling see settings about thread pools and concurrent_threads_soft_limit_num
. For flexible memory limits see Memory overcommit
Disk configuration
To enable IO scheduling for a specific disk, you have to specify read_resource
and/or write_resource
in storage configuration. It says ClickHouse what resource should be used for every read and write requests with given disk. Read and write resource can refer to the same resource name, which is useful for local SSDs or HDDs. Multiple different disks also can refer to the same resource, which is useful for remote disks: if you want to be able to allow fair division of network bandwidth between e.g. "production" and "development" workloads.
Example:
<clickhouse>
<storage_configuration>
...
<disks>
<s3>
<type>s3</type>
<endpoint>https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/root-path/</endpoint>
<access_key_id>your_access_key_id</access_key_id>
<secret_access_key>your_secret_access_key</secret_access_key>
<read_resource>network_read</read_resource>
<write_resource>network_write</write_resource>
</s3>
</disks>
<policies>
<s3_main>
<volumes>
<main>
<disk>s3</disk>
</main>
</volumes>
</s3_main>
</policies>
</storage_configuration>
</clickhouse>
Workload markup
Queries can be marked with setting workload
to distinguish different workloads. If workload
is not set, than value "default" is used. Note that you are able to specify the other value using settings profiles. Setting constraints can be used to make workload
constant if you want all queries from the user to be marked with fixed value of workload
setting.
It is possible to assign a workload
setting for background activities. Merges and mutations are using merge_workload
and mutation_workload
server settings correspondingly. These values can also be overridden for specific tables using merge_workload
and mutation_workload
merge tree settings
Let's consider an example of a system with two different workloads: "production" and "development".
SELECT count() FROM my_table WHERE value = 42 SETTINGS workload = 'production'
SELECT count() FROM my_table WHERE value = 13 SETTINGS workload = 'development'
Resource scheduling hierarchy
From the standpoint of scheduling subsystem a resource represents a hierarchy of scheduling nodes.
Possible node types:
inflight_limit
(constraint) - blocks if either number of concurrent in-flight requests exceedsmax_requests
, or their total cost exceedsmax_cost
; must have a single child.bandwidth_limit
(constraint) - blocks if current bandwidth exceedsmax_speed
(0 means unlimited) or burst exceedsmax_burst
(by default equalsmax_speed
); must have a single child.fair
(policy) - selects the next request to serve from one of its children nodes according to max-min fairness; children nodes can specifyweight
(default is 1).priority
(policy) - selects the next request to serve from one of its children nodes according to static priorities (lower value means higher priority); children nodes can specifypriority
(default is 0).fifo
(queue) - leaf of the hierarchy capable of holding requests that exceed resource capacity.
To be able to use the full capacity of the underlying resource, you should use inflight_limit
. Note that a low number of max_requests
or max_cost
could lead to not full resource utilization, while too high numbers could lead to empty queues inside the scheduler, which in turn will result in policies being ignored (unfairness or ignoring of priorities) in the subtree. On the other hand, if you want to protect resources from too high utilization, you should use bandwidth_limit
. It throttles when the amount of resource consumed in duration
seconds exceeds max_burst + max_speed * duration
bytes. Two bandwidth_limit
nodes on the same resource could be used to limit peak bandwidth during short intervals and average bandwidth for longer ones.
The following example shows how to define IO scheduling hierarchies shown in the picture:
<clickhouse>
<resources>
<network_read>
<node path="/">
<type>inflight_limit</type>
<max_requests>100</max_requests>
</node>
<node path="/fair">
<type>fair</type>
</node>
<node path="/fair/prod">
<type>fifo</type>
<weight>3</weight>
</node>
<node path="/fair/dev">
<type>fifo</type>
</node>
</network_read>
<network_write>
<node path="/">
<type>inflight_limit</type>
<max_requests>100</max_requests>
</node>
<node path="/fair">
<type>fair</type>
</node>
<node path="/fair/prod">
<type>fifo</type>
<weight>3</weight>
</node>
<node path="/fair/dev">
<type>fifo</type>
</node>
</network_write>
</resources>
</clickhouse>
Workload classifiers
Workload classifiers are used to define mapping from workload
specified by a query into leaf-queues that should be used for specific resources. At the moment, workload classification is simple: only static mapping is available.
Example:
<clickhouse>
<workload_classifiers>
<production>
<network_read>/fair/prod</network_read>
<network_write>/fair/prod</network_write>
</production>
<development>
<network_read>/fair/dev</network_read>
<network_write>/fair/dev</network_write>
</development>
<default>
<network_read>/fair/dev</network_read>
<network_write>/fair/dev</network_write>
</default>
</workload_classifiers>
</clickhouse>
See also
- system.scheduler
- merge_workload merge tree setting
- merge_workload global server setting
- mutation_workload merge tree setting
- mutation_workload global server setting