Creating a batch compute service

Overview

A batch service is a service that makes it possible to run jobs on a homogeneous cluster managed by a batch scheduler. The batch scheduler receives requests that ask for a number of compute nodes, with a number of cores per compute node, and a duration. Requests wait in a queue and, using a range of possible batch scheduling algorithms, are dispatched to the requested compute resources in a space-sharing manner. Therefore, a job submitted to the service experiences a “queue waiting time” period (the length of which depends on the load on the service) followed by an “execution time” period. In typical batch-scheduler fashion, a running job is forcefully terminated when it reaches its requested duration (i.e., the job fails). If, instead, the job completes before the requested duration, it succeeds. In both cases, the job’s allocated compute resources are reclaimed by the batch scheduler.

A batch service also supports so-called “pilot jobs”, i.e., jobs that are submitted to the service, with requested resources and duration, but without specifying at submission time which workflow tasks/operations should be performed by the job. Instead, once the job starts it exposes to its submitter a bare-metal service. This service is available only for the requested duration, and can be used in any manner by the submitter. This allows late binding of workflow tasks to compute resources.

Creating a batch compute service

In WRENCH, a batch service is defined by the wrench::BatchComputeService class. An instantiation of a batch service requires the following parameters:

The name of a host on which to start the service;
A list (std::vector) of hostnames (all cores and all RAM of each host is available to the batch service);
A mount point (corresponding to a disk attached to the host) for the scratch space, i.e., storage local to the batch service (used to store workflow files, as needed, during job executions); and
Maps (std::map) of configurable properties (wrench::BatchComputeServiceProperty) and configurable message payloads (wrench::BatchComputeServiceMessagePayload).

The example below creates an instance of a batch service that runs on host Gateway and provides access to 4 hosts (using all their cores and RAM), with scratch space on the disk mounted at path /scratch/ at host Gateway. Furthermore, the batch scheduling algorithm is configured to use the FCFS (First-Come-First-Serve) algorithm, and the message with which the service answers resource request description requests is configured to be 1KiB:

auto batch_cs = simulation->add(
          new wrench::BatchComputeService("Gateway",
                                   {"Node1", "Node2", "Node3", "Node4"},
                                   "/scratch/",
                                   {{wrench::BatchComputeServiceProperty::BATCH_SCHEDULING_ALGORITHM, "fcfs"}},
                                   {{wrench::BareMetalComputeServiceMessagePayload::RESOURCE_DESCRIPTION_ANSWER_MESSAGE_PAYLOAD, 1024}}));

See the documentation of wrench::BatchComputeServiceProperty and wrench::BatchComputeServiceMessagePayload for all possible configuration options.

Also see the simulators in the examples/workflow_api/basic-examples/batch-*/ and examples/action_api/batch-*/ directories, which use batch compute services.