Jobs Usage¶

Use Jobs API (available as Client.jobs) for starting a job, killing it, getting list of running jobs etc. This chapter describes several common scenarios.

Here we describe the most common scenarios, see Jobs API Reference for the full list of job namespace methods.

Start a Job¶

To start a job use Jobs.start() method.

The method accepts image and required resources preset name as parameters and returns JobDescription with information about started job:

from apolo_sdk import get

async with get() as client:
    job = await client.jobs.start(
        image=client.parse.remote_image("ubuntu:latest"),
        preset_name="cpu-small",
        command="sleep 30",
    )

The example above starts a job using ubuntu:latest public image, cpu-small resources preset and executes sleep 30 command inside started container.

Note

After return from the call a new job is scheduled for execution but usually it’s status is pending. The Apolo Platform takes time for preparing resources for a job, pulling image from registry etc. Startup time can vary from seconds for hot start to minutes for cold one.

Check Job Status¶

After spawning a job we have JobDescription instance that describes job status (and other things like executed command line).

A job takes time for deployment, it can be terminated by different reasons, e.g. requested image name doesn’t exist.

The following snippet waits for job’s starting execution or failure:

# job is a JobDescription given by former client.job.run() call

while True:
    job = await client.jobs.status(job.id)
    if job.status in (JobStatus.RUNNING, JobStatus.SUCCEEDED):
        break
    elif job.status == JobStatus.FAILED:
        raise RuntimeError(f"Job {job.id} failed with {job.reason}:"
                           f"{job.history.description}")
    else:
        await asyncio.sleep(1)

Mount Apolo Storage folders¶

The Apolo Platform provides access to Apolo storage (storage://) by mounted folders inside a container (volumes in Docker glossary).

If you have a directory storage:folder and want to mount it inside a container under /var/data path please create a Volume and use it in Container definition:

from yarl import URL

volume = Volume(
    storage_uri=URL("storage:folder"),
    container_path="/mnt/data",
)

job = await client.jobs.run(
    Container(
        image=client.parse.remote_image("ubuntu:latest"),
        resources=Resources(memory_mb=100, cpu=0.5),
        command="sleep 30",
        volumes=[volume],
    )
)

There is a parsing helper that can construct a Volume instance from a string in format that is used in CLI:

volume = client.parse.volume("storage:folder:/var/data")

To specify read-only mount point please pass read_only=True to Volume constructor, e.g. the following code mounts public shared storage://neuro/public folder in read-only mode:

public_volume = Volume(
    storage_uri=URL("storage:neuro/public"),
    container_path="/mnt/public",
    read_only=True,
)

The same effect can be achieved by using a parser API:

public_volume = client.parse.volume(
    "storage:neuro/public:/mnt/public:ro")

Pass a list of volumes into container to support multiple mount points:

Container(
    image=...,
    resources=...,
    command=...,
    volumes=[volume, public_volume],
)

Kill a Job¶

Use Jobs.kill() for enforcing job to stop:

await client.jobs.kill(job.id)

Expose job’s TCP ports locally¶

Sometimes there is a need to access TCP endpoints exposed by a job executed on the Apolo Platform from local workstation.

For example, you’ve started a gRPC server inside a container on TCP port 12345 and want to access this service from your laptop.

You need to bridge this remote 12345 port into a local TCP namespace (e.g. 23456 local port by Jobs.port_forward() method:

from grpclib.client import Channel

# generated by protoc
from .helloworld_pb2 import HelloRequest, HelloReply
from .helloworld_grpc import GreeterStub

 async with client.jobs.port_forward(job.id, 23456, 12345):
     # open gRPC client and use it

     channel = Channel('127.0.0.1', 23456)
     greeter = GreeterStub(channel)

     reply: HelloReply = await greeter.SayHello(HelloRequest(name='Dr. Strange'))
     print(reply.message)

     channel.close()

The example uses grpclib library for make client gRPC requests.

Job preemption¶

Job preemption means that unlike normal jobs preemptible ones can be stopped by kernel when the system has lack of resources and restarted later. All memory and local disk changes are lost but data written to the Apolo Storage (see Mount Apolo Storage folders ) is persistent.

To support preemption job’s code should be organized in the following way: it dumps snapshots on disk periodically. On restart the code checks for last saved snapshot and continues the work from this point.

AI frameworks usually supports snapshots out of the box, see saving and loading models in pytorch or Keras ModelCheckpoint.

Preemptible job is not such convenient as regular job but it’s computational time is much cheaper (exact numbers varies on concrete computational cluster provides, e.g. Google Compute, AWS or Azure).

Jobs are non-preeptible by default, you can change this by passing preemptible_node=True flag to Jobs.run().

Jobs Usage¶

Start a Job¶

Check Job Status¶

Mount Apolo Storage folders¶

Kill a Job¶

Expose job’s TCP ports locally¶

Job preemption¶

apolo-sdk

Navigation

Related Topics