.. currentmodule:: neuro_sdk

.. _jobs-usage:

==========
Jobs Usage
==========

Use Jobs API (available as :attr:`Client.jobs`) for starting a job, killing it, getting
list of running jobs etc.  This chapter describes several common scenarios.

Here we describe the most common scenarios, see :ref:`jobs-reference` for the full list
of job namespace methods.


Start a Job
===========

To start a job use :meth:`Jobs.start` method.

The method accepts image and
required resources preset name as parameters and returns :class:`JobDescription` with information about started
job::

    from neuro_sdk import get

    async with get() as client:
        job = await client.jobs.start(
            image=client.parse.remote_image("ubuntu:latest"),
            preset_name="cpu-small",
            command="sleep 30",
        )


The example above starts a job using ``ubuntu:latest`` public image, ``cpu-small`` resources preset and executes ``sleep 30`` command inside started container.


.. note::

   After return from the call a new job is *scheduled* for execution but usually it's
   status is *pending*.  The Neuro Platform takes time for preparing resources for
   a job, pulling image from registry etc.  Startup time can vary from seconds for *hot
   start* to minutes for *cold* one.


Check Job Status
================

After spawning a job we have :class:`JobDescription` instance that describes job status
(and other things like executed command line).

A job takes time for deployment, it can be terminated by different reasons,
e.g. requested image name doesn't exist.

The following snippet waits for job's *starting execution* or *failure*::

   # job is a JobDescription given by former client.job.run() call

   while True:
       job = await client.jobs.status(job.id)
       if job.status in (JobStatus.RUNNING, JobStatus.SUCCEEDED):
           break
       elif job.status == JobStatus.FAILED:
           raise RuntimeError(f"Job {job.id} failed with {job.reason}:"
                              f"{job.history.description}")
       else:
           await asyncio.sleep(1)

.. _jobs-usage-mounts:

Mount Neuro Storage folders
=================================

The Neuro Platform provides access to Neuro storage (``storage://``) by
*mounted folders* inside a container (*volumes* in `Docker <https://www.docker.com>`_
glossary).


If you have a directory ``storage:folder`` and want to mount it inside a container under
``/var/data`` path please create a :class:`Volume` and use it in :class:`Container`
definition::

   from yarl import URL

   volume = Volume(
       storage_uri=URL("storage:folder"),
       container_path="/mnt/data",
   )

   job = await client.jobs.run(
       Container(
           image=client.parse.remote_image("ubuntu:latest"),
           resources=Resources(memory_mb=100, cpu=0.5),
           command="sleep 30",
           volumes=[volume],
       )
   )


There is a parsing helper that can construct a :class:`Volume` instance from a string in
format that is used in :term:`CLI`::

   volume = client.parse.volume("storage:folder:/var/data")


To specify *read-only* mount point please pass ``read_only=True`` to :class:`Volume`
constructor, e.g. the following code mounts public shared
``storage://neuro/public`` folder in read-only mode::

   public_volume = Volume(
       storage_uri=URL("storage:neuro/public"),
       container_path="/mnt/public",
       read_only=True,
   )


The same effect can be achieved by using a parser API::

   public_volume = client.parse.volume(
       "storage:neuro/public:/mnt/public:ro")

Pass a list of *volumes* into container to support multiple mount points::

   Container(
       image=...,
       resources=...,
       command=...,
       volumes=[volume, public_volume],
   )

.. seealso:: :ref:`storage-usage` for the storage manipulation API.


Kill a Job
==========

Use :meth:`Jobs.kill` for enforcing job to stop::

   await client.jobs.kill(job.id)


Expose job's TCP ports locally
==============================

Sometimes there is a need to access TCP endpoints exposed by a job executed on the
Neuro Platform from local workstation.

For example, you've started a gRPC server inside a container on TCP port ``12345`` and
want to access this service from your laptop.

You need to bridge this *remote* ``12345`` port into a local TCP namespace
(e.g. ``23456`` *local* port by :meth:`Jobs.port_forward` method::

   from grpclib.client import Channel

   # generated by protoc
   from .helloworld_pb2 import HelloRequest, HelloReply
   from .helloworld_grpc import GreeterStub

    async with client.jobs.port_forward(job.id, 23456, 12345):
        # open gRPC client and use it

        channel = Channel('127.0.0.1', 23456)
        greeter = GreeterStub(channel)

        reply: HelloReply = await greeter.SayHello(HelloRequest(name='Dr. Strange'))
        print(reply.message)

        channel.close()

The example uses `grpclib <https://github.com/vmagamedov/grpclib>`_ library for make
client gRPC requests.

.. _job-preemption:

Job preemption
==============

Job preemption means that unlike normal jobs preemptible ones can be stopped by kernel
when the system has lack of resources and restarted later.  All memory and local disk
changes are lost but data written to the Neuro Storage (see
:ref:`jobs-usage-mounts` ) is persistent.

To support preemption job's code should be organized in the following way: it dumps
*snapshots* on disk periodically. On restart the code checks for last saved snapshot and
continues the work from this point.

AI frameworks usually supports snapshots out of the box, see `saving and loading models
in pytorch <https://pytorch.org/tutorials/beginner/saving_loading_models.html>`_ or
`Keras ModelCheckpoint <https://keras.io/callbacks/>`_.

Preemptible job is not such convenient as regular job but it's computational time is
much cheaper (exact numbers varies on concrete computational cluster provides,
e.g. Google Compute, AWS or Azure).

Jobs are *non-preeptible* by default, you can change this by passing
``preemptible_node=True`` flag to :meth:`Jobs.run`.