Backend.AI Documentation
Latest API version: v6.20220615
Backend.AI is an enterprise-grade development and service backend for a wide range of AI-powered applications. Its core technology is tailored for operating high density computing clusters including GPUs and heterogeneous accelerators.
From the user’s perspective, Backend.AI is a cloud-like GPU powered HPC/DL application host (“Google Colab on your machine”). It runs arbitrary user codes safely in resource-constrained containers. It hosts various programming languages and runtimes, such as Python 2/3, R, PHP, C/C++, Java, JavaScript, Julia, Octave, Haskell, Lua and Node.js, as well as AI-oriented libraries such as TensorFlow, Keras, Caffe, and MXNet.
From the admin’s perspecetive, Backend.AI streamlines the process of assigning computing nodes, GPUs, and storage space to individual research team members. With detailed policy-based idle checks and resource limits, you no longer have to worry about exceeding the capacity of the cluster when there are high demands.
Using the plugin architecture, Backend.AI also offers more advanced features such as fractional sharing of GPUs and site-specific SSO integrations, etc. for various-sized enterprise customers.
Backend.AI Concepts
Here we describe the key concepts that are required to understand and follow this documentation.
The diagram of a typical multi-node Backend.AI server architecture
Fig. 1 shows a brief Backend.AI server-side architecture where the components are what you need to install and configure.
Each border-connected group of components is intended to be run on the same server, but you may split them into multiple servers or merge different groups into a single server as you need. For example, you can run separate servers for the nginx reverse-proxy and the Backend.AI manager or run both on a single server. In the development setup, all these components run on a single PC such as your laptop.
Service Components
Public-facing services
Manager and Webserver
Backend.AI manager is the central governor of the cluster. It accepts user requests, creates/destroys the sessions, and routes code execution requests to appropriate agents and sessions. It also collects the output of sessions and responds the users with them.
Backend.AI agent is a small daemon installed onto individual worker servers to control them. It manages and monitors the lifecycle of kernel containers, and also mediates the input/output of sessions. Each agent also reports the resource capacity and status of its server, so that the manager can assign new sessions on idle servers to load balance.
The primary networking requirements are:
The manager server (the HTTPS 443 port) should be exposed to the public Internet or the network that your client can access.
The manager, agents, and all other database/storage servers should reside at the same local private network where any traffic between them are transparently allowed.
For high-volume big-data processing, you may want to separate the network for the storage using a secondary network interface on each server, such as Infiniband and RoCE adaptors.
App Proxy
Backend.AI App Proxy is a proxy to mediate the traffic between user applications and clients like browsers. It provides the central place to set the networking and firewall policy for the user application traffic.
It has two operation modes:
Port mapping: Individual app instances are mapped with a TCP port taken from a pre-configured range of TCP port range.
Wildecard subdomain: Individual app instances are mapped with a system-generated subdomain under the given top-level domain.
Depending on the session type and application launch configurations, it may require an authenticated HTTP session for HTTP-based applications. For instance, you may enforce authentication for interactive development apps like Jupyter while allow anonymous access for AI model service APIs.
Storage Proxy
Backend.AI Storage Proxy is a proxy to offload the large file transfers from the manager. It also provides an abstraction of underlying storage vendor’s acceleration APIs since many storage vendors offer vendor-specific APIs for filesystem operations like scanning of directories with millions of files. Using the storage proxy, we apply our abstraction models for such filesystem operations and quota management specialized to each vendor API.
FastTrack (Enterprise only)
Backend.AI FastTrack is an add-on service running on top of the manager that features a slick GUI to design and run pipelines of computation tasks. It makes it easier to monitor the progress of various MLOps pipelines running concurrently, and allows sharing of such pielines in portable ways.
Resource Management
Sokovan Orchestrator
Backend.AI Sokovan is the central cluster-level scheduler running inside the manager. It monitors the resource usage of agents and assigns new containers from the job queue to the agents.
Each resource group may have separate scheduling policy and options. The scheduling algorithm may be extended using a common abstract interface. A scheduler implementation accepts the list of currently running sessions, the list of pending sessions in the job queue, and the current resource usage of target agents. It then outputs the choice of a pending session to start and the assignment of an agent to host it.
Agent
Backend.AI Agent is a small daemon running at each compute node like a GPU server. Its main job is to control and monitor the containers via Docker, but also includes an abstraction of various “compute process” backends. It publishes various types of container-related events so that the manager could react to status updates of containers.
When the manager assigns a new container, the agent decides the device-level resource mappings for the container considering optimal hardware layouts such as NUMA and the PCIe bus locations of accelerator and network devices.
Internal services
Event bus
Backend.AI uses Redis to keep track of various real-time information and notify system events to other service components.
Control Panel (Enterprise only)
Backend.AI Control Panel is an add-on service to the manager for advanced management and monitoring. It provides a dedicated superadmin GUI, featuring batch creation and modification of the users, detailed configuration of various resource policies, and etc.
Forklift (Enterprise only)
Backend.AI Forklift is a standalone service that eases building new container images from scratch or importing existing ones that are compatible with Backend.AI.
Reservoir (Enterprise only)
Backend.AI Reservoir is an add-on service to provide open source package mirrors for air-gapped setups.
Container Registry
Backend.AI supports integration with several common container registry solutions, while open source users may also rely on our official registry service with prebuilt images in https://cr.backend.ai:
Docker’s vanilla open-source registry
It is simplest to set up but does not provide advanced access controls and namespacing over container images.
Harbor v2 (recommended)
It provides a full-fledged container registry service including ACLs with project/user memberships, cloning from/to remote registries, on-premise and cloud deployments, security analysis, and etc.
Computing
Sessions and kernels
Backend.AI spawns sessions to host various kinds of computation with associated computing resources. Each session may have one or more kernels. We call sessions with multiple kernels as “cluster sessions”.
A kernel represents an isolated unit of computation such as a container, a virtual machine, a native process, or even a Kubernetes pod, depending on the Agent’s backed implementation and configurations. The most common form of a kernel is a Docker container. For container or VM-based kernels, they are also associated with the base images. The most common form of a base image is the OCI container images.
Kernel roles in a cluster session
In a cluster session with multiple kernels, each kernel has a role. By default, the first container takes the “main” role while others takes the “sub” role. All kernels are given unique hostnames like “main1”, “sub1”, “sub2”, …, and “subN” (the cluster size is N+1 in this case). A non-cluster session has one “main1” kernel only.
All interactions with a session are routed to its “main1” kernel, while the “main1” kernel is allowed to access all other kernels via a private network.
See also
Session templates
A session template is a predefined set of parameters to create a session, while they can be overriden by the caller. It may define additional kernel roles for a cluster session, with different base images and resource specifications.
Session types
There are several classes of sessions for different purposes having different features.
Feature |
Compute |
Compute |
Inference |
System |
---|---|---|---|---|
Code execution |
✓ |
✗ |
✗ |
✗ |
Service port |
✓ |
✓ |
✓ |
✓ |
Dependencies |
✗ |
✓ |
✗ |
✗ |
Session result |
✗ |
✓ |
✗ |
✗ |
Clustering |
✓ |
✓ |
✓ |
✓ |
Compute session is the most generic form of session to host computations. It has two operation modes: interactive and batch.
Interactive compute session
Interactive compute sessions are used to run various interactive applications and development tools, such as Jupyter Notebooks, web-based terminals, and etc. It is expected that the users control their lifecycles (e.g., terminating them) while Backend.AI offers configuration knobs for the administrators to set idle timeouts with various criteria.
There are two major ways to interact with an interactive compute session: service ports and the code execution API.
Service ports
TODO: port mapping diagram
Code execution
TODO: execution API state diagram
Batch compute session
Batch compute sessions are used to host a “run-to-completion” script with a finite execution time. It has two result states: SUCCESS or FAILED, which is defined by whether the main program’s exit code is zero or not.
Dependencies between compute sessions
Pipelining
Inference session
Service endpoint and routing
Auto-scaling
System session
SFTP access
Scheduling
Backend.AI keeps track of sessions using a state-machine to represent the various lifecycle stages of them.
TODO: session/kernel state diagram
TODO: two-level scheduler architecture diagram
See also
Session selection strategy
Heuristic FIFO
The default session selection strategy is the heuristic FIFO. It mostly works like a FIFO queue to select the oldest pending session, but offers an option to enable a head-of-line (HoL) blocking avoidance logic.
The HoL blocking problem happens when the oldest pending session requires too much resources so that it cannot be scheduled while other subsequent pending sessions fit within the available cluster resources. Those subsequent pending sessions that can be started never have chances until the oldest pending session (“blocker”) is either cancelled or more running sessions terminate and release more cluster resources.
When enabled, the HoL blocking avoidance logic keeps track of the retry count of scheduling attempts of each pending session and pushes back the pending sessions whose retry counts exceed a certain threshold. This option should be explicitly enabled by the administrators or during installation.
Dominant resource fairness (DRF)
Agent selection strategy
Concentrated
Dispersed
Custom
Resource Management
Resource slots
Backend.AI abstracts each different type of computing resources as a “resource slot”. Resource slots are distinguished by its name consisting of two parts: the device name and the slot name.
Resource slot name |
Device name |
Slot name |
---|---|---|
|
|
(implicitly defined as |
|
|
(implicitly defined as |
|
|
|
|
|
|
|
|
|
Each resource slot has a slot type as follows:
Slot type |
Meaning |
Examples |
---|---|---|
|
The value of the resource slot is an integer or decimal to represent how many of the device(s) are available/allocated. It may also represent fractions of devices. |
|
|
The value of the resource slot is an integer to represent how many bytes of the resources are available/allocated. |
|
|
Only “each one” of the device can be allocated to each different kernel exclusively. |
|
Compute plugins
Backend.AI administrators may install one or more compute plugins to each agent.
Without any plugin, only the intrinsic cpu
and mem
resource slots are available.
Each compute plugin may declare one or more resource slots. The plugin is invoked upon startup of the agent to get the list of devices and the resource slots to report. Administrators can inspect the per-agent accelerator details provided by the compute plugins in the control panel.
The most well-known compute plugin is cuda_open
, which is included in the open source version.
It declares cuda.device
resource slot that represents each NVIDIA GPU as one unit.
There is a special compute plugin to simulate non-existent devices: mock
.
Developers may put a local configuration to declare an arbitrary set of devices and resource slots to test the schedulers and the frontend.
It is useful to develop integrations with new hardware devices before you get the actual devices on your hands.
Resource groups
Resource group is a logical group of the Agents with independent schedulers. Each agent belongs to a single resource group only. It self-reports which resource group to join when sending the heartbeat messages, but the specified resource group must exist in prior.
See also
User Management
Users
Backend.AI’s user account has two types of authentication modes: session and keypair. The session mode just uses the normal username and password based on browser sessions (e.g., when using the Web UI), while the keypair mode uses a pair of access and secret keys for programmatic access.
Projects
There may be multiple projects created by administrators and users may belong to one or more projects. Administrators may configure project-level resource policies such as storage quota shared by all project vfolders and project-level artifacts.
When a user creates a new session, he/she must choose which project to use if he/she belongs to multiple projects to be in line with the resource policies.
Cluster Networking
Single-node cluster session
If a session is created with multiple containers with a single-node option, all containers are created in a single agent. The containers share a private bridge network in addition to the default network, so that they could interact with each other privately. There are no firewall restrictions in this private bridge network.
Multi-node cluster session
For even larger-scale computation, you may create a multi-node cluster session that spans across multiple agents. In this case, the manager auto-configures a private overlay network, so that the containers could interact with each other. There are no firewall restrictions in this private overlay network.
Detection of clustered setups
There is a concept called cluster role.
The current version of Backend.AI creates homogeneous cluster sessions by replicating the same resource configuration and the same container image,
but we have plans to add heterogeneous cluster sessions that have different resource and image configurations for each cluster role.
For instance, a Hadoop cluster may have two types of containers: name nodes and data nodes, where they could be mapped to main
and sub
cluster roles.
All interactive apps are executed only in the main1
container which is always present in both cluster and non-cluster sessions.
It is the user application’s responsibility to connect with and utilize other containers in a cluster session.
To ease the process, Backend.AI injects the following environment variables into the containers and sets up a random-generated SSH keypairs between the containers so that each container ssh into others without additional prompts.:
Environment Variable |
Meaning |
Examples |
---|---|---|
|
The number of containers in this cluster session. |
|
|
A comma-separated list of container hostnames in this cluster session. |
|
|
A comma-separated key:value pairs of cluster roles and the replica counts for each role. |
|
|
The container hostname of the current container. |
|
|
The one-based index of the current container from the containers sharing the same cluster role. |
|
|
The name of the current container’s cluster role. |
|
|
The zero-based global index of the current container within the entire cluster session. |
|
Storage Management
Virtual folders
Backend.AI abstracts network storages as a set of “virtual folders” (aka “vfolders”), which provides a persistent file storage to users and projects.
When creating a new session, users may connect vfolders to it with read-only or read-write permissions.
If the shared vfolder has limited the permission to read-only, then the user may connect it with the read-only permission only.
Virtual folders are mounted into compute session containers at /home/work/{name}
so that user programs have access to the virtual folder contents like a local directory.
The mounted path inside containers may be customized (e.g., /workspace
) for compatibility with existing scripts and codes.
Currently it is not possible to unmount or delete a vfolder when there are any running session connected to it.
For cluster sessions having multiple kernels (containers), the connected vfolders are mounted to all kernels using the same location and the permission.
For a multi-node setup, the storage volume mounts must be synchronized across all Agent nodes and the Storage Proxy node(s) using the same mount path (e.g., /mnt
or /vfroot
).
For a single-node setup, you may simply use an empty local directory, like our install-dev.sh
script (link) does.
From the perspective of the storage, all vfolders from different Backend.AI users and projects share a single same UID and GID. This allows a flexible permission sharing between users and projects, while keeping the Linux ownership of the files and directories consistent when they are accessed by multiple different Backend.AI users.
User-owned vfolders
The users may create their own one or more virtual folders to store data files, libraries, and program codes. The superadmins may limit the maximum number of vfolders owned by a user.
Project-owned vfolders
The project admins and superadmins may create a vfolder that is automatically shared to all members of the project, with a specific read-only or read-write permission.
Note
If allowed, users and projects may create and access vfolders in multiple different storage volumes, but the vfolder names must be unique in all storage volumes, for each user and project.
VFolder invitations and permissions
Users and project administrators may invite other users to collaborate on a vfolder. Once the invitee accepts the request, he/she gets the designated read-only or read-write permission on the shared vfolder.
Volume-level permissions
The superadmin may set additional action privileges to each storage volume, such as whether to allow or block mounting the vfolders in compute sessions, cloning the vfolders, etc.
Auto-mount vfolders
If a user-owned vfolder’s name starts with a dot, it is automatically mounted at /home/work
for all sessions created by the user.
A good usecase is .config
and .local
directories to keep your local configurations and user-installed packages (e.g., pip install --user
) persistent across all your sessions.
Quota scopes
Added in version 23.03.
Quota scopes implement per-user and per-project storage usage limits. Currently it supports the hard limits specified in bytes. There are two main schemes to set up this feature.
Storage with per-directory quota
Quota scopes and vfolders with storage solutions supporting per-directry quota
For each storage volume, each user and project has their own dedicated quota scope directories as shown in Fig. 2. The storage solution must support per-directory quota, at least for a single-level (like NetApp’s QTree). We recommend this configuration for filesystems like CephFS, Weka.io, or custom-built storage servers using ZFS or XFS where Backend.AI Storage Proxy can be installed directly onto the storage servers.
Storage with per-volume quota
Quota scopes and vfolders with storage solutions supporting per-volume quota
Unfortunately, there are many cases that we cannot rely on per-directory quota support in storage solutions, due to limitation of the underlying filesystem implementation or having no direct access to the storage vendor APIs.
For this case, we may assign dedicated storage volumes to each user and project like Fig. 3, which naturally limits the space usage by the volume size. Another option is not to configure quota limits, but we don’t recommend this option in production setups.
The shortcoming is that we may need to frequently mount/unmount the network volumes when we create or remove users and projects, which may cause unexpected system failures due to stale file descriptors.
Note
For shared vfolders, the quota usage is accounted for the original owner of the vfolder, either a user or a project.
Warning
For both schemes, the administrator should take care of the storage solution’s system limits such as the maximum number of volumes and quota sets because such limits may impose a hidden limit to the maximum number of users and projects in Backend.AI.
Configuration
Local config
Each service component has a TOML-based local configuration. It defines node-specific configurations such as the agent name, the resource group where it belongs, specific system limits, the IP address and the TCP port(s) to bind their service traffic, and etc.
The configuration files are named after the service components, like manager.toml
, agent.toml
, and storage-proxy.toml
.
The search paths are: the current working directory, ~/.config/backend.ai
, and /etc/backend.ai
.
See also
The sample configurations in our source repository.
Inside each component directory, sample.toml
contains the full configuration schema and descriptions.
Monitoring
Dashboard (Enterprise only)
Backend.AI Dashboard is an add-on service that displays various real-time and historical performance metrics. The metrics include the number of sessions, cluster power usage, GPU utilization, and etc.
Alerts (Enterprise only)
Administrators may configure automatic alerts based on several thrsholds on the monitored metrics, via an external messaging service like emails and SMS.
FAQ
vs. Notebooks
Product |
Role |
Value |
---|---|---|
Apache Zeppelin, Jupyter Notebook |
Notebook-style document + code frontends |
Familiarity from data scientists and researchers, but hard to avoid insecure host resource sharing |
Backend.AI |
Pluggable backend to any frontends |
Built for multi-tenancy: scalable and better isolation |
vs. Orchestration Frameworks
Product |
Target |
Value |
---|---|---|
Amazon ECS, Kubernetes |
Long-running interactive services |
Load balancing, fault tolerance, incremental deployment |
Amazon Lambda, Azure Functions |
Stateless light-weight, short-lived functions |
Serverless, zero-management |
Backend.AI |
Stateful batch computations mixed with interactive applications |
Low-cost high-density computation, maximization of hardware potentials |
vs. Big-data and AI Frameworks
Product |
Role |
Value |
---|---|---|
TensorFlow, Apache Spark, Apache Hive |
Computation runtime |
Difficult to install, configure, and operate at scale |
Amazon ML, Azure ML, GCP ML |
Managed MLaaS |
Highly scalable but dependent on each platform, still requires system engineering backgrounds |
Backend.AI |
Host of computation runtimes |
Pre-configured, versioned, reproducible, customizable (open-source) |
(All product names and trade-marks are the properties of their respective owners.)
Installation Guides
Install from Source
Note
For production deployments, we recommend to create separate virtualenvs for individual services and install the pre-built wheel distributions, following Install from Packages.
Setting Up Manager and Agent (single node, all-in-one)
Check out Development Setup.
Setting Up Additional Agents (multi-node)
Updating manager configuration for multi-nodes
Since scripts/install-dev.sh
assumes a single-node all-in-one setup, it configures the etcd and Redis addresses to be 127.0.0.1
.
You need to update the etcd configuration of the Redis address so that additional agent nodes can connect to the Redis server using the address advertised via etcd:
$ ./backend.ai mgr etcd get config/redis/addr
127.0.0.1:xxxx
$ ./backend.ai mgr etcd put config/redis/addr MANAGER_IP:xxxx # use the port number read above
where MANAGER_IP
is an IP address of the manager node accessible from other agent nodes.
Installing additional agents in different nodes
First, you need to initialize a working copy of the core repository for each additional agent node.
As our scripts/install-dev.sh
does not yet provide an “agent-only” installation mode,
you need to manually perform the same repository cloning along with the pyenv, Python, and Pants setup procedures as the script does.
Note
Since we use the mono-repo for the core packages, there is no way to separately clone the agent sources only. Just clone the entire repository and configure/execute the agent only. Ensure that you also pull the LFS files and submodules when you manually clone it.
Once your pants
is up and working, run pants export
to populate virtualenvs and install dependencies.
Then start to configure agent.toml
by copying it from configs/agent/halfstack.toml as follows:
agent.toml
[etcd].addr.host
: Replace withMANAGER_IP
[agent].rpc-listen-addr.host
: Replace withAGENT_IP
[container].bind-host
: Replace withAGENT_IP
[watcher].service-addr.host
: Replace withAGENT_IP
where AGENT_IP
is an IP address of this agent node accessible from the manager and MANAGER_IP
is an IP address of the manager node accessible from this agent node.
Now execute ./backend.ai ag start-server
to connect this agent node to an existing manager.
We assume that the agent and manager nodes reside in a same local network, where all TCP ports are open to each other.
If this is not the case, you should configure firewalls to open all the port numbers appearing in agent.toml
.
There are more complicated setup scenarios such as splitting network planes for control and container-to-container communications, but we provide assistance with them for enterprise customers only.
Setting Up Accelerators
Ensure that your accelerator is properly set up using vendor-specific installation methods.
Clone the accelerator plugin package into plugins
directory if necessary or just use one of the already existing one in the mono-repo.
You also need to configure agent.toml
’s [agent].allow-compute-plugins
with the full package path (e.g., ai.backend.accelerator.cuda_open
) to activate them.
Configuring Overlay Networks for Multi-node Training (Optional)
Note
All other features of Backend.AI except multi-node training work without this configuration. The Docker Swarm mode is used to configure overlay networks to ensure privacy between cluster sessions, while the container monitoring and configuration is done by Backend.AI itself.
Currently the cross-node inter-container overlay routing is controlled via Docker Swarm’s overlay networks. In the manager, you need to create a Swarm. In the agent nodes, you need to join the Swarm. Then restart all manager and agent daemons to make it working.
Install from Packages
This guide covers how to install Backend.AI from the official release packages. You can build a fully-functional Backend.AI cluster with open-source packages.
Backend.AI consists of a variety of components, including open-source core components, pluggable extensions, and enterprise modules. Some of the major components are:
Backend.AI Manager : API gateway and resource management. Manager delegates workload requests to Agent and storage/file requests to Storage Proxy.
Backend.AI Agent : Installs on a compute node (usually GPU nodes) to start and manage the workload execution. It sends periodic heartbeat signals to the Manager in order to register itself as a worker node. Even if the connection to the Manager is temporarily lost, the pre-initiated workloads continue to be executed.
Backend.AI Storage Proxy : Handles requests relating to storage and files. It offloads the Manager’s burden of handling long-running file I/O operations. It embeds a plugin backend structure that provides dedicated features for each storage type.
Backend.AI Webserver : A web server that provides persistent user web sessions. Users can use the Backend.AI features without subsequent authentication upon initial login. It also serves the statically built graphical user interface in an Enterprise environment.
Backend.AI Web UI : Web application with a graphical user interface. Users can enjoy the easy-to-use interface to launch their secure execution environment and use apps like Jupyter and Terminal. It can be served as statically built JavaScript via Webserver. Or, it also offers desktop applications for many operating systems and architectures.
Most components can be installed in a single management node except Agent, which is usually installed on dedicated computing nodes (often GPU servers). However, this is not a rule and Agent can also be installed on the management node.
It is also possible to configure a high-availability (HA) setup with three or more management nodes, although this is not the focus of this guide.
Setup OS Environment
Backend.AI and its associated components share common requirements and configurations for proper operation. This section explains how to configure the OS environment.
Note
This section assumes the installation on Ubuntu 20.04 LTS.
Create a user account for operation
We will create a user account bai
to install and operate Backend.AI
services. Set the UID
and GID
to 1100
to prevent conflicts with
other users or groups. sudo
privilege is required so add bai
to
sudo
group.
$ username="bai"
$ password="secure-password"
$ sudo adduser --disabled-password --uid 1100 --gecos "" $username
$ echo "$username:$password" | sudo chpasswd
$ sudo usermod -aG sudo bai
If you do not want to expose your password in the shell history, remove the
--disabled-password
option and interactively enter your password.
Login as the bai
user and continue the installation.
Install Docker engine
Backend.AI requires Docker Engine to create a compute session with the Docker
container backend. Also, some service components are deployed as containers. So
installing Docker Engine is
required. Ensure docker-compose-plugin
is installed as well to use
docker compose
command.
After the installation, add the bai
user to the docker
group not to
issue the sudo
prefix command every time interacting with the Docker engine.
$ sudo usermod -aG docker bai
Logout and login again to apply the group membership change.
Optimize sysctl/ulimit parameters
This is not essential but the recommended step to optimize the performance and stability of operating Backend.AI. Refer to the guide of the Manager repiository for the details of the kernel parameters and the ulimit settings. Depending on the Backend.AI services you install, the optimal values may vary. Each service installation section guide with the values, if needed.
Note
Modern systems may have already set the optimal parameters. In that case, you can skip this step.
To cleanly separate the configurations, you may follow the steps below.
Save the resource limit parameters in
/etc/security/limits.d/99-backendai.conf
.root hard nofile 512000 root soft nofile 512000 root hard nproc 65536 root soft nproc 65536 bai hard nofile 512000 bai soft nofile 512000 bai hard nproc 65536 bai soft nproc 65536
Logout and login again to apply the resource limit changes.
Save the kernel parameters in
/etc/sysctl.d/99-backendai.conf
.fs.file-max=2048000 net.core.somaxconn=1024 net.ipv4.tcp_max_syn_backlog=1024 net.ipv4.tcp_slow_start_after_idle=0 net.ipv4.tcp_fin_timeout=10 net.ipv4.tcp_window_scaling=1 net.ipv4.tcp_tw_reuse=1 net.ipv4.tcp_early_retrans=1 net.ipv4.ip_local_port_range="10000 65000" net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.ipv4.tcp_rmem=4096 12582912 16777216 net.ipv4.tcp_wmem=4096 12582912 16777216 vm.overcommit_memory=1
Apply the kernel parameters with
sudo sysctl -p /etc/sysctl.d/99-backendai.conf
.
Prepare required Python versions and virtual environments
Prepare a Python distribution whose version meets the requirements of the target package. Backend.AI 22.09, for example, requires Python 3.10. The latest information on the Python version compatibility can be found at here.
There can be several ways to prepare a specific Python version. Here, we will be using a standalone static built Python.
Use a standalone static built Python (Recommended)
Obtain distribution of a standalone static built Python according to required python version, target machine architecture and etc. Then extract the distribution to a directory of your choice.
$ curl -L "https://github.com/indygreg/python-build-standalone/releases/download/${PYTHON_RELEASE_DATE}/cpython-${PYTHON_VERSION}+${PYTHON_RELEASE_DATE}-${TARGET_MACHINE_ARCHITECTURE}-${ARCHIVE_FLAVOR}.tar.gz" > cpython-${PYTHON_VERSION}+${PYTHON_RELEASE_DATE}-${TARGET_MACHINE_ARCHITECTURE}-${ARCHIVE_FLAVOR}.tar.gz
$ tar -xf "cpython-${PYTHON_VERSION}+${PYTHON_RELEASE_DATE}-${TARGET_MACHINE_ARCHITECTURE}-${ARCHIVE_FLAVOR}.tar.gz"
$ mkdir -p "/home/${USERNAME}/.static-python/versions"
$ mv python "/home/${USERNAME}/.static-python/versions/${PYTHON_VERSION}"
For example,
$ curl -L "https://github.com/indygreg/python-build-standalone/releases/download/20231002/cpython-3.12.2+20240224-x86_64-unknown-linux-gnu-install_only.tar.gz" > cpython-3.12.2+20240224-x86_64-unknown-linux-gnu-install_only.tar.gz
$ tar -xf "cpython-3.12.2+20240224-x86_64-unknown-linux-gnu-install_only.tar.gz"
$ mkdir -p "/home/bai/.static-python/versions"
$ mv python "/home/bai/.static-python/versions/3.12.2"
Then, you can create multiple virtual environments per service. To create a virtual environment for Backend.AI Manager and activate it, for example, you may run:
$ mkdir "${HOME}/manager"
$ cd "${HOME}/manager"
$ ~/.static-python/versions/3.12.2/bin/python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -U pip setuptools wheel
You also need to make pip
available to the Python installation with the
latest wheel
and setuptools
packages, so that any non-binary extension
packages can be compiled and installed on your system.
(Alternative) Use pyenv to manually build and select a specific Python version
If you prefer, there is no problem using pyenv and pyenv-virtualenv.
Install pyenv and pyenv-virtualenv. Then, install a Python version that are needed:
$ pyenv install "${YOUR_PYTHON_VERSION}"
Note
You may need to install suggested build environment to build Python from pyenv.
Then, you can create multiple virtual environments per service. To create a virtual environment for Backend.AI Manager 22.09.x and automatically activate it, for example, you may run:
$ mkdir "${HOME}/manager"
$ cd "${HOME}/manager"
$ pyenv virtualenv "${YOUR_PYTHON_VERSION}" bai-22.09-manager
$ pyenv local bai-22.09-manager
$ pip install -U pip setuptools wheel
You also need to make pip
available to the Python installation with the
latest wheel
and setuptools
packages, so that any non-binary extension
packages can be compiled and installed on your system.
Configure network aliases
Although not required, using a network aliases instead of IP addresses can make
setup and operation easier. Edit the /etc/hosts
file for each node and
append the contents like example below to access each server with network
aliases.
##### BEGIN for Backend.AI services #####
10.20.30.10 bai-m1 # management node 01
10.20.30.20 bai-a01 # agent node 01 (GPU 01)
10.20.30.22 bai-a02 # agent node 02 (GPU 02)
##### END for Backend.AI services #####
Note that the IP addresses should be accessible from other nodes, if you are installing on multiple servers.
Setup accelerators
If there are accelerators (e.g., GPU) on the server, you have to install the vendor-specific drivers and libraries to make sure the accelerators are properly set up and working. Please refer to the vendor documentation for the details.
To integrate NVIDIA GPUs,
Install the NVIDIA driver and CUDA toolkit.
Install the NVIDIA container toolkit (nvidia-docker2).
Pull container images
For compute nodes, you need to pull some container images that are required for creating a compute session. Lablup provides a set of open container images and you may pull the following starter images:
docker pull cr.backend.ai/stable/filebrowser:21.02-ubuntu20.04
docker pull cr.backend.ai/stable/python:3.9-ubuntu20.04
docker pull cr.backend.ai/stable/python-pytorch:1.11-py38-cuda11.3
docker pull cr.backend.ai/stable/python-tensorflow:2.7-py38-cuda11.3
Prepare Database
Backend.AI makes use of PostgreSQL as its main database. Launch the service
using docker compose by generating the file
$HOME/halfstack/postgres-cluster-default/docker-compose.yaml
and populating it with the
following YAML. Feel free to adjust the volume paths and port settings. Please
refer
the latest configuration
(it’s a symbolic link so follow the filename in it) if needed.
version: "3"
x-base: &base
logging:
driver: "json-file"
options:
max-file: "5"
max-size: "10m"
services:
backendai-pg-active:
<<: *base
image: postgres:15.1-alpine
restart: unless-stopped
command: >
postgres
-c "max_connections=256"
-c "max_worker_processes=4"
-c "deadlock_timeout=10s"
-c "lock_timeout=60000"
-c "idle_in_transaction_session_timeout=60000"
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=develove
- POSTGRES_DB=backend
- POSTGRES_INITDB_ARGS="--data-checksums"
healthcheck:
test: ["CMD", "pg_isready", "-U", "postgres"]
interval: 10s
timeout: 3s
retries: 10
volumes:
- "${HOME}/.data/backend.ai/postgres-data/active:/var/lib/postgresql/data:rw"
ports:
- "8100:5432"
networks:
half_stack:
cpu_count: 4
mem_limit: "4g"
networks:
half_stack:
Execute the following command to start the service container. The project
${USER}
is added for operational convenience.
$ cd ${HOME}/halfstack/postgres-cluster-default
$ docker compose up -d
$ # -- To terminate the container:
$ # docker compose down
$ # -- To see the container logs:
$ # docker compose logs -f
Prepare Cache Service
Backend.AI makes use of Redis as its main cache service. Launch the service
using docker compose by generating the file
$HOME/halfstack/redis-cluster-default/docker-compose.yaml
and populating it with the
following YAML. Feel free to adjust the volume paths and port settings. Please
refer
the latest configuration
(it’s a symbolic link so follow the filename in it) if needed.
version: "3"
x-base: &base
logging:
driver: "json-file"
options:
max-file: "5"
max-size: "10m"
services:
backendai-halfstack-redis:
<<: *base
image: redis:6.2-alpine
restart: unless-stopped
command: >
redis-server
--requirepass develove
--appendonly yes
volumes:
- "${HOME}/.data/backend.ai/redis-data:/data:rw"
healthcheck:
test: ["CMD", "redis-cli", "--raw", "incr", "ping"]
interval: 10s
timeout: 3s
retries: 10
ports:
- "8110:6379"
networks:
- half_stack
cpu_count: 1
mem_limit: "2g"
networks:
half_stack:
Execute the following command to start the service container. The project
${USER}
is added for operational convenience.
$ cd ${HOME}/halfstack/redis-cluster-default
$ docker compose up -d
$ # -- To terminate the container:
$ # docker compose down
$ # -- To see the container logs:
$ # docker compose logs -f
Prepare Config Service
Backend.AI makes use of Etcd as its main config service. Launch the service
using docker compose by generating the file
$HOME/halfstack/etcd-cluster-default/docker-compose.yaml
and populating it with the
following YAML. Feel free to adjust the volume paths and port settings. Please
refer
the latest configuration
(it’s a symbolic link so follow the filename in it) if needed.
version: "3"
x-base: &base
logging:
driver: "json-file"
options:
max-file: "5"
max-size: "10m"
services:
backendai-halfstack-etcd:
<<: *base
image: quay.io/coreos/etcd:v3.4.15
restart: unless-stopped
command: >
/usr/local/bin/etcd
--name etcd-node01
--data-dir /etcd-data
--listen-client-urls http://0.0.0.0:2379
--advertise-client-urls http://0.0.0.0:8120
--listen-peer-urls http://0.0.0.0:2380
--initial-advertise-peer-urls http://0.0.0.0:8320
--initial-cluster etcd-node01=http://0.0.0.0:8320
--initial-cluster-token backendai-etcd-token
--initial-cluster-state new
--auto-compaction-retention 1
volumes:
- "${HOME}/.data/backend.ai/etcd-data:/etcd-data:rw"
healthcheck:
test: ["CMD", "etcdctl", "endpoint", "health"]
interval: 10s
timeout: 3s
retries: 10
ports:
- "8120:2379"
# - "8320:2380" # listen peer (only if required)
networks:
- half_stack
cpu_count: 1
mem_limit: "1g"
networks:
half_stack:
Execute the following command to start the service container. The project
${USER}
is added for operational convenience.
$ cd ${HOME}/halfstack/etcd-cluster-default
$ docker compose up -d
$ # -- To terminate the container:
$ # docker compose down
$ # -- To see the container logs:
$ # docker compose logs -f
Install Backend.AI Manager
Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.
Install the latest version of Backend.AI Manager for the current Python version:
$ cd "${HOME}/manager"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-manager
If you want to install a specific version:
$ pip install -U backend.ai-manager==${BACKEND_PKG_VERSION}
Local configuration
Backend.AI Manager uses a TOML file (manager.toml
) to configure local
service. Refer to the
manager.toml sample file
for a detailed description of each section and item. A configuration example
would be:
[etcd]
namespace = "local"
addr = { host = "bai-m1", port = 8120 }
user = ""
password = ""
[db]
type = "postgresql"
addr = { host = "bai-m1", port = 8100 }
name = "backend"
user = "postgres"
password = "develove"
[manager]
num-proc = 2
service-addr = { host = "0.0.0.0", port = 8081 }
# user = "bai"
# group = "bai"
ssl-enabled = false
heartbeat-timeout = 30.0
pid-file = "/home/bai/manager/manager.pid"
disabled-plugins = []
hide-agents = true
# event-loop = "asyncio"
# importer-image = "lablup/importer:manylinux2010"
distributed-lock = "filelock"
[docker-registry]
ssl-verify = false
[logging]
level = "INFO"
drivers = ["console", "file"]
[logging.pkg-ns]
"" = "WARNING"
"aiotools" = "INFO"
"aiopg" = "WARNING"
"aiohttp" = "INFO"
"ai.backend" = "INFO"
"alembic" = "INFO"
[logging.console]
colored = true
format = "verbose"
[logging.file]
path = "./logs"
filename = "manager.log"
backup-count = 10
rotation-size = "10M"
[debug]
enabled = false
enhanced-aiomonitor-task-info = true
Save the contents to ${HOME}/.config/backend.ai/manager.toml
. Backend.AI
will automatically recognize the location. Adjust each field to conform to your
system.
Global configuration
Etcd (cluster) stores globally shared configurations for all nodes. Some of them should be populated prior to starting the service.
Note
It might be a good idea to create a backup of the current Etcd configuration before modifying the values. You can do so by simply executing:
$ backend.ai mgr etcd get --prefix "" > ./etcd-config-backup.json
To restore the backup:
$ backend.ai mgr etcd delete --prefix ""
$ backend.ai mgr etcd put-json "" ./etcd-config-backup.json
The commands below should be executed at ${HOME}/manager
directory.
To list a specific key from Etcd, for example, config
key:
$ backend.ai mgr etcd get --prefix config
Now, configure Redis access information. This should be accessible from all nodes.
$ backend.ai mgr etcd put config/redis/addr "bai-m1:8110"
$ backend.ai mgr etcd put config/redis/password "develove"
Set the container registry. The following is the Lablup’s open registry (cr.backend.ai). You can set your own registry with username and password if needed. This can be configured via GUI as well.
$ backend.ai mgr etcd put config/docker/image/auto_pull "tag"
$ backend.ai mgr etcd put config/docker/registry/cr.backend.ai "https://cr.backend.ai"
$ backend.ai mgr etcd put config/docker/registry/cr.backend.ai/type "harbor2"
$ backend.ai mgr etcd put config/docker/registry/cr.backend.ai/project "stable"
$ # backend.ai mgr etcd put config/docker/registry/cr.backend.ai/username "bai"
$ # backend.ai mgr etcd put config/docker/registry/cr.backend.ai/password "secure-password"
Also, populate the Storage Proxy configuration to the Etcd:
$ # Allow project (group) folders.
$ backend.ai mgr etcd put volumes/_types/group ""
$ # Allow user folders.
$ backend.ai mgr etcd put volumes/_types/user ""
$ # Default volume host. The name of the volume proxy here is "bai-m1" and volume name is "local".
$ backend.ai mgr etcd put volumes/default_host "bai-m1:local"
$ # Set the "bai-m1" proxy information.
$ # User (browser) facing API endpoint of Storage Proxy.
$ # Cannot use host alias here. It should be user-accessible URL.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/client_api "http://10.20.30.10:6021"
$ # Manager facing internal API endpoint of Storage Proxy.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/manager_api "http://bai-m1:6022"
$ # Random secret string which is used by Manager to communicate with Storage Proxy.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/secret "secure-token-to-authenticate-manager-request"
$ # Option to disable SSL verification for the Storage Proxy.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/ssl_verify "false"
Check if the configuration is properly populated:
$ backend.ai mgr etcd get --prefix volumes
Note that you have to change the secret to a unique random string for secure communication between the manager and Storage Proxy. The most recent set of parameters can be found from sample.etcd.volumes.json.
To enable access to the volumes defined by the Storage Proxy from every user,
you need to update the allowed_vfolder_hosts
column of the domains
table
to hold the storage volume reference (e.g., bai-m1:local
). You can do this by
issuing SQL statement directly inside the PostgreSQL container:
$ vfolder_host_val='{"bai-m1:local": ["create-vfolder", "modify-vfolder", "delete-vfolder", "mount-in-session", "upload-file", "download-file", "invite-others", "set-user-specific-permission"]}'
$ docker exec -it bai-backendai-pg-active-1 psql -U postgres -d backend \
-c "UPDATE domains SET allowed_vfolder_hosts = '${vfolder_host_val}' WHERE name = 'default';"
Populate the database with initial fixtures
You need to prepare alembic.ini
file under ${HOME}/manager
to manage
the database schema. Copy the sample
halfstack.alembic.ini
and save it as ${HOME}/manager/alembic.ini
. Adjust the sqlalchemy.url
field if database connection information is different from the default one. You
may need to change localhost
to bai-m1
.
Populate the database schema and initial fixtures. Copy the example JSON files
(example-keypairs.json
and
example-resource-presets.json)
as keypairs.json
and resource-presets.json
, save them under
${HOME}/manager/
. Customize them to have unique keypairs and passwords for
your initial superadmin and sample user accounts for security.
$ backend.ai mgr schema oneshot
$ backend.ai mgr fixture populate ./keypairs.json
$ backend.ai mgr fixture populate ./resource-presets.json
Sync the information of container registry
You need to scan the image catalog and metadata from the container registry to the Manager. This is required to display the list of compute environments in the user web GUI (Web UI). You can run the following command to sync the information with Lablup’s public container registry:
$ backend.ai mgr image rescan cr.backend.ai
Run Backend.AI Manager service
You can run the service:
$ cd "${HOME}/manager"
$ python -m ai.backend.manager.server
Check if the service is running. The default Manager API port is 8081, but it
can be configured from manager.toml
:
$ curl bai-m1:8081
{"version": "v6.20220615", "manager": "22.09.6"}
Press Ctrl-C
to stop the service.
Register systemd service
The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.
First, create a runner script at ${HOME}/bin/run-manager.sh
:
#! /bin/bash
set -e
if [ -z "$HOME" ]; then
export HOME="/home/bai"
fi
# -- If you have installed using static python --
source .venv/bin/activate
# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"
if [ "$#" -eq 0 ]; then
exec python -m ai.backend.manager.server
else
exec "$@"
fi
Make the script executable:
$ chmod +x "${HOME}/bin/run-manager.sh"
Then, create a systemd service file at
/etc/systemd/system/backendai-manager.service
:
[Unit]
Description= Backend.AI Manager
Requires=network.target
After=network.target remote-fs.target
[Service]
Type=simple
ExecStart=/home/bai/bin/run-manager.sh
PIDFile=/home/bai/manager/manager.pid
User=1100
Group=1100
WorkingDirectory=/home/bai/manager
TimeoutStopSec=5
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072
[Install]
WantedBy=multi-user.target
Finally, enable and start the service:
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-manager
$ # To check the service status
$ sudo systemctl status backendai-manager
$ # To restart the service
$ sudo systemctl restart backendai-manager
$ # To stop the service
$ sudo systemctl stop backendai-manager
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-manager -f
Install Backend.AI Agent
If there are dedicated compute nodes (often, GPU nodes) in your cluster, Backend.AI Agent service should be installed on the compute nodes, not on the management node.
Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.
Install the latest version of Backend.AI Agent for the current Python version:
$ cd "${HOME}/agent"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-agent
If you want to install a specific version:
$ pip install -U backend.ai-agent==${BACKEND_PKG_VERSION}
Setting Up Accelerators
Note
You can skip this section if your system does not have H/W accelerators.
Backend.AI supports various H/W accelerators. To integrate them with Backend.AI, you need to install the corresponding accelerator plugin package. Before installing the package, make sure that the accelerator is properly set up using vendor-specific installation methods.
Most popular accelerator today would be NVIDIA GPU. To install the open-source CUDA accelerator plugin, run:
$ pip install -U backend.ai-accelerator-cuda-open
Note
Backend.AI’s fractional GPU sharing is available only on the enterprise version but not supported on the open-source version.
Local configuration
Backend.AI Agent uses a TOML file (agent.toml
) to configure local
service. Refer to the
agent.toml sample file
for a detailed description of each section and item. A configuration example
would be:
[etcd]
namespace = "local"
addr = { host = "bai-m1", port = 8120 }
user = ""
password = ""
[agent]
mode = "docker"
# NOTE: You cannot use network alias here. Write the actual IP address.
rpc-listen-addr = { host = "10.20.30.10", port = 6001 }
# id = "i-something-special"
scaling-group = "default"
pid-file = "/home/bai/agent/agent.pid"
event-loop = "uvloop"
# allow-compute-plugins = ["ai.backend.accelerator.cuda_open"]
[container]
port-range = [30000, 31000]
kernel-uid = 1100
kernel-gid = 1100
bind-host = "bai-m1"
advertised-host = "bai-m1"
stats-type = "docker"
sandbox-type = "docker"
jail-args = []
scratch-type = "hostdir"
scratch-root = "./scratches"
scratch-size = "1G"
[watcher]
service-addr = { host = "bai-a01", port = 6009 }
ssl-enabled = false
target-service = "backendai-agent.service"
soft-reset-available = false
[logging]
level = "INFO"
drivers = ["console", "file"]
[logging.console]
colored = true
format = "verbose"
[logging.file]
path = "./logs"
filename = "agent.log"
backup-count = 10
rotation-size = "10M"
[logging.pkg-ns]
"" = "WARNING"
"aiodocker" = "INFO"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"
[resource]
reserved-cpu = 1
reserved-mem = "1G"
reserved-disk = "8G"
[debug]
enabled = false
skip-container-deletion = false
asyncio = false
enhanced-aiomonitor-task-info = true
log-events = false
log-kernel-config = false
log-alloc-map = false
log-stats = false
log-heartbeats = false
log-docker-events = false
[debug.coredump]
enabled = false
path = "./coredumps"
backup-count = 10
size-limit = "64M"
You may need to configure [agent].allow-compute-plugins
with the full
package path (e.g., ai.backend.accelerator.cuda_open
) to activate them.
Save the contents to ${HOME}/.config/backend.ai/agent.toml
. Backend.AI
will automatically recognize the location. Adjust each field to conform to your
system.
Run Backend.AI Agent service
You can run the service:
$ cd "${HOME}/agent"
$ python -m ai.backend.agent.server
You should see a log message like started handling RPC requests at ...
There is an add-on service, Agent Watcher, that can be used to monitor and manage the Agent service. It is not required to run the Agent service, but it is recommended to use it for production environments.
$ cd "${HOME}/agent"
$ python -m ai.backend.agent.watcher
Press Ctrl-C
to stop both services.
Register systemd service
The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.
It is better to set [container].stats-type = "cgroup"
in the agent.toml
for better metric collection which is only available with root privileges.
First, create a runner script at ${HOME}/bin/run-agent.sh
:
#! /bin/bash
set -e
if [ -z "$HOME" ]; then
export HOME="/home/bai"
fi
# -- If you have installed using static python --
source .venv/bin/activate
# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"
if [ "$#" -eq 0 ]; then
exec python -m ai.backend.agent.server
else
exec "$@"
fi
Create a runner script for Watcher at ${HOME}/bin/run-watcher.sh
:
#! /bin/bash
set -e
if [ -z "$HOME" ]; then
export HOME="/home/bai"
fi
# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"
if [ "$#" -eq 0 ]; then
exec python -m ai.backend.agent.watcher
else
exec "$@"
fi
Make the scripts executable:
$ chmod +x "${HOME}/bin/run-agent.sh"
$ chmod +x "${HOME}/bin/run-watcher.sh"
Then, create a systemd service file at
/etc/systemd/system/backendai-agent.service
:
[Unit]
Description= Backend.AI Agent
Requires=backendai-watcher.service
After=network.target remote-fs.target backendai-watcher.service
[Service]
Type=simple
ExecStart=/home/bai/bin/run-agent.sh
PIDFile=/home/bai/agent/agent.pid
WorkingDirectory=/home/bai/agent
TimeoutStopSec=5
KillMode=process
KillSignal=SIGINT
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072
[Install]
WantedBy=multi-user.target
And for Watcher at /etc/systemd/system/backendai-watcher.service
:
[Unit]
Description= Backend.AI Agent Watcher
After=network.target remote-fs.target
[Service]
Type=simple
ExecStart=/home/bai/bin/run-watcher.sh
WorkingDirectory=/home/bai/agent
TimeoutStopSec=3
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Finally, enable and start the service:
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-watcher
$ sudo systemctl enable --now backendai-agent
$ # To check the service status
$ sudo systemctl status backendai-agent
$ # To restart the service
$ sudo systemctl restart backendai-agent
$ # To stop the service
$ sudo systemctl stop backendai-agent
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-agent -f
Install Backend.AI Storage Proxy
Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.
Install the latest version of Backend.AI Storage Proxy for the current Python version:
$ cd "${HOME}/storage-proxy"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-storage-proxy
If you want to install a specific version:
$ pip install -U backend.ai-storage-proxy==${BACKEND_PKG_VERSION}
Local configuration
Backend.AI Storage Proxy uses a TOML file (storage-proxy.toml
) to configure
local service. Refer to the
storage-proxy.toml sample file
for a detailed description of each section and item. A configuration example
would be:
[etcd]
namespace = "local"
addr = { host = "bai-m1", port = 8120 }
user = ""
password = ""
[storage-proxy]
node-id = "i-bai-m1"
num-proc = 2
pid-file = "/home/bai/storage-proxy/storage_proxy.pid"
event-loop = "uvloop"
scandir-limit = 1000
max-upload-size = "100g"
# Used to generate JWT tokens for download/upload sessions
secret = "secure-token-for-users-download-upload-sessions"
# The download/upload session tokens are valid for:
session-expire = "1d"
user = 1100
group = 1100
[api.client]
# Client-facing API
service-addr = { host = "0.0.0.0", port = 6021 }
ssl-enabled = false
[api.manager]
# Manager-facing API
service-addr = { host = "0.0.0.0", port = 6022 }
ssl-enabled = false
# Used to authenticate managers
secret = "secure-token-to-authenticate-manager-request"
[debug]
enabled = false
asyncio = false
enhanced-aiomonitor-task-info = true
[logging]
# One of: "NOTSET", "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"
# Set the global logging level.
level = "INFO"
# Multi-choice of: "console", "logstash", "file"
# For each choice, there must be a "logging.<driver>" section
# in this config file as exemplified below.
drivers = ["console", "file"]
[logging.pkg-ns]
"" = "WARNING"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"
[logging.console]
# If set true, use ANSI colors if the console is a terminal.
# If set false, always disable the colored output in console logs.
colored = true
# One of: "simple", "verbose"
format = "simple"
[logging.file]
path = "./logs"
filename = "storage-proxy.log"
backup-count = 10
rotation-size = "10M"
[volume]
[volume.local]
backend = "vfs"
path = "/vfroot/local"
# If there are NFS volumes
# [volume.nfs]
# backend = "vfs"
# path = "/vfroot/nfs"
Save the contents to ${HOME}/.config/backend.ai/storage-proxy.toml
. Backend.AI
will automatically recognize the location. Adjust each field to conform to your
system.
Run Backend.AI Storage Proxy service
You can run the service:
$ cd "${HOME}/storage-proxy"
$ python -m ai.backend.storage.server
Press Ctrl-C
to stop both services.
Register systemd service
The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.
First, create a runner script at ${HOME}/bin/run-storage-proxy.sh
:
#! /bin/bash
set -e
if [ -z "$HOME" ]; then
export HOME="/home/bai"
fi
# -- If you have installed using static python --
source .venv/bin/activate
# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"
if [ "$#" -eq 0 ]; then
exec python -m ai.backend.storage.server
else
exec "$@"
fi
Make the scripts executable:
$ chmod +x "${HOME}/bin/run-storage-proxy.sh"
Then, create a systemd service file at
/etc/systemd/system/backendai-storage-proxy.service
:
[Unit]
Description= Backend.AI Storage Proxy
Requires=network.target
After=network.target remote-fs.target
[Service]
Type=simple
ExecStart=/home/bai/bin/run-storage-proxy.sh
PIDFile=/home/bai/storage-proxy/storage-proxy.pid
WorkingDirectory=/home/bai/storage-proxy
User=1100
Group=1100
TimeoutStopSec=5
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072
[Install]
WantedBy=multi-user.target
Finally, enable and start the service:
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-storage-proxy
$ # To check the service status
$ sudo systemctl status backendai-storage-proxy
$ # To restart the service
$ sudo systemctl restart backendai-storage-proxy
$ # To stop the service
$ sudo systemctl stop backendai-storage-proxy
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-storage-proxy -f
Install Backend.AI Webserver
Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.
Install the latest version of Backend.AI Webserver for the current Python version:
$ cd "${HOME}/webserver"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-webserver
If you want to install a specific version:
$ pip install -U backend.ai-webserver==${BACKEND_PKG_VERSION}
Local configuration
Backend.AI Webserver uses a config file (webserver.conf
) to configure
local service. Refer to the
webserver.conf sample file
for a detailed description of each section and item. A configuration example
would be:
[service]
ip = "0.0.0.0"
port = 8080
# Not active in open-source edition.
wsproxy.url = "http://10.20.30.10:10200"
# Set or enable it when using reverse proxy for SSL-termination
# force_endpoint_protocol = "https"
mode = "webui"
enable_signup = false
allow_signup_without_confirmation = false
always_enqueue_compute_session = false
allow_project_resource_monitor = false
allow_change_signin_mode = false
mask_user_info = false
enable_container_commit = false
hide_agents = true
directory_based_usage = false
[resources]
open_port_to_public = false
allow_non_auth_tcp = false
allow_preferred_port = false
max_cpu_cores_per_container = 255
max_memory_per_container = 1000
max_cuda_devices_per_container = 8
max_cuda_shares_per_container = 8
max_shm_per_container = 256
# Maximum per-file upload size (bytes)
max_file_upload_size = 4294967296
[environments]
# allowlist = ""
[ui]
brand = "Backend.AI"
menu_blocklist = "pipeline"
[api]
domain = "default"
endpoint = "http://bai-m1:8081"
text = "Backend.AI"
ssl-verify = false
[session]
redis.host = "bai-m1"
redis.port = 8110
redis.db = 5
redis.password = "develove"
max_age = 604800 # 1 week
flush_on_startup = false
login_block_time = 1200 # 20 min (in sec)
login_allowed_fail_count = 10
max_count_for_preopen_ports = 10
[license]
[webserver]
[logging]
# One of: "NOTSET", "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"
# Set the global logging level.
level = "INFO"
# Multi-choice of: "console", "logstash", "file"
# For each choice, there must be a "logging.<driver>" section
# in this config file as exemplified below.
drivers = ["console", "file"]
[logging.console]
# If set true, use ANSI colors if the console is a terminal.
# If set false, always disable the colored output in console logs.
colored = true
# One of: "simple", "verbose"
format = "verbose"
[logging.file]
# The log file path and filename pattern.
# All messages are wrapped in single-line JSON objects.
# Rotated logs may have additional suffixes.
# For production, "/var/log/backend.ai" is recommended.
path = "./logs"
filename = "webserver.log"
# Set the maximum number of recent container coredumps in the coredump directory.
# Oldest coredumps are deleted if there is more than this number of coredumps.
backup-count = 10
# The log file size to begin rotation.
rotation-size = "10M"
[logging.logstash]
# The endpoint to publish logstash records.
endpoint = { host = "localhost", port = 9300 }
# One of: "zmq.push", "zmq.pub", "tcp", "udp"
protocol = "tcp"
# SSL configs when protocol = "tcp"
ssl-enabled = true
ssl-verify = true
# Specify additional package namespaces to include in the logs
# and their individual log levels.
# Note that the actual logging level applied is the conjunction of the global logging level and the
# logging levels specified here for each namespace.
[logging.pkg-ns]
"" = "WARNING"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"
[debug]
enabled = false
[plugin]
[pipeline]
Save the contents to ${HOME}/.config/backend.ai/webserver.conf
.
Run Backend.AI Webserver service
You can run the service by specifying the config file path with -f
option:
$ cd "${HOME}/webserver"
$ python -m ai.backend.web.server -f ${HOME}/.config/backend.ai/webserver.conf
Press Ctrl-C
to stop both services.
Register systemd service
The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.
First, create a runner script at ${HOME}/bin/run-webserver.sh
:
#! /bin/bash
set -e
if [ -z "$HOME" ]; then
export HOME="/home/bai"
fi
# -- If you have installed using static python --
source .venv/bin/activate
# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"
if [ "$#" -eq 0 ]; then
exec python -m ai.backend.web.server -f ${HOME}/.config/backend.ai/webserver.conf
else
exec "$@"
fi
Make the scripts executable:
$ chmod +x "${HOME}/bin/run-webserver.sh"
Then, create a systemd service file at
/etc/systemd/system/backendai-webserver.service
:
[Unit]
Description= Backend.AI Webserver
Requires=network.target
After=network.target remote-fs.target
[Service]
Type=simple
ExecStart=/home/bai/bin/run-webserver.sh
PIDFile=/home/bai/webserver/webserver.pid
WorkingDirectory=/home/bai/webserver
User=1100
Group=1100
TimeoutStopSec=5
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072
[Install]
WantedBy=multi-user.target
Finally, enable and start the service:
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-webserver
$ # To check the service status
$ sudo systemctl status backendai-webserver
$ # To restart the service
$ sudo systemctl restart backendai-webserver
$ # To stop the service
$ sudo systemctl stop backendai-webserver
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-webserver -f
Check user GUI access via web
You can check the access to the web GUI by opening the URL
http://<host-ip-or-domain>:8080
in your web browser. If all goes well, you
will see the login page.

Enter the email and password you set in the previous step to check login.

You can use almost every feature from the web GUI, but launching compute sesison apps like Terminal and/or Jupyer notebook is not possible from the web in the open-source edition. You can instead use the GUI desktop client to fully use the GUI features.
You can download the GUI desktop client from the web GUI in the Summary page. Please use the “Download Backend.AI Web UI App” at the bottom of the page.

Or, you can download from the following release page: https://github.com/lablup/backend.ai-webui/releases
Web UI (user GUI) guide can be found at https://webui.docs.backend.ai/.
Install on Clouds
The minimal instance configuration:
1x SSL certificate with a private key for your own domain (for production)
1x manager instance (e.g., t3.xlarge on AWS)
For HA setup, you many replicate multiple manager instances running in different availability zones and put a load balancer in front of them.
Nx agent instances (e.g., t3.medium / p2.xlarge on AWS – for minimal testing)
If you spawn multiple agents, it is recommended to use a placement group to improve locality for each availability zone.
1x PostgreSQL instance (e.g., AWS RDS)
1x Redis instance (e.g., AWS ElasticCache)
1x etcd cluster
For HA setup, it should consist of 5 separate instances distributed across availability zones.
1x cloud file system (e.g., AWS EFS, Azure FileShare)
All should be in the same virtual private network (e.g., AWS VPC).
Install on Premise
The minimal server node configuration:
1x SSL certificate with a private key for your own domain (for production)
1x manager server
Nx agent servers
1x PostgreSQL server
1x Redis server
1x etcd cluster
For HA setup, it should consist of 5 separate server nodes.
1x network-accessible storage server (NAS) with NFS/SMB mounts
All should be in the same private network (LAN).
Depending on the cluster size, several service/database daemons may run on the same physical server.
Install monitoring and logging tools
The Backend.AI can use several 3rd-party monitoring and logging services. Using them is completely optional.
Guide variables
⚠️ Prepare the values of the following variables before working with this page and replace their occurrences with the values when you follow the guide.
Name |
Description |
---|---|
|
>The Datadog API key |
|
The Datadog application key |
|
The private Sentry report URL |
Install Datadog agent
Datadog is a 3rd-party service to monitor the server resource usage.
$ DD_API_KEY={DDAPIKEY} bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/dd-agent/master/packaging/datadog-agent/source/install_agent.sh)"
Install Raven (Sentry client)
Raven is the official client package name of Sentry, which reports detailed contextual information such as stack and package versions when an unhandled exception occurs.
$ pip install "raven>=6.1"
Upgrade existing Backend.AI cluster
Note
It is considered as an ideal situation to terminate every workload (including compute sessions) before initiating upgrade. There may be unexpected side effects when performing a rolling upgrade.
Note
Unless you know how each components interacts with others, it is best to retain a single version installed across every parts of Backend.AI cluster.
Performing minor upgrade
A minor upgrade means upgrading a Backend.AI cluster while keeping the major version same (e.g. 24.03.0 to 24.03.1). Usually changes for minor upgrades are meant for fixing critical bugs rather than introducing new features. In general there should be only trivial changes between minor versions that won’t affect how users interact with the software. To plan the upgrade, first check following facts:
Read every bit of the release changelog.
Run the minor upgrade consecutively version by version.
Do not skip the intermediate version event when trying to upgrade an outdated cluster.
Check if there is a change at the database schema.
As it is mentioned at the beginning it is best to maintain database schema as concrete, but in rare situations it is inevitable to alter it.
Make sure every mission critical workloads are shut down when performing a rolling upgrade.
Upgrading Backend.AI Manager
Stop the manager process running at server.
Upgrade the Python package by executing
pip install -U backend.ai-manager==<target version>
.Match databse schema with latest by executing
alembic upgrade head
.Restart the process.
Upgrading other Backend.AI components
Stop the ongoing server process.
Upgrade the Python package by executing
pip install -U backend.ai-<component name>==<target version>
.Restart the process.
Others
Depending on the situation there might be an additional process required which must be manually performed by the system administrator. Always check out the release changelog to find out whether it indicates to do so.
Performing major upgrade
A major upgrade involves significant feature additions and structural changes. DO NOT perform rolling upgrades in any cases. Please make sure to shutdown every workload of the cluster and notify users of a relatively prolonged downtime.
To plan the upgrade, first check following facts:
Upgrade the Backend.AI cluster to the very latest minor version of the prior release before starting major version upgrade.
By the policy, it is not allowed to upgrade the cluster to the latest major on a cluster with an outdated minor version installed.
Do not skip the intermediate major version
You can not skip the stop-gap version!
Example of allowed upgrade paths
23.09.10 (latest in the previous major) -> 24.03.0
23.09.10 (latest in the previous major) -> 24.03.5
23.09.9 -> 23.09.10 (latest in the previous major) -> 24.03.0
23.03.11 -> 23.09.0 -> 23.09.1 -> … -> 23.09.10 (latest in the previous major) -> 24.03.0
…
Example of forbidden upgrade paths
23.09.9 (a non-latest minor version of the prior release) -> 24.03.0
23.03.0 (not a direct prior release) -> 24.03.0
…
Upgrading Backend.AI Manager
Stop the manager process running at server.
Upgrade the Python package by executing
pip install -U backend.ai-manager==<target version>
.Match databse schema with latest by executing
alembic upgrade head
.Fill out any missing DB revisions by executing
backend.ai mgr schema apply-mission-revisions <version number of previous Backend.AI software>
.Start the process again.
Upgrading other Backend.AI components
Stop the ongoing server process.
Upgrade the Python package by executing
pip install -U backend.ai-<component name>==<target version>
.Restart the process.
Others
Depending on the situation there might be an additional process required which must be manually performed by system administrator. Always check out the release changelog to find out whether it indicates to do so.
Environment specifics: WSL v2
Backend.AI supports running on WSL (Windows Subsystem for Linux) version 2. However, you need to configure some special options so that the WSL distribution can interact with the Docker Desktop service.
Configuration of Docker Desktop for Windows
Turn on WSL Integration on Settings → Resources → WSL INTEGRATION. For the most cases, this should be already configured when you install Docker Desktop for Windows.
See also
Configuration of WSL
Create or modify
/etc/wsl.conf
usingsudo
in the WSL shell.Write down this and save.
[automount] root = / options = "metadata"
Run
wsl --shutdown
in a PowerShell prompt to restart the WSL distribution to ensure yourwsl.conf
updates applied.Enter the WSL shell again. If it is applied, your path must appears like
/c/some/path
instead of/mnt/c/some/path
.Run
sudo mount --make-rshared /
in the WSL shell. Otherwise, your container creation from Backend.AI will fail with an error message likeaiodocker.exceptions.DockerError: DockerError(500, 'path is mounted on /d but it is not a shared mount.')
.
Installation of Backend.AI
Now you may run the installer in the WSL shell.
User Guides
Install User Programs in Session Containers
Sometimes you need new programs or libraries that are not installed in your environment. If so, you can install the new program into your environment.
NOTE: Newly installed programs are not environment dependent. It is installed in the user directory.
Install packages with linuxbrew
If you are a macOS user and a researcher or developer who occasionally installs unix programs, you may be familiar with homebrew <https://brew.sh>. You can install new programs using linuxbrew in Backend.AI.
Creating a user linuxbrew directory
Directories that begin with a dot are automatically mounted when the session starts. Create a linuxbrew directory that will be automatically mounted so that programs you install with linuxbrew can be used in all sessions.
Create .linuxbrew in the Storage section.
With CLI:
$ backend.ai vfolder create .linuxbrew
Let’s check if they are created correctly.
$ backend.ai vfolder list
Also, you can create a directory using GUI console with same name.
Installing linuxbrew
Start a new session for installation. Choose your environment and allocate the necessary resources. Generally, you don’t need to allocate a lot of resources, but if you need to compile or install a GPU-dependent library, you need to adjust the resource allocation to your needs.
In general, 1 CPU / 4GB RAM is enough.
$ sh -c "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install.sh)"
Testing linuxbrew
Enter the brew command to verify that linuxbrew is installed. In general, to use linuxbrew
you need to add the path where linuxbrew
is installed to the PATH variable.
Enter the following command to temporarily add the path and verify that it is installed correctly.
$ brew
Setting linuxbrew environment variables automatically
To correctly reference the binaries and libraries installed by linuxbrew, add the configuration to .bashrc
. You can add settings from the settings tab.
Example: Installing and testing htop
To test the program installation, let’s install a program called htop
. htop
is a program that extends the top command, allowing you to monitor the running computing environment in a variety of ways.
Let’s install it with the following command:
$ brew install htop
If there are any libraries needed for the htop
program, they will be installed automatically.
Now let’s run:
$ htop
From the run screen, you can press q to return to the terminal.
1.6 Deleting the linuxbrew Environment
To reset all programs installed with linuxbrew, just delete everything in the .linuxbrew directory.
Note: If you want to remove a program by selecting it, use the brew uninstall [PROGRAM_NAME]
command.
$ rm -rf ~/.linuxbrew/*
Install packages with miniconda
Some environments support miniconda. In this case, you can use miniconda <https://docs.conda.io/projects/conda/en/latest/user-guide/install/> to install the packages you want.
Creating a user miniconda-required directory
Directories that begin with a dot are automatically mounted when the session starts. Create a .conda
, .continuum
directory that will be automatically mounted so that programs you install with miniconda can be used in all sessions.
Create .conda
, .continuum
in the Storage section.
With CLI:
$ backend.ai vfolder create .conda
$ backend.ai vfolder create .continuum
Let’s check if they are created correctly.
$ backend.ai vfolder list
Also, you can create a directory using GUI console with same name.
miniconda test
Make sure you have miniconda installed in your environment. Package installation using miniconda is only available if miniconda is preinstalled in your environment.
$ conda
Example: Installing and testing htop
To test the program installation, let’s install a program called htop
. htop
is a program that extends the top command, allowing you to monitor the running computing environment in a variety of ways.
Let’s install it with the following command:
$ conda install -c conda-forge htop
If there are any libraries needed for the htop
program, they will be installed automatically.
Now let’s run:
$ htop
From the run screen, you can press q to return to the terminal.
Developer Guides
Development Setup
Currently Backend.AI is developed and tested under only *NIX-compatible platforms (Linux or macOS).
The development setup uses a mono-repository for the backend stack and a side-by-side repository checkout of the frontend stack. In contrast, the production setup uses per-service independent virtual environments and relies on a separately provisioned app proxy pool.
There are three ways to run both the backend and frontend stacks for development, as demonstrated in
Fig. 4, Fig. 5, and Fig. 6.
The installation guide in this page using scripts/install-dev.sh
covers all three cases because the only difference
is that how you launch the Web UI from the mono-repo.
A standard development setup of Backend.AI open source components
A development setup of Backend.AI open source components for Electron-based desktop app
A development setup of Backend.AI open source components with pre-built web UI from the backend.ai-app
repository
Installation from Source
For the ease of on-boarding developer experience, we provide an automated script that installs all server-side components in editable states with just one command.
Prerequisites
Install the followings accordingly to your host operating system.
-
Ensure that you have all of the Python versions specified in
pants.toml
withpyenv
. (both Python 3.9.x and Python 3.10.8 at the time of writing, but please consult your copy ofpants.toml
for the latest information)Check the prerequisites for Python build environment setup for your system.
Docker Compose (v2 required)
(For Linux aarch64/arm64 setups only) Rust to build Pants from its source
-
For pants version 2.18 and later. The following verions are released from Github Releases instead of PyPI.
Warning
To avoid conflicts with your system Python such as macOS/XCode versions,
our default pants.toml
is configured to search only pyenv
-provided Python versions.
Note
In some cases, locale conflicts between the terminal client and the remote host may cause encoding errors when installing Backend.AI components due to Unicode characters in README files. Please keep correct locale configurations to prevent such errors.
Running the install-dev script
$ git clone https://github.com/lablup/backend.ai bai-dev
$ cd bai-dev
$ ./scripts/install-dev.sh
Note
The script requires sudo
to check and install several system packages
such as build-essential
.
This script will bootstrap Pants and creates the halfstack
containers using docker compose
with fixture population.
At the end of execution, the script will show several command examples about
launching the service daemons such as manager and agent.
You may execute this script multiple times when you encounter prerequisite errors and
resolve them.
Also check out additional options using -h
/ --help
option, such as installing
the CUDA mockup plugin together, etc.
Changed in version 22.09: We have migrated to per-package repositories to a semi-mono repository that contains all Python-based components except plugins. This has changed the installation instruction completely with introduction of Pants.
Note
To install multiple instances/versions of development environments using this script,
just clone the repository in another location and run scripts/install-dev.sh
inside that directory.
It is important to name these working-copy directories differently not to confuse
docker compose
so that it can distinguish the containers for each setup.
Unless you customize all port numbers by the options of scripts/install-dev.sh
,
you should docker compose -f docker-compose.halfstack.current.yml down
and docker compose -f docker-compose.halfstack.current.yml up -d
when switching
between multiple working copies.
Note
By default, the script pulls the docker images for our standard Python kernel and TensorFlow CPU-only kernel. To try out other images, you have to pull them manually afterwards.
Note
Currently there are many limitations on running deep learning images on ARM64 platforms, because users need to rebuild the whole computation library stack, although more supported images will come in the future.
Note
To install the webui in an editable state, try --editable-webui
flag option when running scripts/install-dev.sh
.
Tip
Using the agent’s cgroup-based statistics without the root privilege (Linux-only)
To allow Backend.AI to collect sysfs/cgroup resource usage statistics, the Python executable must have the following Linux capabilities: CAP_SYS_ADMIN
, CAP_SYS_PTRACE
, and CAP_DAC_OVERRIDE
.
$ sudo setcap \
> cap_sys_ptrace,cap_sys_admin,cap_dac_override+eip \
> $(readlink -f $(pyenv which python))
Verifying Installation
Refer the instructions displayed after running scripts/install-dev.sh
.
We recommend to use tmux to open
multiple terminals in a single SSH session.
Your terminal app may provide a tab interface, but when using remote servers,
tmux is more convenient because you don’t have to setup a new SSH connection
whenever adding a new terminal.
Ensure the halfstack containers are running:
$ docker compose -f docker-compose.halfstack.current.yml up -d
Open a terminal for manager and run:
$ ./backend.ai mgr start-server --debug
Open another terminal for agent and run:
$ ./backend.ai ag start-server --debug
Open yet another terminal for client and run:
$ source ./env-local-admin-api.sh # Use the generated local endpoint and credential config.
$ # source ./env-local-user-api.sh # You may choose an alternative credential config.
$ ./backend.ai config
$ ./backend.ai run python --rm -c 'print("hello world")'
∙ Session token prefix: fb05c73953
✔ [0] Session fb05c73953 is ready.
hello world
✔ [0] Execution finished. (exit code = 0)
✔ [0] Cleaned up the session.
$ ./backend.ai ps
Resetting the environment
Shutdown all docker containers using docker compose -f docker-compose.halfstack.current.yml down
and delete the entire working copy directory. That’s all.
You may need sudo
to remove the directories mounted as halfstack container volumes
because Docker auto-creates them with the root privilege.
Daily Workflows
Check out Daily Development Workflows for your reference.
Daily Development Workflows
About Pants
Since 22.09, we have migrated to Pants as our primary build system and dependency manager for the mono-repository of Python components.
Pants is a graph-based async-parallel task executor written in Rust and Python. It is tailored to building programs with explicit and auto-inferred dependency checks and aggressive caching.
Key concepts
The command pattern:
$ pants [GLOBAL_OPTS] GOAL [GOAL_OPTS] [TARGET ...]
Goal: an action to execute
You may think this as the root node of the task graph executed by Pants.
Target: objectives for the action, usually expressed as
path/to/dir:name
The targets are declared/defined by
path/to/dir/BUILD
files.
The global configuration is at
pants.toml
.Recommended reading: https://www.pantsbuild.org/docs/concepts
Inspecting build configurations
Display all targets
$ pants list ::
This list includes the full enumeration of individual targets auto-generated by collective targets (e.g.,
python_sources()
generates multiplepython_source()
targets by globbing thesources
pattern)
Display all dependencies of a specific target (i.e., all targets required to build this target)
$ pants dependencies --transitive src/ai/backend/common:src
Display all dependees of a specific target (i.e., all targets affected when this target is changed)
$ pants dependees --transitive src/ai/backend/common:src
Note
Pants statically analyzes the source files to enumerate all its imports
and determine the dependencies automatically. In most cases this works well,
but sometimes you may need to manually declare explicit dependencies in
BUILD
files.
Running lint and check
Run lint/check for all targets:
$ pants lint ::
$ pants check ::
To run lint/check for a specific target or a set of targets:
$ pants lint src/ai/backend/common:: tests/common::
$ pants check src/ai/backend/manager::
Currently running mypy with pants is slow because mypy cannot utilize its own cache as pants invokes mypy per file due to its own dependency management scheme.
(e.g., Checking all sources takes more than 1 minutes!)
This performance issue is being tracked by pantsbuild/pants#10864. For now, try using a
smaller target of files that you work on and use an option to select the
targets only changed (--changed-since
).
Running formatters
If you encounter failure from ruff
, you may run the following to automatically fix the import ordering issues.
$ pants fix ::
If you encounter failure from black
, you may run the following to automatically fix the code style issues.
$ pants fmt ::
Running unit tests
Here are various methods to run tests:
$ pants test ::
$ pants test tests/manager/test_scheduler.py::
$ pants test tests/manager/test_scheduler.py:: -- -k test_scheduler_configs
$ pants test tests/common:: # Run common/**/test_*.py
$ pants test tests/common:tests # Run common/test_*.py
$ pants test tests/common/redis:: # Run common/redis/**/test_*.py
$ pants test tests/common/redis:tests # Run common/redis/test_*.py
You may also try --changed-since
option like lint
and check
.
To specify extra environment variables for tests, use the --test-extra-env-vars
option:
$ pants test \
> --test-extra-env-vars=MYVARIABLE=MYVALUE \
> tests/common:tests
Running integration tests
$ ./backend.ai test run-cli user,admin
Building wheel packages
To build a specific package:
$ pants \
> --tag="wheel" \
> package \
> src/ai/backend/common:dist
$ ls -l dist/*.whl
If the package content varies by the target platform, use:
$ pants \
> --tag="wheel" \
> --tag="+platform-specific" \
> --platform-specific-resources-target=linux_arm64 \
> package \
> src/ai/backend/runner:dist
$ ls -l dist/*.whl
Using IDEs and editors
Pants has an export
goal to auto-generate a virtualenv that contains all
external dependencies installed in a single place.
This is very useful when you use IDEs and editors.
To (re-)generate the virtualenv(s), run:
$ pants export --resolve=RESOLVE_NAME # you may add multiple --resolve options
You may display the available resolve names by (the command works with Python 3.12 or later):
$ python -c 'import tomllib,pathlib;print("\n".join(tomllib.loads(pathlib.Path("pants.toml").read_text())["python"]["resolves"].keys()))'
Similarly, you can export all virtualenvs at once:
$ python -c 'import tomllib,pathlib;print("\n".join(tomllib.loads(pathlib.Path("pants.toml").read_text())["python"]["resolves"].keys()))' | sed 's/^/--resolve=/' | xargs ./pants export
Then configure your IDEs/editors to use
dist/export/python/virtualenvs/python-default/PYTHON_VERSION/bin/python
as the
interpreter for your code, where PYTHON_VERSION
is the interpreter version
specified in pants.toml
.
As of Pants 2.16, you must export the virtualenvs by the individual lockfiles
using the --resolve
option, as all tools are unified to use the same custom resolve subsystem of Pants and the ::
target no longer works properly, like:
$ pants export --resolve=python-default --resolve=mypy
To make LSP (language server protocol) services like PyLance to detect our source packages correctly,
you should also configure PYTHONPATH
to include the repository root’s src
directory and
plugins/*/
directories if you have added Backend.AI plugin checkouts.
For linters and formatters, configure the tool executable paths to indicate
dist/export/python/virtualenvs/RESOLVE_NAME/PYTHON_VERSION/bin/EXECUTABLE
.
For example, ruff’s executable path is
dist/export/python/virtualenvs/ruff/3.12.2/bin/ruff
.
Currently we have the following Python tools to configure in this way:
ruff
: Provides a fast linting (combining pylint, flake8, and isort) fixing (auto-fix for some linting rules and isort) and formatting (black)mypy
: Validates the type annotations and performs a static analysisTip
For a long list of arguments or list/tuple items, you could explicitly add a trailing comma to force Ruff/Black to insert line-breaks after every item even when the line length does not exceed the limit (100 characters).
Tip
You may disable auto-formatting on a specific region of code using
# fmt: off
and# fmt: on
comments, though this is strongly discouraged except when manual formatting gives better readability, such as numpy matrix declarations.pytest
: The unit test runner framework.coverage-py
: Generates reports about which source lines were visited during execution of a pytest session.towncrier
: Generates the changelog from news fragments in thechanges
directory when making a new release.
VSCode
Install the following extensions:
Python (
ms-python.python
)Pylance (
ms-python.vscode-pylance
) (optional but recommended)Mypy (
ms-python.mypy-type-checker
)Ruff (
charliermarsh.ruff
)For other standard Python extensions like Flake8, isort, and Black, disable them for the Backend.AI workspace only to prevent interference with Ruff’s own linting, fixing and formatting.
Set the workspace settings for the Python extension for code navigation and auto-completion:
Setting ID |
Recommended value |
---|---|
|
true |
|
|
|
|
|
|
|
|
Set the following keys in the workspace settings to configure Python tools:
Setting ID |
Example value |
---|---|
|
|
|
|
|
|
|
|
Note
Changed in July 2023
After applying the VSCode Python Tool migration,
we no longer recommend to configure python.linting.*Path
and python.formatting.*Path
keys.
Vim/NeoVim
There are a large variety of plugins and usually heavy Vimmers should know what to do.
We recommend using ALE or CoC plugins to have automatic lint highlights, auto-formatting on save, and auto-completion support with code navigation via LSP backends.
Warning
Note that it is recommended to enable only one linter/formatter at a time (either ALE or CoC) with proper configurations, to avoid duplicate suggestions and error reports.
When using ALE, it is recommended to have a directory-local vimrc as follows.
First, add set exrc
in your user-level vimrc.
Then put the followings in .vimrc
(or .nvimrc
for NeoVim) in the build root directory:
let s:cwd = getcwd()
let g:ale_python_mypy_executable = s:cwd . '/dist/export/python/virtualenvs/mypy/3.12.2/bin/mypy'
let g:ale_python_ruff_executable = s:cwd . '/dist/export/python/virtualenvs/ruff/3.12.2/bin/ruff'
let g:ale_linters = { "python": ['ruff', 'mypy'] }
let g:ale_fixers = {'python': ['ruff']}
let g:ale_fix_on_save = 1
When using CoC, run :CocInstall coc-pyright @yaegassy/coc-ruff
and :CocLocalConfig
after opening a file
in the local working copy to initialize Pyright functionalities.
In the local configuration file (.vim/coc-settings.json
), you may put the linter/formatter configurations
just like VSCode (see the official reference).
{
"coc.preferences.formatOnType": false,
"coc.preferences.willSaveHandlerTimeout": 5000,
"ruff.enabled": true,
"ruff.autoFixOnSave": true,
"ruff.useDetectRuffCommand": false,
"ruff.builtin.pythonPath": "dist/export/python/virtualenvs/ruff/3.12.2/bin/python",
"ruff.serverPath": "dist/export/python/virtualenvs/ruff/3.12.2/bin/ruff-lsp",
"python.pythonPath": "dist/export/python/virtualenvs/python-default/3.12.2/bin/python",
"python.linting.mypyEnabled": true,
"python.linting.mypyPath": "dist/export/python/virtualenvs/mypy/3.12.2/bin/mypy",
}
To activate Ruff (a Python linter and fixer), run :CocCommand ruff.builtin.installServer
after opening any Python source file to install the ruff-lsp
server.
Switching between branches
When each branch has different external package requirements, you should run pants export
before running codes after git switch
-ing between such branches.
Sometimes, you may experience bogus “glob” warning from pants because it sees a stale cache.
In that case, run pgrep pantsd | xargs kill
and it will be fine.
Running entrypoints
To run a Python program within the unified virtualenv, use the ./py
helper
script. It automatically passes additional arguments transparently to the
Python executable of the unified virtualenv.
./backend.ai
is an alias of ./py -m ai.backend.cli
.
Examples:
$ ./py -m ai.backend.storage.server
$ ./backend.ai mgr start-server
$ ./backend.ai ps
Working with plugins
To develop Backend.AI plugins together, the repository offers a special location
./plugins
where you can clone plugin repositories and a shortcut script
scripts/install-plugin.sh
that does this for you.
$ scripts/install-plugin.sh lablup/backend.ai-accelerator-cuda-mock
This is equivalent to:
$ git clone \
> https://github.com/lablup/backend.ai-accelerator-cuda-mock \
> plugins/backend.ai-accelerator-cuda-mock
These plugins are auto-detected by scanning setup.cfg
of plugin subdirectories
by the ai.backend.plugin.entrypoint
module, even without explicit editable installations.
Writing test cases
Mostly it is just same as before: use the standard pytest practices. Though, there are a few key differences:
Tests are executed in parallel in the unit of test modules.
Therefore, session-level fixtures may be executed multiple times during a single run of
pants test
.
Warning
If you interrupt (Ctrl+C, SIGINT) a run of pants test
, it will
immediately kill all pytest processes without fixture cleanup. This may
accumulate unused Docker containers in your system, so it is a good practice
to run docker ps -a
periodically and clean up dangling containers.
To interactively run tests, see Debugging test cases (or interactively running test cases).
Here are considerations for writing Pants-friendly tests:
Ensure that it runs in an isolated/mocked environment and minimize external dependency.
If required, use the environment variable
BACKEND_TEST_EXEC_SLOT
(an integer value) to uniquely define TCP port numbers and other resource identifiers to allow parallel execution. Refer the Pants docs.Use
ai.backend.testutils.bootstrap
to populate a single-node Redis/etcd/Postgres container as fixtures of your test cases. Import the fixture and use it like a plain pytest fixture.These fixtures create those containers with OS-assigned public port numbers and give you a tuple of container ID and a
ai.backend.common.types.HostPortPair
for use in test codes. In manager and agent tests, you could just referlocal_config
to get a pre-populated local configurations with those port numbers.In this case, you may encounter
flake8
complaining about unused imports and redefinition. Use# noqa: F401
and# noqa: F811
respectively for now.
Warning
About using /tmp in tests
If your Docker service is installed using Snap (e.g., Ubuntu 20.04 or
later), it cannot access the system /tmp
directory because Snap applies a
private “virtualized” tmp directory to the Docker service.
You should use other locations under the user’s home directory (or
preferably .tmp
in the working copy directory) to avoid mount failures
for the developers/users in such platforms.
It is okay to use the system /tmp
directory if they are not mounted inside
any containers.
Writing documentation
Create a new pyenv virtualenv based on Python 3.10.
$ pyenv virtualenv 3.10.9 venv-bai-docs
Activate the virtualenv and run:
$ pyenv activate venv-bai-docs $ pip install -U pip setuptools wheel $ pip install -U -r docs/requirements.txt
You can build the docs as follows:
$ cd docs $ pyenv activate venv-bai-docs $ make html
To locally serve the docs:
$ cd docs $ python -m http.server --directory=_build/html
(TODO: Use Pants’ own Sphinx support when pantsbuild/pants#15512 is released.)
Advanced Topics
Adding new external dependencies
Add the package version requirements to the unified requirements file (
./requirements.txt
).Update the
module_mapping
field in the root build configuration (./BUILD
) if the package name and its import name differs.Update the
type_stubs_module_mapping
field in the root build configuration if the package provides a type stubs package separately.Run:
$ pants generate-lockfiles $ pants export
Merging lockfile conflicts
When you work on a branch that adds a new external dependency and the main branch has also
another external dependency addition, merging the main branch into your branch is likely to
make a merge conflict on python.lock
file.
In this case, you can just do the followings since we can just regenerate the lockfile
after merging requirements.txt
and BUILD
files.
$ git merge main
... it says a conflict on python.lock ...
$ git checkout --theirs python.lock
$ pants generate-lockfiles --resolve=python-default
$ git add python.lock
$ git commit
Resetting Pants
If Pants behaves strangely, you could simply reset all its runtime-generated files by:
$ pgrep pantsd | xargs -r kill
$ rm -r /tmp/*-pants/ .pants.d .pids ~/.cache/pants
After this, re-running any Pants command will automatically reinitialize itself and all cached data as necessary.
Note that you may find out the concrete path inside /tmp
from .pants.rc
’s
local_execution_root_dir
option set by install-dev.sh
.
Warning
If you have run pants
or the installation script with sudo
, some of the above directories
may be owned by root and running pants
as the user privilege would not work.
In such cases, remove the directories with sudo
and retry.
Resolve the error message ‘Pants is not abailable for your platform’, When installing Backend.AI with pants
When installing Backend.AI, you may find the following error message saying ‘Pants is not available for your platform’ if you have installed Pants 2.17 or older with prior versions of Backend.AI.
[INFO] Bootstrapping the Pants build system...
Pants system command is already installed.
Failed to fetch https://binaries.pantsbuild.org/tags/pantsbuild.pants/release_2.19.0: [22] HTTP response code said error (The requested URL returned error: 404)
Bootstrapping Pants 2.19.0 using cpython 3.9.15
Installing pantsbuild.pants==2.19.0 into a virtual environment at /home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/venvs/2.19.0
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.4/65.4 KB 3.3 MB/s eta 0:00:00
ERROR: Could not find a version that satisfies the requirement pantsbuild.pants==2.19.0 (from versions: 0.0.17, 0.0.18, 0.0.20, 0.0.21, 0.0.22, ... (a long list of versions) ..., 2.17.0,
2.17.1rc0, 2.17.1rc1, 2.17.1rc2, 2.17.1rc3, 2.17.1, 2.18.0.dev0, 2.18.0.dev1, 2.18.0.dev3, 2.18.0.dev4, 2.18.0.dev5, 2.18.0.dev6, 2.18.0.dev7, 2.18.0a0)
ERROR: No matching distribution found for pantsbuild.pants==2.19.0
Install failed: Command '['/home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/venvs/2.19.0/bin/python', '-sE', '-m', 'pip', '--disable-pip-versi
on-check', '--no-python-version-warning', '--log', PosixPath('/home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/venvs/2.19.0/pants-install.log'
), 'install', '--quiet', '--find-links', 'file:///home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/find_links/2.19.0/e430175b/index.html', '--p
rogress-bar', 'off', 'pantsbuild.pants==2.19.0']' returned non-zero exit status 1.
More information can be found in the log at: /home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/logs/install.log
Error: Isolates your Pants from the elements.
Please select from the following boot commands:
<default>: Detects the current Pants installation and launches it.
bootstrap-tools: Introspection tools for the Pants bootstrap process.
pants: Runs a hermetic Pants installation.
pants-debug: Runs a hermetic Pants installation with a debug server for debugging Pants code.
update: Update scie-pants.
You can select a boot command by passing it as the 1st argument or else by setting the SCIE_BOOT environment variable.
ERROR: Failed to establish atomic directory /home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/locks/install-a4f15e2d2c97473883ec33b4ee0f9d11f99dcf5bee63
8b1cc7a0270d55d0ec8d. Population of work directory failed: Boot binding command failed: exit status: 1
[ERROR] Cannot proceed the installation because Pants is not available for your platform!
To resolve this error, reinstall or upgrade Pants. As of the Pants 2.18.0 release, they no longer use the Python Package Index but GitHub releases to distribute the binary builds.
Resolving missing directories error when running Pants
ValueError: Failed to create temporary directory for immutable inputs: No such file or directory (os error 2) at path "/tmp/bai-dev-PN4fpRLB2u2xL.j6-pants/immutable_inputsvIpaoN"
If you encounter errors like above when running daily Pants commands like lint
,
you may manually create the directory one step higher.
For the above example, run:
mkdir -p /tmp/bai-dev-PN4fpRLB2u2xL.j6-pants/
If this workaround does not work, backup your current working files and
reinstall by running scripts/delete-dev.sh
and scripts/install-dev.sh
serially.
Changing or updating the Python runtime for Pants
When you run scripts/install-dev.sh
, it automatically creates .pants.bootstrap
to explicitly set a specific pyenv Python version to run Pants.
If you have removed/upgraded this specific Python version from pyenv, you also need to
update .pants.bootstrap
accordingly.
Debugging test cases (or interactively running test cases)
When your tests hang, you can try adding the --debug
flag to the pants test
command:
$ pants test --debug ...
so that Pants runs the designated test targets serially and interactively.
This means that you can directly observe the console output and Ctrl+C to
gracefully shutdown the tests with fixture cleanup. You can also apply
additional pytest options such as --fulltrace
, -s
, etc. by passing them
after target arguments and --
when executing pants test
command.
Installing a subset of mono-repo packages in the editable mode for other projects
Sometimes, you need to editable-install a subset of packages into other project’s directories. For instance you could mount the client SDK and its internal dependencies for a Docker container for development.
In this case, we recommend to do it as follows:
Run the following command to build a wheel from the current mono-repo source:
$ pants --tag=wheel package src/ai/backend/client:dist
This will generate
dist/backend.ai_client-{VERSION}-py3-none-any.whl
.Run
pip install -U {MONOREPO_PATH}/dist/{WHEEL_FILE}
in the target environment.This will populate the package metadata and install its external dependencies. The target environment may be one of a separate virtualenv or a container being built. For container builds, you need to first
COPY
the wheel file and install it.Check the internal dependency directories to link by running the following command:
$ pants dependencies --transitive src/ai/backend/client:src \ > | grep src/ai/backend | grep -v ':version' | cut -d/ -f4 | uniq cli client plugin
Link these directories in the target environment.
For example, if it is a Docker container, you could add
-v {MONOREPO_PATH}/src/ai/backend/{COMPONENT}:/usr/local/lib/python3.10/site-packages/ai/backend/{COMPONENT}
to thedocker create
ordocker run
commands for all the component directories found in the previous step.If it is a local checkout with a pyenv-based virtualenv, you could replace
$(pyenv prefix)/lib/python3.10/site-packages/ai/backend/{COMPONENT}
directories with symbolic links to the mono-repo’s component source directories.
Boosting the performance of Pants commands
Since Pants uses temporary directories for aggressive caching, you could make
the .tmp
directory under the working copy root a tmpfs partition:
$ sudo mount -t tmpfs -o size=4G tmpfs .tmp
To make this persistent across reboots, add the following line to
/etc/fstab
:tmpfs /path/to/dir/.tmp tmpfs defaults,size=4G 0 0
The size should be more than 3GB. (Running
pants test ::
consumes about 2GB.)To change the size at runtime, you could simply remount it with a new size option:
$ sudo mount -t tmpfs -o remount,size=8G tmpfs .tmp
Making a new release
Update
./VERSION
file to set a new version number. (Remove the ending new line, e.g., usingset noeol
in Vim. This is also configured in./editorconfig
)Run
LOCKSET=tools/towncrier ./py -m towncrier
to auto-generate the changelog.You may append
--draft
to see a preview of the changelog update without actually modifying the filesystem.(WIP: lablup/backend.ai#427).
Make a new git commit with the commit message: “release: <version>”.
Make an annotated tag to the commit with the message: “Release v<version>” or “Pre-release v<version>” depending on the release version.
Push the commit and tag. The GitHub Actions workflow will build the packages and publish them to PyPI.
When making a new major release, snapshot of prior release’s final DB migration history should be dumped. This will later help to fill out missing gaps of DB revisions when upgrading outdated cluster. The output then should be committed to next major release.
$ ./backend.ai mgr schema dump-history > src/ai/backend/manager/models/alembic/revision_history/<version>.json
Suppose you are trying to create both fresh baked 24.09.0 and good old 24.03.10 releases. In such cases you should first make a release of version 24.03.10, move back to latest branch, and then execute code snippet above with <version> set as 24.03.10, and release 24.09.0 including the dump.
To make workflow above effective, be aware that backporting DB revisions to older major releases will no longer be permitted after major release version is switched.
Backporting to legacy per-pkg repositories
Use
git diff
andgit apply
instead ofgit cherry-pick
.To perform a three-way merge for conflicts, add
-3
option to thegit apply
command.You may need to rewrite some codes as the package structure differs. (The new mono repository has more fine-grained first party packages divided from the
backend.ai-common
package.)
When referring the PR/issue numbers in the commit for per-pkg repositories, update them like
lablup/backend.ai#NNN
instead of#NNN
.
Adding New Kernel Images
Overview
Backend.AI supports running Docker containers to execute user-requested computations in a resource-constrained and isolated environment. Most Docker container images can be imported as Backend.AI kernels with appropriate metadata annotations.
Prepare a Docker image based on Ubuntu 16.04/18.04, CentOS 7.6, or Alpine 3.8.
Create a Dockerfile that does:
Install the OpenSSL library in the image for the kernel runner (if not installed).
Add metadata labels.
Add service definition files.
Add a jail policy file.
Build a derivative image using the Dockerfile.
Upload the image to a Docker registry to use with Backend.AI.
Kernel Runner
Every Backend.AI kernel should run a small daemon called “kernel runner”. It communicates with the Backend.AI Agent running in the host via ZeroMQ, and manages user code execution and in-container service processes.
The kernel runner provides runtime-specific implementations for various code execution modes such as the query mode and the batch mode, compatible with a number of well-known programming languages. It also manages the process lifecycles of service-port processes.
To decouple the development and update cycles for Docker images and the Backend.AI Agent, we don’t install the kernel runner inside images.
Instead, Backend.AI Agent mounts a special “krunner” volume as /opt/backend.ai
inside containers.
This volume includes a customized static build of Python.
The kernel runner daemon package is mounted as one of the site packages of this Python distribution as well.
The agent also uses /opt/kernel
as the directory for mounting other self-contained single-binary utilities.
This way, image authors do not have to bother with installing Python and Backend.AI specific software.
All dirty jobs like volume deployment, its content updates, and mounting for new containers are automatically managed by Backend.AI Agent.
Since the customized Python build and binary utilities need to be built for specific Linux distributions, we only support Docker images built on top of Alpine 3.8+, CentOS 7+, and Ubuntu 16.04+ base images. Note that these three base distributions practically cover all commonly available Docker images.
Image Prerequisites
For glibc-based (most) Linux kernel images, you don’t have to add anything to the existing container image as we use a statically built Python distribution with precompiled wheels to run the kernel runner. The only requirement is that it should be compatible with manylinux2014 or later.
For musl-based Linux kernel images (e.g., Alpine), you have to install libffi
and sqlite-libs
as the minimum.
Please also refer the Dockerfile to build a minimal compatible image.
Metadata Labels
Any Docker image based on Alpine 3.17+, CentOS 7+, and Ubuntu 16.04+ which satisfies the above prerequisites may become a Backend.AI kernel image if you add the following image labels:
Required Labels
ai.backend.kernelspec
:1
(this will be used for future versioning of the metadata specification)ai.backend.features
: A list of constant strings indicating which Backend.AI kernel features are available for the kernel.batch: Can execute user programs passed as files.
query: Can execute user programs passed as code snippets while keeping the context across multiple executions.
uid-match: As of 19.03, this must be specified always.
user-input: The query/batch mode supports interactive user inputs.
ai.backend.resource.min.*
: The minimum amount of resource to launch this kernel. At least, you must define the CPU core (cpu
) and the main memory (mem
). In the memory size values, you may use binary scale-suffixes such asm
forMiB
,g
forGiB
, etc.ai.backend.base-distro
: Either “ubuntu16.04” or “alpine3.8”. Note that Ubuntu 18.04-based kernels also need to use “ubuntu16.04” here.ai.backend.runtime-type
: The type of kernel runner to use. (One of the directories in theai.backend.kernels
namespace.)python: This runtime is for Python-based kernels, allowing the given Python executable accessible via the query and batch mode, also as a Jupyter kernel service.
app: This runtime does not support code execution in the query/batch modes but just manages the service port processes. For custom kernel images with their own service ports for their main applications, this is the most frequently used runtime type for derivative images.
For the full list of available runtime types, check out the
lang_map
variable at theai.backend.kernels
module code
ai.backend.runtime-path
: The path to the language runtime executable.
Optional Labels
ai.backend.role
:COMPUTE
(default if unspecified) orINFERENCE
ai.backend.service-ports
: A list of port mapping declaration strings for services supported by the image. (See the next section for details) Backend.AI manages the host-side port mapping and network tunneling via the API gateway automagically.ai.backend.endpoint-ports
: A comma-separated name(s) of service port(s) to be bound with the service endpoint. (At least one is required in inference sessions)ai.backend.model-path
: The path to mount the target model’s target version storage folder. (Required in inference sessions)ai.backend.envs.corecount
: A comma-separated string list of environment variable names. They are set to the number of available CPU cores to the kernel container. It allows the CPU core restriction to be enforced to legacy parallel computation libraries. (e.g.,JULIA_CPU_CORES
,OPENBLAS_NUM_THREADS
)
Service Ports
As of Backend.AI v19.03, service ports are our preferred way to run computation workloads inside Backend.AI kernels. It provides tunneled access to Jupyter Notebooks and other daemons running in containers.
As of Backend.AI v19.09, Backend.AI provides SSH (including SFTP and SCP) and ttyd (web-based xterm shell) as intrinsic services for all kernels. “Intrinsic” means that image authors do not have to do anything to support/enable the services.
As of Backend.AI v20.03, image authors may define their own service ports using service definition JSON files installed at /etc/backend.ai/service-defs
in their images.
Port Mapping Declaration
A custom service port should define two things.
First, the image label ai.backend.service-ports
contains the port mapping declarations.
Second, the service definition file which specifies how to start the service process.
A port mapping declaration is composed of three values: the service name, the protocol, and the container-side port number. The label may contain multiple port mapping declarations separated by commas, like the following example:
jupyter:http:8080,tensorboard:http:6006
The name may be an non-empty arbitrary ASCII alphanumeric string.
We use the kebab-case for it.
The protocol may be one of tcp
, http
, and pty
, but currently most services use http
.
Note that there are a few port numbers reserved for Backend.AI itself and intrinsic service ports. The TCP port 2000 and 2001 is reserved for the query mode, whereas 2002 and 2003 are reserved for the native pseudo-terminal mode (stdin and stdout combined with stderr), 2200 for the intrinsic SSH service, and 7681 for the intrinsic ttyd service.
Up to Backend.AI 19.09, this was the only method to define a service port for images, and the service-specific launch sequences were all hard-coded in the ai.backend.kernel
module.
Service Definition DSL
Now the image author should define the service launch sequences using a DSL (domain-specific language).
The service definitions are written as JSON files in the container’s /etc/backend.ai/service-defs
directory.
The file names must be same with the name parts of the port mapping declarations.
For example, a sample service definition file for “jupyter” service (hence its filename must be /etc/backend.ai/service-defs/jupyter.json
) looks like:
{
"prestart": [
{
"action": "write_tempfile",
"args": {
"body": [
"c.NotebookApp.allow_root = True\n",
"c.NotebookApp.ip = \"0.0.0.0\"\n",
"c.NotebookApp.port = {ports[0]}\n",
"c.NotebookApp.token = \"\"\n",
"c.FileContentsManager.delete_to_trash = False\n"
]
},
"ref": "jupyter_cfg"
}
],
"command": [
"{runtime_path}",
"-m", "jupyterlab",
"--no-browser",
"--config", "{jupyter_cfg}"
],
"url_template": "http://{host}:{port}/"
}
A service definition is composed of three major fields: prestart
that contains a list of prestart actions, command
as a list of template-enabled strings, and an optional url_template
as a template-enabled string that defines the URL presented to the end-user on CLI or used as the redirection target on GUI with wsproxy.
The “template-enabled” strings may have references to a contextual set of variables in curly braces. All the variable substitution follows the Python’s brace-style formatting syntax and rules.
Available predefined variables
There are a few predefined variables as follows:
ports: A list of TCP ports used by the service. Most services have only one port. An item in the list may be referenced using bracket notation like
{ports[0]}
.runtime_path: A string representing the full path to the runtime, as specified in the
ai.backend.runtime-path
image label.
Available prestart actions
A prestart action is composed of two mandatory fields action
and args
(see the table below), and an optional field ref
.
The ref
field defines a variable that stores the result of the action and can be referenced in later parts of the service definition file where the arguments are marked as “template-enabled”.
Action Name |
Arguments |
Return |
---|---|---|
|
|
None |
|
|
The generated file path |
|
|
None |
|
|
A dictionary with two fields: |
|
|
None |
Warning
run_command
action should return quickly, otherwise the session creation latency will be increased.
If you need to run a background process, you must use its own options to let it daemonize or wrap as a background shell command (["/bin/sh", "-c", "... &"]
).
Interpretation of URL template
url_template
field is used by the client SDK and wsproxy to fill up the actual URL presented to the end-user (or the end-user’s web browser as the redirection target).
So its template variables are not parsed when starting the service, but they are parsed and interpolated by the clients.
There are only three fixed variables: {protocol}
, {host}
, and {port}
.
Here is a sample service-definition that utilizes the URL template:
{
"command": [
"/opt/noVNC/utils/launch.sh",
"--vnc", "localhost:5901",
"--listen", "{ports[0]}"
],
"url_template": "{protocol}://{host}:{port}/vnc.html?host={host}&port={port}&password=backendai&autoconnect=true"
}
Jail Policy
(TODO: jail policy syntax and interpretation)
Adding Custom Jail Policy
To write a new policy implementation, extend the jail policy interface in Go. Ebmed it inside your jail build. Please give a look to existing jail policies as good references.
Example: An Ubuntu-based Kernel
FROM ubuntu:16.04
# Add commands for image customization
RUN apt-get install ...
# Backend.AI specifics
RUN apt-get install libssl
LABEL ai.backend.kernelspec=1 \
ai.backend.resource.min.cpu=1 \
ai.backend.resource.min.mem=256m \
ai.backend.envs.corecount="OPENBLAS_NUM_THREADS,OMP_NUM_THREADS,NPROC" \
ai.backend.features="batch query uid-match user-input" \
ai.backend.base-distro="ubuntu16.04" \
ai.backend.runtime-type="python" \
ai.backend.runtime-path="/usr/local/bin/python" \
ai.backend.service-ports="jupyter:http:8080"
COPY service-defs/*.json /etc/backend.ai/service-defs/
COPY policy.yml /etc/backend.ai/jail/policy.yml
Custom startup scripts (aka custom entrypoint)
When the image has preopen service ports and/or an endpoint port, Backend.AI automatically sets up application proxy tunnels as if the listening applications are already started.
To initialize and start such applications, put a shell script as /opt/container/bootstrap.sh
when building the image.
This per-image bootstrap script is executed as root by the agent-injected entrypoint.sh
.
Warning
Since Backend.AI overrides the command and the entrypoint of container images to run the kernel runner regardless of the image content,
setting CMD
or ENTRYPOINT
in Dockerfile has no effects.
You should use /opt/container/bootstrap.sh
to migrate existing entrypoint/command wrappers.
Warning
/opt/container/bootstrap.sh
must return immediately to prevent the session from staying in the PREPARING
status.
This means that it should run service applications in background by daemonization.
To run a process as the user privilege, you should use su-exec
which is also injected by the agent like:
/opt/kernel/su-exec "${LOCAL_GROUP_ID}:${LOCAL_USER_ID}" /path/to/your/service
Implementation details
The query mode I/O protocol
The input is a ZeroMQ’s multipart message with two payloads. The first payload should contain a unique identifier for the code snippet (usually a hash of it), but currently it is ignored (reserved for future caching implementations). The second payload should contain a UTF-8 encoded source code string.
The reply is a ZeroMQ’s multipart message with a single payload, containing a UTF-8 encoded string of the following JSON object:
{
"stdout": "hello world!",
"stderr": "oops!",
"exceptions": [
["exception-name", ["arg1", "arg2"], false, null]
],
"media": [
["image/png", "data:image/base64,...."]
],
"options": {
"upload_output_files": true
}
}
Each item in exceptions
is an array composed of four items:
exception name,
exception arguments (optional),
a boolean indicating if the exception is raised outside the user code (mostly false),
and a traceback string (optional).
Each item in media
is an array of two items: MIME-type and the data string.
Specific formats are defined and handled by the Backend.AI Media module.
The options
field may present optionally.
If upload_output_files
is true (default), then the agent uploads the files generated by user code in the working directory (/home/work
) to AWS S3 bucket and make their URLs available in the front-end.
The pseudo-terminal mode protocol
If you want to allow users to have real-time interactions with your kernel using web-based terminals, you should implement the PTY mode as well. A good example is our “git” kernel runner.
The key concept is separation of the “outer” daemon and the “inner” target program (e.g., a shell). The outer daemon should wrap the inner program inside a pseudo-tty. As the outer daemon is completely hidden in terminal interaction by the end-users, the programming language may differ from the inner program. The challenge is that you need to implement piping of ZeroMQ sockets from/to pseudo-tty file descriptors. It is up to you how you implement the outer daemon, but if you choose Python for it, we recommend to use asyncio or similar event loop libraries such as tornado and Twisted to mulitplex sockets and file descriptors for both input/output directions. When piping the messages, the outer daemon should not apply any specific transformation; it should send and receive all raw data/control byte sequences transparently because the front-end (e.g., terminal.js) is responsible for interpreting them. Currently we use PUB/SUB ZeroMQ socket types but this may change later.
Optionally, you may run the query-mode loop side-by-side. For example, our git kernel supports terminal resizing and pinging commands as the query-mode inputs. There is no fixed specification for such commands yet, but the current CodeOnWeb uses the followings:
%resize <rows> <cols>
: resize the pseudo-tty’s terminal to fit with the web terminal element in user browsers.
%ping
: just a no-op command to prevent kernel idle timeouts while the web terminal is open in user browsers.
A best practice (not mandatory but recommended) for PTY mode kernels is to automatically respawn the inner program if it terminates (e.g., the user has exited the shell) so that the users are not locked in a “blank screen” terminal.
Using Mocked Accelerators
For developers who do not have access to physical accelerator devices such as CUDA GPUs, we provide a mock-up plugin to simulate the system configuration with such devices, allowing development and testing of accelerator-related features in various components including the web UI.
Configuring the mock-accelerator plugin
Check out the examples in the configs/accelerator
directory.
Here are the description of each field:
slot_name
: The resource slot’s main key name. The plugin’s resource slot name has the form of"<slot_name>.<subtype>"
, where the subtype may be something such asdevice
(default),shares
(for the fractional allocation mode). For CUDA MIG devices, it becomes a string including the slice size from the device memory size such as10g-mig
.To configure the fractional allocation mode, you should also specify the etcd accelerator plugin configuration like the following JSON, where
unit_mem
andunit_proc
is used as the divisor to calculate 1.0 fraction:{ "config": { "plugins": { "accelerator": { "<slot_name>": { "allocation_mode": "fractional", "unit_mem": 1073741824, "unit_proc": 10 } } } } }
In the above example, the 10 subprocessors and 1 GiB of device memory is regarded as 1.0 fractional device. You may store it as a JSON file and put in the etcd configuration tree like:
$ ./backend.ai mgr etcd put-json '' mydevice-fractional-mode.json
device_plugin_name
: The class name to use as the actual implementation. Currently there are two:CUDADevice
andMockDevice
.formats.<subtype>
: The tables for per-subtype formatting detailsdisplay_icon
: The device icon type displayed in the UI.display_unit
: The resource slot unit displayed in the UI, alongside the amount numbers.human_readable_name
: The device name displayed in the UI.description
: The device description displayed in the UI.number_format
: The number formatting string used for the UI.binary
: A boolean flag to indicate whether to use the binary suffixes (divided by 2^(10n) instead of 10^(3n))round_length
: The length of fixed points to wrap the numeric value of this resource slot. If zero, the number is treated as an integer.
devices
: The list of mocked device declarationsmother_uuid
: The unique ID of the device, which may be random-generatedmodel_name
: The model name to report to the manager as metadatanuma_node
: The NUMA node index to place this device.subproc_count
: The number of sub-processing cores (e.g., the number of streaming multi-processors of CUDA GPUs)memory_size
: The size of on-device memory represented as human-readable binary sizesis_mig_devices
: (CUDA-specific) whether this device is a MIG slice or a full device
Activating the mock-accelerator plugin
Add "ai.backend.accelerator.mock"
to the agent.toml
’s [agent].allowed-compute-plugins
field.
Then restart the agent.
Version Numbering
Version numbering uses
x.y.z
format (wherex
,y
,z
are integers).Mostly, we follow the calendar versioning scheme.
x.y
is a release branch name (major releases per 6 months).When
y
is smaller than 10, we prepend a zero sign like05
in the version numbers (e.g.,20.09.0
).When referring the version in other Python packages as requirements, you need to strip the leading zeros (e.g.,
20.9.0
instead of20.09.0
) because Python setuptools normalizes the version integers.
x.y.z
is a release tag name (patch releases).When releasing
x.y.0
:Create a new
x.y
branch, do all bugfix/hotfix there, and makex.y.z
releases there.All fixes must be first implemented on the
main
branch and then cherry-picked back tox.y
branches.When cherry-picking, use the
-e
option to edit the commit message.
AppendBackported-From: main
andBackported-To: X.Y
lines after one blank line at the end of the existing commit message.
Change the version number of
main
tox.(y+1).0.dev0
There is no strict rules about alpha/beta/rc builds yet. We will elaborate as we scale up.
Once used, alpha versions will haveaN
suffixes, beta versionsbN
suffixes, and RC versionsrcN
suffixes whereN
is an integer.
New development should go on the
main
branch.main
: commit here directly if your changes are a self-complete one as a single commit.Use both short-lived and long-running feature branches freely, but ensure there names differ from release branches and tags.
The major/minor (
x.y
) version of Backend.AI subprojects will go together to indicate compatibility. Currently manager/agent/common versions progress this way, while client SDKs have their own version numbers and the API specification has a differentvN.yyyymmdd
version format.Generally
backend.ai-manager 1.2.p
is compatible withbackend.ai-agent 1.2.q
(wherep
andq
are same or different integers)As of 22.09, this won’t be guaranteed any more. All server-side core component versions should exactly match with others, as we release them at once from the mono-repo, even for those who do not have any code changes.
The client is guaranteed to be backward-compatible with the server they share the same API specification version.
Upgrading
You can upgrade the installed Python packages using pip install -U ...
command along with dependencies.
If you have cloned the stable version of source code from git, then pull and check out the next x.y
release branch.
It is recommended to re-run pip install -U -r requirements.txt
as dependencies might be updated.
For the manager, ensure that your database schema is up-to-date by running alembic upgrade head
. If you setup your development environment with Pants and install-dev.sh
script, keep your database schema up-to-date via ./py -m alembic upgrade head
instead of plain alembic command above.
Also check if any manual etcd configuration scheme change is required, though we will try to keep it compatible and automatically upgrade when first executed.
Migration Guides
Upgrading from 20.03 to 20.09
(TODO)
Migrating from the Docker Hub to cr.backend.ai
As of November 2020, the Docker Hub has begun to limit the retention time and the rate of pulls of public images. Since Backend.AI uses a number of Docker images with variety of access frequencies, we decided to migrate to our own container registry, https://cr.backend.ai.
It is strongly recommended to set a maintenance period if there are active users of the Backend.AI cluster to prevent new session starts during migration. This registry migration does not affect existing running sessions, though the Docker image removal in the agent nodes can only be done after terminating all existing containers started with the old images and there will be brief disconnection of service ports as the manager requires to be restarted.
Update your Backend.AI installation to the latest version (manager 20.03.11 or 20.09.0b2) to get support for Harbor v2 container registries.
Save the following JSON snippet as
registry-config.json
.{ "config": { "docker": { "registry": { "cr.backend.ai": { "": "https://cr.backend.ai", "type": "harbor2", "project": "stable,community" } } } } }
Run the following using the manager CLI on one of the manager nodes:
$ sudo systemctl stop backendai-manager # stop the manager daemon (may differ by setup) $ backend.ai mgr etcd put-json '' registry-config.json $ backend.ai mgr image rescan cr.backend.ai $ sudo systemctl start backendai-manager # start the manager daemon (may differ by setup)
The agents will automatically pull the images since the image references are changed even when the new images are actually same to the existing ones. It is recommended to pull the essential images by yourself in the agents to avoid long waiting times when starting sessions using the
docker pull
command in the agent nodes.Now the images are categorized with additional path prefix, such as
stable
andcommunity
. More prefixes may be introduced in the future and some prefixes may be set only available to specific set of users/user groups, with dedicated credentials.For example,
lablup/python:3.6-ubuntu18.04
is now referred ascr.backend.ai/stable/python:3.6-ubuntu18.04
.If you have configured image aliases, you need to update them manually as well, using the
backend.ai mgr image alias
command. This does not affect existing sessions running with old aliases.
Update the allowed docker registries policy for each domain using the
backend.ai mgr dbshell
command. Remove “index.docker.io” from the existing values and replace “…” below with your own domain names and additional registries.SELECT name, allowed_docker_registries FROM domains; -- check the current config UPDATE domains SET allowed_docker_registries = '{cr.backend.ai,...}' WHERE name = '...';
Now you may start new sessions using the images from the new registry.
After terminating all existing sessions using the old images from the Docker Hub (i.e., images whose names start with
lablup/
prefix), remove the image metadata and registry configuration using the manager CLI:$ backend.ai mgr etcd delete --prefix images/index.docker.io $ backend.ai mgr etcd delete --prefix config/docker/registry/index.docker.io
Run
docker rmi
commands to clean up the pulled images in the agent nodes. (Automatic/managed removal of images will be implemented in the future versions of Backend.AI)
Backend.AI Manager Reference
Manager API Common Concepts
API and Document Conventions
HTTP Methods
We use the standard HTTP/1.1 methods (RFC-2616), such as GET
, POST
, PUT
, PATCH
and DELETE
, with some additions from WebDAV (RFC-3253) such as REPORT
method to send JSON objects in request bodies with GET
semantics.
If your client runs under a restrictive environment that only allows a subset of above methods, you may use the universal POST
method with an extra HTTP header like X-Method-Override: REPORT
, so that the Backend.AI gateway can recognize the intended HTTP method.
Parameters in URI and JSON Request Body
The parameters with colon prefixes (e.g., :id
) are part of the URI path and must be encoded using a proper URI-compatible encoding schemes such as encodeURIComponent(value)
in Javascript and urllib.parse.quote(value, safe='~()*!.\'')
in Python 3+.
Other parameters should be set as a key-value pair of the JSON object in the HTTP request body. The API server accepts both UTF-8 encoded bytes and standard-compliant Unicode-escaped strings in the body.
HTTP Status Codes and JSON Response Body
The API responses always contain a root JSON object, regardless of success or failures.
For successful responses (HTTP status 2xx), the root object has a varying set of key-value pairs depending on the API.
For failures (HTTP status 4xx/5xx), the root object contains at least two keys: type
which uniquely identifies the failure reason as an URI and title
for human-readable error messages.
Some failures may return extra structured information as additional key-value pairs.
We use RFC 7807-style problem detail description returned in JSON of the response body.
JSON Field Notation
Dot-separated field names means a nested object. If the field name is a pure integer, it means a list item.
Example |
Meaning |
---|---|
|
The attribute |
|
The attribute |
|
An item in the list |
|
The attribute |
JSON Value Types
This documentation uses a type annotation style similar to Python’s typing module, but with minor intuitive differences such as lower-cased generic type names and wildcard as asterisk *
instead of Any
.
The common types are array
(JSON array), object
(JSON object), int
(integer-only subset of JSON number), str
(JSON string), and bool
(JSON true
or false
).
tuple
and list
are aliases to array
.
Optional values may be omitted or set to null
.
We also define several custom types:
Type |
Description |
---|---|
|
Fractional numbers represented as |
|
Similar to |
|
ISO-8601 timestamps in |
|
Only allows a fixed/predefined set of possible values in the given parametrized type. |
API Versioning
A version string of the Backend.AI API uses two parts: a major revision (prefixed with v
) and minor release dates after a dot following the major revision.
For example, v23.20250101
indicates a 23rd major revision with a minor release at January 1st in 2025.
We keep backward compatibility between minor releases within the same major version.
Therefore, all API query URLs are prefixed with the major revision, such as /v2/kernel/create
.
Minor releases may introduce new parameters and response fields but no URL changes.
Accessing unsupported major revision returns HTTP 404 Not Found.
Changed in version v3.20170615: Version prefix in API queries are deprecated. (Yet still supported currently)
For example, now users should call /kernel/create
rather than /v2/kernel/create
.
A client must specify the API version in the HTTP request header named X-BackendAI-Version
.
To check the latest minor release date of a specific major revision, try a GET query to the URL with only the major revision part (e.g., /v2
).
The API server will return a JSON string in the response body containing the full version.
When querying the API version, you do not have to specify the authorization header and the rate-limiting is enforced per the client IP address.
Check out more details about Authentication and Rate Limiting.
Example version check response body:
{
"version": "v2.20170315"
}
JSON Object References
Paging Query Object
It describes how many items to fetch for object listing APIs.
If index
exceeds the number of pages calculated by the server, an empty list is returned.
Key |
Type |
Description |
---|---|---|
|
|
The number of items per page.
If set zero or this object is entirely omitted, all items are returned and |
|
|
The page number to show, zero-based. |
Paging Info Object
It contains the paging information based on the paging query object in the request.
Key |
Type |
Description |
---|---|---|
|
|
The number of total pages. |
|
|
The number of all items. |
KeyPair Item Object
Key |
Type |
Description |
---|---|---|
|
|
The access key part. |
|
|
Indicates if the keypair is active or not. |
|
|
The number of queries done via this keypair. It may have a stale value. |
|
|
The timestamp when the keypair was created. |
KeyPair Properties Object
Key |
Type |
Description |
---|---|---|
|
|
Indicates if the keypair is activated or not.
If not activated, all authentication using the keypair returns 401 Unauthorized.
When changed from |
|
|
The maximum number of concurrent sessions allowed for this keypair.
(default: |
|
|
Sets the number of instances clustered together when launching new machine learning sessions. (default: |
|
|
Sets the memory limit of each instance in the cluster launched for new machine learning sessions. (default: |
The enterprise edition offers the following additional properties:
Key |
Type |
Description |
---|---|---|
|
|
If set |
|
|
The string representation of money amount as decimals.
The currency is fixed to USD. (default: |
Service Port Object
Key |
Type |
Description |
---|---|---|
|
|
The name of service provided by the container. See also: Terminal Emulation |
|
|
The type of network protocol used by the container service. |
Batch Execution Query Object
Key |
Type |
Description |
---|---|---|
|
|
The bash command to build the main program from the given uploaded files. If this field is not present, an empty string or If this field is a constant string |
|
|
The bash command to execute the main program. If this is not present, an empty string, or |
|
|
The bash command to clean the intermediate files produced during the build phase. The clean step comes before the build step if specified so that the build step can (re)start fresh. If the field is not present, an empty string, or Unlike the build and exec command, the default for |
Note
A client can distinguish whether the current output is from the build phase
or the execution phase by whether it has received build-finished
status
or not.
Note
All shell commands are by default executed under /home/work
.
The common environment is:
TERM=xterm
LANG=C.UTF-8
SHELL=/bin/bash
USER=work
HOME=/home/work
but individual kernels may have additional environment settings.
Warning
The shell does NOT have access to sudo or the root privilege. Though, some kernels may allow installation of language-specific packages in the user directory.
Also, your build script and the main program is executed inside
Backend.AI Jail, meaning that some system calls are blocked by our policy.
Since ptrace
syscall is blocked, you cannot use native debuggers
such as gdb.
This limitation, however, is subject to change in the future.
Example:
{
"build": "gcc -Wall main.c -o main -lrt -lz",
"exec": "./main"
}
Execution Result Object
Key |
Type |
Description |
---|---|---|
|
|
The user-provided run identifier. If the user has NOT provided it, this will be set by the API server upon the first execute API call. In that case, the client should use it for the subsequent execute API calls during the same run. |
|
|
One of |
|
|
The exit code of the last process.
This field has a valid value only when the For batch-mode kernels and query-mode kernels without global context support,
For query-mode kernels with global context support, this value is always zero, regardless of whether the user code has caused an exception or not. A negative value (which cannot happen with normal process termination) indicates a Backend.AI-side error. |
|
|
A list of Console Item Object. |
|
|
An object containing extra display options. If there is no options indicated by the kernel, this field is |
|
|
A list of Execution Result File Object that represents files
generated in |
Console Item Object
Key |
Type |
Description |
---|---|---|
(root) |
|
A tuple of the item type and the item content.
The type may be See more details at Handling Console Output. |
Execution Result File Object
Key |
Type |
Description |
---|---|---|
|
|
The name of a created file after execution. |
|
|
The URL of a create file uploaded to AWS S3. |
Container Stats Object
Key |
Type |
Description |
---|---|---|
|
|
The total time the kernel was running. |
|
|
The maximum memory usage. |
|
|
The current memory usage. |
|
|
The total amount of received data through network. |
|
|
The total amount of transmitted data through network. |
|
|
The total amount of received data from IO. |
|
|
The total amount of transmitted data to IO. |
|
|
Currently not used field. |
|
|
Currently not used field. |
Creation Config Object
Key |
Type |
Description |
---|---|---|
|
|
A dictionary object specifying additional environment variables. The values must be strings. |
|
|
An optional list of the name of virtual folders that belongs to the current API key.
These virtual folders are mounted under If the name contains a colon in the middle, the second part of the string indicates
the alias location in the kernel’s file system which is relative to You may mount up to 5 folders for each session. |
|
|
The number of instances bundled for this session. |
|
The resource slot specification for each container in this session. Added in version v4.20190315. |
|
|
|
The maximum memory allowed per instance. The value is capped by the per-kernel image limit. Additional charges may apply on the public API service. Deprecated since version v4.20190315. |
|
|
The number of CPU cores. The value is capped by the per-kernel image limit. Additional charges may apply on the public API service. Deprecated since version v4.20190315. |
|
|
The fraction of GPU devices (1.0 means a whole device). The value is capped by the per-kernel image limit. Additional charges may apply on the public API service. Deprecated since version v4.20190315. |
Resource Slot Object
Key |
Type |
Description |
---|---|---|
|
|
The number of CPU cores. |
|
|
The amount of main memory in bytes. When the slot object is used as an input to an API, it may be represented as binary numbers using the binary scale suffixes such as k, m, g, t, p, e, z, and y, e.g., “512m”, “512M”, “512MiB”, “64g”, “64G”, “64GiB”, etc. When the slot object is used as an output of an API, this field is always represented in the unscaled number of bytes as strings. Warning When parsing this field as JSON, you must check whether your JSON
library or the programming language supports large integers.
For instance, most modern Javascript engines support up to
\(2^{53}-1\) (8 PiB – 1) which is often defined as the
|
|
|
The number of CUDA devices. Only available when the server is configured to use the CUDA agent plugin. |
|
|
The virtual share of CUDA devices represented as fractional decimals. Only available when the server is configured to use the CUDA agent plugin with the fractional allocation mode (enterprise edition only). |
|
|
The number of TPU devices. Only available when the server is configured to use the TPU agent plugin (cloud edition only). |
(others) |
|
More resource slot types may be available depending on the server configuration and agent plugins. There are two types for an arbitrary slot: “count” (the default) and “bytes”. For “count” slots, you may put arbitrary positive real number there, but fractions may be truncated depending on the plugin implementation. For “bytes” slots, its interpretation and representation follows that of
the |
Resource Preset Object
Key |
Type |
Description |
---|---|---|
|
|
The name of this preset. |
|
The pre-configured combination of resource slots.
If it contains slot types that are not currently used/activated in the cluster,
they will be removed when returned via |
|
|
|
The pre-configured shared memory size. Client can send humanized strings like ‘2g’, ‘128m’, ‘534773760’, etc, and they will be automatically converted into bytes. |
Virtual Folder Creation Result Object
Key |
Type |
Description |
---|---|---|
|
|
An internally used unique identifier of the created vfolder. Currently it has no use in the client-side. |
|
|
The name of created vfolder, as the client has given. |
|
|
The host name where the vfolder is created. |
|
|
The user who has the ownership of this vfolder. |
|
|
The group who is the owner of this vfolder. |
Added in version v4.20190615: user
and group
fields.
Virtual Folder List Item Object
Key |
Type |
Description |
---|---|---|
|
|
The human readable name set when created. |
|
|
The unique ID of the folder. |
|
|
The host name where this folder is located. |
|
|
True if the client user is the owner of this folder. False if the folder is shared from a group or another user. |
|
|
The requested user’s permission for this folder. (One of “ro”, “rw”, and “wd” which represents read-only, read-write, and write-delete respectively. Currently “rw” and “wd” has no difference.) |
|
|
The user ID if the owner of this item is a user vfolder. Otherwise, |
|
|
The group ID if the owner of this item is a group vfolder. Otherwise, |
|
|
The owner type of vfolder. One of “user” or “group”. |
Added in version v4.20190615: user
, group
, and type
fields.
Virtual Folder Item Object
Key |
Type |
Description |
---|---|---|
|
|
The human readable name set when created. |
|
|
The unique ID of the folder. |
|
|
The host name where this folder is located. |
|
|
True if the client user is the owner of this folder. False if the folder is shared from a group or another user. |
|
|
The number of files in this folder. |
|
|
The requested user’s permission for this folder. |
|
|
The date and time when the folder is created. |
|
|
The date and time when the folder is last used. |
|
|
The user ID if the owner of this item is a user. Otherwise, |
|
|
The group ID if the owner of this item is a group. Otherwise, |
|
|
The owner type of vfolder. One of “user” or “group”. |
Added in version v4.20190615: user
, group
, and type
fields.
Virtual Folder File Object
Key |
Type |
Description |
---|---|---|
|
|
The filename. |
|
|
The file’s mode (permission) bits as an integer. |
|
|
The file’s size. |
|
|
The timestamp when the file is created. |
|
|
The timestamp when the file is last modified. |
|
|
The timestamp when the file is last accessed. |
Virtual Folder Invitation Object
Key |
Type |
Description |
---|---|---|
|
|
The unique ID of the invitation. Use this when making API requests referring this invitation. |
|
|
The inviter’s user ID (email) of the invitation. |
|
|
The permission that the invited user will have. |
|
|
The current state of the invitation. |
|
|
The unique ID of the vfolder where the user is invited. |
|
|
The name of the vfolder where the user is invited. |
Key |
Type |
Description |
---|---|---|
|
|
The retrieved content (multi-line string) of fstab. |
|
|
The node type, either “agent” or “manager. |
|
|
The node’s unique ID. |
Added in version v4.20190615.
Authentication
Access Tokens and Secret Key
To make requests to the API server, a client needs to have a pair of an API access key and a secret key. You may get one from our cloud service or from the administrator of your Backend.AI cluster.
The server uses the API keys to identify each client and secret keys to verify integrity of API requests as well as to authenticate clients.
Warning
For security reasons (to avoid exposition of your API access key and secret keys to arbitrary Internet users), we highly recommend to setup a server-side proxy to our API service if you are building a public-facing front-end service using Backend.AI.
For local deployments, you may create a master dummy pair in the configuration (TODO).
Common Structure of API Requests
HTTP Headers |
Values |
---|---|
Method |
|
Query String |
If your access key has the administrator privilege, your client may
optionally specify other user’s access key as the Added in version v4.20190315. |
|
Always should be |
|
Signature information generated as the section Signing API Requests describes. |
|
The date/time of the request formatted in RFC 8022 or ISO 8601. If no timezone is specified, UTC is assumed. The deviation with the server-side clock must be within 15-minutes. |
|
Same as |
|
|
|
An optional, client-generated random string to allow the server to distinguish repeated duplicate requests. It is important to keep idempotent semantics with multiple retries for intermittent failures. (Not implemented yet) |
Body |
JSON-encoded request parameters |
Common Structure of API Responses
HTTP Headers |
Values |
---|---|
Status code |
API-specific HTTP-standard status codes. Responses commonly used throughout all APIs include 200, 201, 204, 400, 401, 403, 404, 429, and 500, but not limited to. |
|
|
|
Web link headers specified as in RFC 5988. Only optionally used when returning a collection of objects. |
|
The rate-limiting information (see Rate Limiting). |
Body |
JSON-encoded results |
Signing API Requests
Each API request must be signed with a signature. First, the client should generate a signing key derived from its API secret key and a string to sign by canonicalizing the HTTP request.
Generating a signing key
Here is a Python code that derives the signing key from the secret key. The key is nestedly signed against the current date (without time) and the API endpoint address.
import hashlib, hmac
from datetime import datetime
SECRET_KEY = b'abc...'
def sign(key, msg):
return hmac.new(key, msg, hashlib.sha256).digest()
def get_sign_key():
t = datetime.utcnow()
k1 = sign(SECRET_KEY, t.strftime('%Y%m%d').encode('utf8'))
k2 = sign(k1, b'your.sorna.api.endpoint')
return k2
Generating a string to sign
The string to sign is generated from the following request-related values:
HTTP Method (uppercase)
URI including query strings
The value of
Date
(orX-BackendAI-Date
ifDate
is not present) formatted in ISO 8601 (YYYYmmddTHHMMSSZ
) using the UTC timezone.The canonicalized header/value pair of
Host
The canonicalized header/value pair of
Content-Type
The canonicalized header/value pair of
X-BackendAI-Version
The hex-encoded hash value of body as-is. The hash function must be same to the one given in the
Authorization
header (e.g., SHA256).
To generate a string to sign, the client should join the above values using the newline ("\n"
, ASCII 10) character.
All non-ASCII strings must be encoded with UTF-8.
To canonicalize a pair of HTTP header/value, first trim all leading/trailing whitespace characters ("\n"
, "\r"
, " "
, "\t"
; or ASCII 10, 13, 32, 9) of its value, and join the lowercased header name and the value with a single colon (":"
, ASCII 58) character.
The success example in Example Requests and Responses makes a string to sign as follows (where the newlines are "\n"
):
GET
/v2
20160930T01:23:45Z
host:your.sorna.api.endpoint
content-type:application/json
x-sorna-version:v2.20170215
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
In this example, the hash value e3b0c4...
is generated from an empty string using the SHA256 hash function since there is no body for GET requests.
Then, the client should calculate the signature using the derived signing key and the generated string with the hash function, as follows:
import hashlib, hmac
str_to_sign = 'GET\n/v2...'
sign_key = get_sign_key() # see "Generating a signing key"
m = hmac.new(sign_key, str_to_sign.encode('utf8'), hashlib.sha256)
signature = m.hexdigest()
Attaching the signature
Finally, the client now should construct the following HTTP Authorization
header:
Authorization: BackendAI signMethod=HMAC-SHA256, credential=<access-key>:<signature>
Example Requests and Responses
For the examples here, we use a dummy access key and secret key:
Example access key:
AKIAIOSFODNN7EXAMPLE
Example secret key:
wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Success example for checking the latest API version
GET /v2 HTTP/1.1
Host: your.sorna.api.endpoint
Date: 20160930T01:23:45Z
Authorization: BackendAI signMethod=HMAC-SHA256, credential=AKIAIOSFODNN7EXAMPLE:022ae894b4ecce097bea6eca9a97c41cd17e8aff545800cd696112cc387059cf
Content-Type: application/json
X-BackendAI-Version: v2.20170215
HTTP/1.1 200 OK
Content-Type: application/json
Content-Language: en
Content-Length: 31
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1999
X-RateLimit-Reset: 897065
{
"version": "v2.20170215"
}
Rate Limiting
The API server imposes a rate limit to prevent clients from overloading the server. The limit is applied to the last N minutes at ANY moment (N is 15 minutes by default).
For public non-authorized APIs such as version checks, the server uses the client’s IP address seen by the server to impose rate limits. Due to this, please keep in mind that large-scale NAT-based deployments may encounter the rate limits sooner than expected. For authorized APIs, it uses the access key in the authorization header to impose rate limits. The rate limit includes both all successful and failed requests.
Upon a valid request, the HTTP response contains the following header fields to help the clients flow-control their requests.
HTTP Headers |
Values |
---|---|
|
The maximum allowed number of requests during the rate-limit window. |
|
The number of further allowed requests left for the moment. |
|
The constant value representing the window size in seconds. (e.g., 900 means 15 minutes) Changed in version v3.20170615: Deprecated |
When the limit is exceeded, further API calls will get HTTP 429 “Too Many Requests”. If the client seems to be DDoS-ing, the server may block the client forever without prior notice.
Manager REST API
Backend.AI REST API is for running instant compute sessions at scale in clouds or on-premise clusters.
Session Management
Here are the API calls to create and manage compute sessions.
Creating Session
URI:
/session
(/session/create
also works for legacy)Method:
POST
Creates a new session or returns an existing session, depending on the parameters.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The kernel runtime type in the form of the Docker image name and tag.
For legacy, the API also recognizes the Changed in version v4.20190315. |
|
|
A client-provided session token, which must be unique among the currently non-terminated sessions owned by the requesting access key. Clients may reuse the token if the previous session with the same token has been terminated. It may contain ASCII alphabets, numbers, and hyphens in the middle. The length must be between 4 to 64 characters inclusively. It is useful for aliasing the session with a human-friendly name. |
|
|
(optional) If set true, the API returns immediately after queueing the session creation request to the scheduler.
Otherwise, the manager will wait until the session gets started actually.
(default: Added in version v4.20190615. |
|
|
(optional) Set the maximum duration to wait until the session starts after queued, in seconds. If zero,
the manager will wait indefinitely.
(default: Added in version v4.20190615. |
|
|
(optional) If set true, the API returns without creating a new session if a session
with the same ID and the same image already exists and not terminated.
In this case Added in version v4.20190615. |
|
|
(optional) The name of a user group (aka “project”) to launch the session within. (default: Added in version v4.20190615. |
|
|
(optional) The name of a domain to launch the session within (default: Added in version v4.20190615. |
|
|
(optional) A Creation Config Object to specify kernel configuration including resource requirements. If not given, the kernel is created with the minimum required resource slots defined by the target image. |
|
|
(optional) A per-session, user-provided tag for administrators to keep track of additional information of each session, such as which sessions are from which users. |
Example:
{
"image": "python:3.6-ubuntu18.04",
"clientSessionToken": "mysession-01",
"enqueueOnly": false,
"maxWaitSeconds": 0,
"reuseIfExists": true,
"domain": "default",
"group": "default",
"config": {
"clusterSize": 1,
"environ": {
"MYCONFIG": "XXX",
},
"mounts": ["mydata", "mypkgs"],
"resources": {
"cpu": "2",
"mem": "4g",
"cuda.devices": "1",
}
},
"tag": "example-tag"
}
Response
HTTP Status Code |
Description |
---|---|
200 OK |
The session is already running and you are okay to reuse it. |
201 Created |
The session is successfully created. |
401 Invalid API parameters |
There are invalid or malformed values in the API parameters. |
406 Not acceptable |
The requested resource limits exceed the server’s own limits. |
Fields |
Type |
Values |
---|---|---|
|
|
The session ID used for later API calls, which is same to the value of |
|
|
The status of the created kernel. This is always Added in version v4.20190615. |
|
|
The list of Service Port Object.
This field becomes an empty list if Note In most cases the service ports are same to what specified in the image metadata, but the agent may add shared services for all sessions. Changed in version v4.20190615. |
|
|
True if the session is freshly created. |
Example:
{
"sessId": "mysession-01",
"status": "RUNNING",
"servicePorts": [
{"name": "jupyter", "protocol": "http"},
{"name": "tensorboard", "protocol": "http"}
],
"created": true
}
Getting Session Information
URI:
/session/:id
Method:
GET
Retrieves information about a session. For performance reasons, the returned information may not be real-time; usually they are updated every a few seconds in the server-side.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID. |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
The information is successfully returned. |
404 Not Found |
There is no such session. |
Key |
Type |
Description |
---|---|---|
|
|
The kernel’s programming language |
|
|
The time elapsed since the kernel has started. |
|
|
The memory limit of the kernel in KiB. |
|
|
The number of times the kernel has been accessed. |
|
|
The total time the kernel was running. |
Destroying Session
URI:
/session/:id
Method:
DELETE
Terminates a session.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID. |
Response
HTTP Status Code |
Description |
---|---|
204 No Content |
The session is successfully destroyed. |
404 Not Found |
There is no such session. |
Key |
Type |
Description |
---|---|---|
|
|
The Container Stats Object of the kernel when deleted. |
Restarting Session
URI:
/session/:id
Method:
PATCH
Restarts a session. The idle time of the session will be reset, but other properties such as the age and CPU credit will continue to accumulate. All global states such as global variables and modules imports are also reset.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID. |
Response
HTTP Status Code |
Description |
---|---|
204 No Content |
The session is successfully restarted. |
404 Not Found |
There is no such session. |
Code Execution (Query Mode)
Executing Snippet
URI:
/session/:id
Method:
POST
Executes a snippet of user code using the specified session. Each execution request to a same session may have side-effects to subsequent executions. For instance, setting a global variable in a request and reading the variable in another request is completely legal. It is the job of the user (or the front-end) to guarantee the correct execution order of multiple interdependent requests. When the session is terminated or restarted, all such volatile states vanish.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID. |
|
|
A constant string |
|
|
A string of user-written code. All non-ASCII data must be encoded in UTF-8 or any format acceptable by the session. |
|
|
A string of client-side unique identifier for this particular run. For more details about the concept of a run, see Code Execution Model. If not given, the API server will assign a random one in the first response and the client must use it for the same run afterwards. |
Example:
{
"mode": "query",
"code": "print('Hello, world!')",
"runId": "5facbf2f2697c1b7"
}
Response
HTTP Status Code |
Description |
---|---|
200 OK |
The session has responded with the execution result. The response body contains a JSON object as described below. |
Fields |
Type |
Values |
---|---|---|
|
|
Note
Even when the user code raises exceptions, such queries are treated as successful execution. i.e., The failure of this API means that our API subsystem had errors, not the user codes.
Warning
If the user code tries to breach the system, causes crashes (e.g., segmentation fault), or runs too long (timeout), the session is automatically terminated.
In such cases, you will get incomplete console logs with "finished"
status earlier than expected.
Depending on situation, the result.stderr
may also contain specific error information.
Here we demonstrate a few example returns when various Python codes are executed.
Example: Simple return.
print("Hello, world!")
{
"result": {
"runId": "5facbf2f2697c1b7",
"status": "finished",
"console": [
["stdout", "Hello, world!\n"]
],
"options": null
}
}
Example: Runtime error.
a = 123
print('what happens now?')
a = a / 0
{
"result": {
"runId": "5facbf2f2697c1b7",
"status": "finished",
"console": [
["stdout", "what happens now?\n"],
["stderr", "Traceback (most recent call last):\n File \"<input>\", line 3, in <module>\nZeroDivisionError: division by zero"],
],
"options": null
}
}
Example: Multimedia output.
Media outputs are also mixed with other console outputs according to their execution order.
import matplotlib.pyplot as plt
a = [1,2]
b = [3,4]
print('plotting simple line graph')
plt.plot(a, b)
plt.show()
print('done')
{
"result": {
"runId": "5facbf2f2697c1b7",
"status": "finished",
"console": [
["stdout", "plotting simple line graph\n"],
["media", ["image/svg+xml", "<?xml version=\"1.0\" ..."]],
["stdout", "done\n"]
],
"options": null
}
}
Example: Continuation results.
import time
for i in range(5):
print(f"Tick {i+1}")
time.sleep(1)
print("done")
{
"result": {
"runId": "5facbf2f2697c1b7",
"status": "continued",
"console": [
["stdout", "Tick 1\nTick 2\n"]
],
"options": null
}
}
Here you should make another API query with the empty code
field.
{
"result": {
"runId": "5facbf2f2697c1b7",
"status": "continued",
"console": [
["stdout", "Tick 3\nTick 4\n"]
],
"options": null
}
}
Again.
{
"result": {
"runId": "5facbf2f2697c1b7",
"status": "finished",
"console": [
["stdout", "Tick 5\ndone\n"],
],
"options": null
}
}
Example: User input.
print("What is your name?")
name = input(">> ")
print(f"Hello, {name}!")
{
"result": {
"runId": "5facbf2f2697c1b7",
"status": "waiting-input",
"console": [
["stdout", "What is your name?\n>> "]
],
"options": {
"is_password": false
}
}
}
You should make another API query with the code
field filled with the user input.
{
"result": {
"runId": "5facbf2f2697c1b7",
"status": "finished",
"console": [
["stdout", "Hello, Lablup!\n"]
],
"options": null
}
}
Auto-completion
URI:
/session/:id/complete
Method:
POST
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID. |
|
|
A string containing the code until the current cursor position. |
|
|
A string containing the code after the current cursor position. |
|
|
A string containing the content of the current line. |
|
|
An integer indicating the line number (0-based) of the cursor. |
|
|
An integer indicating the column number (0-based) in the current line of the cursor. |
Example:
{
"code": "pri",
"options": {
"post": "\nprint(\"world\")\n",
"line": "pri",
"row": 0,
"col": 3
}
}
Response
HTTP Status Code |
Description |
---|---|
200 OK |
The session has responded with the execution result. The response body contains a JSON object as described below. |
Fields |
Type |
Values |
---|---|---|
|
|
An ordered list containing the possible auto-completion matches as strings. This may be empty if the current session does not implement auto-completion or no matches have been found. Selecting a match and merging it into the code text are up to the front-end implementation. |
Example:
{
"result": [
"print",
"printf"
]
}
Interrupt
URI:
/session/:id/interrupt
Method:
POST
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID. |
Response
HTTP Status Code |
Description |
---|---|
204 No Content |
Sent the interrupt signal to the session. Note that this does not guarantee the effectiveness of the interruption. |
Code Execution (Batch Mode)
Some sessions provide the batch mode, which offers an explicit build step required for multi-module programs or compiled programming languages. In this mode, you first upload files in prior to execution.
Uploading files
URI:
/session/:id/upload
Method:
POST
Parameters
Upload files to the session.
You may upload multiple files at once using multi-part form-data encoding in the request body (RFC 1867/2388).
The uploaded files are placed under /home/work
directory (which is the home directory for all sessions by default),
and existing files are always overwritten.
If the filename has a directory part, non-existing directories will be auto-created.
The path may be either absolute or relative, but only sub-directories under /home/work
is allowed to be created.
Hint
This API is for uploading frequently-changing source files in prior to batch-mode execution. All files uploaded via this API is deleted when the session terminates. Use virtual folders to store and access larger, persistent, static data and library files for your codes.
Warning
You cannot upload files to mounted virtual folders using this API directly. However, you may copy/move the generated files to virtual folders in your build script or the main program for later uses.
There are several limits on this API:
The maximum size of each file |
1 MiB |
The number of files per upload request |
20 |
Response
HTTP Status Code |
Description |
---|---|
204 OK |
Success. |
400 Bad Request |
Returned when one of the uploaded file exceeds the size limit or there are too many files. |
Executing with Build Step
URI:
/session/:id
Method:
POST
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID. |
|
|
A constant string |
|
|
Must be an empty string |
|
|
A string of client-side unique identifier for this particular run. For more details about the concept of a run, see Code Execution Model. If not given, the API server will assign a random one in the first response and the client must use it for the same run afterwards. |
|
|
Example:
{
"mode": "batch",
"options": "{batch-execution-query-object}",
"runId": "af9185c5fb0eacb2"
}
Response
HTTP Status Code |
Description |
---|---|
200 OK |
The session has responded with the execution result. The response body contains a JSON object as described below. |
Fields |
Type |
Values |
---|---|---|
|
|
Listing Files
Once files are uploaded to the session or generated during the execution of the code, there is a need to identify what files actually are in the current session. In this case, use this API to get the list of files of your compute sesison.
URI:
/session/:id/files
Method:
GET
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID. |
|
|
Path inside the session (default: |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
404 Not Found |
There is no such path. |
Fields |
Type |
Values |
---|---|---|
|
|
Stringified json containing list of files. |
|
|
Absolute path inside session. |
|
|
Any errors occurred during scanning the specified path. |
Downloading Files
Download files from your compute session.
The response contents are multiparts with tarfile binaries. Post-processing, such as unpacking and save them, should be handled by the client.
URI:
/session/:id/download
Method:
GET
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID. |
|
|
File paths inside the session container to download. (maximum 5 files at once) |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
Code Execution (Streaming)
The streaming mode provides a lightweight and interactive method to connect with the session containers.
Code Execution
URI:
/stream/session/:id/execute
Method:
GET
upgraded to WebSockets
This is a real-time streaming version of Code Execution (Batch Mode) and Code Execution (Query Mode) which uses long polling via HTTP.
(under construction)
Added in version v4.20181215.
Terminal Emulation
URI:
/stream/session/:id/pty?app=:service
Method:
GET
upgraded to WebSockets
This endpoint provides a duplex continuous stream of JSON objects via the native WebSocket. Although WebSocket supports binary streams, we currently rely on TEXT messages only conveying JSON payloads to avoid quirks in typed array support in Javascript across different browsers.
The service name should be taken from the list of service port objects returned by the session creation API.
Note
We do not provide any legacy WebSocket emulation interfaces such as socket.io or SockJS. You need to set up your own proxy if you want to support legacy browser users.
Changed in version v4.20181215: Added the service
query parameter.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID. |
|
|
The service name to connect. |
Client-to-Server Protocol
The endpoint accepts the following four types of input messages.
Standard input stream
All ASCII (and UTF-8) inputs must be encoded as base64 strings. The characters may include control characters as well.
{
"type": "stdin",
"chars": "<base64-encoded-raw-characters>"
}
Terminal resize
Set the terminal size to the given number of rows and columns. You should calculate them by yourself.
For instance, for web-browsers, you may do a simple math by measuring the width and height of a temporarily created, invisible HTML element with the (monospace) font styles same to the terminal container element that contains only a single ASCII character.
{
"type": "resize",
"rows": 25,
"cols": 80
}
Ping
Use this to keep the session alive (preventing it from auto-terminated by idle timeouts) by sending pings periodically while the user-side browser is open.
{
"type": "ping",
}
Restart
Use this to restart the session without affecting the working directory and usage counts. Useful when your foreground terminal program does not respond for whatever reasons.
{
"type": "restart",
}
Server-to-Client Protocol
Standard output/error stream
Since the terminal is an output device, all stdout/stderr outputs are merged into a single stream as we see in real terminals. This means there is no way to distinguish stdout and stderr in the client-side, unless your session applies some special formatting to distinguish them (e.g., make all stderr otuputs red).
The terminal output is compatible with xterm (including 256-color support).
{
"type": "out",
"data": "<base64-encoded-raw-characters>"
}
Server-side errors
{
"type": "error",
"data": "<human-readable-message>"
}
Event Monitoring
Session Lifecycle Events
URI:
/events/session
Method:
GET
Provides a continuous message-by-message JSON object stream of session lifecycles. It uses HTML5 Server-Sent Events (SSE). Browser-based clients may use the EventSource API for convenience.
Added in version v4.20190615: First properly implemented in this version, deprecating prior unimplemented interfaces.
Changed in version v5.20191215: The URI is changed from /stream/session/_/events
to /events/session
.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The session ID to monitor the lifecycle events.
If set |
|
|
(optional) The access key of the owner of the specified session, since different access keys (users) may share a same session ID for different session instances. You can specify this only when the client is either a domain admin or a superadmin. |
|
|
The group name to filter the lifecycle events.
If set |
Responses
The response is a continuous stream of UTF-8 text lines following the text/event-stream
format.
Each event is composed of the event type and data, where the data part is encoded as JSON.
Possible event names (more events may be added in the future):
Event Name |
Description |
---|---|
|
The session is just scheduled from the job queue and got an agent resource allocation. |
|
The session begins pulling the session image (usually from a Docker registry) to the scheduled agent. |
|
The session is being created as containers (or other entities in different agent backends). |
|
The session becomes ready to execute codes. |
|
The session has terminated. |
When using the EventSource API, you should add event listeners as follows:
const sse = new EventSource('/events/session', {
withCredentials: true,
});
sse.addEventListener('session_started', (e) => {
console.log('session_started', JSON.parse(e.data));
});
Note
The EventSource API must be used with the session-based authentication mode (when the endpoint is a console-server) which uses the browser cookies. Otherwise, you need to manually implement the event stream parser using the standard fetch API running against the manager server.
The event data contains a JSON string like this (more fields may be added in the future):
Field Name |
Description |
---|---|
|
The source session ID. |
|
The access key who owns the session. |
|
A short string that describes why the event happened.
This may be |
|
Only present for |
{
"sessionId": "mysession-01",
"ownerAccessKey": "MYACCESSKEY",
"reason": "self-terminated",
"result": "SUCCESS"
}
Background Task Progress Events
URI:
/events/background-task
Method:
GET
for server-side events
Added in version v5.20191215.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The background task ID to monitor the progress and completion. |
Responses
The response is a continuous stream of UTF-8 text lines following text/event-stream
format.
Each event is composed of the event type and data, where the data part is encoded as JSON.
Possible event names (more events may be added in the future):
Event Name |
Description |
---|---|
|
Updates for the progress. This can be generated many times during the background task execution. |
|
The background task is successfully completed. |
|
The background task has failed.
Check the |
|
The background task is cancelled in the middle. Usually this means that the server is being shutdown for maintenance. |
|
This event indicates explicit server-initiated close of the event monitoring connection, which is raised just after the background task is either done/failed/cancelled. The client should not reconnect because there is nothing more to monitor about the given task. |
The event data (per-line JSON objects) include the following fields:
Field Name |
Type |
Description |
---|---|---|
|
|
The background task ID. |
|
|
The current progress value.
Only meaningful for |
|
|
The total progress count.
Only meaningful for |
|
|
An optional human-readable message indicating what the task is doing.
It may be |
Check out the session lifecycle events API for example client-side Javascript implementations to handle text/event-stream
responses.
If you make the request for the tasks already finished, it may return either “404 Not Found” (the result is expired or the task ID is invalid) or a single event which is one of task_done
, task_fail
, or task_cancel
followed by immediate response disconnection.
Currently, the results for finished tasks may be archived up to one day (24 hours).
Service Ports (aka Service Proxies)
The service ports API provides WebSocket-based authenticated and encrypted tunnels to network-facing services (“container services”) provided by the kernel container. The main advantage of this feature is that all application-specific network traffic are wrapped as a standard WebSocket API (no need to open extra ports of the manager). It also hides the container from the client and the client from the container, offerring an extra level of security.
The diagram showing how tunneling of TCP connections via WebSockets works.
As Fig. 7 shows, all TCP traffic to a container service could be sent to a WebSocket connection to the following API endpoints. A single WebSocket connection corresponds to a single TCP connection to the service, and there may be multiple concurrent WebSocket connections to represent multiple TCP connections to the service. It is the client’s responsibility to accept arbitrary TCP connections from users (e.g., web browsers) with proper authorization for multi-user setups and wrap those as WebSocket connections to the following APIs.
When the first connection is initiated, the Backend.AI Agent running the designated
kernel container signals the kernel runner daemon in the container to start the
designated service. It shortly waits for the in-container port opening and
then delivers the first packet to the service. After initialization, all
WebSocket payloads are delivered back and forth just like normal TCP packets.
Note that the WebSocket message type must be BINARY
.
The container service will see the packets from the manager and it never knows the real origin of packets unless the service-level protocol enforces to state such client-side information. Likewise, the client never knows the container’s IP address (though the port numbers are included in service port objects returned by the session creation API).
Note
Currently non-TCP (e.g., UDP) services are not supported.
Service Proxy (HTTP)
URI:
/stream/kernel/:id/httpproxy?app=:service
Method:
GET
upgraded to WebSockets
The service proxy API allows clients to directly connect to service daemons running inside compute sessions, such as Jupyter and TensorBoard.
The service name should be taken from the list of service port objects returned by the session creation API.
Added in version v4.20181215.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The kernel ID. |
|
|
The service name to connect. |
Service Proxy (TCP)
URI:
/stream/kernel/:id/tcpproxy?app=:service
Method:
GET
upgraded to WebSockets
This is the TCP version of service proxy, so that client users can connect to native services running inside compute sessions, such as SSH.
The service name should be taken from the list of service port objects returned by the session creation API.
Added in version v4.20181215.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The kernel ID. |
|
|
The service name to connect. |
Resource Presets
Resource presets provide a simple storage for pre-configured resource slots and a dynamic checker for allocatability of given presets before actually calling the kernel creation API.
To add/modify/delete resource presets, you need to use the admin GraphQL API.
Added in version v4.20190315.
Listing Resource Presets
Returns the list of admin-configured resource presets.
URI:
/resource/presets
Method:
GET
Parameters
None.
Response
HTTP Status Code |
Description |
---|---|
200 OK |
The preset list is returned. |
Fields |
Type |
Values |
---|---|---|
|
|
The list of Resource Preset Object |
Checking Allocatability of Resource Presets
Returns current keypair and scaling-group’s resource limits in addition to the
list of admin-configured resource presets.
It also checks the allocatability of the resource presets and adds allocatable
boolean field to each preset item.
URI:
/resource/check-presets
Method:
POST
Parameters
None.
Response
HTTP Status Code |
Description |
---|---|
200 OK |
The preset list is returned. |
401 Unauthorized |
The client is not authorized. |
Fields |
Type |
Values |
---|---|---|
|
The maximum amount of total resource slots allowed for the current access key. It may contain infinity values as the string “Infinity”. |
|
|
The amount of total resource slots used by the current access key. |
|
|
The amount of total resource slots remaining for the current access key. It may contain infinity values as the string “Infinity”. |
|
|
The amount of total resource slots remaining for the current scaling group. It may contain infinity values as the string “Infinity” if the server is configured for auto-scaling. |
|
|
|
The list of Resource Preset Object, but with an extra boolean field |
Virtual Folders
Virtual folders provide access to shared, persistent, and reused files across different sessions.
You can mount virtual folders when creating new sessions, and use them
like a plain directory on the local filesystem.
Of course, reads/writes to virtual folder contents may have degraded
performance compared to the main scratch directory (usually /home/work
in
most kernels) as internally it uses a networked file system.
Also, you might share your virtual folders with other users by inviting them
and granting them proper permission. Currently, there are three levels of
permissions: read-only, read-write, read-write-delete. They are represented
by short strings, 'ro'
, 'rw'
, 'wd'
, respectively. The owner of a
virtual folder have read-write-delete permission for the folder.
Listing Virtual Folders
Returns the list of virtual folders created by the current keypair.
URI:
/folders
Method:
GET
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
(optional) If this parameter is |
|
|
(optional) If this parameter is set, it returns the virtual folders that belong to the specified group. Have no effect in user-type virtual folders. |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success |
Fields |
Type |
Values |
---|---|---|
(root) |
|
A list of Virtual Folder List Item Object |
Example:
[
{
"name": "myfolder",
"id": "b4b1b16c-d07f-4f1f-b60e-da9449aa60a6",
"host": "local:volume1",
"usage_mode": "general",
"created_at": "2020-11-28 13:30:30.912056+00",
"is_owner": "true",
"permission": "rw",
"user": "dfa9da54-4b28-432f-be29-c0d680c7a412",
"group": null,
"creator": "admin@lablup.com",
"user_email": "admin@lablup.com",
"group_name": null,
"ownership_type": "user",
"unmanaged_path": null,
"cloneable": "false",
}
]
Listing Virtual Folder Hosts
Returns the list of available host names where the current keypair can create new virtual folders.
Added in version v4.20190315.
URI:
/folders/_/hosts
Method:
GET
Parameters
None
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success |
Fields |
Type |
Values |
---|---|---|
|
|
The default virtual folder host |
|
|
The list of available virtual folder hosts |
Example:
{
"default": "seoul:nfs1",
"allowed": ["seoul:nfs1", "seoul:nfs2", "seoul:cephfs1"]
}
Creating a Virtual Folder
URI:
/folders
Method:
POST
Creates a virtual folder associated with the current API key.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder |
|
|
(optional) The name of the virtual folder host |
|
|
(optional) The purpose of the virtual folder. Allowed values are
|
|
|
(optional) The default share permission of the virtual folder.
The owner of the virtual folder always have |
|
|
(optional) If this parameter is set, it creates a group-type virtual folder. If empty, it creates a user-type virtual folder. |
|
|
(optional) Set the quota of the virtual folder in bytes. Note, however, that the quota is only supported under the xfs filesystems. Other filesystems that do not support per-directory quota will ignore this parameter. |
Example:
{
"name": "My Data",
"host": "seoul:nfs1"
}
Response
HTTP Status Code |
Description |
---|---|
201 Created |
The kernel is successfully created. |
400 Bad Request |
The name is malformed or duplicate with your existing virtual folders. |
406 Not acceptable |
You have exceeded internal limits of virtual folders. (e.g., the maximum number of folders you can have.) |
Fields |
Type |
Values |
---|---|---|
|
|
The unique folder ID used for later API calls |
|
|
The human-readable name of the created virtual folder |
|
|
The name of the virtual folder host where the new folder is created |
Example:
{
"id": "aef1691db3354020986d6498340df13c",
"name": "My Data",
"host": "nfs1",
"usage_mode": "general",
"permission": "rw",
"creator": "admin@lablup.com",
"ownership_type": "user",
"user": "dfa9da54-4b28-432f-be29-c0d680c7a412",
"group": "",
}
Getting Virtual Folder Information
URI:
/folders/:name
Method:
GET
Retrieves information about a virtual folder. For performance reasons, the returned information may not be real-time; usually they are updated every a few seconds in the server-side.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
The information is successfully returned. |
404 Not Found |
There is no such folder or you may not have proper permission to access the folder. |
Fields |
Type |
Values |
---|---|---|
(root) |
|
Deleting Virtual Folder
URI:
/folders/:name
Method:
DELETE
This immediately deletes all contents of the given virtual folder and makes the folder unavailable for future mounts.
Danger
If there are running kernels that have mounted the deleted virtual folder, those kernels are likely to break!
Warning
There is NO way to get back the contents once this API is invoked.
Parameters
Parameter |
Description |
---|---|
|
The human-readable name of the virtual folder |
Response
HTTP Status Code |
Description |
---|---|
204 No Content |
The folder is successfully destroyed. |
404 Not Found |
There is no such folder or you may not have proper permission to delete the folder. |
Rename a Virtual Folder
URI:
/folders/:name/rename
Method:
POST
Rename a virtual folder associated with the current API key.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder |
|
|
New virtual folder name |
Response
HTTP Status Code |
Description |
---|---|
201 Created |
The folder is successfully renamed. |
404 Not Found |
There is no such folder or you may not have proper permission to rename the folder. |
Listing Files in Virtual Folder
Returns the list of files in a virtual folder associated with current keypair.
URI:
/folders/:name/files
Method:
GET
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder |
|
|
Path inside the virtual folder (default: root) |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
404 Not Found |
There is no such path or you may not have proper permission to access the folder. |
Fields |
Type |
Values |
---|---|---|
|
|
List of Virtual Folder File Object |
Uploading a File to Virtual Folder
Upload a local file to a virtual folder associated with the current keypair. Internally, the Manager will deligate the upload to a Backend.AI Storage-Proxy service. JSON web token is used for the authentication of the request.
URI:
/folders/:name/request-upload
Method:
POST
Warning
If a file with the same name already exists in the virtual folder, it will be overwritten without warning.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder |
|
|
Path of the local file to upload |
|
|
The total size of the local file to upload |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
Fields |
Type |
Values |
---|---|---|
|
|
JSON web token for the authentication of the upload session to Storage-Proxy service. |
|
|
Request url for a Storage-Proxy. Client should use this URL to upload the file. |
Creating New Directory in Virtual Folder
Create a new directory in the virtual folder associated with current keypair. this API recursively creates parent directories if they does not exist.
URI:
/folders/:name/mkdir
Method:
POST
Warning
If a directory with the same name already exists in the virtual folder, it may be overwritten without warning.
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder. |
|
|
The relative path of a new folder to create inside the virtual folder |
|
|
If |
|
|
If a directory with the same name already exists, overwrite it without an error. |
Response
HTTP Status Code |
Description |
---|---|
201 Created |
Success. |
400 Bad Request |
There already exists a file, not a directory, with duplicated name. |
404 Not Found |
There is no such folder or you may not have proper permission to write into folder. |
Downloading a File or a Directory from a Virtual Folder
Download a file or a directory from a virtual folder associated with the current keypair. Internally, the Manager will deligate the download to a Backend.AI Storage-Proxy service. JSON web token is used for the authentication of the request.
Added in version v4.20190315.
URI:
/folders/:name/request-download
Method:
POST
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder |
|
|
The path to a file or a directory inside the virtual folder to download. |
|
|
If this parameter is |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
404 Not Found |
File not found or you may not have proper permission to access the folder. |
Fields |
Type |
Values |
---|---|---|
|
|
JSON web token for the authentication of the download session to Storage-Proxy service. |
|
|
Request url for a Storage-Proxy. Client should use this URL to download the file. |
Deleting Files in Virtual Folder
This deletes files inside a virtual folder.
Warning
There is NO way to get back the files once this API is invoked.
URI:
/folders/:name/delete-files
Method:
DELETE
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder |
|
|
File paths inside the virtual folder to delete |
|
|
Recursive option to delete folders if set to True. The default is False. |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
400 Bad Request |
You tried to delete a folder without setting recursive option as True. |
404 Not Found |
There is no such folder or you may not have proper permission to delete the file in the folder. |
Rename a File in Virtual Folder
Rename a file inside a virtual folder.
URI:
/folders/:name/rename-file
Method:
POST
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder |
|
|
The relative path of target file or directory |
|
|
The new name of the file or directory |
|
|
Flag that indicates the |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
400 Bad Request |
You tried to rename a directory without setting is_dir option as True. |
404 Not Found |
There is no such folder or you may not have proper permission to rename the file in the folder. |
Listing Invitations for Virtual Folder
Returns the list of pending invitations that the requested user received. This will display the invitations sent to me by other users.
URI:
/folders/invitations/list
Method:
GET
Parameters
This API does not need any parameter.
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
Fields |
Type |
Values |
---|---|---|
|
|
A list of Virtual Folder Invitation Object |
Creating an Invitation
Invite other users to share a virtual folder with proper permissions. If a user is already invited, then this API does not create a new invitation or update the permission of the existing invitation.
URI:
/folders/:name/invite
Method:
POST
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder |
|
|
The permission to grant to invitee |
|
|
A list of user emails to invite |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
400 Bad Request |
No invitee is given. |
404 Not Found |
There is no invitation. |
Fields |
Type |
Values |
---|---|---|
|
|
A list of invited user emails |
Accepting an Invitation
Accept an invitation and receive permission to a virtual folder as in the invitation.
URI:
/folders/invitations/accept
Method:
POST
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The unique invitation ID |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
400 Bad Request |
The name of the target virtual folder is duplicate with your existing virtual folders. |
404 Not Found |
There is no such invitation. |
Rejecting an Invitation
Reject an invitation.
URI:
/folders/invitations/delete
Method:
DELETE
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The unique invitation ID |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
404 Not Found |
There is no such invitation. |
Fields |
Type |
Values |
---|---|---|
|
|
Detail message for the invitation deletion |
Listing Sent Invitations
Returns the list of virtual folder invitations the requested user sent. This does not include the invitations those are already accepted or rejected.
URI:
/folders/invitations/list-sent
Method:
GET
Parameters
This API does not need any parameter.
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
Fields |
Type |
Values |
---|---|---|
|
|
A list of Virtual Folder Invitation Object |
Updating an Invitation
Update the permission of an already-sent, but not accepted or rejected, invitation.
URI:
/folders/invitations/update/:inv_id
Method:
POST
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The unique invitation ID |
|
|
The permission to grant to invitee |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
400 Bad Request |
No permission is given. |
404 Not Found |
There is no invitation. |
Fields |
Type |
Values |
---|---|---|
|
|
An update message string |
Clone a Virtual Folder
Clone a virtual folder
URI:
/folders/:name/clone
Method:
POST
Parameters
Parameter |
Type |
Description |
---|---|---|
|
|
The human-readable name of the virtual folder |
|
|
If |
|
|
The name of the new virtual folder |
|
|
The targe host volume of the new virtual folder |
|
|
(optional) The purpose of the new virtual folder. Allowed values are
|
|
|
(optional) The default share permission of the new virtual folder.
The owner of the virtual folder always have |
Response
HTTP Status Code |
Description |
---|---|
200 OK |
Success. |
400 Bad Request |
No target name, target host, or no permission. |
403 Forbidden |
The source virtual folder is not permitted to be cloned. |
404 Not Found |
There is no virtual folder. |
Fields |
Type |
Values |
---|---|---|
|
|
A list of user emails those are successfully unshared the virtual folder. |
Fields |
Type |
Values |
---|---|---|
(root) |
|
Example:
{
"name": "my cloned folder",
"id": "b4b1b16c-d07f-4f1f-b60e-da9449aa60a6",
"host": "local:volume1",
"usage_mode": "general",
"created_at": "2020-11-28 13:30:30.912056+00",
"is_owner": "true",
"permission": "rw",
"user": "dfa9da54-4b28-432f-be29-c0d680c7a412",
"group": null,
"creator": "admin@lablup.com",
"user_email": "admin@lablup.com",
"group_name": null,
"ownership_type": "user",
"unmanaged_path": null,
"cloneable": "false"
}
Code Execution Model
The core of the user API is the execute call which allows clients to execute user-provided codes in isolated compute sessions (aka kernels). Each session is managed by a kernel runtime, whose implementation is language-specific. A runtime is often a containerized daemon that interacts with the Backend.AI agent via our internal ZeroMQ protocol. In some cases, kernel runtimes may be just proxies to other code execution services instead of actual executor daemons.
Inside each compute session, a client may perform multiple runs. Each run is for executing different code snippets (the query mode) or different sets of source files (the batch mode). The client often has to call the execute API multiple times to finish a single run. It is completely legal to mix query-mode runs and batch-mode runs inside the same session, given that the kernel runtime supports both modes.
To distinguish different runs which may be overlapped, the client must provide the same run ID to all execute calls during a single run. The run ID should be unique for each run and can be an arbitrary random string. If the run ID is not provided by the client at the first execute call of a run, the API server will assign a random one and inform it to the client via the first response. Normally, if two or more runs are overlapped, they are processed in a FIFO order using an internal queue. But they may be processed in parallel if the kernel runtime supports parallel processing. Note that the API server may raise a timeout error and cancel the run if the waiting time exceeds a certain limit.
In the query mode, usually the runtime context (e.g., global variables) is preserved for next subsequent runs, but this is not guaranteed by the API itself—it’s up to the kernel runtime implementation.
The state diagram of a “run” with the execute API.
The execute API accepts 4 arguments: mode
, runId
, code
, and options
(opts
).
It returns an Execution Result Object encoded as JSON.
Depending on the value of status
field in the returned Execution Result Object,
the client must perform another subsequent execute call with appropriate arguments or stop.
Fig. 8 shows all possible states and transitions between them via the status
field value.
If status
is "finished"
, the client should stop.
If status
is "continued"
, the client should make another execute API call with the code
field set to an empty string and the mode
field set to "continue"
.
Continuation happens when the user code runs longer than a few seconds to allow the client to show its progress, or when it requires extra step to finish the run cycle.
If status
is "clean-finished"
or "build-finished"
(this happens at the batch-mode only), the client should make the same continuation call.
Since cleanup is performed before every build, the client will always receive "build-finished"
after "clean-finished"
status.
All outputs prior to "build-finished"
status return are from the build program and all future outputs are from the executed program built.
Note that even when the exitCode
value is non-zero (failed), the client must continue to complete the run cycle.
If status
is "waiting-input"
, you should make another execute API call with the code
field set to the user-input text and the mode
field set to "input"
.
This happens when the user code calls interactive input()
functions.
Until you send the user input, the current run is blocked.
You may use modal dialogs or other input forms (e.g., HTML input) to retrieve user inputs.
When the server receives the user input, the kernel’s input()
returns the given value.
Note that each kernel runtime may provide different ways to trigger this interactive input cycle or may not provide at all.
When each call returns, the console
field in the Execution Result Object have the console logs captured since the last previous call.
Check out the following section for details.
Handling Console Output
The console output consists of a list of tuple pairs of item type and item data.
The item type is one of "stdout"
, "stderr"
, "media"
, "html"
, or "log"
.
When the item type is "stdout"
or "stderr"
, the item data is the standard I/O stream outputs as (non-escaped) UTF-8 string.
The total length of either streams is limited to 524,288 Unicode characters per each execute API call; all excessive outputs are truncated.
The stderr often includes language-specific tracebacks of (unhandled) exceptions or errors occurred in the user code.
If the user code generates a mixture of stdout and stderr, the print ordering is preserved and each contiguous block of stdout/stderr becomes a separate item in the console output list so that the client user can reconstruct the same console output by sequentially rendering the items.
Note
The text in the stdout/stderr item may contain arbitrary terminal control sequences such as ANSI color codes and cursor/line manipulations. It is the user’s job to strip out them or implement some sort of terminal emulation.
Tip
Since the console texts are not escaped, the client user should take care of rendering and escaping depending on the UI implementation.
For example, use <pre>
element, replace newlines with <br>
, or apply white-space: pre
CSS style when rendering as HTML.
An easy way to do escape the text safely is to use insertAdjacentText()
DOM API.
When the item type is "media"
, the item data is a pair of the MIME type and the content data.
If the MIME type is text-based (e.g., "text/plain"
) or XML-based (e.g., "image/svg+xml"
), the content is just a string that represent the content.
Otherwise, the data is encoded as a data URI format (RFC 2397).
You may use backend.ai-media library to handle this field in Javascript on web-browsers.
When the item type is "html"
, the item data is a partial HTML document string, such as a table to show tabular data.
If you are implementing a web-based front-end, you may use it directly to the standard DOM API, for instance, consoleElem.insertAdjacentHTML(value, "beforeend")
.
When the item type is "log"
, the item data is a 4-tuple of the log level, the timestamp in the ISO 8601 format, the logger name and the log message string.
The log level may be one of "debug"
, "info"
, "warning"
, "error"
, or "fatal"
.
You may use different colors/formatting by the log level when printing the log message.
Not every kernel runtime supports this rich logging facility.
Manager GraphQL API
Backend.AI GraphQL API is for developing in-house management consoles.
There are two modes of operation:
Full admin access: you can query all information of all users. It requires a privileged keypair.
Restricted owner access: you can query only your own information. The server processes your request in this mode if you use your own plain keypair.
Warning
The Admin API only accepts authenticated requests.
Tip
To test and debug with the Admin API easily, try the proxy mode of the official Python client. It provides an insecure (non-SSL, non-authenticated) local HTTP proxy where all the required authorization headers are attached from the client configuration. Using this you do not have to add any custom header configurations to your favorite API development tools such as GraphiQL.
Domain Management
Query Schema
type Domain {
name: String
description: String
is_active: Boolean
created_at: DateTime
modified_at: DateTime
total_resource_slots: JSONString # ResourceSlot
allowed_vfolder_hosts: [String]
allowed_docker_registries: [String]
integration_id: String
scaling_groups: [String]
}
type Query {
domain(name: String): Domain
domains(is_active: Boolean): [Domain]
}
Mutation Schema
input DomainInput {
description: String
is_active: Boolean
total_resource_slots: JSONString # ResourceSlot
allowed_vfolder_hosts: [String]
allowed_docker_registries: [String]
integration_id: String
}
input ModifyDomainInput {
name: String
description: String
is_active: Boolean
total_resource_slots: JSONString # ResourceSlot
allowed_vfolder_hosts: [String]
allowed_docker_registries: [String]
integration_id: String
}
type CreateDomain {
ok: Boolean
msg: String
keypair: KeyPair
}
type ModifyDomain {
ok: Boolean
msg: String
}
type DeleteDomain {
ok: Boolean
msg: String
}
type Mutation {
create_domain(name: String!, props: DomainInput!): CreateDomain
modify_domain(name: String!, props: ModifyDomainInput!): ModifyDomain
delete_domain(name: String!): DeleteDomain
}
Scaling Group Management
Query Schema
type ScalingGroup {
name: String
description: String
is_active: Boolean
created_at: DateTime
driver: String
driver_opts: JSONString
scheduler: String
scheduler_opts: JSONString
}
type Query {
scaling_group(name: String): ScalingGroup
scaling_groups(name: String, is_active: Boolean): [ScalingGroup]
scaling_groups_for_domain(domain: String!, is_active: Boolean): [ScalingGroup]
scaling_groups_for_user_group(user_group: String!, is_active: Boolean): [ScalingGroup]
scaling_groups_for_keypair(access_key: String!, is_active: Boolean): [ScalingGroup]
}
Mutation Schema
input ScalingGroupInput {
description: String
is_active: Boolean
driver: String!
driver_opts: JSONString
scheduler: String!
scheduler_opts: JSONString
}
input ModifyScalingGroupInput {
description: String
is_active: Boolean
driver: String
driver_opts: JSONString
scheduler: String
scheduler_opts: JSONString
}
type CreateScalingGroup {
ok: Boolean
msg: String
scaling_group: ScalingGroup
}
type ModifyScalingGroup {
ok: Boolean
msg: String
}
type DeleteScalingGroup {
ok: Boolean
msg: String
}
type AssociateScalingGroupWithDomain {
ok: Boolean
msg: String
}
type AssociateScalingGroupWithKeyPair {
ok: Boolean
msg: String
}
type AssociateScalingGroupWithUserGroup {
ok: Boolean
msg: String
}
type DisassociateAllScalingGroupsWithDomain {
ok: Boolean
msg: String
}
type DisassociateAllScalingGroupsWithGroup {
ok: Boolean
msg: String
}
type DisassociateScalingGroupWithDomain {
ok: Boolean
msg: String
}
type DisassociateScalingGroupWithKeyPair {
ok: Boolean
msg: String
}
type DisassociateScalingGroupWithUserGroup {
ok: Boolean
msg: String
}
type Mutation {
create_scaling_group(name: String!, props: ScalingGroupInput!): CreateScalingGroup
modify_scaling_group(name: String!, props: ModifyScalingGroupInput!): ModifyScalingGroup
delete_scaling_group(name: String!): DeleteScalingGroup
associate_scaling_group_with_domain(domain: String!, scaling_group: String!): AssociateScalingGroupWithDomain
associate_scaling_group_with_user_group(scaling_group: String!, user_group: String!): AssociateScalingGroupWithUserGroup
associate_scaling_group_with_keypair(access_key: String!, scaling_group: String!): AssociateScalingGroupWithKeyPair
disassociate_scaling_group_with_domain(domain: String!, scaling_group: String!): DisassociateScalingGroupWithDomain
disassociate_scaling_group_with_user_group(scaling_group: String!, user_group: String!): DisassociateScalingGroupWithUserGroup
disassociate_scaling_group_with_keypair(access_key: String!, scaling_group: String!): DisassociateScalingGroupWithKeyPair
disassociate_all_scaling_groups_with_domain(domain: String!): DisassociateAllScalingGroupsWithDomain
disassociate_all_scaling_groups_with_group(user_group: String!): DisassociateAllScalingGroupsWithGroup
}
Resource Preset Management
Query Schema
type ResourcePreset {
name: String
resource_slots: JSONString
shared_memory: BigInt
}
type Query {
resource_preset(name: String!): ResourcePreset
resource_presets(): [ResourcePreset]
}
Mutation Schema
input CreateResourcePresetInput {
resource_slots: JSONString
shared_memory: String
}
type CreateResourcePreset {
ok: Boolean
msg: String
resource_preset: ResourcePreset
}
input ModifyResourcePresetInput {
resource_slots: JSONString
shared_memory: String
}
type ModifyResourcePreset {
ok: Boolean
msg: String
}
type DeleteResourcePreset {
ok: Boolean
msg: String
}
type Mutation {
create_resource_preset(name: String!, props: CreateResourcePresetInput!): CreateResourcePreset
modify_resource_preset(name: String!, props: ModifyResourcePresetInput!): ModifyResourcePreset
delete_resource_preset(name: String!): DeleteResourcePreset
}
Agent Monitoring
Query Schema
type Agent {
id: ID
status: String
status_changed: DateTime
region: String
scaling_group: String
available_slots: JSONString # ResourceSlot
occupied_slots: JSONString # ResourceSlot
addr: String
first_contact: DateTime
lost_at: DateTime
live_stat: JSONString
version: String
compute_plugins: JSONString
compute_containers(status: String): [ComputeContainer]
# legacy fields
mem_slots: Int
cpu_slots: Float
gpu_slots: Float
tpu_slots: Float
used_mem_slots: Int
used_cpu_slots: Float
used_gpu_slots: Float
used_tpu_slots: Float
cpu_cur_pct: Float
mem_cur_bytes: Float
}
type Query {
agent_list(
limit: Int!,
offset: Int!
order_key: String,
order_asc: Boolean,
scaling_group: String,
status: String,
): PaginatedList[Agent]
}
User Management
Query Schema
type User {
uuid: UUID
username: String
email: String
password: String
need_password_change: Boolean
full_name: String
description: String
is_active: Boolean
created_at: DateTime
domain_name: String
role: String
groups: [UserGroup]
}
type UserGroup { # shorthand reference to Group
id: UUID
name: String
}
type Query {
user(domain_name: String, email: String): User
user_from_uuid(domain_name: String, user_id: String): User
users(domain_name: String, group_id: String, is_active: Boolean): [User]
}
Mutation Schema
input UserInput {
username: String!
password: String!
need_password_change: Boolean!
full_name: String
description: String
is_active: Boolean
domain_name: String!
role: String
group_ids: [String]
}
input ModifyUserInput {
username: String
password: String
need_password_change: Boolean
full_name: String
description: String
is_active: Boolean
domain_name: String
role: String
group_ids: [String]
}
type CreateKeyPair {
ok: Boolean
msg: String
keypair: KeyPair
}
type ModifyUser {
ok: Boolean
msg: String
user: User
}
type DeleteUser {
ok: Boolean
msg: String
}
type Mutation {
create_user(email: String!, props: UserInput!): CreateUser
modify_user(email: String!, props: ModifyUserInput!): ModifyUser
delete_user(email: String!): DeleteUser
}
Group Management
Query Schema
type Group {
id: UUID
name: String
description: String
is_active: Boolean
created_at: DateTime
modified_at: DateTime
domain_name: String
total_resource_slots: JSONString # ResourceSlot
allowed_vfolder_hosts: [String]
integration_id: String
scaling_groups: [String]
}
type Query {
group(id: String!): Group
groups(domain_name: String, is_active: Boolean): [Group]
}
Mutation Schema
input GroupInput {
description: String
is_active: Boolean
domain_name: String!
total_resource_slots: JSONString # ResourceSlot
allowed_vfolder_hosts: [String]
integration_id: String
}
input ModifyGroupInput {
name: String
description: String
is_active: Boolean
domain_name: String
total_resource_slots: JSONString # ResourceSlot
user_update_mode: String
user_uuids: [String]
allowed_vfolder_hosts: [String]
integration_id: String
}
type CreateGroup {
ok: Boolean
msg: String
keypair: KeyPair
}
type ModifyGroup {
ok: Boolean
msg: String
}
type DeleteGroup {
ok: Boolean
msg: String
}
type Mutation {
create_group(name: String!, props: GroupInput!): CreateGroup
modify_group(name: String!, props: ModifyGroupInput!): ModifyGroup
delete_group(name: String!): DeleteGroup
}
KeyPair Management
Query Schema
type KeyPair {
user_id: String
access_key: String
secret_key: String
is_active: Boolean
is_admin: Boolean
resource_policy: String
created_at: DateTime
last_used: DateTime
concurrency_used: Int
rate_limit: Int
num_queries: Int
user: UUID
ssh_public_key: String
vfolders: [VirtualFolder]
compute_sessions(status: String): [ComputeSession]
}
type Query {
keypair(domain_name: String, access_key: String): KeyPair
keypairs(domain_name: String, email: String, is_active: Boolean): [KeyPair]
}
Mutation Schema
input KeyPairInput {
is_active: Boolean
resource_policy: String
concurrency_limit: Int
rate_limit: Int
}
input ModifyKeyPairInput {
is_active: Boolean
is_admin: Boolean
resource_policy: String
concurrency_limit: Int
rate_limit: Int
}
type CreateKeyPair {
ok: Boolean
msg: String
keypair: KeyPair
}
type ModifyKeyPair {
ok: Boolean
msg: String
}
type DeleteKeyPair {
ok: Boolean
msg: String
}
type Mutation {
create_keypair(props: KeyPairInput!, user_id: String!): CreateKeyPair
modify_keypair(access_key: String!, props: ModifyKeyPairInput!): ModifyKeyPair
delete_keypair(access_key: String!): DeleteKeyPair
}
KeyPair Resource Policy Management
Query Schema
type KeyPairResourcePolicy {
name: String
created_at: DateTime
default_for_unspecified: String
total_resource_slots: JSONString # ResourceSlot
max_concurrent_sessions: Int
max_containers_per_session: Int
idle_timeout: BigInt
max_vfolder_count: Int
max_vfolder_size: BigInt
allowed_vfolder_hosts: [String]
}
type Query {
keypair_resource_policy(name: String): KeyPairResourcePolicy
keypair_resource_policies(): [KeyPairResourcePolicy]
}
Mutation Schema
input CreateKeyPairResourcePolicyInput {
default_for_unspecified: String!
total_resource_slots: JSONString!
max_concurrent_sessions: Int!
max_containers_per_session: Int!
idle_timeout: BigInt!
max_vfolder_count: Int!
max_vfolder_size: BigInt!
allowed_vfolder_hosts: [String]
}
input ModifyKeyPairResourcePolicyInput {
default_for_unspecified: String
total_resource_slots: JSONString
max_concurrent_sessions: Int
max_containers_per_session: Int
idle_timeout: BigInt
max_vfolder_count: Int
max_vfolder_size: BigInt
allowed_vfolder_hosts: [String]
}
type CreateKeyPairResourcePolicy {
ok: Boolean
msg: String
resource_policy: KeyPairResourcePolicy
}
type ModifyKeyPairResourcePolicy {
ok: Boolean
msg: String
}
type DeleteKeyPairResourcePolicy {
ok: Boolean
msg: String
}
type Mutation {
create_keypair_resource_policy(name: String!, props: CreateKeyPairResourcePolicyInput!): CreateKeyPairResourcePolicy
modify_keypair_resource_policy(name: String!, props: ModifyKeyPairResourcePolicyInput!): ModifyKeyPairResourcePolicy
delete_keypair_resource_policy(name: String!): DeleteKeyPairResourcePolicy
}
Compute Session Monitoring
As of Backend.AI v20.03, compute sessions are composed of one or more containers, while interactions with sessions only occur with the master container when using REST APIs. The GraphQL API allows users and admins to check details of sessions and their belonging containers.
Changed in version v5.20191215.
Query Schema
ComputeSession
provides information about the whole session, including user-requested parameters when creating sessions.
type ComputeSession {
# identity and type
id: UUID
name: String
type: String
id: UUID
tag: String
# image
image: String
registry: String
cluster_template: String # reserved for future release
# ownership
domain_name: String
group_name: String
group_id: UUID
user_email: String
user_id: UUID
access_key: String
created_user_email: String # reserved for future release
created_user_uuid: UUID # reserved for future release
# status
status: String
status_changed: DateTime
status_info: String
created_at: DateTime
terminated_at: DateTime
startup_command: String
result: String
# resources
resource_opts: JSONString
scaling_group: String
service_ports: JSONString # only available in master
mounts: List[String] # shared by all kernels
occupied_slots: JSONString # ResourceSlot; sum of belonging containers
# statistics
num_queries: BigInt
# owned containers (aka kernels)
containers: List[ComputeContainer] # full list of owned containers
# pipeline relations
dependencies: List[ComputeSession] # full list of dependency sessions
}
The sessions may be queried one by one using compute_session
field on the root query schema,
or as a paginated list using compute_session_list
.
type Query {
compute_session(
id: UUID!,
): ComputeSession
compute_session_list(
limit: Int!,
offset: Int!,
order_key: String,
order_asc: Boolean,
domain_name: String, # super-admin can query sessions in any domain
group_id: String, # domain-admins can query sessions in any group
access_key: String, # admins can query sessions of other users
status: String,
): PaginatedList[ComputeSession]
}
ComputeContainer
provides information about individual containers that belongs to the given session.
Note that the client must assume that id
is different from container_id
, because agents may be configured to use non-Docker backends.
Note
The container ID in the GraphQL queries and REST APIs are different from the actual Docker container ID.
The Docker container IDs can be queried using container_id
field of ComputeContainer
objects.
If the agents are configured to using non-Docker-based backends, then container_id
may also be completely arbitrary identifiers.
type ComputeContainer {
# identity
id: UUID
role: String # "master" is reserved, other values are defined by cluster templates
hostname: String # used by sibling containers in the same session
session_id: UUID
# image
image: String
registry: String
# status
status: String
status_changed: DateTime
status_info: String
created_at: DateTime
terminated_at: DateTime
# resources
agent: String # super-admin only
container_id: String
resource_opts: JSONString
# NOTE: mounts are same in all containers of the same session.
occupied_slots: JSONString # ResourceSlot
# statistics
live_stat: JSONString
last_stat: JSONString
}
In the same way, the containers may be queried one by one using compute_container
field on the root query schema, or as a paginated list using compute_container_list
for a single session.
Note
The container ID of the master container of each session is same to the session ID.
type Query {
compute_container(
id: UUID!,
): ComputeContainer
compute_container_list(
limit: Int!,
offset: Int!,
session_id: UUID!,
role: String,
): PaginatedList[ComputeContainer]
}
Query Example
query(
$limit: Int!,
$offset: Int!,
$ak: String,
$status: String,
) {
compute_session_list(
limit: $limit,
offset: $offset,
access_key: $ak,
status: $status,
) {
total_count
items {
id
name
type
user_email
status
status_info
status_updated
containers {
id
role
agent
}
}
}
}
API Parameters
Using the above GraphQL query, clients may send the following JSON object as the request:
{
"query": "...",
"variables": {
"limit": 10,
"offset": 0,
"ak": "AKIA....",
"status": "RUNNING"
}
}
API Response
{
"compute_session_list": {
"total_count": 1,
"items": [
{
"id": "12c45b55-ce3c-418d-9c58-223bbba307f1",
"name": "mysession",
"type": "interactive",
"user_email": "user@lablup.com",
"status": "RUNNING",
"status_info": null,
"status_updated": "2020-02-16T15:47:28.997335+00:00",
"containers": [
{
"id": "12c45b55-ce3c-418d-9c58-223bbba307f1",
"role": "master",
"agent": "i-agent01"
},
{
"id": "12c45b55-ce3c-418d-9c58-223bbba307f2",
"role": "slave",
"agent": "i-agent02"
},
{
"id": "12c45b55-ce3c-418d-9c58-223bbba307f3",
"role": "slave",
"agent": "i-agent03"
}
]
}
]
}
}
Virtual Folder Management
Query Schema
type VirtualFolder {
id: UUID
host: String
name: String
user: UUID
group: UUID
unmanaged_path: UUID
max_files: Int
max_size: Int
created_at: DateTime
last_used: DateTime
num_files: Int
cur_size: BigInt
}
type Query {
vfolder_list(
limit: Int!,
offset: Int!,
order_key: String,
order_asc: Boolean,
domain_name: String,
group_id: String,
access_key: String,
): PaginatedList[VirtualFolder]
}
Image Management
Query Schema
type Image {
name: String
humanized_name: String
tag: String
registry: String
digest: String
labels: [KVPair]
aliases: [String]
size_bytes: BigInt
resource_limits: [ResourceLimit]
supported_accelerators: [String]
installed: Boolean
installed_agents: [String] # super-admin only
}
type Query {
image(reference: String!): Image
images(
is_installed: Boolean,
is_operation: Boolean,
domain: String, # only settable by super-admins
group: String,
scaling_group: String, # null to take union of all agents from allowed scaling groups
): [Image]
}
The image list is automatically filtered by:
1) the allowed docker registries of the current user’s domain,
2) whether at least one agent in the union of all agents from the allowed scaling groups for the current user’s group has the image or not.
The second condition applies only when the value of group
is given explicitly.
If scaling_group
is not null
, then only the agents in the given scaling group are checked for image availability instead of taking the union of all agents from the allowed scaling groups.
If the requesting user is a super-admin, clients may set the filter conditions as they want. If the filter conditions are not specified by the super-admin, clients work like v19.09 and prior versions
Added in version v5.20191215: domain
, group
, and scaling_group
filters are added to the images
root query field.
Changed in version v5.20191215: images
query returns the images currently usable by the requesting user as described above.
Previously, it returned all etcd-registered images.
Mutation Schema
type RescanImages {
ok: Boolean
msg: String
task_id: String
}
type PreloadImage {
ok: Boolean
msg: String
task_id: String
}
type UnloadImage {
ok: Boolean
msg: String
task_id: String
}
type ForgetImage {
ok: Boolean
msg: String
}
type AliasImage {
ok: Boolean
msg: String
}
type DealiasImage {
ok: Boolean
msg: String
}
type Mutation {
rescan_images(registry: String!): RescanImages
preload_image(reference: String!, target_agents: String!): PreloadImage
unload_image(reference: String!, target_agents: String!): UnloadImage
forget_image(reference: String!): ForgetImage
alias_image(alias: String!, target: String!): AliasImage
dealias_image(alias: String!): DealiasImage
}
All these mutations are only allowed for super-admins.
The query parameter target_agents
takes a special expression to indicate a set of agents.
The mutations that returns task_id
may take an arbitrarily long time to complete.
This means that getting the response does not necessarily mean that the requested task is complete.
To monitor the progress and actual completion, clients should use the background task API using the task_id
value.
Added in version v5.20191215: forget_image
, preload_image
and unload_image
are added to the root mutation.
Changed in version v5.20191215: rescan_images
now returns immediately and its completion must be monitored using the new background task API.
Basics of GraphQL
The Admin API uses a single GraphQL endpoint for both queries and mutations.
https://api.backend.ai/admin/graphql
For more information about GraphQL concepts and syntax, please visit the following site(s):
HTTP Request Convention
A client must use the POST
HTTP method.
The server accepts a JSON-encoded body with an object containing two fields: query
and variables
,
pretty much like other GraphQL server implementations.
Warning
Currently the API gateway does not support schema discovery which is often used by API development tools such as Insomnia and GraphiQL.
Field Naming Convention
We do NOT automatically camel-case our field names. All field names follow the underscore style, which is common in the Python world as our server-side framework uses Python.
Common Object Types
ResourceLimit
represents a range (min
, max
) of specific resource slot (key
).
The max
value may be the string constant “Infinity” if not specified.
type ResourceLimit {
key: String
min: String
max: String
}
KVPair
is used to represent a mapping data structure with arbitrary (runtime-determined) key-value pairs, in contrast to other data types in GraphQL which have a set of predefined static fields.
type KVPair {
key: String
value: String
}
Pagination Convention
GraphQL itself does not enforce how to pass pagination information when querying multiple objects of the same type.
We use a pagination convention as described below:
interface Item {
id: UUID
# other fields are defined by concrete types
}
interface PaginatedList(
offset: Integer!,
limit: Integer!,
# some concrete types define ordering customization fields:
# order_key: String,
# order_asc: Boolean,
# other optional filter condition may be added by concrete types
) {
total_count: Integer
items: [Item]
}
offset
and limit
are interpreted as SQL’s offset and limit clauses.
For the first page, set the offset to zero and the limit to the page size.
The items
field may contain from zero up to limit
items.
Use total_count
field to determine how many pages are there.
Fields that support pagination is suffixed with _list
in our schema.
Custom Scalar Types
UUID
: A hexademically formatted (8-4-4-4-12 alphanumeric characters connected via single hyphens) UUID values represented asString
DateTime
: An ISO-8601 formatted date-time value represented asString
BigInt
: GraphQL’s integer is officially 32-bits only, so we define a “big integer” type which can represent from -9007199254740991 (-253+1) to 9007199254740991 (253-1) (or, ±(8 PiB - 1 byte). This range is regarded as a “safe” (i.e., can be compared without loosing precision) integer range in most Javascript implementations which represent numbers in the IEEE-754 double (64-bit) format.JSONString
: It contains a stringified JSON value, whereas the whole query result is already a JSON object. A client must parse the value again to get an object representation.
Authentication
The admin API shares the same authentication method of the user API.
Versioning
As we use GraphQL, there is no explicit versioning. Check out the descriptions for each API for its own version history.
Backend.AI REST API Reference
Backend.AI Agent Reference
Backend.AI Storage Proxy Reference
Backend.AI Client SDK for Python
Python 3.8 or higher is required.
You can download its official installer from python.org, or use a 3rd-party package/version manager such as homebrew, miniconda, or pyenv. It works on Linux, macOS, and Windows.
We recommend to create a virtual environment for isolated, unobtrusive installation of the client SDK library and tools.
$ python3 -m venv venv-backend-ai
$ source venv-backend-ai/bin/activate
(venv-backend-ai) $
Then install the client library from PyPI.
(venv-backend-ai) $ pip install -U pip setuptools
(venv-backend-ai) $ pip install backend.ai-client
Note
We recommend to install the client library with the same version as the server. You can check the server version by visiting the server’s webui, click the profile icon on the top-right corner, and then click the “About Backend.AI” menu. Then install the client library with the same version as the server.
(venv-backend-ai) $ pip install backend.ai-client==<server_version>
Set your API keypair as environment variables:
(venv-backend-ai) $ export BACKEND_ACCESS_KEY=AKIA...
(venv-backend-ai) $ export BACKEND_SECRET_KEY=...
And then try the first commands:
(venv-backend-ai) $ backend.ai --help
...
(venv-backend-ai) $ backend.ai ps
...
Check out more details with the below table of contents.
Installation
Linux/macOS
We recommend using pyenv to manage your Python versions and virtual environments to avoid conflicts with other Python applications.
Create a new virtual environment (Python 3.6 or higher) and activate it on your shell. Then run the following commands:
pip install -U pip setuptools
pip install -U backend.ai-client-py
Create a shell script my-backendai-env.sh
like:
export BACKEND_ACCESS_KEY=...
export BACKEND_SECRET_KEY=...
export BACKEND_ENDPOINT=https://my-precious-cluster
export BACKEND_ENDPOINT_TYPE=api
Run this shell script before using backend.ai
command.
Note
The console-server users should set BACKEND_ENDPOINT_TYPE
to session
.
For details, check out the client configuration document.
Windows
We recommend using the Anaconda Navigator to manage your Python environments with a slick GUI app.
Create a new environment (Python 3.6 or higher) and launch a terminal (command prompt). Then run the following commands:
python -m pip install -U pip setuptools
python -m pip install -U backend.ai-client-py
Create a batch file my-backendai-env.bat
like:
chcp 65001
set PYTHONIOENCODING=UTF-8
set BACKEND_ACCESS_KEY=...
set BACKEND_SECRET_KEY=...
set BACKEND_ENDPOINT=https://my-precious-cluster
set BACKEND_ENDPOINT_TYPE=api
Run this batch file before using backend.ai
command.
Note that this batch file switches your command prompt to use the UTF-8 codepage for correct display of special characters in the console logs.
Verification
Run backend.ai ps
command and check if it says “there is no compute sessions
running” or something similar.
If you encounter error messages about “ACCESS_KEY”, then check if your batch/shell scripts have the correct environment variable names.
If you encounter network connection error messages, check if the endpoint server is configured correctly and accessible.
Client Configuration
The configuration for Backend.AI API includes the endpoint URL prefix, API keypairs (access and secret keys), and a few others.
There are two ways to set the configuration:
Setting environment variables before running your program that uses this SDK. This applies to the command-line interface as well.
Manually creating
APIConfig
instance and creating sessions with it.
The list of configurable environment variables are:
BACKEND_ENDPOINT
BACKEND_ENDPOINT_TYPE
BACKEND_ACCESS_KEY
BACKEND_SECRET_KEY
BACKEND_VFOLDER_MOUNTS
Please refer the parameter descriptions of APIConfig
’s constructor
for what each environment variable means and what value format should be used.
Command Line Interface
Configuration
Note
Please consult the detailed usage in the help of each command
(use -h
or --help
argument to display the manual).
Check out the client configuration for configurations via environment variables.
Session Mode
When the endpoint type is "session"
, you must explicitly login and logout
into/from the console server.
$ backend.ai login
Username: myaccount@example.com
Password:
✔ Login succeeded.
$ backend.ai ... # any commands
$ backend.ai logout
✔ Logout done.
API Mode
After setting up the environment variables, just run any command:
$ backend.ai ...
Checking out the current configuration
Run the following command to list your current active configurations.
$ backend.ai config
Compute Sessions
Note
Please consult the detailed usage in the help of each command
(use -h
or --help
argument to display the manual).
Listing sessions
List the session owned by you with various status filters.
The most recently status-changed sessions are listed first.
To prevent overloading the server, the result is limited to the first 10
sessions and it provides a separate --all
option to paginate further
sessions.
backend.ai ps
The ps
command is an alias of the following admin session list
command.
If you have the administrator privilege, you can list sessions owned by
other users by adding --access-key
option here.
backend.ai admin session list
Both commands offer options to set the status filter as follows.
For other options, please consult the output of --help
.
Option |
Included Session Status |
---|---|
(no option) |
|
|
|
|
|
Both commands offer options to specify which fields of sessions should be printed as follows.
Option |
Included Session Fields |
---|---|
(no option) |
|
|
|
|
|
|
Specified fields by user. |
Note
Fields for -f/--format
option can be displayed by specifying comma-separated parameters.
Available parameters for this option are: id
, status
, status_info
, created_at
, last_updated
, result
, image
, type
, task_id
, tag
, occupied_slots
, used_memory
, max_used_memory
, cpu_using
.
For example:
backend.ai admin session --format id,status,cpu_using
Running simple sessions
The following command spawns a Python session and executes
the code passed as -c
argument immediately.
--rm
option states that the client automatically terminates
the session after execution finishes.
backend.ai run --rm -c 'print("hello world")' python:3.6-ubuntu18.04
Note
By default, you need to specify language with full version tag like
python:3.6-ubuntu18.04
. Depending on the Backend.AI admin’s language
alias settings, this can be shortened just as python
. If you want
to know defined language aliases, contact the admin of Backend.AI server.
The following command spawns a Python session and executes
the code passed as ./myscript.py
file, using the shell command
specified in the --exec
option.
backend.ai run --rm --exec 'python myscript.py arg1 arg2' \
python:3.6-ubuntu18.04 ./myscript.py
Please note that your run
command may hang up for a very long time
due to queueing when the cluster resource is not sufficiently available.
To avoid indefinite waiting, you may add --enqueue-only
to return
immediately after posting the session creation request.
Note
When using --enqueue-only
, the codes are NOT executed and relevant
options are ignored.
This makes the run
command to the same of the start
command.
Or, you may use --max-wait
option to limit the maximum waiting time.
If the session starts within the given --max-wait
seconds, it works
normally, but if not, it returns without code execution like when used
--enqueue-only
.
To watch what is happening behind the scene until the session starts,
try backend.ai events <sessionID>
to receive the lifecycle events
such as its scheduling and preparation steps.
Running sessions with accelerators
Use one or more -r
options to specify resource requirements when
using backend.ai run
and backend.ai start
commands.
For instance, the following command spawns a Python TensorFlow session
using a half of virtual GPU device, 4 CPU cores, and 8 GiB of the main
memory to execute ./mygpucode.py
file inside it.
backend.ai run --rm \
-r cpu=4 -r mem=8g -r cuda.shares=2 \
python-tensorflow:1.12-py36 ./mygpucode.py
Terminating or cancelling sessions
Without --rm
option, your session remains alive for a configured
amount of idle timeout (default is 30 minutes).
You can see such sessions using the backend.ai ps
command.
Use the following command to manually terminate them via their session
IDs. You may specifcy multiple session IDs to terminate them at once.
backend.ai rm <sessionID> [<sessionID>...]
If you terminate PENDING
sessions which are not scheduled yet,
they are cancelled.
Container Applications
Note
Please consult the detailed usage in the help of each command
(use -h
or --help
argument to display the manual).
Starting a session and connecting to its Jupyter Notebook
The following command first spawns a Python session named “mysession”
without running any code immediately, and then executes a local proxy which
connects to the “jupyter” service running inside the session via the local
TCP port 9900.
The start
command shows application services provided by the created
compute session so that you can choose one in the subsequent app
command.
In the start
command, you can specify detailed resource options using
-r
and storage mounts using -m
parameter.
backend.ai start -t mysession python
backend.ai app -b 9900 mysession jupyter
Once executed, the app
command waits for the user to open the displayed
address using appropriate application.
For the jupyter service, use your favorite web browser just like the
way you use Jupyter Notebooks.
To stop the app
command, press Ctrl+C
or send the SIGINT
signal.
Accessing sessions via a web terminal
All Backend.AI sessions expose an intrinsic application named "ttyd"
.
It is an web application that embeds xterm.js-based full-screen terminal
that runs on web browsers.
backend.ai start -t mysession ...
backend.ai app -b 9900 mysession ttyd
Then open http://localhost:9900
to access the shell in a fully
functional web terminal using browsers.
The default shell is /bin/bash
for Ubuntu/CentOS-based images and
/bin/ash
for Alpine-based images with a fallback to /bin/sh
.
Note
This shell access does NOT grant your root access. All compute session processes are executed as the user privilege.
Accessing sessions via native SSH/SFTP
Backend.AI offers direct access to compute sessions (containers) via SSH
and SFTP, by auto-generating host identity and user keypairs for all
sessions.
All Baceknd.AI sessions expose an intrinsic application named "sshd"
like "ttyd"
.
To connect your sessions with SSH, first prepare your session and download
an auto-generated SSH keypair named id_container
.
Then start the service port proxy (“app” command) to open a local TCP port
that proxies the SSH/SFTP traffic to the compute sessions:
$ backend.ai start -t mysess ...
$ backend.ai session download mysess id_container
$ mv id_container ~/.ssh
$ backend.ai app mysess sshd -b 9922
In another terminal on the same PC, run your ssh client like:
$ ssh -o StrictHostKeyChecking=no \
> -o UserKnownHostsFile=/dev/null \
> -i ~/.ssh/id_container \
> work@localhost -p 9922
Warning: Permanently added '[127.0.0.1]:9922' (RSA) to the list of known hosts.
f310e8dbce83:~$
This SSH port is also compatible with SFTP to browse the container’s filesystem and to upload/download large-sized files.
You could add the following to your ~/.ssh/config
to avoid type
extra options every time.
Host localhost
User work
IdentityFile ~/.ssh/id_container
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
$ ssh localhost -p 9922
Warning
Since the SSH keypair is auto-generated every time when your launch a new compute session, you need to download and keep it separately for each session.
To use your own SSH private key across all your sessions without
downloading the auto-generated one every time, create a vfolder named
.ssh
and put the authorized_keys
file that includes the public key.
The keypair and .ssh
directory permissions will be automatically
updated by Backend.AI when the session launches.
$ ssh-keygen -t rsa -b 2048 -f id_container
$ cat id_container.pub > authorized_keys
$ backend.ai vfolder create .ssh
$ backend.ai vfolder upload .ssh authorized_keys
Storage Management
Note
Please consult the detailed usage in the help of each command
(use -h
or --help
argument to display the manual).
Backend.AI abstracts shared network storages into per-user slices called “virtual folders” (aka “vfolders”), which can be shared between users and user group members.
Creating vfolders and managing them
The command-line interface provides a set of subcommands under backend.ai
vfolder
to manage vfolders and files inside them.
To list accessible vfolders including your own ones and those shared by other users:
$ backend.ai vfolder list
To create a virtual folder named “mydata1”:
$ backend.ai vfolder create mydata1 mynas
The second argument mynas
corresponds to the name of a storage host.
To list up storage hosts that you are allowed to use:
$ backend.ai vfolder list-hosts
To delete the vfolder completely:
$ backend.ai vfolder delete mydata1
File transfers and management
To upload a file from the current working directory into the vfolder:
$ backend.ai vfolder upload mydata1 ./bigdata.csv
To download a file from the vfolder into the current working directory:
$ backend.ai vfolder download mydata1 ./bigresult.txt
To list files in the vfolder’s specific path:
$ backend.ai vfolder ls mydata1 .
To delete files in the vfolder:
$ backend.ai vfolder rm mydata1 ./bigdata.csv
Warning
All file uploads and downloads overwrite existing files and all file operations are irreversible.
Running sessions with storages
The following command spawns a Python session where the virtual folder
“mydata1” is mounted. The execution options are omitted in this example.
Then, it downloads ./bigresult.txt
file (generated by your code) from
the “mydata1” virtual folder.
$ backend.ai vfolder upload mydata1 ./bigdata.csv
$ backend.ai run --rm -m mydata1 python:3.6-ubuntu18.04 ...
$ backend.ai vfolder download mydata1 ./bigresult.txt
In your code, you may access the virtual folder via /home/work/mydata1
(where the default current working directory is /home/work
) just like
a normal directory. If you want to mount vfolders in other path, add ‘/’
as prefix at the forefont of the vfolder path.
By reusing the same vfolder in subsequent sessions, you do not have to download the result and upload it as the input for next sessions, just keeping them in the storage.
Creating default files for kernels
Backend.AI has a feature called ‘dotfile’, created to all the kernels
user spawns. As you can guess, dotfile’s path should start with .
.
The following command creates dotfile named .aws/config
with permission 755. This file will be created under /home/work
every time user spawns
Backend.AI kernel.
$ backend.ai dotfile create .aws/config < ~/.aws/config
Advanced Code Execution
Note
Please consult the detailed usage in the help of each command
(use -h
or --help
argument to display the manual).
Running concurrent experiment sessions
In addition to single-shot code execution as described in
Running simple sessions, the run
command offers concurrent execution of
multiple sessions with different parameters interpolated in the execution
command specified in --exec
option and environment variables specified
as -e
/ --env
options.
To define variables interpolated in the --exec
option, use --exec-range
.
To define variables interpolated in the --env
options, use --env-range
.
Here is an example with environment variable ranges that expands into 4 concurrent sessions.
backend.ai run -c 'import os; print("Hello world, {}".format(os.environ["CASENO"]))' \
-r cpu=1 -r mem=256m \
-e 'CASENO=$X' \
--env-range=X=case:1,2,3,4 \
lablup/python:3.6-ubuntu18.04
Both range options accept a special form of argument: “range expressions”.
The front part of range option value consists of the variable name used for
interpolation and an equivalence sign (=
).
The rest of range expressions have the following three types:
Expression |
Interpretation |
---|---|
|
A list of discrete values. The values may be either string or numbers. |
|
An inclusive numerical range with discrete points, in the same way
of |
|
A numerical range with the same semantics of Python’s |
If you specify multiple occurrences of range options in the run
command, the client spawns sessions for all possible combinations of all
values specified by each range.
Note
When your resource limit and cluster’s resource capacity cannot run all spawned sessions at the same time, some of sessions may be queued and the command may take a long time to finish.
Warning
Until all cases finish, the client must keep its network connections to the server alive because this feature is implemented in the client-side. Server-side batch job scheduling is under development!
Session Templates
Creating and starting session template
Users may define commonly used set of session creation parameters as reusable templates.
A session template includes common session parameters such as resource slots, vfolder mounts, the kernel image to use, and etc. It also support an extra feature that automatically clones a Git repository upon startup as a bootstrap command.
The following sample shows how a session template looks like:
---
api_version: v1
kind: taskTemplate
metadata:
name: template1234
tag: example-tag
spec:
kernel:
environ:
MYCONFIG: XXX
git:
branch: '19.09'
commit: 10daee9e328876d75e6d0fa4998d4456711730db
repository: https://github.com/lablup/backend.ai-agent
destinationDir: /home/work/baiagent
image: python:3.6-ubuntu18.04
resources:
cpu: '2'
mem: 4g
mounts:
hostpath-test: /home/work/hostDesktop
test-vfolder:
sessionType: interactive
The backend.ai sesstpl
command set provides the basic CRUD operations
of user-specific session templates.
The create
command accepts the YAML content either piped from the
standard input or read from a file using -f
flag:
$ backend.ai sesstpl create < session-template.yaml
# -- or --
$ backend.ai sesstpl create -f session-template.yaml
Once the session template is uploaded, you may use it to start a new session:
$ backend.ai start-template <templateId>
with substituting <templateId>
to your template ID.
Other CRUD command examples are as follows:
$ backend.ai sesstpl update <templateId> < session-template.yaml
$ backend.ai sesstpl list
$ backend.ai sesstpl get <templateId>
$ backend.ai sesstpl delete <templateId>
Full syntax for task template
---
api_version or apiVersion: str, required
kind: Enum['taskTemplate', 'task_template'], required
metadata: required
name: str, required
tag: str (optional)
spec:
type or sessionType: Enum['interactive', 'batch'] (optional), default=interactive
kernel:
image: str, required
environ: map[str, str] (optional)
run: (optional)
bootstrap: str (optional)
stratup or startup_command or startupCommand: str (optional)
git: (optional)
repository: str, required
commit: str (optional)
branch: str (optional)
credential: (optional)
username: str
password: str
destination_dir or destinationDir: str (optional)
mounts: map[str, str] (optional)
resources: map[str, str] (optional)
Developer Guides
Client Session
Client Session Objects
This module is the first place to begin with your Python programs that use Backend.AI API functions.
The high-level API functions cannot be used alone – you must initiate a client session first because each session provides proxy attributes that represent API functions and run on the session itself.
To achieve this, during initialization session objects internally construct new types
by combining the BaseFunction
class with the
attributes in each API function classes, and makes the new types bound to itself.
Creating new types every time when creating a new session instance may look weird,
but it is the most convenient way to provide class-methods in the API function
classes to work with specific session instances.
When designing your application, please note that session objects are intended to live long following the process’ lifecycle, instead of to be created and disposed whenever making API requests.
- class ai.backend.client.session.BaseSession(*, config=None, proxy_mode=False)
The base abstract class for sessions.
- abstractmethod open()
Initializes the session and perform version negotiation.
- abstractmethod close()
Terminates the session and releases underlying resources.
- class ai.backend.client.session.Session(*, config=None, proxy_mode=False)
A context manager for API client sessions that makes API requests synchronously. You may call simple request-response APIs like a plain Python function, but cannot use streaming APIs based on WebSocket and Server-Sent Events.
- close()
Terminates the session. It schedules the
close()
coroutine of the underlying aiohttp session and then enqueues a sentinel object to indicate termination. Then it waits until the worker thread to self-terminate by joining.- Return type:
- property worker_thread
The thread that internally executes the asynchronous implementations of the given API functions.
- class ai.backend.client.session.AsyncSession(*, config=None, proxy_mode=False)
A context manager for API client sessions that makes API requests asynchronously. You may call all APIs as coroutines. WebSocket-based APIs and SSE-based APIs returns special response types.
Examples
Here are several examples to demonstrate the functional API usage.
Initialization of the API Client
Implicit configuration from environment variables
from ai.backend.client.session import Session
def main():
with Session() as api_session:
print(api_session.System.get_versions())
if __name__ == "__main__":
main()
See also
Explicit configuration
from ai.backend.client.config import APIConfig
from ai.backend.client.session import Session
def main():
config = APIConfig(
endpoint="https://api.backend.ai.local",
endpoint_type="api",
domain="default",
group="default", # the default project name to use
)
with Session(config=config) as api_session:
print(api_session.System.get_versions())
if __name__ == "__main__":
main()
See also
Asyncio-native API session
import asyncio
from ai.backend.client.session import AsyncSession
async def main():
async with AsyncSession() as api_session:
print(api_session.System.get_versions())
if __name__ == "__main__":
asyncio.run(main())
See also
The interface of API client session objects: ai.backend.client.session
Working with Compute Sessions
Note
From here, we omit the main()
function structure in the sample codes.
Listing currently running compute sessions
import functools
from ai.backend.client.session import Session
with Session() as api_session:
fetch_func = functools.partial(
api_session.ComputeSession.paginated_list,
status="RUNNING",
)
current_offset = 0
while True:
result = fetch_func(page_offset=current_offset, page_size=20)
if result.total_count == 0:
# no items found
break
current_offset += len(result.items)
for item in result.items:
print(item)
if current_offset >= result.total_count:
# end of list
break
Creating and destroying a compute session
from ai.backend.client.session import Session
with Session() as api_session:
my_session = api_session.ComputeSession.get_or_create(
"python:3.9-ubuntu20.04", # registered container image name
mounts=["mydata", "mymodel"], # vfolder names
resources={"cpu": 8, "mem": "32g", "cuda.device": 2},
)
print(my_session.id)
my_session.destroy()
Accessing Container Applications
Launchable apps may vary for sessions. From here we illustrate an example to create a ttyd (web-based terminal) app, which is available for all Backend.AI sessions.
Note
This example is only applicable for the Backend.AI cluster with AppProxy v2 enabled and configured. AppProxy v2 only ships with enterprise version of Backend.AI.
The ComputeSession.start_service()
API
import requests
from ai.backend.client.request import Request
from ai.backend.client.session import Session
app_name = "ttyd"
with Session() as api_session:
sess = api_session.ComputeSession.get_or_create(...)
service_info = sess.start_service(app_name, login_session_token="dummy")
app_proxy_url = f"{service_info['wsproxy_addr']}/v2/proxy/{service_info['token']}/{sess.id}/add?app={app_name}"
resp = requests.get(app_proxy_url)
body = resp.json()
auth_url = body["url"]
print(auth_url) # opening this link from browser will navigate user to the terminal session
Added in version 23.09.8: ai.backend.client.func.session.ComputeSession.start_service()
Set the value login_session_token
to a dummy string like "dummy"
as it is a trace of the legacy interface, which is no longer used.
Alternatively, in versions before 23.09.8, you may use the raw ai.backend.client.Request
to call the server-side start_service
API.
import asyncio
import aiohttp
from ai.backend.client.request import Request
from ai.backend.client.session import AsyncSession
app_name = "ttyd"
async def main():
async with AsyncSession() as api_session:
sess = api_session.ComputeSession.get_or_create(...)
rqst = Request(
"POST",
f"/session/{sess.id}/start-service",
)
rqst.set_json({"app": app_name, "login_session_token": "dummy"})
async with rqst.fetch() as resp:
body = await resp.json()
app_proxy_url = f"{body['wsproxy_addr']}/v2/proxy/{body['token']}/{sess.id}/add?app={app_name}"
async with aiohttp.ClientSession() as client:
async with client.get(app_proxy_url) as resp:
body = await resp.json()
auth_url = body["url"]
print(auth_url) # opening this link from browser will navigate user to the terminal session
if __name__ == "__main__":
asyncio.run(main())
Code Execution via API
Synchronous mode
Snippet execution (query mode)
This is the minimal code to execute a code snippet with this client SDK.
import sys
from ai.backend.client.session import Session
with Session() as api_session:
my_session = api_session.ComputeSession.get_or_create("python:3.9-ubuntu20.04")
code = 'print("hello world")'
mode = "query"
run_id = None
try:
while True:
result = my_session.execute(run_id, code, mode=mode)
run_id = result["runId"] # keeps track of this particular run loop
for rec in result.get("console", []):
if rec[0] == "stdout":
print(rec[1], end="", file=sys.stdout)
elif rec[0] == "stderr":
print(rec[1], end="", file=sys.stderr)
else:
handle_media(rec)
sys.stdout.flush()
if result["status"] == "finished":
break
else:
mode = "continued"
code = ""
finally:
my_session.destroy()
You need to take care of client_token
because it determines whether to
reuse kernel sessions or not.
Backend.AI cloud has a timeout so that it terminates long-idle kernel sessions,
but within the timeout, any kernel creation requests with the same client_token
let Backend.AI cloud to reuse the kernel.
Script execution (batch mode)
You first need to upload the files after creating the session and construct a
opts
struct.
import sys
from ai.backend.client.session import Session
with Session() as session:
compute_sess = session.ComputeSession.get_or_create("python:3.6-ubuntu18.04")
compute_sess.upload(["mycode.py", "setup.py"])
code = ""
mode = "batch"
run_id = None
opts = {
"build": "*", # calls "python setup.py install"
"exec": "python mycode.py arg1 arg2",
}
try:
while True:
result = kern.execute(run_id, code, mode=mode, opts=opts)
opts.clear()
run_id = result["runId"]
for rec in result.get("console", []):
if rec[0] == "stdout":
print(rec[1], end="", file=sys.stdout)
elif rec[0] == "stderr":
print(rec[1], end="", file=sys.stderr)
else:
handle_media(rec)
sys.stdout.flush()
if result["status"] == "finished":
break
else:
mode = "continued"
code = ""
finally:
compute_sess.destroy()
Handling user inputs
Inside the while-loop for kern.execute()
above,
change the if-block for result['status']
as follows:
...
if result["status"] == "finished":
break
elif result["status"] == "waiting-input":
mode = "input"
if result["options"].get("is_password", False):
code = getpass.getpass()
else:
code = input()
else:
mode = "continued"
code = ""
...
A common gotcha is to miss setting mode = "input"
. Be careful!
Handling multi-media outputs
The handle_media()
function used above examples would look like:
def handle_media(record):
media_type = record[0] # MIME-Type string
media_data = record[1] # content
...
The exact method to process media_data
depends on the media_type
.
Currently the following behaviors are well-defined:
For (binary-format) images, the content is a dataURI-encoded string.
For SVG (scalable vector graphics) images, the content is an XML string.
For
application/x-sorna-drawing
, the content is a JSON string that represents a set of vector drawing commands to be replayed the client-side (e.g., Javascript on browsers)
Asynchronous mode
The async version has all sync-version interfaces as coroutines but comes with additional
features such as stream_execute()
which streams the execution results via websockets and
stream_pty()
for interactive terminal streaming.
import asyncio
import json
import sys
import aiohttp
from ai.backend.client.session import AsyncSession
async def main():
async with AsyncSession() as api_session:
compute_sess = await api_session.ComputeSession.get_or_create(
"python:3.6-ubuntu18.04",
client_token="mysession",
)
code = 'print("hello world")'
mode = "query"
try:
async with compute_sess.stream_execute(code, mode=mode) as stream:
# no need for explicit run_id since WebSocket connection represents it!
async for result in stream:
if result.type != aiohttp.WSMsgType.TEXT:
continue
result = json.loads(result.data)
for rec in result.get("console", []):
if rec[0] == "stdout":
print(rec[1], end="", file=sys.stdout)
elif rec[0] == "stderr":
print(rec[1], end="", file=sys.stderr)
else:
handle_media(rec)
sys.stdout.flush()
if result["status"] == "finished":
break
elif result["status"] == "waiting-input":
mode = "input"
if result["options"].get("is_password", False):
code = getpass.getpass()
else:
code = input()
await stream.send_text(code)
else:
mode = "continued"
code = ""
finally:
await compute_sess.destroy()
if __name__ == "__main__":
asyncio.run(main())
Added in version 19.03.
Working with model service
Along with working AppProxy v2 deployments, model service requires a resource group configured to accept the inference workload.
Starting model service
from ai.backend.client.session import Session
with Session() as api_session:
compute_sess = api_session.Service.create(
"python:3.6-ubuntu18.04",
"Llama2-70B",
1,
service_name="Llama2-service",
resources={"cuda.shares": 2, "cpu": 8, "mem": "64g"},
open_to_public=False,
)
If you set open_to_public=True
, the endpoint accepts anonymous traffic without the authentication token (see below).
Making request to model service endpoint
from ai.backend.client.session import Session
with Session() as api_session:
compute_sess = api_session.Service.create(...)
service_info = compute_sess.info()
endpoint = service_info["url"] # this value can be None if no successful inference service deployment has been made
token_info = compute_sess.generate_api_token("3600s")
token = token_info["token"]
headers = {"Authorization": f"BackendAI {token}"} # providing token is not required for public model services
resp = requests.get(f"{endpoint}/v1/models", headers=headers)
The token returned by the generate_api_token()
method is a JSON web token (JWT), which conveys all required information to authenticate the inference request.
Once generated, it cannot be revoked. A token may have its own expiration date/time.
The lifetime of a token is configured by the user who deploys the inference model, and currently there is no intrinsic minimum/maximum limits of the lifetime.
Added in version 23.09.
Testing
Unit Tests
Unit tests perform function-by-function tests to ensure their individual functionality. This test suite runs without depending on the server-side and thus it is executed in Travis CI for every push.
How to run
$ python -m pytest -m 'not integration' tests
Integration Tests
Integration tests combine multiple invocations of high-level interfaces to make underlying API requests to a running gateway server to test the full functionality of the client as well as the manager.
They are marked as “integration” using the @pytest.mark.integration
decorator
to each test case.
Warning
The integration tests actually make changes to the target gateway server and agents. If some tests fail, those changes may remain in an inconsistent state and requires a manual recovery such as resetting the database and populating fixtures again, though the test suite tries to clean up them properly.
So, DO NOT RUN it against your production server.
Prerequisite
Please refer the README of the manager and agent repositories to set up them. To avoid an indefinite waiting time for pulling Docker images:
(manager)
python -m ai.backend.manager.cli image rescan
(agent)
docker pull
lablup/python:3.6-ubuntu18.04
lablup/lua:5.3-alpine3.8
The manager must also have at least the following active suerp-admin account
in the default
domain and the default
group.
Example super-admin account:
User ID:
admin@lablup.com
Password
wJalrXUt
Access key:
AKIAIOSFODNN7EXAMPLE
Secret key:
wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
One or more testing-XXXX
domain, one or more testing-XXXX
groups, and one ore more dummy users
are created and used during the tests and destroyed after running tests. XXXX
will be filled with
random identifiers.
Tip
The halfstack configuration and the example-users.json
, example-keypairs.json
, example-set-user-main-access-keys.json
fixture is compatible with this
integration test suite.
How to run
Execute the gateway and at least one agent in their respective virtualenvs and hosts:
$ python -m ai.backend.client.gateway.server
$ python -m ai.backend.client.agent.server
$ python -m ai.backend.client.agent.watcher
Then run the tests:
$ export BACKEND_ENDPOINT=...
$ python -m pytest -m 'integration' tests
High-level Function Reference
Admin Functions
- class ai.backend.client.func.admin.Admin
Provides the function interface for making admin GraphQL queries.
Note
Depending on the privilege of your API access key, you may or may not have access to querying/mutating server-side resources of other users.
- classmethod await query(query, variables=None)
Sends the GraphQL query and returns the response.
Agent Functions
- class ai.backend.client.func.agent.Agent
Provides a shortcut of
Admin.query()
that fetches various agent information.Note
All methods in this function class require your API access key to have the admin privilege.
- classmethod await paginated_list(status='ALIVE', scaling_group=None, *, fields=(FieldSpec(field_ref='id', humanized_name='ID', field_name='id', alt_name='id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scaling_group', humanized_name='Scaling Group', field_name='scaling_group', alt_name='scaling_group', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='available_slots', humanized_name='Available Slots', field_name='available_slots', alt_name='available_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={}), FieldSpec(field_ref='occupied_slots', humanized_name='Occupied Slots', field_name='occupied_slots', alt_name='occupied_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)
Lists the keypairs. You need an admin privilege for this operation.
- Return type:
PaginatedResult
- classmethod await detail(agent_id, fields=(FieldSpec(field_ref='id', humanized_name='ID', field_name='id', alt_name='id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scaling_group', humanized_name='Scaling Group', field_name='scaling_group', alt_name='scaling_group', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='addr', humanized_name='Addr', field_name='addr', alt_name='addr', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='region', humanized_name='Region', field_name='region', alt_name='region', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='first_contact', humanized_name='First Contact', field_name='first_contact', alt_name='first_contact', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='cpu_cur_pct', humanized_name='CPU Usage (%)', field_name='cpu_cur_pct', alt_name='cpu_cur_pct', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='mem_cur_bytes', humanized_name='Used Memory (MiB)', field_name='mem_cur_bytes', alt_name='mem_cur_bytes', formatter=<ai.backend.client.output.formatters.MiBytesOutputFormatter object>, subfields={}), FieldSpec(field_ref='available_slots', humanized_name='Available Slots', field_name='available_slots', alt_name='available_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={}), FieldSpec(field_ref='occupied_slots', humanized_name='Occupied Slots', field_name='occupied_slots', alt_name='occupied_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={}), FieldSpec(field_ref='local_config', humanized_name='Local Config', field_name='local_config', alt_name='local_config', formatter=<ai.backend.client.output.formatters.NestedDictOutputFormatter object>, subfields={})))
Auth Functions
- class ai.backend.client.func.auth.Auth
Provides the function interface for login session management and authorization.
- classmethod await login(user_id, password, otp=None)
Log-in into the endpoint with the given user ID and password. It creates a server-side web session and return a dictionary with
"authenticated"
boolean field and JSON-encoded raw cookie data.- Return type:
- classmethod await logout()
Log-out from the endpoint. It clears the server-side web session.
- Return type:
- classmethod await update_password(old_password, new_password, new_password2)
Update user’s password. This API works only for account owner.
- Return type:
- classmethod await update_password_no_auth(domain, user_id, current_password, new_password)
Update user’s password. This is used to update EXPIRED password only. This function fetch a request to manager.
- Return type:
Configuration
- ai.backend.client.config.get_env(key, default=Undefined.token, *, clean=<function default_clean>)
Retrieves a configuration value from the environment variables. The given key is uppercased and prefixed by
"BACKEND_"
and then"SORNA_"
if the former does not exist.- Parameters:
key (
str
) – The key name.default (
Union
[str
,Mapping
,Undefined
]) – The default value returned when there is no corresponding environment variable.clean (
Callable
[[Any
],TypeVar
(T
)]) – A single-argument function that is applied to the result of lookup (in both successes and the default value for failures). The default is returning the value as-is.
- Return type:
TypeVar
(T
)- Returns:
The value processed by the clean function.
- ai.backend.client.config.get_config()
Returns the configuration for the current process. If there is no explicitly set
APIConfig
instance, it will generate a new one from the current environment variables and defaults.- Return type:
- ai.backend.client.config.set_config(conf)
Sets the configuration used throughout the current process.
- Return type:
- class ai.backend.client.config.APIConfig(*, endpoint=None, endpoint_type=None, domain=None, group=None, storage_proxy_address_map=None, version=None, user_agent=None, access_key=None, secret_key=None, hash_type=None, vfolder_mounts=None, skip_sslcert_validation=None, connection_timeout=None, read_timeout=None, announcement_handler=None)
Represents a set of API client configurations. The access key and secret key are mandatory – they must be set in either environment variables or as the explicit arguments.
- Parameters:
endpoint (
Union
[URL
,str
]) – The URL prefix to make API requests via HTTP/HTTPS. If this is given asstr
and contains multiple URLs separated by comma, the underlying HTTP request-response facility will perform client-side load balancing and automatic fail-over using them, assuming that all those URLs indicates a single, same cluster. The users of the API and CLI will get network connection errors only when all of the given endpoints fail – intermittent failures of a subset of endpoints will be hidden with a little increased latency.endpoint_type (
str
) – Either"api"
or"session"
. If the endpoint type is"api"
(the default if unspecified), it uses the access key and secret key in the configuration to access the manager API server directly. If the endpoint type is"session"
, it assumes the endpoint is a Backend.AI console server which provides cookie-based authentication with username and password. In the latter, users need to usebackend.ai login
andbackend.ai logout
to manage their sign-in status, or the API equivalent inlogin()
andlogout()
methods.version (
str
) – The API protocol version.user_agent (
str
) – A custom user-agent string which is sent to the API server as aUser-Agent
HTTP header.access_key (
str
) – The API access key. If deliberately set to an empty string, the API requests will be made without signatures (anonymously).secret_key (
str
) – The API secret key.hash_type (
str
) – The hash type to generate per-request authentication signatures.vfolder_mounts (
Iterable
[str
]) – A list of vfolder names (that must belong to the given access key) to be automatically mounted upon anyKernel.get_or_create()
calls.
-
DEFAULTS:
Mapping
[str
,Union
[str
,Mapping
]] = {'connection_timeout': '10.0', 'domain': 'default', 'endpoint': 'https://api.cloud.backend.ai', 'endpoint_type': 'api', 'group': 'default', 'hash_type': 'sha256', 'read_timeout': '0', 'storage_proxy_address_map': {}, 'version': 'v8.20240315'} The default values for config parameters settable via environment variables except the access and secret keys.
- property endpoint: URL
The currently active endpoint URL. This may change if there are multiple configured endpoints and the current one is not accessible.
- property storage_proxy_address_map: Mapping[str, str]
The storage proxy address map for overriding.
- property skip_sslcert_validation: bool
Whether to skip SSL certificate validation for the API gateway.
- property connection_timeout: float
The maximum allowed duration for making TCP connections to the server.
KeyPair Functions
- class ai.backend.client.func.keypair.KeyPair(access_key)
Provides interactions with keypairs.
- classmethod await create(user_id, is_active=True, is_admin=False, resource_policy=Undefined.TOKEN, rate_limit=Undefined.TOKEN, fields=(FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))
Creates a new keypair with the given options. You need an admin privilege for this operation.
- Return type:
- classmethod await update(access_key, is_active=Undefined.TOKEN, is_admin=Undefined.TOKEN, resource_policy=Undefined.TOKEN, rate_limit=Undefined.TOKEN)
Creates a new keypair with the given options. You need an admin privilege for this operation.
- Return type:
- classmethod await delete(access_key)
Deletes an existing keypair with given ACCESSKEY.
- classmethod await list(user_id=None, is_active=None, fields=(FieldSpec(field_ref='user_id', humanized_name='Email', field_name='user_id', alt_name='user_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_admin', humanized_name='Admin?', field_name='is_admin', alt_name='is_admin', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))
Lists the keypairs. You need an admin privilege for this operation.
- classmethod await paginated_list(is_active=None, domain_name=None, *, user_id=None, fields=(FieldSpec(field_ref='user_id', humanized_name='Email', field_name='user_id', alt_name='user_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_admin', humanized_name='Admin?', field_name='is_admin', alt_name='is_admin', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)
Lists the keypairs. You need an admin privilege for this operation.
- Return type:
PaginatedResult
[dict
]
- await info(fields=(FieldSpec(field_ref='user_id', humanized_name='Email', field_name='user_id', alt_name='user_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_admin', humanized_name='Admin?', field_name='is_admin', alt_name='is_admin', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))
Returns the keypair’s information such as resource limits.
- Parameters:
fields (
Sequence
[FieldSpec
]) – Additional per-agent query fields to fetch.- Return type:
Added in version 18.12.
- classmethod await activate(access_key)
Activates this keypair. You need an admin privilege for this operation.
- Return type:
Manager Functions
- class ai.backend.client.func.manager.Manager
Provides controlling of the gateway/manager servers.
Added in version 18.12.
- classmethod await status()
Returns the current status of the configured API server.
- classmethod await freeze(force_kill=False)
Freezes the configured API server. Any API clients will no longer be able to create new compute sessions nor create and modify vfolders/keypairs/etc. This is used to enter the maintenance mode of the server for unobtrusive manager and/or agent upgrades.
- Parameters:
force_kill (
bool
) – If setTrue
, immediately shuts down all running compute sessions forcibly. If not set, clients who have running compute session are still able to interact with them though they cannot create new compute sessions.
- classmethod await unfreeze()
Unfreezes the configured API server so that it resumes to normal operation.
- classmethod await get_announcement()
Get current announcement.
- classmethod await update_announcement(enabled=True, message=None)
Update (create / delete) announcement.
Scaling Group Functions
- class ai.backend.client.func.scaling_group.ScalingGroup(name)
Provides getting scaling-group information required for the current user.
The scaling-group is an opaque server-side configuration which splits the whole cluster into several partitions, so that server administrators can apply different auto-scaling policies and operation standards to each partition of agent sets.
- classmethod await list_available(group)
List available scaling groups for the current user, considering the user, the user’s domain, and the designated user group.
- classmethod await list(fields=(FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='description', humanized_name='Description', field_name='description', alt_name='description', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_public', humanized_name='Public?', field_name='is_public', alt_name='is_public', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='driver', humanized_name='Driver', field_name='driver', alt_name='driver', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scheduler', humanized_name='Scheduler', field_name='scheduler', alt_name='scheduler', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='use_host_network', humanized_name='Use Host Network', field_name='use_host_network', alt_name='use_host_network', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_addr', humanized_name='Wsproxy Addr', field_name='wsproxy_addr', alt_name='wsproxy_addr', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_api_token', humanized_name='Wsproxy Api Token', field_name='wsproxy_api_token', alt_name='wsproxy_api_token', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))
List available scaling groups for the current user, considering the user, the user’s domain, and the designated user group.
- classmethod await detail(name, fields=(FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='description', humanized_name='Description', field_name='description', alt_name='description', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_public', humanized_name='Public?', field_name='is_public', alt_name='is_public', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='driver', humanized_name='Driver', field_name='driver', alt_name='driver', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='driver_opts', humanized_name='Driver Opts', field_name='driver_opts', alt_name='driver_opts', formatter=<ai.backend.client.output.formatters.NestedDictOutputFormatter object>, subfields={}), FieldSpec(field_ref='scheduler', humanized_name='Scheduler', field_name='scheduler', alt_name='scheduler', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scheduler_opts', humanized_name='Scheduler Opts', field_name='scheduler_opts', alt_name='scheduler_opts', formatter=<ai.backend.client.output.formatters.NestedDictOutputFormatter object>, subfields={}), FieldSpec(field_ref='use_host_network', humanized_name='Use Host Network', field_name='use_host_network', alt_name='use_host_network', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_addr', humanized_name='Wsproxy Addr', field_name='wsproxy_addr', alt_name='wsproxy_addr', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_api_token', humanized_name='Wsproxy Api Token', field_name='wsproxy_api_token', alt_name='wsproxy_api_token', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))
Fetch information of a scaling group by name.
- classmethod await create(name, *, description='', is_active=True, is_public=True, driver, driver_opts=Undefined.TOKEN, scheduler, scheduler_opts=Undefined.TOKEN, use_host_network=False, wsproxy_addr=None, wsproxy_api_token=None, fields=None)
Creates a new scaling group with the given options.
- Return type:
- classmethod await update(name, *, description=Undefined.TOKEN, is_active=Undefined.TOKEN, is_public=Undefined.TOKEN, driver=Undefined.TOKEN, driver_opts=Undefined.TOKEN, scheduler=Undefined.TOKEN, scheduler_opts=Undefined.TOKEN, use_host_network=Undefined.TOKEN, wsproxy_addr=Undefined.TOKEN, wsproxy_api_token=Undefined.TOKEN, fields=None)
Update existing scaling group.
- Return type:
- classmethod await delete(name)
Deletes an existing scaling group.
- classmethod await associate_domain(scaling_group, domain)
Associate scaling_group with domain.
- classmethod await dissociate_domain(scaling_group, domain)
Dissociate scaling_group from domain.
- classmethod await dissociate_all_domain(domain)
Dissociate all scaling_groups from domain.
- Parameters:
domain (
str
) – The name of a domain.
- classmethod await associate_group(scaling_group, group_id)
Associate scaling_group with group.
- classmethod await dissociate_group(scaling_group, group_id)
Dissociate scaling_group from group.
ComputeSession Functions
- class ai.backend.client.func.session.ComputeSession(name, owner_access_key=None)
Provides various interactions with compute sessions in Backend.AI.
The term ‘kernel’ is now deprecated and we prefer ‘compute sessions’. However, for historical reasons and to avoid confusion with client sessions, we keep the backward compatibility with the naming of this API function class.
For multi-container sessions, all methods take effects to the master container only, except
destroy()
andrestart()
methods. So it is the user’s responsibility to distribute uploaded files to multiple containers using explicit copies or virtual folders which are commonly mounted to all containers belonging to the same compute session.- classmethod await paginated_list(status=None, access_key=None, *, fields=(FieldSpec(field_ref='id', humanized_name='Session ID', field_name='id', alt_name='session_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='image', humanized_name='Image', field_name='image', alt_name='image', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='type', humanized_name='Type', field_name='type', alt_name='type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status_info', humanized_name='Status Info', field_name='status_info', alt_name='status_info', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status_changed', humanized_name='Last Updated', field_name='status_changed', alt_name='status_changed', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='result', humanized_name='Result', field_name='result', alt_name='result', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='abusing_reports', humanized_name='Abusing Reports', field_name='abusing_reports', alt_name='abusing_reports', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)
Fetches the list of sessions.
- classmethod await get_or_create(image, *, name=None, type_='interactive', starts_at=None, enqueue_only=False, max_wait=0, no_reuse=False, dependencies=None, callback_url=None, mounts=None, mount_map=None, mount_options=None, envs=None, startup_command=None, resources=None, resource_opts=None, cluster_size=1, cluster_mode=ClusterMode.SINGLE_NODE, domain_name=None, group_name=None, bootstrap_script=None, tag=None, architecture='x86_64', scaling_group=None, owner_access_key=None, preopen_ports=None, assign_agent=None)
Get-or-creates a compute session. If name is
None
, it creates a new compute session as long as the server has enough resources and your API key has remaining quota. If name is a valid string and there is an existing compute session with the same token and the same image, then it returns theComputeSession
instance representing the existing session.- Parameters:
image (
str
) – The image name and tag for the compute session. Example:python:3.6-ubuntu
. Check out the full list of available images in your server using (TODO: new API).name (
str
) –A client-side (user-defined) identifier to distinguish the session among currently running sessions. It may be used to seamlessly reuse the session already created.
Changed in version 19.12.0: Renamed from
clientSessionToken
.type –
Either
"interactive"
(default) or"batch"
.Added in version 19.09.0.
enqueue_only (
bool
) –Just enqueue the session creation request and return immediately, without waiting for its startup. (default:
false
to preserve the legacy behavior)Added in version 19.09.0.
max_wait (
int
) –The time to wait for session startup. If the cluster resource is being fully utilized, this waiting time can be arbitrarily long due to job queueing. If the timeout reaches, the returned status field becomes
"TIMEOUT"
. Still in this case, the session may start in the future.Added in version 19.09.0.
no_reuse (
bool
) –Raises an explicit error if a session with the same image and the same name already exists instead of returning the information of it.
Added in version 19.09.0.
mounts (
List
[str
]) – The list of vfolder names that belongs to the current API access key.mount_map (
Mapping
[str
,str
]) – Mapping which contains custom path to mount vfolder. Key and value of this map should be vfolder name and custom path. Default mounts or relative paths are under /home/work. If you want different paths, names should be absolute paths. The target mount path of vFolders should not overlap with the linux system folders. vFolders which has a dot(.) prefix in its name are not affected.mount_options (
Optional
[Mapping
[str
,Mapping
[str
,str
]]]) – Mapping which contains extra options for vfolder.envs (
Mapping
[str
,str
]) – The environment variables which always bypasses the jail policy.resources (
Mapping
[str
,str
|int
]) – The resource specification. (TODO: details)cluster_size (
int
) –The number of containers in this compute session. Must be at least 1.
Added in version 19.09.0.
Changed in version 20.09.0.
cluster_mode (
ClusterMode
) –Set the clustering mode whether to use distributed nodes or a single node to spawn multiple containers for the new session.
Added in version 20.09.0.
tag (
str
) – An optional string to annotate extra information.owner – An optional access key that owns the created session. (Only available to administrators)
- Return type:
- Returns:
The
ComputeSession
instance.
- classmethod await create_from_template(template_id, *, name=Undefined.TOKEN, type_=Undefined.TOKEN, starts_at=None, enqueue_only=Undefined.TOKEN, max_wait=Undefined.TOKEN, dependencies=None, callback_url=Undefined.TOKEN, no_reuse=Undefined.TOKEN, image=Undefined.TOKEN, mounts=Undefined.TOKEN, mount_map=Undefined.TOKEN, envs=Undefined.TOKEN, startup_command=Undefined.TOKEN, resources=Undefined.TOKEN, resource_opts=Undefined.TOKEN, cluster_size=Undefined.TOKEN, cluster_mode=Undefined.TOKEN, domain_name=Undefined.TOKEN, group_name=Undefined.TOKEN, bootstrap_script=Undefined.TOKEN, tag=Undefined.TOKEN, scaling_group=Undefined.TOKEN, owner_access_key=Undefined.TOKEN)
Get-or-creates a compute session from template. All other parameters provided will be overwritten to template, including vfolder mounts (not appended!). If name is
None
, it creates a new compute session as long as the server has enough resources and your API key has remaining quota. If name is a valid string and there is an existing compute session with the same token and the same image, then it returns theComputeSession
instance representing the existing session.- Parameters:
template_id (
str
) – Task template to apply to compute session.image (
str
|Undefined
) – The image name and tag for the compute session. Example:python:3.6-ubuntu
. Check out the full list of available images in your server using (TODO: new API).A client-side (user-defined) identifier to distinguish the session among currently running sessions. It may be used to seamlessly reuse the session already created.
Changed in version 19.12.0: Renamed from
clientSessionToken
.type –
Either
"interactive"
(default) or"batch"
.Added in version 19.09.0.
enqueue_only (
bool
|Undefined
) –Just enqueue the session creation request and return immediately, without waiting for its startup. (default:
false
to preserve the legacy behavior)Added in version 19.09.0.
The time to wait for session startup. If the cluster resource is being fully utilized, this waiting time can be arbitrarily long due to job queueing. If the timeout reaches, the returned status field becomes
"TIMEOUT"
. Still in this case, the session may start in the future.Added in version 19.09.0.
Raises an explicit error if a session with the same image and the same name already exists instead of returning the information of it.
Added in version 19.09.0.
mounts (
Union
[List
[str
],Undefined
]) – The list of vfolder names that belongs to the current API access key.mount_map (
Union
[Mapping
[str
,str
],Undefined
]) – Mapping which contains custom path to mount vfolder. Key and value of this map should be vfolder name and custom path. Default mounts or relative paths are under /home/work. If you want different paths, names should be absolute paths. The target mount path of vFolders should not overlap with the linux system folders. vFolders which has a dot(.) prefix in its name are not affected.envs (
Union
[Mapping
[str
,str
],Undefined
]) – The environment variables which always bypasses the jail policy.resources (
Union
[Mapping
[str
,str
|int
],Undefined
]) – The resource specification. (TODO: details)cluster_size (
int
|Undefined
) –The number of containers in this compute session. Must be at least 1.
Added in version 19.09.0.
Changed in version 20.09.0.
cluster_mode (
ClusterMode
|Undefined
) –Set the clustering mode whether to use distributed nodes or a single node to spawn multiple containers for the new session.
Added in version 20.09.0.
tag (
str
|Undefined
) – An optional string to annotate extra information.owner – An optional access key that owns the created session. (Only available to administrators)
- Return type:
- Returns:
The
ComputeSession
instance.
- await destroy(*, forced=False, recursive=False)
Destroys the compute session. Since the server literally kills the container(s), all ongoing executions are forcibly interrupted.
- await restart()
Restarts the compute session. The server force-destroys the current running container(s), but keeps their temporary scratch directories intact.
- await rename(new_id)
Renames Session ID of running compute session.
- await commit()
Commit a running session to a tar file in the agent host.
- await export_to_image(new_image_name)
Commits running session to new image and then uploads to designated container registry. Requires Backend.AI server set up for per-user image commit feature (24.03).
- await interrupt()
Tries to interrupt the current ongoing code execution. This may fail without any explicit errors depending on the code being executed.
- await complete(code, opts=None)
Gets the auto-completion candidates from the given code string, as if a user has pressed the tab key just after the code in IDEs.
Depending on the language of the compute session, this feature may not be supported. Unsupported sessions returns an empty list.
- await get_info()
Retrieves a brief information about the compute session.
- await get_logs()
Retrieves the console log of the compute session container.
- await get_dependency_graph()
Retrieves the root node of dependency graph of the compute session.
- await get_status_history()
Retrieves the status transition history of the compute session.
- await execute(run_id=None, code=None, mode='query', opts=None)
Executes a code snippet directly in the compute session or sends a set of build/clean/execute commands to the compute session.
For more details about using this API, please refer the official API documentation.
- Parameters:
run_id (
str
) – A unique identifier for a particular run loop. In the first call, it may beNone
so that the server auto-assigns one. Subsequent calls must use the returnedrunId
value to request continuation or to send user inputs.code (
str
) – A code snippet as string. In the continuation requests, it must be an empty string. When sending user inputs, this is where the user input string is stored.mode (
str
) – A constant string which is one of"query"
,"batch"
,"continue"
, and"user-input"
.opts (
dict
) – A dict for specifying additional options. Mainly used in the batch mode to specify build/clean/execution commands. See the API object reference for details.
- Returns:
- await upload(files, basedir=None, show_progress=False)
Uploads the given list of files to the compute session. You may refer them in the batch-mode execution or from the code executed in the server afterwards.
- Parameters:
files (
Sequence
[str
|Path
]) –The list of file paths in the client-side. If the paths include directories, the location of them in the compute session is calculated from the relative path to basedir and all intermediate parent directories are automatically created if not exists.
For example, if a file path is
/home/user/test/data.txt
(ortest/data.txt
) where basedir is/home/user
(or the current working directory is/home/user
), the uploaded file is located at/home/work/test/data.txt
in the compute session container.basedir (
Union
[str
,Path
,None
]) – The directory prefix where the files reside. The default value is the current working directory.show_progress (
bool
) – Displays a progress bar during uploads.
- await download(files, dest='.', show_progress=False)
Downloads the given list of files from the compute session.
- Parameters:
files (
Sequence
[str
|Path
]) – The list of file paths in the compute session. If they are relative paths, the path is calculated from/home/work
in the compute session container.dest (
str
|Path
) – The destination directory in the client-side.show_progress (
bool
) – Displays a progress bar during downloads.
- await list_files(path='.')
Gets the list of files in the given path inside the compute session container.
- await get_abusing_report()
Retrieves abusing reports of session’s sibling kernels.
- await start_service(app, *, port=Undefined.TOKEN, envs=Undefined.TOKEN, arguments=Undefined.TOKEN, login_session_token=Undefined.TOKEN)
Starts application from Backend.AI session and returns access credentials to access AppProxy endpoint.
- listen_events(scope='*')
Opens the stream of the kernel lifecycle events. Only the master kernel of each session is monitored.
- Return type:
SSEContextManager
- Returns:
a
StreamEvents
object.
- stream_events(scope='*')
Opens the stream of the kernel lifecycle events. Only the master kernel of each session is monitored.
- Return type:
SSEContextManager
- Returns:
a
StreamEvents
object.
- stream_pty()
Opens a pseudo-terminal of the kernel (if supported) streamed via websockets.
- Return type:
- Returns:
a
StreamPty
object.
- stream_execute(code='', *, mode='query', opts=None)
Executes a code snippet in the streaming mode. Since the returned websocket represents a run loop, there is no need to specify run_id explicitly.
- Return type:
- class ai.backend.client.func.session.StreamPty(session, underlying_response, **kwargs)
A derivative class of
WebSocketResponse
which provides additional functions to control the terminal.
Session Template Functions
Virtual Folder Functions
- class ai.backend.client.func.vfolder.VFolder(name, id=None)
- classmethod await create(name, host=None, unmanaged_path=None, group=None, usage_mode='general', permission='rw', cloneable=False)
- classmethod await delete_by_id(oid)
- classmethod await list(list_all=False)
- classmethod await paginated_list(group=None, *, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)
Fetches the list of vfolders. Domain admins can only get domain vfolders.
- classmethod await paginated_own_list(*, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)
Fetches the list of own vfolders.
- classmethod await paginated_invited_list(*, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)
Fetches the list of invited vfolders.
- classmethod await paginated_project_list(*, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)
Fetches the list of invited vfolders.
- classmethod await list_hosts()
- classmethod await list_all_hosts()
- classmethod await list_allowed_types()
- await info()
- await delete()
- await recover()
- await restore()
- await rename(new_name)
- await download(relative_paths, *, basedir=None, dst_dir=None, chunk_size=16777216, show_progress=False, address_map=None, max_retries=20)
- Return type:
- await upload(sources, *, basedir=None, recursive=False, dst_dir=None, chunk_size=16777216, address_map=None, show_progress=False)
- Return type:
- await mkdir(path, parents=False, exist_ok=False)
- Return type:
ResultSet
- await rename_file(target_path, new_name)
- await move_file(src_path, dst_path)
- await delete_files(files, recursive=False)
- await list_files(path='.')
- await invite(perm, emails)
- classmethod await invitations()
- classmethod await accept_invitation(inv_id)
- classmethod await delete_invitation(inv_id)
- classmethod await get_fstab_contents(agent_id=None)
- classmethod await get_performance_metric(folder_host)
- classmethod await list_mounts()
- classmethod await mount_host(name, fs_location, options=None, edit_fstab=False)
- classmethod await umount_host(name, edit_fstab=False)
- await leave(shared_user_uuid=None)
- await clone(target_name, target_host=None, usage_mode='general', permission='rw')
- await update_options(name, permission=None, cloneable=None)
- classmethod await change_vfolder_ownership(vfolder, user_email)
Low-level SDK Reference
Base Function
This module defines a few utilities that ease complexities to support both synchronous and asynchronous API functions, using some tricks with Python metaclasses.
Unless your are contributing to the client SDK, probably you won’t have to use this module directly.
- class ai.backend.client.func.base.APIFunctionMeta(name, bases, attrs, **kwargs)
Converts all methods marked with
api_function()
into session-aware methods that are either plain Python functions or coroutines.- mro()
Return a type’s method resolution order.
- class ai.backend.client.func.base.BaseFunction
- @ai.backend.client.func.base.api_function(meth)
Mark the wrapped method as the API function method.
Request API
This module provides low-level API request/response interfaces based on aiohttp.
Depending on the session object where the request is made from,
Request
and Response
differentiate their behavior:
works as plain Python functions or returns awaitables.
- class ai.backend.client.request.Request(method='GET', path=None, content=None, *, content_type=None, params=None, reporthook=None, override_api_version=None)
The API request object.
- with async with fetch(**kwargs) as Response
Sends the request to the server and reads the response.
You may use this method with AsyncSession only, following the pattern below:
from ai.backend.client.request import Request from ai.backend.client.session import AsyncSession async with AsyncSession() as sess: rqst = Request('GET', ...) async with rqst.fetch() as resp: print(await resp.text())
- Return type:
- async with connect_websocket(**kwargs) as WebSocketResponse or its derivatives
Creates a WebSocket connection. :rtype:
WebSocketContextManager
Warning
This method only works with
AsyncSession
.
- property content: bytes | bytearray | str | StreamReader | IOBase | None
Retrieves the content in the original form. Private codes should NOT use this as it incurs duplicate encoding/decoding.
- connect_events(**kwargs)
Creates a Server-Sent Events connection. :rtype:
SSEContextManager
Warning
This method only works with
AsyncSession
.
- class ai.backend.client.request.Response(session, underlying_response, *, async_mode=False, **kwargs)
- class ai.backend.client.request.WebSocketResponse(session, underlying_response, **kwargs)
A high-level wrapper of
aiohttp.ClientWebSocketResponse
.
- class ai.backend.client.request.FetchContextManager(session, rqst_ctx_builder, *, response_cls=<class 'ai.backend.client.request.Response'>, check_status=True)
The context manager returned by
Request.fetch()
.It provides asynchronous context manager interfaces only.
- class ai.backend.client.request.WebSocketContextManager(session, ws_ctx_builder, *, on_enter=None, response_cls=<class 'ai.backend.client.request.WebSocketResponse'>)
The context manager returned by
Request.connect_websocket()
.
- class ai.backend.client.request.AttachedFile(filename, stream, content_type)
A struct that represents an attached file to the API request.
- Parameters:
filename (str) – The name of file to store. It may include paths and the server will create parent directories if required.
stream (Any) – A file-like object that allows stream-reading bytes.
content_type (str) – The content type for the stream. For arbitrary binary data, use “application/octet-stream”.
- content_type
Alias for field number 2
- count(value, /)
Return number of occurrences of value.
- filename
Alias for field number 0
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
- stream
Alias for field number 1
Exceptions
- class ai.backend.client.exceptions.BackendError
Exception type to catch all ai.backend-related errors.
- add_note()
Exception.add_note(note) – add a note to the exception
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class ai.backend.client.exceptions.BackendAPIError(status, reason, data)
Exceptions returned by the API gateway.
- add_note()
Exception.add_note(note) – add a note to the exception
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class ai.backend.client.exceptions.BackendClientError
Exceptions from the client library, such as argument validation errors and connection failures.
- add_note()
Exception.add_note(note) – add a note to the exception
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
Type Definitions
- class ai.backend.client.types.Undefined(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
A special type to represent an undefined value.