Backend.AI 레퍼런스 문서

Latest API version: v6.20220615

Backend.AI는 광범위한 AI 기반 애플리케이션을 위한 엔터프라이즈급 개발 및 서비스 백엔드입니다. Backend.AI의 핵심 기술은 GPU와 이종 가속기 등을 포함한 고밀도 컴퓨팅 클러스터 운영에 맞춰져 있습니다.

사용자의 관점에서 Backend.AI는 클라우드처럼 동작하는 GPU 기반 고성능 컴퓨팅/딥 러닝 애플리케이션 호스트입니다 (“사용자의 머신에 있는 Google Colab”). Backend.AI는 제한된 자원의 컨테이너에서 사용자의 코드를 안전하게 실행합니다.Backend.AI는 Python 2/3, R, PHP, C/C++, Java, JavaScript, Julia, Octave, Haskell, Lua, 그리고 Node.js 등의 다양한 프로그래밍 언어와 런타임을 지원하며, TensorFlow, Keras, Caffe, MXNet 등의 AI 라이브러리를 지원합니다.

관리자의 관점에서 Backend.AI는 컴퓨팅 노드와 GPU, 스토리지를 연구 팀원 개개인에게 할당하는 프로세스를 간소화합니다. 세부적인 정책 기반 유휴 상태 검사와 자원 제한으로 인해, 요구량이 많을 때에도 클러스터의 용량을 초과할 염려가 없습니다.

Backend.AI는 플러그인 구조를 사용하여 다양한 규모의 엔터프라이즈 고객을 위한 GPU 분할 공유나 사이트별 SSO 통합 등과 같은 고급 기능을 보다 많이 제공합니다.

Backend.AI 주요 개념

Here we describe the key concepts that are required to understand and follow this documentation.

_images/server-architecture.svg

The diagram of a typical multi-node Backend.AI server architecture

그림 1 shows a brief Backend.AI server-side architecture where the components are what you need to install and configure.

Each border-connected group of components is intended to be run on the same server, but you may split them into multiple servers or merge different groups into a single server as you need. For example, you can run separate servers for the nginx reverse-proxy and the Backend.AI manager or run both on a single server. In the development setup, all these components run on a single PC such as your laptop.

Service Components

Public-facing services

Manager and Webserver

Backend.AI manager is the central governor of the cluster. It accepts user requests, creates/destroys the sessions, and routes code execution requests to appropriate agents and sessions. It also collects the output of sessions and responds the users with them.

Backend.AI agent is a small daemon installed onto individual worker servers to control them. It manages and monitors the lifecycle of kernel containers, and also mediates the input/output of sessions. Each agent also reports the resource capacity and status of its server, so that the manager can assign new sessions on idle servers to load balance.

The primary networking requirements are:

  • The manager server (the HTTPS 443 port) should be exposed to the public Internet or the network that your client can access.

  • The manager, agents, and all other database/storage servers should reside at the same local private network where any traffic between them are transparently allowed.

  • For high-volume big-data processing, you may want to separate the network for the storage using a secondary network interface on each server, such as Infiniband and RoCE adaptors.

App Proxy

Backend.AI App Proxy is a proxy to mediate the traffic between user applications and clients like browsers. It provides the central place to set the networking and firewall policy for the user application traffic.

It has two operation modes:

  • Port mapping: Individual app instances are mapped with a TCP port taken from a pre-configured range of TCP port range.

  • Wildecard subdomain: Individual app instances are mapped with a system-generated subdomain under the given top-level domain.

Depending on the session type and application launch configurations, it may require an authenticated HTTP session for HTTP-based applications. For instance, you may enforce authentication for interactive development apps like Jupyter while allow anonymous access for AI model service APIs.

Storage Proxy

Backend.AI Storage Proxy is a proxy to offload the large file transfers from the manager. It also provides an abstraction of underlying storage vendor’s acceleration APIs since many storage vendors offer vendor-specific APIs for filesystem operations like scanning of directories with millions of files. Using the storage proxy, we apply our abstraction models for such filesystem operations and quota management specialized to each vendor API.

FastTrack (Enterprise only)

Backend.AI FastTrack is an add-on service running on top of the manager that features a slick GUI to design and run pipelines of computation tasks. It makes it easier to monitor the progress of various MLOps pipelines running concurrently, and allows sharing of such pielines in portable ways.

Resource Management

Sokovan Orchestrator

Backend.AI Sokovan is the central cluster-level scheduler running inside the manager. It monitors the resource usage of agents and assigns new containers from the job queue to the agents.

Each resource group may have separate scheduling policy and options. The scheduling algorithm may be extended using a common abstract interface. A scheduler implementation accepts the list of currently running sessions, the list of pending sessions in the job queue, and the current resource usage of target agents. It then outputs the choice of a pending session to start and the assignment of an agent to host it.

Agent

Backend.AI Agent is a small daemon running at each compute node like a GPU server. Its main job is to control and monitor the containers via Docker, but also includes an abstraction of various “compute process” backends. It publishes various types of container-related events so that the manager could react to status updates of containers.

When the manager assigns a new container, the agent decides the device-level resource mappings for the container considering optimal hardware layouts such as NUMA and the PCIe bus locations of accelerator and network devices.

Internal services

Event bus

Backend.AI uses Redis to keep track of various real-time information and notify system events to other service components.

Control Panel (Enterprise only)

Backend.AI Control Panel is an add-on service to the manager for advanced management and monitoring. It provides a dedicated superadmin GUI, featuring batch creation and modification of the users, detailed configuration of various resource policies, and etc.

Forklift (Enterprise only)

Backend.AI Forklift is a standalone service that eases building new container images from scratch or importing existing ones that are compatible with Backend.AI.

Reservoir (Enterprise only)

Backend.AI Reservoir is an add-on service to provide open source package mirrors for air-gapped setups.

Container Registry

Backend.AI supports integration with several common container registry solutions, while open source users may also rely on our official registry service with prebuilt images in https://cr.backend.ai:

  • Docker’s vanilla open-source registry

    • It is simplest to set up but does not provide advanced access controls and namespacing over container images.

  • Harbor v2 (recommended)

    • It provides a full-fledged container registry service including ACLs with project/user memberships, cloning from/to remote registries, on-premise and cloud deployments, security analysis, and etc.

Computing

Sessions and kernels

Backend.AI spawns sessions to host various kinds of computation with associated computing resources. Each session may have one or more kernels. We call sessions with multiple kernels as “cluster sessions”.

A kernel represents an isolated unit of computation such as a container, a virtual machine, a native process, or even a Kubernetes pod, depending on the Agent’s backed implementation and configurations. The most common form of a kernel is a Docker container. For container or VM-based kernels, they are also associated with the base images. The most common form of a base image is the OCI container images.

Kernel roles in a cluster session

In a cluster session with multiple kernels, each kernel has a role. By default, the first container takes the “main” role while others takes the “sub” role. All kernels are given unique hostnames like “main1”, “sub1”, “sub2”, …, and “subN” (the cluster size is N+1 in this case). A non-cluster session has one “main1” kernel only.

All interactions with a session are routed to its “main1” kernel, while the “main1” kernel is allowed to access all other kernels via a private network.

더 보기

Cluster Networking

Session templates

A session template is a predefined set of parameters to create a session, while they can be overriden by the caller. It may define additional kernel roles for a cluster session, with different base images and resource specifications.

Session types

There are several classes of sessions for different purposes having different features.

Features by the session type

Feature

Compute
(Interactive)

Compute
(Batch)

Inference

System

Code execution

Service port

Dependencies

Session result

Clustering

Compute session is the most generic form of session to host computations. It has two operation modes: interactive and batch.

Interactive compute session

Interactive compute sessions are used to run various interactive applications and development tools, such as Jupyter Notebooks, web-based terminals, and etc. It is expected that the users control their lifecycles (e.g., terminating them) while Backend.AI offers configuration knobs for the administrators to set idle timeouts with various criteria.

There are two major ways to interact with an interactive compute session: service ports and the code execution API.

Service ports

TODO: port mapping diagram

Code execution

TODO: execution API state diagram

Batch compute session

Batch compute sessions are used to host a “run-to-completion” script with a finite execution time. It has two result states: SUCCESS or FAILED, which is defined by whether the main program’s exit code is zero or not.

Dependencies between compute sessions

Pipelining

Inference session

Service endpoint and routing

Auto-scaling

System session

SFTP access

Scheduling

Backend.AI keeps track of sessions using a state-machine to represent the various lifecycle stages of them.

TODO: session/kernel state diagram

TODO: two-level scheduler architecture diagram

더 보기

Resource groups

Session selection strategy
Heuristic FIFO

The default session selection strategy is the heuristic FIFO. It mostly works like a FIFO queue to select the oldest pending session, but offers an option to enable a head-of-line (HoL) blocking avoidance logic.

The HoL blocking problem happens when the oldest pending session requires too much resources so that it cannot be scheduled while other subsequent pending sessions fit within the available cluster resources. Those subsequent pending sessions that can be started never have chances until the oldest pending session (“blocker”) is either cancelled or more running sessions terminate and release more cluster resources.

When enabled, the HoL blocking avoidance logic keeps track of the retry count of scheduling attempts of each pending session and pushes back the pending sessions whose retry counts exceed a certain threshold. This option should be explicitly enabled by the administrators or during installation.

Dominant resource fairness (DRF)
Agent selection strategy
Concentrated
Dispersed
Custom

Resource Management

Resource slots

Backend.AI abstracts each different type of computing resources as a “resource slot”. Resource slots are distinguished by its name consisting of two parts: the device name and the slot name.

Resource slot name

Device name

Slot name

cpu

cpu

(implicitly defined as root)

mem

mem

(implicitly defined as root)

cuda.device

cuda

device

cuda.shares

cuda

shares

cuda.mig-2c10g

cuda

mig-2c10g

Each resource slot has a slot type as follows:

Slot type

Meaning

Examples

COUNT

The value of the resource slot is an integer or decimal to represent how many of the device(s) are available/allocated. It may also represent fractions of devices.

cpu, cuda.device, cuda.shares

BYTES

The value of the resource slot is an integer to represent how many bytes of the resources are available/allocated.

mem

UNIQUE

Only “each one” of the device can be allocated to each different kernel exclusively.

cuda.mig-10g

Compute plugins

Backend.AI administrators may install one or more compute plugins to each agent. Without any plugin, only the intrinsic cpu and mem resource slots are available.

Each compute plugin may declare one or more resource slots. The plugin is invoked upon startup of the agent to get the list of devices and the resource slots to report. Administrators can inspect the per-agent accelerator details provided by the compute plugins in the control panel.

The most well-known compute plugin is cuda_open, which is included in the open source version. It declares cuda.device resource slot that represents each NVIDIA GPU as one unit.

There is a special compute plugin to simulate non-existent devices: mock. Developers may put a local configuration to declare an arbitrary set of devices and resource slots to test the schedulers and the frontend. It is useful to develop integrations with new hardware devices before you get the actual devices on your hands.

Resource groups

Resource group is a logical group of the Agents with independent schedulers. Each agent belongs to a single resource group only. It self-reports which resource group to join when sending the heartbeat messages, but the specified resource group must exist in prior.

더 보기

Scheduling

User Management

Users

Backend.AI’s user account has two types of authentication modes: session and keypair. The session mode just uses the normal username and password based on browser sessions (e.g., when using the Web UI), while the keypair mode uses a pair of access and secret keys for programmatic access.

Projects

There may be multiple projects created by administrators and users may belong to one or more projects. Administrators may configure project-level resource policies such as storage quota shared by all project vfolders and project-level artifacts.

When a user creates a new session, he/she must choose which project to use if he/she belongs to multiple projects to be in line with the resource policies.

Cluster Networking

Single-node cluster session

If a session is created with multiple containers with a single-node option, all containers are created in a single agent. The containers share a private bridge network in addition to the default network, so that they could interact with each other privately. There are no firewall restrictions in this private bridge network.

Multi-node cluster session

For even larger-scale computation, you may create a multi-node cluster session that spans across multiple agents. In this case, the manager auto-configures a private overlay network, so that the containers could interact with each other. There are no firewall restrictions in this private overlay network.

Detection of clustered setups

There is a concept called cluster role. The current version of Backend.AI creates homogeneous cluster sessions by replicating the same resource configuration and the same container image, but we have plans to add heterogeneous cluster sessions that have different resource and image configurations for each cluster role. For instance, a Hadoop cluster may have two types of containers: name nodes and data nodes, where they could be mapped to main and sub cluster roles.

All interactive apps are executed only in the main1 container which is always present in both cluster and non-cluster sessions. It is the user application’s responsibility to connect with and utilize other containers in a cluster session. To ease the process, Backend.AI injects the following environment variables into the containers and sets up a random-generated SSH keypairs between the containers so that each container ssh into others without additional prompts.:

Environment Variable

Meaning

Examples

BACKENDAI_CLUSTER_SIZE

The number of containers in this cluster session.

4

BACKENDAI_CLUSTER_HOSTS

A comma-separated list of container hostnames in this cluster session.

main1,sub1,sub2,sub3

BACKENDAI_CLUSTER_REPLICAS

A comma-separated key:value pairs of cluster roles and the replica counts for each role.

main:1,sub:3

BACKENDAI_CLUSTER_HOST

The container hostname of the current container.

main1

BACKENDAI_CLUSTER_IDX

The one-based index of the current container from the containers sharing the same cluster role.

1

BACKENDAI_CLUSTER_ROLE

The name of the current container’s cluster role.

main

BACKENDAI_CLUSTER_LOCAL_RANK

The zero-based global index of the current container within the entire cluster session.

0

Storage Management

Virtual folders

Backend.AI abstracts network storages as a set of “virtual folders” (aka “vfolders”), which provides a persistent file storage to users and projects.

When creating a new session, users may connect vfolders to it with read-only or read-write permissions. If the shared vfolder has limited the permission to read-only, then the user may connect it with the read-only permission only. Virtual folders are mounted into compute session containers at /home/work/{name} so that user programs have access to the virtual folder contents like a local directory. The mounted path inside containers may be customized (e.g., /workspace) for compatibility with existing scripts and codes. Currently it is not possible to unmount or delete a vfolder when there are any running session connected to it. For cluster sessions having multiple kernels (containers), the connected vfolders are mounted to all kernels using the same location and the permission.

For a multi-node setup, the storage volume mounts must be synchronized across all Agent nodes and the Storage Proxy node(s) using the same mount path (e.g., /mnt or /vfroot). For a single-node setup, you may simply use an empty local directory, like our install-dev.sh script (link) does.

From the perspective of the storage, all vfolders from different Backend.AI users and projects share a single same UID and GID. This allows a flexible permission sharing between users and projects, while keeping the Linux ownership of the files and directories consistent when they are accessed by multiple different Backend.AI users.

User-owned vfolders

The users may create their own one or more virtual folders to store data files, libraries, and program codes. The superadmins may limit the maximum number of vfolders owned by a user.

Project-owned vfolders

The project admins and superadmins may create a vfolder that is automatically shared to all members of the project, with a specific read-only or read-write permission.

참고

If allowed, users and projects may create and access vfolders in multiple different storage volumes, but the vfolder names must be unique in all storage volumes, for each user and project.

VFolder invitations and permissions

Users and project administrators may invite other users to collaborate on a vfolder. Once the invitee accepts the request, he/she gets the designated read-only or read-write permission on the shared vfolder.

Volume-level permissions

The superadmin may set additional action privileges to each storage volume, such as whether to allow or block mounting the vfolders in compute sessions, cloning the vfolders, etc.

Auto-mount vfolders

If a user-owned vfolder’s name starts with a dot, it is automatically mounted at /home/work for all sessions created by the user. A good usecase is .config and .local directories to keep your local configurations and user-installed packages (e.g., pip install --user) persistent across all your sessions.

Quota scopes

버전 23.03에 추가.

Quota scopes implement per-user and per-project storage usage limits. Currently it supports the hard limits specified in bytes. There are two main schemes to set up this feature.

Storage with per-directory quota
_images/vfolder-dir-quota.svg

Quota scopes and vfolders with storage solutions supporting per-directry quota

For each storage volume, each user and project has their own dedicated quota scope directories as shown in 그림 2. The storage solution must support per-directory quota, at least for a single-level (like NetApp’s QTree). We recommend this configuration for filesystems like CephFS, Weka.io, or custom-built storage servers using ZFS or XFS where Backend.AI Storage Proxy can be installed directly onto the storage servers.

Storage with per-volume quota
_images/vfolder-volume-quota.svg

Quota scopes and vfolders with storage solutions supporting per-volume quota

Unfortunately, there are many cases that we cannot rely on per-directory quota support in storage solutions, due to limitation of the underlying filesystem implementation or having no direct access to the storage vendor APIs.

For this case, we may assign dedicated storage volumes to each user and project like 그림 3, which naturally limits the space usage by the volume size. Another option is not to configure quota limits, but we don’t recommend this option in production setups.

The shortcoming is that we may need to frequently mount/unmount the network volumes when we create or remove users and projects, which may cause unexpected system failures due to stale file descriptors.

참고

For shared vfolders, the quota usage is accounted for the original owner of the vfolder, either a user or a project.

경고

For both schemes, the administrator should take care of the storage solution’s system limits such as the maximum number of volumes and quota sets because such limits may impose a hidden limit to the maximum number of users and projects in Backend.AI.

Configuration

Shared config

Most cluster-level configurations are stored in an Etcd service. The Etcd server is also used for service discovery; when new agents boot up they register themselves to the cluster manager via etcd. For production deployments, we recommend to use an Etcd cluster composed of odd (3, 5, or higher) number of nodes to keep high availability.

Local config

Each service component has a TOML-based local configuration. It defines node-specific configurations such as the agent name, the resource group where it belongs, specific system limits, the IP address and the TCP port(s) to bind their service traffic, and etc.

The configuration files are named after the service components, like manager.toml, agent.toml, and storage-proxy.toml. The search paths are: the current working directory, ~/.config/backend.ai, and /etc/backend.ai.

더 보기

The sample configurations in our source repository. Inside each component directory, sample.toml contains the full configuration schema and descriptions.

Monitoring

Dashboard (Enterprise only)

Backend.AI Dashboard is an add-on service that displays various real-time and historical performance metrics. The metrics include the number of sessions, cluster power usage, GPU utilization, and etc.

Alerts (Enterprise only)

Administrators may configure automatic alerts based on several thrsholds on the monitored metrics, via an external messaging service like emails and SMS.

자주 묻는 질문

vs. Notebooks

제품명

역할

Apache Zeppelin, Jupyter Notebook

노트북 스타일의 문서 + 코드 프론트엔드

Familiarity from data scientists and researchers, but hard to avoid insecure host resource sharing

Backend.AI

모든 프론트엔드들에 연결 가능한 backend 구조

Built for multi-tenancy: scalable and better isolation

vs. 오케스트레이션 프레임워크

제품명

대상

Amazon ECS, Kubernetes

Long-running interactive services

Load balancing, fault tolerance, incremental deployment

Amazon Lambda, Azure Functions

Stateless light-weight, short-lived functions

관리가 필요없는 서버리스 구조

Backend.AI

Stateful batch computations mixed with interactive applications

Low-cost high-density computation, maximization of hardware potentials

vs. 빅데이터와 AI 프레임워크

제품명

역할

TensorFlow, Apache Spark, Apache Hive

연산 런타임

Difficult to install, configure, and operate at scale

Amazon ML, Azure ML, GCP ML

Managed MLaaS

Highly scalable but dependent on each platform, still requires system engineering backgrounds

Backend.AI

연산 런타임의 호스트

Pre-configured, versioned, reproducible, customizable (open-source)

(모든 상품명과 트레이드 마크들은 각 소유자의 재산입니다.)

설치 가이드

Install from Source

참고

For production deployments, we recommend to create separate virtualenvs for individual services and install the pre-built wheel distributions, following Install from Packages.

Setting Up Manager and Agent (single node, all-in-one)

Check out Development Setup.

Setting Up Additional Agents (multi-node)

Updating manager configuration for multi-nodes

Since scripts/install-dev.sh assumes a single-node all-in-one setup, it configures the etcd and Redis addresses to be 127.0.0.1.

You need to update the etcd configuration of the Redis address so that additional agent nodes can connect to the Redis server using the address advertised via etcd:

$ ./backend.ai mgr etcd get config/redis/addr
127.0.0.1:xxxx
$ ./backend.ai mgr etcd put config/redis/addr MANAGER_IP:xxxx  # use the port number read above

where MANAGER_IP is an IP address of the manager node accessible from other agent nodes.

Installing additional agents in different nodes

First, you need to initialize a working copy of the core repository for each additional agent node. As our scripts/install-dev.sh does not yet provide an “agent-only” installation mode, you need to manually perform the same repository cloning along with the pyenv, Python, and Pants setup procedures as the script does.

참고

Since we use the mono-repo for the core packages, there is no way to separately clone the agent sources only. Just clone the entire repository and configure/execute the agent only. Ensure that you also pull the LFS files and submodules when you manually clone it.

Once your pants is up and working, run pants export to populate virtualenvs and install dependencies.

Then start to configure agent.toml by copying it from configs/agent/halfstack.toml as follows:

  • agent.toml

    • [etcd].addr.host: Replace with MANAGER_IP

    • [agent].rpc-listen-addr.host: Replace with AGENT_IP

    • [container].bind-host: Replace with AGENT_IP

    • [watcher].service-addr.host: Replace with AGENT_IP

where AGENT_IP is an IP address of this agent node accessible from the manager and MANAGER_IP is an IP address of the manager node accessible from this agent node.

Now execute ./backend.ai ag start-server to connect this agent node to an existing manager.

We assume that the agent and manager nodes reside in a same local network, where all TCP ports are open to each other. If this is not the case, you should configure firewalls to open all the port numbers appearing in agent.toml.

There are more complicated setup scenarios such as splitting network planes for control and container-to-container communications, but we provide assistance with them for enterprise customers only.

Setting Up Accelerators

Ensure that your accelerator is properly set up using vendor-specific installation methods.

Clone the accelerator plugin package into plugins directory if necessary or just use one of the already existing one in the mono-repo.

You also need to configure agent.toml’s [agent].allow-compute-plugins with the full package path (e.g., ai.backend.accelerator.cuda_open) to activate them.

Setting Up Shared Storage

To make vfolders working properly with multiple nodes, you must enable and configure Linux NFS to share the manager node’s vfroot/local directory under the working copy and mount it in the same path in all agent nodes.

It is recommended to unify the UID and GID of the storage-proxy service, all of the agent services across nodes, container UID and GID (configurable in agent.toml), and the NFS volume.

Configuring Overlay Networks for Multi-node Training (Optional)

참고

All other features of Backend.AI except multi-node training work without this configuration. The Docker Swarm mode is used to configure overlay networks to ensure privacy between cluster sessions, while the container monitoring and configuration is done by Backend.AI itself.

Currently the cross-node inter-container overlay routing is controlled via Docker Swarm’s overlay networks. In the manager, you need to create a Swarm. In the agent nodes, you need to join the Swarm. Then restart all manager and agent daemons to make it working.

Install from Packages

This guide covers how to install Backend.AI from the official release packages. You can build a fully-functional Backend.AI cluster with open-source packages.

Backend.AI consists of a variety of components, including open-source core components, pluggable extensions, and enterprise modules. Some of the major components are:

  • Backend.AI Manager : API gateway and resource management. Manager delegates workload requests to Agent and storage/file requests to Storage Proxy.

  • Backend.AI Agent : Installs on a compute node (usually GPU nodes) to start and manage the workload execution. It sends periodic heartbeat signals to the Manager in order to register itself as a worker node. Even if the connection to the Manager is temporarily lost, the pre-initiated workloads continue to be executed.

  • Backend.AI Storage Proxy : Handles requests relating to storage and files. It offloads the Manager’s burden of handling long-running file I/O operations. It embeds a plugin backend structure that provides dedicated features for each storage type.

  • Backend.AI Webserver : A web server that provides persistent user web sessions. Users can use the Backend.AI features without subsequent authentication upon initial login. It also serves the statically built graphical user interface in an Enterprise environment.

  • Backend.AI Web UI : Web application with a graphical user interface. Users can enjoy the easy-to-use interface to launch their secure execution environment and use apps like Jupyter and Terminal. It can be served as statically built JavaScript via Webserver. Or, it also offers desktop applications for many operating systems and architectures.

Most components can be installed in a single management node except Agent, which is usually installed on dedicated computing nodes (often GPU servers). However, this is not a rule and Agent can also be installed on the management node.

It is also possible to configure a high-availability (HA) setup with three or more management nodes, although this is not the focus of this guide.

Setup OS Environment

Backend.AI and its associated components share common requirements and configurations for proper operation. This section explains how to configure the OS environment.

참고

This section assumes the installation on Ubuntu 20.04 LTS.

Create a user account for operation

We will create a user account bai to install and operate Backend.AI services. Set the UID and GID to 1100 to prevent conflicts with other users or groups. sudo privilege is required so add bai to sudo group.

$ username="bai"
$ password="secure-password"
$ sudo adduser --disabled-password --uid 1100 --gecos "" $username
$ echo "$username:$password" | sudo chpasswd
$ sudo usermod -aG sudo bai

If you do not want to expose your password in the shell history, remove the --disabled-password option and interactively enter your password.

Login as the bai user and continue the installation.

Install Docker engine

Backend.AI requires Docker Engine to create a compute session with the Docker container backend. Also, some service components are deployed as containers. So installing Docker Engine is required. Ensure docker-compose-plugin is installed as well to use docker compose command.

After the installation, add the bai user to the docker group not to issue the sudo prefix command every time interacting with the Docker engine.

$ sudo usermod -aG docker bai

Logout and login again to apply the group membership change.

Optimize sysctl/ulimit parameters

This is not essential but the recommended step to optimize the performance and stability of operating Backend.AI. Refer to the guide of the Manager repiository for the details of the kernel parameters and the ulimit settings. Depending on the Backend.AI services you install, the optimal values may vary. Each service installation section guide with the values, if needed.

참고

Modern systems may have already set the optimal parameters. In that case, you can skip this step.

To cleanly separate the configurations, you may follow the steps below.

  • Save the resource limit parameters in /etc/security/limits.d/99-backendai.conf.

    root hard nofile 512000
    root soft nofile 512000
    root hard nproc 65536
    root soft nproc 65536
    bai hard nofile 512000
    bai soft nofile 512000
    bai hard nproc 65536
    bai soft nproc 65536
    
  • Logout and login again to apply the resource limit changes.

  • Save the kernel parameters in /etc/sysctl.d/99-backendai.conf.

    fs.file-max=2048000
    net.core.somaxconn=1024
    net.ipv4.tcp_max_syn_backlog=1024
    net.ipv4.tcp_slow_start_after_idle=0
    net.ipv4.tcp_fin_timeout=10
    net.ipv4.tcp_window_scaling=1
    net.ipv4.tcp_tw_reuse=1
    net.ipv4.tcp_early_retrans=1
    net.ipv4.ip_local_port_range="10000 65000"
    net.core.rmem_max=16777216
    net.core.wmem_max=16777216
    net.ipv4.tcp_rmem=4096 12582912 16777216
    net.ipv4.tcp_wmem=4096 12582912 16777216
    vm.overcommit_memory=1
    
  • Apply the kernel parameters with sudo sysctl -p /etc/sysctl.d/99-backendai.conf.

Prepare required Python versions and virtual environments

Prepare a Python distribution whose version meets the requirements of the target package. Backend.AI 22.09, for example, requires Python 3.10. The latest information on the Python version compatibility can be found at here.

There can be several ways to prepare a specific Python version. Here, we will be using pyenv and pyenv-virtualenv.

Use pyenv to manually build and select a specific Python version

Install pyenv and pyenv-virtualenv. Then, install a Python version that are needed:

$ pyenv install "${YOUR_PYTHON_VERSION}"

참고

You may need to install suggested build environment to build Python from pyenv.

Then, you can create multiple virtual environments per service. To create a virtual environment for Backend.AI Manager 22.09.x and automatically activate it, for example, you may run:

$ mkdir "${HOME}/manager"
$ cd "${HOME}/manager"
$ pyenv virtualenv "${YOUR_PYTHON_VERSION}" bai-22.09-manager
$ pyenv local bai-22.09-manager
$ pip install -U pip setuptools wheel

You also need to make pip available to the Python installation with the latest wheel and setuptools packages, so that any non-binary extension packages can be compiled and installed on your system.

Use a standalone static built Python

We can use a standalone static built Python.

경고

Details will be added later.

Configure network aliases

Although not required, using a network aliases instead of IP addresses can make setup and operation easier. Edit the /etc/hosts file for each node and append the contents like example below to access each server with network aliases.

##### BEGIN for Backend.AI services #####
10.20.30.10 bai-m1   # management node 01
10.20.30.20 bai-a01  # agent node 01 (GPU 01)
10.20.30.22 bai-a02  # agent node 02 (GPU 02)
##### END for Backend.AI services #####

Note that the IP addresses should be accessible from other nodes, if you are installing on multiple servers.

Mount a shared storage

Having a shared storage volume makes it easy to save and manage data inside a Backend.AI compute environment. If you have a dedicated storage, mount it with the name of your choice under /vfroot/ directory on each server. You must mount it in the same path in all management and compute nodes.

Detailed mount procedures may vary depending on the storage type or vendor. For a usual NFS, adding the configurations to /etc/fstab and executing sudo mount -a will do the job.

참고

It is recommended to unify the UID and GID of the Storage Proxy service, all of the Agent services across nodes, container UID and GID (configurable in agent.toml), and the NFS volume.

If you do not have a dedicated storage or installing on one server, you can use a local directory. Just create a directory /vfroot/local.

$ sudo mkdir -p /vfroot/local
$ sudo chown -R ${UID}.${GID} /vfroot
Setup accelerators

If there are accelerators (e.g., GPU) on the server, you have to install the vendor-specific drivers and libraries to make sure the accelerators are properly set up and working. Please refer to the vendor documentation for the details.

  • To integrate NVIDIA GPUs,

    • Install the NVIDIA driver and CUDA toolkit.

    • Install the NVIDIA container toolkit (nvidia-docker2).

Pull container images

For compute nodes, you need to pull some container images that are required for creating a compute session. Lablup provides a set of open container images and you may pull the following starter images:

docker pull cr.backend.ai/stable/filebrowser:21.02-ubuntu20.04
docker pull cr.backend.ai/stable/python:3.9-ubuntu20.04
docker pull cr.backend.ai/stable/python-pytorch:1.11-py38-cuda11.3
docker pull cr.backend.ai/stable/python-tensorflow:2.7-py38-cuda11.3

Prepare Database

Backend.AI makes use of PostgreSQL as its main database. Launch the service using docker compose by generating the file $HOME/halfstack/docker-compose.hs.postgres.yaml and populating it with the following YAML. Feel free to adjust the volume paths and port settings. Please refer the latest configuration (it’s a symbolic link so follow the filename in it) if needed.

version: "3"
x-base: &base
   logging:
      driver: "json-file"
      options:
         max-file: "5"
         max-size: "10m"

services:
   backendai-pg-active:
      <<: *base
      image: postgres:13.2-alpine
      restart: unless-stopped
      command: >
         postgres
         -c "max_connections=256"
         -c "max_worker_processes=4"
         -c "deadlock_timeout=10s"
         -c "lock_timeout=60000"
         -c "idle_in_transaction_session_timeout=60000"
      environment:
         - POSTGRES_USER=postgres
         - POSTGRES_PASSWORD=develove
         - POSTGRES_DB=backend
         - POSTGRES_INITDB_ARGS="--data-checksums"
      healthcheck:
         test: ["CMD", "pg_isready", "-U", "postgres"]
         interval: 10s
         timeout: 3s
         retries: 10
      volumes:
         - "${HOME}/.data/backend.ai/postgres-data/active:/var/lib/postgresql/data:rw"
      ports:
         - "8100:5432"
      networks:
         half_stack:
      cpu_count: 4
      mem_limit: "4g"

networks:
    half_stack:

Execute the following command to start the service container. The project ${USER} is added for operational convenience.

$ cd ${HOME}/halfstack
$ docker compose -f docker-compose.hs.postgres.yaml -p ${USER} up -d
$ # -- To terminate the container:
$ # docker compose -f docker-compose.hs.postgres.yaml -p ${USER} down
$ # -- To see the container logs:
$ # docker compose -f docker-compose.hs.postgres.yaml -p ${USER} logs -f

Prepare Cache Service

Backend.AI makes use of Redis as its main cache service. Launch the service using docker compose by generating the file $HOME/halfstack/docker-compose.hs.redis.yaml and populating it with the following YAML. Feel free to adjust the volume paths and port settings. Please refer the latest configuration (it’s a symbolic link so follow the filename in it) if needed.

version: "3"
x-base: &base
   logging:
      driver: "json-file"
      options:
         max-file: "5"
         max-size: "10m"

services:
   backendai-halfstack-redis:
      <<: *base
      image: redis:6.2-alpine
      restart: unless-stopped
      command: >
         redis-server
         --requirepass develove
         --appendonly yes
      volumes:
         - "${HOME}/.data/backend.ai/redis-data:/data:rw"
      healthcheck:
         test: ["CMD", "redis-cli", "--raw", "incr", "ping"]
         interval: 10s
         timeout: 3s
         retries: 10
      ports:
         - "8110:6379"
      networks:
         - half_stack
      cpu_count: 1
      mem_limit: "2g"

networks:
   half_stack:

Execute the following command to start the service container. The project ${USER} is added for operational convenience.

$ cd ${HOME}/halfstack
$ docker compose -f docker-compose.hs.redis.yaml -p ${USER} up -d
$ # -- To terminate the container:
$ # docker compose -f docker-compose.hs.redis.yaml -p ${USER} down
$ # -- To see the container logs:
$ # docker compose -f docker-compose.hs.redis.yaml -p ${USER} logs -f

Prepare Config Service

Backend.AI makes use of Etcd as its main config service. Launch the service using docker compose by generating the file $HOME/halfstack/docker-compose.hs.etcd.yaml and populating it with the following YAML. Feel free to adjust the volume paths and port settings. Please refer the latest configuration (it’s a symbolic link so follow the filename in it) if needed.

version: "3"
x-base: &base
   logging:
      driver: "json-file"
      options:
         max-file: "5"
         max-size: "10m"

services:
   backendai-halfstack-etcd:
      <<: *base
      image: quay.io/coreos/etcd:v3.4.15
      restart: unless-stopped
      command: >
         /usr/local/bin/etcd
         --name etcd-node01
         --data-dir /etcd-data
         --listen-client-urls http://0.0.0.0:2379
         --advertise-client-urls http://0.0.0.0:8120
         --listen-peer-urls http://0.0.0.0:2380
         --initial-advertise-peer-urls http://0.0.0.0:8320
         --initial-cluster etcd-node01=http://0.0.0.0:8320
         --initial-cluster-token backendai-etcd-token
         --initial-cluster-state new
         --auto-compaction-retention 1
      volumes:
         - "${HOME}/.data/backend.ai/etcd-data:/etcd-data:rw"
      healthcheck:
         test: ["CMD", "etcdctl", "endpoint", "health"]
         interval: 10s
         timeout: 3s
         retries: 10
      ports:
         - "8120:2379"
         # - "8320:2380"  # listen peer (only if required)
      networks:
         - half_stack
      cpu_count: 1
      mem_limit: "1g"

networks:
   half_stack:

Execute the following command to start the service container. The project ${USER} is added for operational convenience.

$ cd ${HOME}/halfstack
$ docker compose -f docker-compose.hs.etcd.yaml -p ${USER} up -d
$ # -- To terminate the container:
$ # docker compose -f docker-compose.hs.etcd.yaml -p ${USER} down
$ # -- To see the container logs:
$ # docker compose -f docker-compose.hs.etcd.yaml -p ${USER} logs -f

Install Backend.AI Manager

Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.

Install the latest version of Backend.AI Manager for the current Python version:

$ cd "${HOME}/manager"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-manager

If you want to install a specific version:

$ pip install -U backend.ai-manager==${BACKEND_PKG_VERSION}
Local configuration

Backend.AI Manager uses a TOML file (manager.toml) to configure local service. Refer to the manager.toml sample file for a detailed description of each section and item. A configuration example would be:

[etcd]
namespace = "local"
addr = { host = "bai-m1", port = 8120 }
user = ""
password = ""

[db]
type = "postgresql"
addr = { host = "bai-m1", port = 8100 }
name = "backend"
user = "postgres"
password = "develove"

[manager]
num-proc = 2
service-addr = { host = "0.0.0.0", port = 8081 }
# user = "bai"
# group = "bai"
ssl-enabled = false

heartbeat-timeout = 30.0
pid-file = "/home/bai/manager/manager.pid"
disabled-plugins = []
hide-agents = true
# event-loop = "asyncio"
# importer-image = "lablup/importer:manylinux2010"
distributed-lock = "filelock"

[docker-registry]
ssl-verify = false

[logging]
level = "INFO"
drivers = ["console", "file"]

[logging.pkg-ns]
"" = "WARNING"
"aiotools" = "INFO"
"aiopg" = "WARNING"
"aiohttp" = "INFO"
"ai.backend" = "INFO"
"alembic" = "INFO"

[logging.console]
colored = true
format = "verbose"

[logging.file]
path = "./logs"
filename = "manager.log"
backup-count = 10
rotation-size = "10M"

[debug]
enabled = false
enhanced-aiomonitor-task-info = true

Save the contents to ${HOME}/.config/backend.ai/manager.toml. Backend.AI will automatically recognize the location. Adjust each field to conform to your system.

Global configuration

Etcd (cluster) stores globally shared configurations for all nodes. Some of them should be populated prior to starting the service.

참고

It might be a good idea to create a backup of the current Etcd configuration before modifying the values. You can do so by simply executing:

$ backend.ai mgr etcd get --prefix "" > ./etcd-config-backup.json

To restore the backup:

$ backend.ai mgr etcd delete --prefix ""
$ backend.ai mgr etcd put-json "" ./etcd-config-backup.json

The commands below should be executed at ${HOME}/manager directory.

To list a specific key from Etcd, for example, config key:

$ backend.ai mgr etcd get --prefix config

Now, configure Redis access information. This should be accessible from all nodes.

$ backend.ai mgr etcd put config/redis/addr "bai-m1:8110"
$ backend.ai mgr etcd put config/redis/password "develove"

Set the container registry. The following is the Lablup’s open registry (cr.backend.ai). You can set your own registry with username and password if needed. This can be configured via GUI as well.

$ backend.ai mgr etcd put config/docker/image/auto_pull "tag"
$ backend.ai mgr etcd put config/docker/registry/cr.backend.ai "https://cr.backend.ai"
$ backend.ai mgr etcd put config/docker/registry/cr.backend.ai/type "harbor2"
$ backend.ai mgr etcd put config/docker/registry/cr.backend.ai/project "stable"
$ # backend.ai mgr etcd put config/docker/registry/cr.backend.ai/username "bai"
$ # backend.ai mgr etcd put config/docker/registry/cr.backend.ai/password "secure-password"

Also, populate the Storage Proxy configuration to the Etcd:

$ # Allow project (group) folders.
$ backend.ai mgr etcd put volumes/_types/group ""
$ # Allow user folders.
$ backend.ai mgr etcd put volumes/_types/user ""
$ # Default volume host. The name of the volume proxy here is "bai-m1" and volume name is "local".
$ backend.ai mgr etcd put volumes/default_host "bai-m1:local"
$ # Set the "bai-m1" proxy information.
$ # User (browser) facing API endpoint of Storage Proxy.
$ # Cannot use host alias here. It should be user-accessible URL.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/client_api "http://10.20.30.10:6021"
$ # Manager facing internal API endpoint of Storage Proxy.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/manager_api "http://bai-m1:6022"
$ # Random secret string which is used by Manager to communicate with Storage Proxy.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/secret "secure-token-to-authenticate-manager-request"
$ # Option to disable SSL verification for the Storage Proxy.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/ssl_verify "false"

Check if the configuration is properly populated:

$ backend.ai mgr etcd get --prefix volumes

Note that you have to change the secret to a unique random string for secure communication between the manager and Storage Proxy. The most recent set of parameters can be found from sample.etcd.volumes.json.

To enable access to the volumes defined by the Storage Proxy from every user, you need to update the allowed_vfolder_hosts column of the domains table to hold the storage volume reference (e.g., bai-m1:local). You can do this by issuing SQL statement directly inside the PostgreSQL container:

$ vfolder_host_val='{"bai-m1:local": ["create-vfolder", "modify-vfolder", "delete-vfolder", "mount-in-session", "upload-file", "download-file", "invite-others", "set-user-specific-permission"]}'
$ docker exec -it bai-backendai-pg-active-1 psql -U postgres -d backend \
      -c "UPDATE domains SET allowed_vfolder_hosts = '${vfolder_host_val}' WHERE name = 'default';"
Populate the database with initial fixtures

You need to prepare alembic.ini file under ${HOME}/manager to manage the database schema. Copy the sample halfstack.alembic.ini and save it as ${HOME}/manager/alembic.ini. Adjust the sqlalchemy.url field if database connection information is different from the default one. You may need to change localhost to bai-m1.

Populate the database schema and initial fixtures. Copy the example JSON files (example-keypairs.json and example-resource-presets.json) as keypairs.json and resource-presets.json, save them under ${HOME}/manager/. Customize them to have unique keypairs and passwords for your initial superadmin and sample user accounts for security.

$ backend.ai mgr schema oneshot
$ backend.ai mgr fixture populate ./keypairs.json
$ backend.ai mgr fixture populate ./resource-presets.json
Sync the information of container registry

You need to scan the image catalog and metadata from the container registry to the Manager. This is required to display the list of compute environments in the user web GUI (Web UI). You can run the following command to sync the information with Lablup’s public container registry:

$ backend.ai mgr image rescan cr.backend.ai
Run Backend.AI Manager service

You can run the service:

$ cd "${HOME}/manager"
$ python -m ai.backend.manager.server

Check if the service is running. The default Manager API port is 8081, but it can be configured from manager.toml:

$ curl bai-m1:8081
{"version": "v6.20220615", "manager": "22.09.6"}

Press Ctrl-C to stop the service.

Register systemd service

The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.

First, create a runner script at ${HOME}/bin/run-manager.sh:

#! /bin/bash
set -e

if [ -z "$HOME" ]; then
   export HOME="/home/bai"
fi

# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
   export PYENV_ROOT="$HOME/.pyenv"
   export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

if [ "$#" -eq 0 ]; then
   exec python -m ai.backend.manager.server
else
   exec "$@"
fi

Make the script executable:

$ chmod +x "${HOME}/bin/run-manager.sh"

Then, create a systemd service file at /etc/systemd/system/backendai-manager.service:

[Unit]
Description= Backend.AI Manager
Requires=network.target
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/home/bai/bin/run-manager.sh
PIDFile=/home/bai/manager/manager.pid
User=1100
Group=1100
WorkingDirectory=/home/bai/manager
TimeoutStopSec=5
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072

[Install]
WantedBy=multi-user.target

Finally, enable and start the service:

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-manager

$ # To check the service status
$ sudo systemctl status backendai-manager
$ # To restart the service
$ sudo systemctl restart backendai-manager
$ # To stop the service
$ sudo systemctl stop backendai-manager
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-manager -f

Install Backend.AI Agent

If there are dedicated compute nodes (often, GPU nodes) in your cluster, Backend.AI Agent service should be installed on the compute nodes, not on the management node.

Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.

Install the latest version of Backend.AI Agent for the current Python version:

$ cd "${HOME}/agent"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-agent

If you want to install a specific version:

$ pip install -U backend.ai-agent==${BACKEND_PKG_VERSION}
Setting Up Accelerators

참고

You can skip this section if your system does not have H/W accelerators.

Backend.AI supports various H/W accelerators. To integrate them with Backend.AI, you need to install the corresponding accelerator plugin package. Before installing the package, make sure that the accelerator is properly set up using vendor-specific installation methods.

Most popular accelerator today would be NVIDIA GPU. To install the open-source CUDA accelerator plugin, run:

$ pip install -U backend.ai-accelerator-cuda-open

참고

Backend.AI’s fractional GPU sharing is available only on the enterprise version but not supported on the open-source version.

Local configuration

Backend.AI Agent uses a TOML file (agent.toml) to configure local service. Refer to the agent.toml sample file for a detailed description of each section and item. A configuration example would be:

[etcd]
namespace = "local"
addr = { host = "bai-m1", port = 8120 }
user = ""
password = ""

[agent]
mode = "docker"
# NOTE: You cannot use network alias here. Write the actual IP address.
rpc-listen-addr = { host = "10.20.30.10", port = 6001 }
# id = "i-something-special"
scaling-group = "default"
pid-file = "/home/bai/agent/agent.pid"
event-loop = "uvloop"
# allow-compute-plugins = ["ai.backend.accelerator.cuda_open"]

[container]
port-range = [30000, 31000]
kernel-uid = 1100
kernel-gid = 1100
bind-host = "bai-m1"
advertised-host = "bai-m1"
stats-type = "docker"
sandbox-type = "docker"
jail-args = []
scratch-type = "hostdir"
scratch-root = "./scratches"
scratch-size = "1G"

[watcher]
service-addr = { host = "bai-m1", port = 6009 }
ssl-enabled = false
target-service = "backendai-agent.service"
soft-reset-available = false

[logging]
level = "INFO"
drivers = ["console", "file"]

[logging.console]
colored = true
format = "verbose"

[logging.file]
path = "./logs"
filename = "agent.log"
backup-count = 10
rotation-size = "10M"

[logging.pkg-ns]
"" = "WARNING"
"aiodocker" = "INFO"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"

[resource]
reserved-cpu = 1
reserved-mem = "1G"
reserved-disk = "8G"

[debug]
enabled = false
skip-container-deletion = false
asyncio = false
enhanced-aiomonitor-task-info = true
log-events = false
log-kernel-config = false
log-alloc-map = false
log-stats = false
log-heartbeats = false
log-docker-events = false

[debug.coredump]
enabled = false
path = "./coredumps"
backup-count = 10
size-limit = "64M"

You may need to configure [agent].allow-compute-plugins with the full package path (e.g., ai.backend.accelerator.cuda_open) to activate them.

Save the contents to ${HOME}/.config/backend.ai/agent.toml. Backend.AI will automatically recognize the location. Adjust each field to conform to your system.

Run Backend.AI Agent service

You can run the service:

$ cd "${HOME}/agent"
$ python -m ai.backend.agent.server

You should see a log message like started handling RPC requests at ...

There is an add-on service, Agent Watcher, that can be used to monitor and manage the Agent service. It is not required to run the Agent service, but it is recommended to use it for production environments.

$ cd "${HOME}/agent"
$ python -m ai.backend.agent.watcher

Press Ctrl-C to stop both services.

Register systemd service

The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.

It is better to set [container].stats-type = "cgroup" in the agent.toml for better metric collection which is only available with root privileges.

First, create a runner script at ${HOME}/bin/run-agent.sh:

#! /bin/bash
set -e

if [ -z "$HOME" ]; then
   export HOME="/home/bai"
fi

# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
   export PYENV_ROOT="$HOME/.pyenv"
   export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

if [ "$#" -eq 0 ]; then
   exec python -m ai.backend.agent.server
else
   exec "$@"
fi

Create a runner script for Watcher at ${HOME}/bin/run-watcher.sh:

#! /bin/bash
set -e

if [ -z "$HOME" ]; then
   export HOME="/home/bai"
fi

# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
   export PYENV_ROOT="$HOME/.pyenv"
   export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

if [ "$#" -eq 0 ]; then
   exec python -m ai.backend.agent.watcher
else
   exec "$@"
fi

Make the scripts executable:

$ chmod +x "${HOME}/bin/run-agent.sh"
$ chmod +x "${HOME}/bin/run-watcher.sh"

Then, create a systemd service file at /etc/systemd/system/backendai-agent.service:

[Unit]
Description= Backend.AI Agent
Requires=backendai-watcher.service
After=network.target remote-fs.target backendai-watcher.service

[Service]
Type=simple
ExecStart=/home/bai/bin/run-agent.sh
PIDFile=/home/bai/agent/agent.pid
WorkingDirectory=/home/bai/agent
TimeoutStopSec=5
KillMode=process
KillSignal=SIGINT
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072

[Install]
WantedBy=multi-user.target

And for Watcher at /etc/systemd/system/backendai-watcher.service:

[Unit]
Description= Backend.AI Agent Watcher
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/home/bai/bin/run-watcher.sh
WorkingDirectory=/home/bai/agent
TimeoutStopSec=3
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Finally, enable and start the service:

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-watcher
$ sudo systemctl enable --now backendai-agent

$ # To check the service status
$ sudo systemctl status backendai-agent
$ # To restart the service
$ sudo systemctl restart backendai-agent
$ # To stop the service
$ sudo systemctl stop backendai-agent
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-agent -f

Install Backend.AI Storage Proxy

Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.

Install the latest version of Backend.AI Storage Proxy for the current Python version:

$ cd "${HOME}/storage-proxy"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-storage-proxy

If you want to install a specific version:

$ pip install -U backend.ai-storage-proxy==${BACKEND_PKG_VERSION}
Local configuration

Backend.AI Storage Proxy uses a TOML file (storage-proxy.toml) to configure local service. Refer to the storage-proxy.toml sample file for a detailed description of each section and item. A configuration example would be:

[etcd]
namespace = "local"
addr = { host = "bai-m1", port = 8120 }
user = ""
password = ""

[storage-proxy]
node-id = "i-bai-m1"
num-proc = 2
pid-file = "/home/bai/storage-proxy/storage_proxy.pid"
event-loop = "uvloop"
scandir-limit = 1000
max-upload-size = "100g"

# Used to generate JWT tokens for download/upload sessions
secret = "secure-token-for-users-download-upload-sessions"
# The download/upload session tokens are valid for:
session-expire = "1d"

user = 1100
group = 1100

[api.client]
# Client-facing API
service-addr = { host = "0.0.0.0", port = 6021 }
ssl-enabled = false

[api.manager]
# Manager-facing API
service-addr = { host = "0.0.0.0", port = 6022 }
ssl-enabled = false

# Used to authenticate managers
secret = "secure-token-to-authenticate-manager-request"

[debug]
enabled = false
asyncio = false
enhanced-aiomonitor-task-info = true

[logging]
# One of: "NOTSET", "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"
# Set the global logging level.
level = "INFO"

# Multi-choice of: "console", "logstash", "file"
# For each choice, there must be a "logging.<driver>" section
# in this config file as exemplified below.
drivers = ["console", "file"]

[logging.pkg-ns]
"" = "WARNING"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"

[logging.console]
# If set true, use ANSI colors if the console is a terminal.
# If set false, always disable the colored output in console logs.
colored = true

# One of: "simple", "verbose"
format = "simple"

[logging.file]
path = "./logs"
filename = "storage-proxy.log"
backup-count = 10
rotation-size = "10M"

[volume]

[volume.local]
backend = "vfs"
path = "/vfroot/local"

# If there are NFS volumes
# [volume.nfs]
# backend = "vfs"
# path = "/vfroot/nfs"

Save the contents to ${HOME}/.config/backend.ai/storage-proxy.toml. Backend.AI will automatically recognize the location. Adjust each field to conform to your system.

Run Backend.AI Storage Proxy service

You can run the service:

$ cd "${HOME}/storage-proxy"
$ python -m ai.backend.storage.server

Press Ctrl-C to stop both services.

Register systemd service

The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.

First, create a runner script at ${HOME}/bin/run-storage-proxy.sh:

#! /bin/bash
set -e

if [ -z "$HOME" ]; then
   export HOME="/home/bai"
fi

# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
   export PYENV_ROOT="$HOME/.pyenv"
   export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

if [ "$#" -eq 0 ]; then
   exec python -m ai.backend.storage.server
else
   exec "$@"
fi

Make the scripts executable:

$ chmod +x "${HOME}/bin/run-storage-proxy.sh"

Then, create a systemd service file at /etc/systemd/system/backendai-storage-proxy.service:

[Unit]
Description= Backend.AI Storage Proxy
Requires=network.target
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/home/bai/bin/run-storage-proxy.sh
PIDFile=/home/bai/storage-proxy/storage-proxy.pid
WorkingDirectory=/home/bai/storage-proxy
User=1100
Group=1100
TimeoutStopSec=5
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072

[Install]
WantedBy=multi-user.target

Finally, enable and start the service:

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-storage-proxy

$ # To check the service status
$ sudo systemctl status backendai-storage-proxy
$ # To restart the service
$ sudo systemctl restart backendai-storage-proxy
$ # To stop the service
$ sudo systemctl stop backendai-storage-proxy
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-storage-proxy -f

Install Backend.AI Webserver

Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.

Install the latest version of Backend.AI Webserver for the current Python version:

$ cd "${HOME}/webserver"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-webserver

If you want to install a specific version:

$ pip install -U backend.ai-webserver==${BACKEND_PKG_VERSION}
Local configuration

Backend.AI Webserver uses a config file (webserver.conf) to configure local service. Refer to the webserver.conf sample file for a detailed description of each section and item. A configuration example would be:

[service]
ip = "0.0.0.0"
port = 8080
# Not active in open-source edition.
wsproxy.url = "http://10.20.30.10:10200"

# Set or enable it when using reverse proxy for SSL-termination
# force_endpoint_protocol = "https"

mode = "webui"
enable_signup = false
allow_signup_without_confirmation = false
always_enqueue_compute_session = false
allow_project_resource_monitor = false
allow_change_signin_mode = false
mask_user_info = false
enable_container_commit = false
hide_agents = true
directory_based_usage = false

[resources]
open_port_to_public = false
allow_preferred_port = false
max_cpu_cores_per_container = 255
max_memory_per_container = 1000
max_cuda_devices_per_container = 8
max_cuda_shares_per_container = 8
max_shm_per_container = 256
# Maximum per-file upload size (bytes)
max_file_upload_size = 4294967296

[environments]
# allowlist = ""

[ui]
brand = "Backend.AI"
default_environment = "cr.backend.ai/stable/python"
default_import_environment = "cr.backend.ai/filebrowser:21.02-ubuntu20.04"

[api]
domain = "default"
endpoint = "http://bai-m1:8081"
text = "Backend.AI"
ssl-verify = false

[session]
redis.host = "bai-m1"
redis.port = 8110
redis.db = 5
redis.password = "develove"
max_age = 604800  # 1 week
flush_on_startup = false
login_block_time = 1200  # 20 min (in sec)
login_allowed_fail_count = 10
max_count_for_preopen_ports = 10

[license]

[plugin]

[pipeline]

Save the contents to ${HOME}/.config/backend.ai/webserver.conf.

Run Backend.AI Webserver service

You can run the service by specifying the config file path with -f option:

$ cd "${HOME}/webserver"
$ python -m ai.backend.web.server -f ${HOME}/.config/backend.ai/webserver.conf

Press Ctrl-C to stop both services.

Register systemd service

The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.

First, create a runner script at ${HOME}/bin/run-webserver.sh:

#! /bin/bash
set -e

if [ -z "$HOME" ]; then
   export HOME="/home/bai"
fi

# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
   export PYENV_ROOT="$HOME/.pyenv"
   export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

if [ "$#" -eq 0 ]; then
   exec python -m ai.backend.web.server -f ${HOME}/.config/backend.ai/webserver.conf
else
   exec "$@"
fi

Make the scripts executable:

$ chmod +x "${HOME}/bin/run-webserver.sh"

Then, create a systemd service file at /etc/systemd/system/backendai-webserver.service:

[Unit]
Description= Backend.AI Webserver
Requires=network.target
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/home/bai/bin/run-webserver.sh
PIDFile=/home/bai/webserver/webserver.pid
WorkingDirectory=/home/bai/webserver
User=1100
Group=1100
TimeoutStopSec=5
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072

[Install]
WantedBy=multi-user.target

Finally, enable and start the service:

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-webserver

$ # To check the service status
$ sudo systemctl status backendai-webserver
$ # To restart the service
$ sudo systemctl restart backendai-webserver
$ # To stop the service
$ sudo systemctl stop backendai-webserver
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-webserver -f
Check user GUI access via web

You can check the access to the web GUI by opening the URL http://<host-ip-or-domain>:8080 in your web browser. If all goes well, you will see the login page.

_images/webserver-login.png

Enter the email and password you set in the previous step to check login.

_images/webserver-summary-page-after-login.png

You can use almost every feature from the web GUI, but launching compute sesison apps like Terminal and/or Jupyer notebook is not possible from the web in the open-source edition. You can instead use the GUI desktop client to fully use the GUI features.

You can download the GUI desktop client from the web GUI in the Summary page. Please use the “Download Backend.AI Web UI App” at the bottom of the page.

_images/webserver-dashboard-download-desktop-app.png

Or, you can download from the following release page: https://github.com/lablup/backend.ai-webui/releases

Web UI (user GUI) guide can be found at https://webui.docs.backend.ai/.

Install on Clouds

The minimal instance configuration:

  • 1x SSL certificate with a private key for your own domain (for production)

  • 1x manager instance (e.g., t3.xlarge on AWS)

    • For HA setup, you many replicate multiple manager instances running in different availability zones and put a load balancer in front of them.

  • Nx agent instances (e.g., t3.medium / p2.xlarge on AWS – for minimal testing)

    • If you spawn multiple agents, it is recommended to use a placement group to improve locality for each availability zone.

  • 1x PostgreSQL instance (e.g., AWS RDS)

  • 1x Redis instance (e.g., AWS ElasticCache)

  • 1x etcd cluster

    • For HA setup, it should consist of 5 separate instances distributed across availability zones.

  • 1x cloud file system (e.g., AWS EFS, Azure FileShare)

  • All should be in the same virtual private network (e.g., AWS VPC).

Install on Premise

The minimal server node configuration:

  • 1x SSL certificate with a private key for your own domain (for production)

  • 1x manager server

  • Nx agent servers

  • 1x PostgreSQL server

  • 1x Redis server

  • 1x etcd cluster

    • For HA setup, it should consist of 5 separate server nodes.

  • 1x network-accessible storage server (NAS) with NFS/SMB mounts

    • All should be in the same private network (LAN).

  • Depending on the cluster size, several service/database daemons may run on the same physical server.

Install monitoring and logging tools

The Backend.AI can use several 3rd-party monitoring and logging services. Using them is completely optional.

Guide variables

⚠️ Prepare the values of the following variables before working with this page and replace their occurrences with the values when you follow the guide.

Name

Description

{DDAPIKEY}

>The Datadog API key

{DDAPPKEY}

The Datadog application key

{SENTRYURL}

The private Sentry report URL

Install Datadog agent

Datadog is a 3rd-party service to monitor the server resource usage.

$ DD_API_KEY={DDAPIKEY} bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/dd-agent/master/packaging/datadog-agent/source/install_agent.sh)"

Install Raven (Sentry client)

Raven is the official client package name of Sentry, which reports detailed contextual information such as stack and package versions when an unhandled exception occurs.

$ pip install "raven>=6.1"

Environment specifics: WSL v2

Backend.AI supports running on WSL (Windows Subsystem for Linux) version 2. However, you need to configure some special options so that the WSL distribution can interact with the Docker Desktop service.

Configuration of Docker Desktop for Windows

Turn on WSL Integration on Settings → Resources → WSL INTEGRATION. For the most cases, this should be already configured when you install Docker Desktop for Windows.

Configuration of WSL

  1. Create or modify /etc/wsl.conf using sudo in the WSL shell.

  2. Write down this and save.

    [automount]
    root = /
    options = "metadata"
    
  3. Run wsl --shutdown in a PowerShell prompt to restart the WSL distribution to ensure your wsl.conf updates applied.

  4. Enter the WSL shell again. If it is applied, your path must appears like /c/some/path instead of /mnt/c/some/path.

  5. Run sudo mount --make-rshared / in the WSL shell. Otherwise, your container creation from Backend.AI will fail with an error message like aiodocker.exceptions.DockerError: DockerError(500, 'path is mounted on /d but it is not a shared mount.').

Installation of Backend.AI

Now you may run the installer in the WSL shell.

User Guides

Install User Programs in Session Containers

Sometimes you need new programs or libraries that are not installed in your environment. If so, you can install the new program into your environment.

NOTE: Newly installed programs are not environment dependent. It is installed in the user directory.

Install packages with linuxbrew

If you are a macOS user and a researcher or developer who occasionally installs unix programs, you may be familiar with homebrew <https://brew.sh>. You can install new programs using linuxbrew in Backend.AI.

Creating a user linuxbrew directory

Directories that begin with a dot are automatically mounted when the session starts. Create a linuxbrew directory that will be automatically mounted so that programs you install with linuxbrew can be used in all sessions.

Create .linuxbrew in the Storage section.

With CLI:

$ backend.ai vfolder create .linuxbrew

Let’s check if they are created correctly.

$ backend.ai vfolder list

Also, you can create a directory using GUI console with same name.

Installing linuxbrew

Start a new session for installation. Choose your environment and allocate the necessary resources. Generally, you don’t need to allocate a lot of resources, but if you need to compile or install a GPU-dependent library, you need to adjust the resource allocation to your needs.

In general, 1 CPU / 4GB RAM is enough.

$ sh -c "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install.sh)"
Testing linuxbrew

Enter the brew command to verify that linuxbrew is installed. In general, to use linuxbrew you need to add the path where linuxbrew is installed to the PATH variable.

Enter the following command to temporarily add the path and verify that it is installed correctly.

$ brew
Setting linuxbrew environment variables automatically

To correctly reference the binaries and libraries installed by linuxbrew, add the configuration to .bashrc. You can add settings from the settings tab.

Example: Installing and testing htop

To test the program installation, let’s install a program called htop. htop is a program that extends the top command, allowing you to monitor the running computing environment in a variety of ways.

Let’s install it with the following command:

$ brew install htop

If there are any libraries needed for the htop program, they will be installed automatically.

Now let’s run:

$ htop

From the run screen, you can press q to return to the terminal.

1.6 Deleting the linuxbrew Environment

To reset all programs installed with linuxbrew, just delete everything in the .linuxbrew directory.

Note: If you want to remove a program by selecting it, use the brew uninstall [PROGRAM_NAME] command.

$ rm -rf ~/.linuxbrew/*

Install packages with miniconda

Some environments support miniconda. In this case, you can use miniconda <https://docs.conda.io/projects/conda/en/latest/user-guide/install/> to install the packages you want.

Creating a user miniconda-required directory

Directories that begin with a dot are automatically mounted when the session starts. Create a .conda, .continuum directory that will be automatically mounted so that programs you install with miniconda can be used in all sessions.

Create .conda, .continuum in the Storage section.

With CLI:

$ backend.ai vfolder create .conda
$ backend.ai vfolder create .continuum

Let’s check if they are created correctly.

$ backend.ai vfolder list

Also, you can create a directory using GUI console with same name.

miniconda test

Make sure you have miniconda installed in your environment. Package installation using miniconda is only available if miniconda is preinstalled in your environment.

$ conda
Example: Installing and testing htop

To test the program installation, let’s install a program called htop. htop is a program that extends the top command, allowing you to monitor the running computing environment in a variety of ways.

Let’s install it with the following command:

$ conda install -c conda-forge htop

If there are any libraries needed for the htop program, they will be installed automatically.

Now let’s run:

$ htop

From the run screen, you can press q to return to the terminal.

개발자용 매뉴얼

Development Setup

Currently Backend.AI is developed and tested under only *NIX-compatible platforms (Linux or macOS).

The development setup uses a mono-repository for the backend stack and a side-by-side repository checkout of the frontend stack. In contrast, the production setup uses per-service independent virtual environments and relies on a separately provisioned app proxy pool.

There are three ways to run both the backend and frontend stacks for development, as demonstrated in 그림 4, 그림 5, and 그림 6. The installation guide in this page using scripts/install-dev.sh covers all three cases because the only difference is that how you launch the Web UI from the mono-repo.

_images/dev-setup.svg

A standard development setup of Backend.AI open source components

_images/dev-setup-app.svg

A development setup of Backend.AI open source components for Electron-based desktop app

_images/dev-setup-staticwebui.svg

A development setup of Backend.AI open source components with pre-built web UI from the backend.ai-app repository

Installation from Source

For the ease of on-boarding developer experience, we provide an automated script that installs all server-side components in editable states with just one command.

Prerequisites

Install the followings accordingly to your host operating system.

경고

To avoid conflicts with your system Python such as macOS/XCode versions, our default pants.toml is configured to search only pyenv-provided Python versions.

참고

In some cases, locale conflicts between the terminal client and the remote host may cause encoding errors when installing Backend.AI components due to Unicode characters in README files. Please keep correct locale configurations to prevent such errors.

Running the install-dev script
$ git clone https://github.com/lablup/backend.ai bai-dev
$ cd bai-dev
$ ./scripts/install-dev.sh

참고

The script requires sudo to check and install several system packages such as build-essential.

This script will bootstrap Pants and creates the halfstack containers using docker compose with fixture population. At the end of execution, the script will show several command examples about launching the service daemons such as manager and agent. You may execute this script multiple times when you encounter prerequisite errors and resolve them. Also check out additional options using -h / --help option, such as installing the CUDA mockup plugin together, etc.

버전 22.09에서 변경: We have migrated to per-package repositories to a semi-mono repository that contains all Python-based components except plugins. This has changed the installation instruction completely with introduction of Pants.

참고

To install multiple instances/versions of development environments using this script, just clone the repository in another location and run scripts/install-dev.sh inside that directory.

It is important to name these working-copy directories differently not to confuse docker compose so that it can distinguish the containers for each setup.

Unless you customize all port numbers by the options of scripts/install-dev.sh, you should docker compose -f docker-compose.halfstack.current.yml down and docker compose -f docker-compose.halfstack.current.yml up -d when switching between multiple working copies.

참고

By default, the script pulls the docker images for our standard Python kernel and TensorFlow CPU-only kernel. To try out other images, you have to pull them manually afterwards.

참고

Currently there are many limitations on running deep learning images on ARM64 platforms, because users need to rebuild the whole computation library stack, although more supported images will come in the future.

참고

To install the webui in an editable state, try --editable-webui flag option when running scripts/install-dev.sh.

Using the agent’s cgroup-based statistics without the root privilege (Linux-only)

To allow Backend.AI to collect sysfs/cgroup resource usage statistics, the Python executable must have the following Linux capabilities: CAP_SYS_ADMIN, CAP_SYS_PTRACE, and CAP_DAC_OVERRIDE.

$ sudo setcap \
>   cap_sys_ptrace,cap_sys_admin,cap_dac_override+eip \
>   $(readlink -f $(pyenv which python))
Verifying Installation

Refer the instructions displayed after running scripts/install-dev.sh. We recommend to use tmux to open multiple terminals in a single SSH session. Your terminal app may provide a tab interface, but when using remote servers, tmux is more convenient because you don’t have to setup a new SSH connection whenever adding a new terminal.

Ensure the halfstack containers are running:

$ docker compose -f docker-compose.halfstack.current.yml up -d

Open a terminal for manager and run:

$ ./backend.ai mgr start-server --debug

Open another terminal for agent and run:

$ ./backend.ai ag start-server --debug

Open yet another terminal for client and run:

$ source ./env-local-admin-api.sh  # Use the generated local endpoint and credential config.
$ # source ./env-local-user-api.sh  # Yo may choose an alternative credential config.
$ ./backend.ai config
$ ./backend.ai run python --rm -c 'print("hello world")'
∙ Session token prefix: fb05c73953
✔ [0] Session fb05c73953 is ready.
hello world
✔ [0] Execution finished. (exit code = 0)
✔ [0] Cleaned up the session.
$ ./backend.ai ps
Resetting the environment

Shutdown all docker containers using docker compose -f docker-compose.halfstack.current.yml down and delete the entire working copy directory. That’s all.

You may need sudo to remove the directories mounted as halfstack container volumes because Docker auto-creates them with the root privilege.

Daily Workflows

Check out Daily Development Workflows for your reference.

Daily Development Workflows

About Pants

Since 22.09, we have migrated to Pants as our primary build system and dependency manager for the mono-repository of Python components.

Pants is a graph-based async-parallel task executor written in Rust and Python. It is tailored to building programs with explicit and auto-inferred dependency checks and aggressive caching.

Key concepts
  • The command pattern:

    $ pants [GLOBAL_OPTS] GOAL [GOAL_OPTS] [TARGET ...]
    
  • Goal: an action to execute

    • You may think this as the root node of the task graph executed by Pants.

  • Target: objectives for the action, usually expressed as path/to/dir:name

    • The targets are declared/defined by path/to/dir/BUILD files.

  • The global configuration is at pants.toml.

  • Recommended reading: https://www.pantsbuild.org/docs/concepts

Inspecting build configurations
  • Display all targets

    $ pants list ::
    
    • This list includes the full enumeration of individual targets auto-generated by collective targets (e.g., python_sources() generates multiple python_source() targets by globbing the sources pattern)

  • Display all dependencies of a specific target (i.e., all targets required to build this target)

    $ pants dependencies --transitive src/ai/backend/common:src
    
  • Display all dependees of a specific target (i.e., all targets affected when this target is changed)

    $ pants dependees --transitive src/ai/backend/common:src
    

참고

Pants statically analyzes the source files to enumerate all its imports and determine the dependencies automatically. In most cases this works well, but sometimes you may need to manually declare explicit dependencies in BUILD files.

Running lint and check

Run lint/check for all targets:

$ pants lint ::
$ pants check ::

To run lint/check for a specific target or a set of targets:

$ pants lint src/ai/backend/common:: tests/common::
$ pants check src/ai/backend/manager::

Currently running mypy with pants is slow because mypy cannot utilize its own cache as pants invokes mypy per file due to its own dependency management scheme. (e.g., Checking all sources takes more than 1 minutes!) This performance issue is being tracked by pantsbuild/pants#10864. For now, try using a smaller target of files that you work on and use an option to select the targets only changed (--changed-since).

Running formatters

If you encounter failure from ruff, you may run the following to automatically fix the import ordering issues.

$ pants fix ::

If you encounter failure from black, you may run the following to automatically fix the code style issues.

$ pants fmt ::

Running unit tests

Here are various methods to run tests:

$ pants test ::
$ pants test tests/manager/test_scheduler.py::
$ pants test tests/manager/test_scheduler.py:: -- -k test_scheduler_configs
$ pants test tests/common::            # Run common/**/test_*.py
$ pants test tests/common:tests        # Run common/test_*.py
$ pants test tests/common/redis::      # Run common/redis/**/test_*.py
$ pants test tests/common/redis:tests  # Run common/redis/test_*.py

You may also try --changed-since option like lint and check.

To specify extra environment variables for tests, use the --test-extra-env-vars option:

$ pants test \
>   --test-extra-env-vars=MYVARIABLE=MYVALUE \
>   tests/common:tests

Running integration tests

$ ./backend.ai test run-cli user,admin

Building wheel packages

To build a specific package:

$ pants \
>   --tag="wheel" \
>   package \
>   src/ai/backend/common:dist
$ ls -l dist/*.whl

If the package content varies by the target platform, use:

$ pants \
>   --tag="wheel" \
>   --tag="+platform-specific" \
>   --platform-specific-resources-target=linux_arm64 \
>   package \
>   src/ai/backend/runner:dist
$ ls -l dist/*.whl

Using IDEs and editors

Pants has an export goal to auto-generate a virtualenv that contains all external dependencies installed in a single place. This is very useful when you use IDEs and editors.

To (re-)generate the virtualenv(s), run:

$ pants export --resolve=RESOLVE_NAME  # you may add multiple --resolve options

You may display the available resolve names by (the command works with Python 3.11 or later):

$ python -c 'import tomllib,pathlib;print("\n".join(tomllib.loads(pathlib.Path("pants.toml").read_text())["python"]["resolves"].keys()))'

Similarly, you can export all virtualenvs at once:

$ python -c 'import tomllib,pathlib;print("\n".join(tomllib.loads(pathlib.Path("pants.toml").read_text())["python"]["resolves"].keys()))' | sed 's/^/--resolve=/' | xargs ./pants export

Then configure your IDEs/editors to use dist/export/python/virtualenvs/python-default/PYTHON_VERSION/bin/python as the interpreter for your code, where PYTHON_VERSION is the interpreter version specified in pants.toml.

As of Pants 2.16, you must export the virtualenvs by the individual lockfiles using the --resolve option, as all tools are unified to use the same custom resolve subsystem of Pants and the :: target no longer works properly, like:

$ pants export --resolve=python-default --resolve=mypy

To make LSP (language server protocol) services like PyLance to detect our source packages correctly, you should also configure PYTHONPATH to include the repository root’s src directory and plugins/*/ directories if you have added Backend.AI plugin checkouts.

For linters and formatters, configure the tool executable paths to indicate dist/export/python/virtualenvs/RESOLVE_NAME/PYTHON_VERSION/bin/EXECUTABLE. For example, ruff’s executable path is dist/export/python/virtualenvs/ruff/3.11.8/bin/ruff.

Currently we have the following Python tools to configure in this way:

  • ruff: Provides a fast linting (combining pylint, flake8, and isort) fixing (auto-fix for some linting rules and isort) and formatting (black)

  • mypy: Validates the type annotations and performs a static analysis

    For a long list of arguments or list/tuple items, you could explicitly add a trailing comma to force Ruff/Black to insert line-breaks after every item even when the line length does not exceed the limit (100 characters).

    You may disable auto-formatting on a specific region of code using # fmt: off and # fmt: on comments, though this is strongly discouraged except when manual formatting gives better readability, such as numpy matrix declarations.

  • pytest: The unit test runner framework.

  • coverage-py: Generates reports about which source lines were visited during execution of a pytest session.

  • towncrier: Generates the changelog from news fragments in the changes directory when making a new release.

VSCode

Install the following extensions:

  • Python (ms-python.python)

  • Pylance (ms-python.vscode-pylance) (optional but recommended)

  • Mypy (ms-python.mypy-type-checker)

  • Ruff (charliermarsh.ruff)

  • For other standard Python extensions like Flake8, isort, and Black, disable them for the Backend.AI workspace only to prevent interference with Ruff’s own linting, fixing and formatting.

Set the workspace settings for the Python extension for code navigation and auto-completion:

Setting ID

Recommended value

python.analysis.autoSearchPaths

true

python.analysis.extraPaths

["dist/export/python/virtualenvs/python-default/3.11.8/lib/python3.11/site-packages"]

python.analysis.importFormat

"relative"

editor.formatOnSave

true

editor.codeActionsOnSave

{"source.fixAll": true}

Set the following keys in the workspace settings to configure Python tools:

Setting ID

Example value

mypy-type-checker.interpreter

["dist/export/python/virtualenvs/mypy/3.11.8/bin/python"]

mypy-type-checker.importStrategy

"fromEnvironment"

ruff.interpreter

["dist/export/python/virtualenvs/ruff/3.11.8/bin/python"]

ruff.importStrategy

"fromEnvironment"

참고

Changed in July 2023

After applying the VSCode Python Tool migration, we no longer recommend to configure python.linting.*Path and python.formatting.*Path keys.

Vim/NeoVim

There are a large variety of plugins and usually heavy Vimmers should know what to do.

We recommend using ALE or CoC plugins to have automatic lint highlights, auto-formatting on save, and auto-completion support with code navigation via LSP backends.

경고

Note that it is recommended to enable only one linter/formatter at a time (either ALE or CoC) with proper configurations, to avoid duplicate suggestions and error reports.

When using ALE, it is recommended to have a directory-local vimrc as follows. First, add set exrc in your user-level vimrc. Then put the followings in .vimrc (or .nvimrc for NeoVim) in the build root directory:

let s:cwd = getcwd()
let g:ale_python_mypy_executable = s:cwd . '/dist/export/python/virtualenvs/mypy/3.11.8/bin/mypy'
let g:ale_python_ruff_executable = s:cwd . '/dist/export/python/virtualenvs/ruff/3.11.8/bin/ruff'
let g:ale_linters = { "python": ['ruff', 'mypy'] }
let g:ale_fixers = {'python': ['ruff']}
let g:ale_fix_on_save = 1

When using CoC, run :CocInstall coc-pyright @yaegassy/coc-ruff and :CocLocalConfig after opening a file in the local working copy to initialize Pyright functionalities. In the local configuration file (.vim/coc-settings.json), you may put the linter/formatter configurations just like VSCode (see the official reference).

{
  "coc.preferences.formatOnType": false,
  "coc.preferences.willSaveHandlerTimeout": 5000,
  "ruff.enabled": true,
  "ruff.autoFixOnSave": true,
  "ruff.useDetectRuffCommand": false,
  "ruff.builtin.pythonPath": "dist/export/python/virtualenvs/ruff/3.11.8/bin/python",
  "ruff.serverPath": "dist/export/python/virtualenvs/ruff/3.11.8/bin/ruff-lsp",
  "python.pythonPath": "dist/export/python/virtualenvs/python-default/3.11.8/bin/python",
  "python.linting.mypyEnabled": true,
  "python.linting.mypyPath": "dist/export/python/virtualenvs/mypy/3.11.8/bin/mypy",
}

To activate Ruff (a Python linter and fixer), run :CocCommand ruff.builtin.installServer after opening any Python source file to install the ruff-lsp server.

Switching between branches

When each branch has different external package requirements, you should run pants export before running codes after git switch-ing between such branches.

Sometimes, you may experience bogus “glob” warning from pants because it sees a stale cache. In that case, run pgrep pantsd | xargs kill and it will be fine.

Running entrypoints

To run a Python program within the unified virtualenv, use the ./py helper script. It automatically passes additional arguments transparently to the Python executable of the unified virtualenv.

./backend.ai is an alias of ./py -m ai.backend.cli.

Examples:

$ ./py -m ai.backend.storage.server
$ ./backend.ai mgr start-server
$ ./backend.ai ps

Working with plugins

To develop Backend.AI plugins together, the repository offers a special location ./plugins where you can clone plugin repositories and a shortcut script scripts/install-plugin.sh that does this for you.

$ scripts/install-plugin.sh lablup/backend.ai-accelerator-cuda-mock

This is equivalent to:

$ git clone \
>   https://github.com/lablup/backend.ai-accelerator-cuda-mock \
>   plugins/backend.ai-accelerator-cuda-mock

These plugins are auto-detected by scanning setup.cfg of plugin subdirectories by the ai.backend.plugin.entrypoint module, even without explicit editable installations.

Writing test cases

Mostly it is just same as before: use the standard pytest practices. Though, there are a few key differences:

  • Tests are executed in parallel in the unit of test modules.

  • Therefore, session-level fixtures may be executed multiple times during a single run of pants test.

경고

If you interrupt (Ctrl+C, SIGINT) a run of pants test, it will immediately kill all pytest processes without fixture cleanup. This may accumulate unused Docker containers in your system, so it is a good practice to run docker ps -a periodically and clean up dangling containers.

To interactively run tests, see Debugging test cases (or interactively running test cases).

Here are considerations for writing Pants-friendly tests:

  • Ensure that it runs in an isolated/mocked environment and minimize external dependency.

  • If required, use the environment variable BACKEND_TEST_EXEC_SLOT (an integer value) to uniquely define TCP port numbers and other resource identifiers to allow parallel execution. Refer the Pants docs.

  • Use ai.backend.testutils.bootstrap to populate a single-node Redis/etcd/Postgres container as fixtures of your test cases. Import the fixture and use it like a plain pytest fixture.

    • These fixtures create those containers with OS-assigned public port numbers and give you a tuple of container ID and a ai.backend.common.types.HostPortPair for use in test codes. In manager and agent tests, you could just refer local_config to get a pre-populated local configurations with those port numbers.

    • In this case, you may encounter flake8 complaining about unused imports and redefinition. Use # noqa: F401 and # noqa: F811 respectively for now.

경고

About using /tmp in tests

If your Docker service is installed using Snap (e.g., Ubuntu 20.04 or later), it cannot access the system /tmp directory because Snap applies a private “virtualized” tmp directory to the Docker service.

You should use other locations under the user’s home directory (or preferably .tmp in the working copy directory) to avoid mount failures for the developers/users in such platforms.

It is okay to use the system /tmp directory if they are not mounted inside any containers.

Writing documentation

  • Create a new pyenv virtualenv based on Python 3.10.

    $ pyenv virtualenv 3.10.9 venv-bai-docs
    
  • Activate the virtualenv and run:

    $ pyenv activate venv-bai-docs
    $ pip install -U pip setuptools wheel
    $ pip install -U -r docs/requirements.txt
    
  • You can build the docs as follows:

    $ cd docs
    $ pyenv activate venv-bai-docs
    $ make html
    
  • To locally serve the docs:

    $ cd docs
    $ python -m http.server --directory=_build/html
    

(TODO: Use Pants’ own Sphinx support when pantsbuild/pants#15512 is released.)

Advanced Topics

Adding new external dependencies
  • Add the package version requirements to the unified requirements file (./requirements.txt).

  • Update the module_mapping field in the root build configuration (./BUILD) if the package name and its import name differs.

  • Update the type_stubs_module_mapping field in the root build configuration if the package provides a type stubs package separately.

  • Run:

    $ pants generate-lockfiles
    $ pants export
    
Merging lockfile conflicts

When you work on a branch that adds a new external dependency and the main branch has also another external dependency addition, merging the main branch into your branch is likely to make a merge conflict on python.lock file.

In this case, you can just do the followings since we can just regenerate the lockfile after merging requirements.txt and BUILD files.

$ git merge main
... it says a conflict on python.lock ...
$ git checkout --theirs python.lock
$ pants generate-lockfiles --resolve=python-default
$ git add python.lock
$ git commit
Resetting Pants

If Pants behaves strangely, you could simply reset all its runtime-generated files by:

$ pgrep pantsd | xargs -r kill
$ rm -r /tmp/*-pants/ .pants.d .pids ~/.cache/pants

After this, re-running any Pants command will automatically reinitialize itself and all cached data as necessary.

Note that you may find out the concrete path inside /tmp from .pants.rc’s local_execution_root_dir option set by install-dev.sh.

경고

If you have run pants or the installation script with sudo, some of the above directories may be owned by root and running pants as the user privilege would not work. In such cases, remove the directories with sudo and retry.

Resolve the error message ‘Pants is not abailable for your platform’, When installing Backend.AI with pants

When installing Backend.AI, you may find the following error message saying ‘Pants is not available for your platform’ if you have installed Pants 2.17 or older with prior versions of Backend.AI.

[INFO] Bootstrapping the Pants build system...
Pants system command is already installed.
Failed to fetch https://binaries.pantsbuild.org/tags/pantsbuild.pants/release_2.19.0: [22] HTTP response code said error (The requested URL returned error: 404)
Bootstrapping Pants 2.19.0 using cpython 3.9.15
Installing pantsbuild.pants==2.19.0 into a virtual environment at /home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/venvs/2.19.0
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.4/65.4 KB 3.3 MB/s eta 0:00:00
ERROR: Could not find a version that satisfies the requirement pantsbuild.pants==2.19.0 (from versions: 0.0.17, 0.0.18, 0.0.20, 0.0.21, 0.0.22, ... (a long list of versions) ..., 2.17.0,
2.17.1rc0, 2.17.1rc1, 2.17.1rc2, 2.17.1rc3, 2.17.1, 2.18.0.dev0, 2.18.0.dev1, 2.18.0.dev3, 2.18.0.dev4, 2.18.0.dev5, 2.18.0.dev6, 2.18.0.dev7, 2.18.0a0)
ERROR: No matching distribution found for pantsbuild.pants==2.19.0
Install failed: Command '['/home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/venvs/2.19.0/bin/python', '-sE', '-m', 'pip', '--disable-pip-versi
on-check', '--no-python-version-warning', '--log', PosixPath('/home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/venvs/2.19.0/pants-install.log'
), 'install', '--quiet', '--find-links', 'file:///home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/find_links/2.19.0/e430175b/index.html', '--p
rogress-bar', 'off', 'pantsbuild.pants==2.19.0']' returned non-zero exit status 1.
More information can be found in the log at: /home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/logs/install.log

Error: Isolates your Pants from the elements.

Please select from the following boot commands:

<default>: Detects the current Pants installation and launches it.
bootstrap-tools: Introspection tools for the Pants bootstrap process.
pants: Runs a hermetic Pants installation.
pants-debug: Runs a hermetic Pants installation with a debug server for debugging Pants code.
update: Update scie-pants.

You can select a boot command by passing it as the 1st argument or else by setting the SCIE_BOOT environment variable.

ERROR: Failed to establish atomic directory /home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/locks/install-a4f15e2d2c97473883ec33b4ee0f9d11f99dcf5bee63
8b1cc7a0270d55d0ec8d. Population of work directory failed: Boot binding command failed: exit status: 1

[ERROR] Cannot proceed the installation because Pants is not available for your platform!

To resolve this error, reinstall or upgrade Pants. As of the Pants 2.18.0 release, they no longer use the Python Package Index but GitHub releases to distribute the binary builds.

Resolving missing directories error when running Pants
ValueError: Failed to create temporary directory for immutable inputs: No such file or directory (os error 2) at path "/tmp/bai-dev-PN4fpRLB2u2xL.j6-pants/immutable_inputsvIpaoN"

If you encounter errors like above when running daily Pants commands like lint, you may manually create the directory one step higher. For the above example, run:

mkdir -p /tmp/bai-dev-PN4fpRLB2u2xL.j6-pants/

If this workaround does not work, backup your current working files and reinstall by running scripts/delete-dev.sh and scripts/install-dev.sh serially.

Changing or updating the Python runtime for Pants

When you run scripts/install-dev.sh, it automatically creates .pants.bootstrap to explicitly set a specific pyenv Python version to run Pants.

If you have removed/upgraded this specific Python version from pyenv, you also need to update .pants.bootstrap accordingly.

Debugging test cases (or interactively running test cases)

When your tests hang, you can try adding the --debug flag to the pants test command:

$ pants test --debug ...

so that Pants runs the designated test targets serially and interactively. This means that you can directly observe the console output and Ctrl+C to gracefully shutdown the tests with fixture cleanup. You can also apply additional pytest options such as --fulltrace, -s, etc. by passing them after target arguments and -- when executing pants test command.

Installing a subset of mono-repo packages in the editable mode for other projects

Sometimes, you need to editable-install a subset of packages into other project’s directories. For instance you could mount the client SDK and its internal dependencies for a Docker container for development.

In this case, we recommend to do it as follows:

  1. Run the following command to build a wheel from the current mono-repo source:

    $ pants --tag=wheel package src/ai/backend/client:dist
    

    This will generate dist/backend.ai_client-{VERSION}-py3-none-any.whl.

  2. Run pip install -U {MONOREPO_PATH}/dist/{WHEEL_FILE} in the target environment.

    This will populate the package metadata and install its external dependencies. The target environment may be one of a separate virtualenv or a container being built. For container builds, you need to first COPY the wheel file and install it.

  3. Check the internal dependency directories to link by running the following command:

    $ pants dependencies --transitive src/ai/backend/client:src \
    >   | grep src/ai/backend | grep -v ':version' | cut -d/ -f4 | uniq
    cli
    client
    plugin
    
  4. Link these directories in the target environment.

    For example, if it is a Docker container, you could add -v {MONOREPO_PATH}/src/ai/backend/{COMPONENT}:/usr/local/lib/python3.10/site-packages/ai/backend/{COMPONENT} to the docker create or docker run commands for all the component directories found in the previous step.

    If it is a local checkout with a pyenv-based virtualenv, you could replace $(pyenv prefix)/lib/python3.10/site-packages/ai/backend/{COMPONENT} directories with symbolic links to the mono-repo’s component source directories.

Boosting the performance of Pants commands

Since Pants uses temporary directories for aggressive caching, you could make the .tmp directory under the working copy root a tmpfs partition:

$ sudo mount -t tmpfs -o size=4G tmpfs .tmp
  • To make this persistent across reboots, add the following line to /etc/fstab:

    tmpfs /path/to/dir/.tmp tmpfs defaults,size=4G 0 0
    
  • The size should be more than 3GB. (Running pants test :: consumes about 2GB.)

  • To change the size at runtime, you could simply remount it with a new size option:

    $ sudo mount -t tmpfs -o remount,size=8G tmpfs .tmp
    
Making a new release
  • Update ./VERSION file to set a new version number. (Remove the ending new line, e.g., using set noeol in Vim. This is also configured in ./editorconfig)

  • Run LOCKSET=tools/towncrier ./py -m towncrier to auto-generate the changelog.

    • You may append --draft to see a preview of the changelog update without actually modifying the filesystem.

    • (WIP: lablup/backend.ai#427).

  • Make a new git commit with the commit message: “release: <version>”.

  • Make an annotated tag to the commit with the message: “Release v<version>” or “Pre-release v<version>” depending on the release version.

  • Push the commit and tag. The GitHub Actions workflow will build the packages and publish them to PyPI.

Backporting to legacy per-pkg repositories
  • Use git diff and git apply instead of git cherry-pick.

    • To perform a three-way merge for conflicts, add -3 option to the git apply command.

    • You may need to rewrite some codes as the package structure differs. (The new mono repository has more fine-grained first party packages divided from the backend.ai-common package.)

  • When referring the PR/issue numbers in the commit for per-pkg repositories, update them like lablup/backend.ai#NNN instead of #NNN.

새 커널 이미지 추가하는 방법

둘러보기

Backend.AI supports running Docker containers to execute user-requested computations in a resource-constrained and isolated environment. Most Docker container images can be imported as Backend.AI kernels with appropriate metadata annotations.

  1. Prepare a Docker image based on Ubuntu 16.04/18.04, CentOS 7.6, or Alpine 3.8.

  2. 다음을 수행하는 Dockerfile을 만듭니다:

  • Install the OpenSSL library in the image for the kernel runner (if not installed).

  • Add metadata labels.

  • Add service definition files.

  • Add a jail policy file.

  1. Build a derivative image using the Dockerfile.

  2. Upload the image to a Docker registry to use with Backend.AI.

Kernel Runner

Every Backend.AI kernel should run a small daemon called “kernel runner”. It communicates with the Backend.AI Agent running in the host via ZeroMQ, and manages user code execution and in-container service processes.

The kernel runner provides runtime-specific implementations for various code execution modes such as the query mode and the batch mode, compatible with a number of well-known programming languages. It also manages the process lifecycles of service-port processes.

To decouple the development and update cycles for Docker images and the Backend.AI Agent, we don’t install the kernel runner inside images. Instead, Backend.AI Agent mounts a special “krunner” volume as /opt/backend.ai inside containers. This volume includes a customized static build of Python. The kernel runner daemon package is mounted as one of the site packages of this Python distribution as well. The agent also uses /opt/kernel as the directory for mounting other self-contained single-binary utilities. This way, image authors do not have to bother with installing Python and Backend.AI specific software. All dirty jobs like volume deployment, its content updates, and mounting for new containers are automatically managed by Backend.AI Agent.

Since the customized Python build and binary utilities need to be built for specific Linux distributions, we only support Docker images built on top of Alpine 3.8+, CentOS 7+, and Ubuntu 16.04+ base images. Note that these three base distributions practically cover all commonly available Docker images.

Image Prerequisites

For glibc-based (most) Linux kernel images, you don’t have to add anything to the existing container image as we use a statically built Python distribution with precompiled wheels to run the kernel runner. The only requirement is that it should be compatible with manylinux2014 or later.

For musl-based Linux kernel images (e.g., Alpine), you have to install libffi and sqlite-libs as the minimum. Please also refer the Dockerfile to build a minimal compatible image.

Metadata Labels

Any Docker image based on Alpine 3.17+, CentOS 7+, and Ubuntu 16.04+ which satisfies the above prerequisites may become a Backend.AI kernel image if you add the following image labels:

  • Required Labels

    • ai.backend.kernelspec: 1 (this will be used for future versioning of the metadata specification)

    • ai.backend.features: A list of constant strings indicating which Backend.AI kernel features are available for the kernel.

      • batch: Can execute user programs passed as files.

      • query: Can execute user programs passed as code snippets while keeping the context across multiple executions.

      • uid-match: As of 19.03, this must be specified always.

      • user-input: The query/batch mode supports interactive user inputs.

    • ai.backend.resource.min.*: The minimum amount of resource to launch this kernel. At least, you must define the CPU core (cpu) and the main memory (mem). In the memory size values, you may use binary scale-suffixes such as m for MiB, g for GiB, etc.

    • ai.backend.base-distro: Either “ubuntu16.04” or “alpine3.8”. Note that Ubuntu 18.04-based kernels also need to use “ubuntu16.04” here.

    • ai.backend.runtime-type: The type of kernel runner to use. (One of the directories in the ai.backend.kernels namespace.)

      • python: This runtime is for Python-based kernels, allowing the given Python executable accessible via the query and batch mode, also as a Jupyter kernel service.

      • app: This runtime does not support code execution in the query/batch modes but just manages the service port processes. For custom kernel images with their own service ports for their main applications, this is the most frequently used runtime type for derivative images.

      • For the full list of available runtime types, check out the lang_map variable at the ai.backend.kernels module code

    • ai.backend.runtime-path: The path to the language runtime executable.

  • Optional Labels

    • ai.backend.role: COMPUTE (default if unspecified) or INFERENCE

    • ai.backend.service-ports: A list of port mapping declaration strings for services supported by the image. (See the next section for details) Backend.AI manages the host-side port mapping and network tunneling via the API gateway automagically.

    • ai.backend.endpoint-ports: A comma-separated name(s) of service port(s) to be bound with the service endpoint. (At least one is required in inference sessions)

    • ai.backend.model-path: The path to mount the target model’s target version storage folder. (Required in inference sessions)

    • ai.backend.envs.corecount: A comma-separated string list of environment variable names. They are set to the number of available CPU cores to the kernel container. It allows the CPU core restriction to be enforced to legacy parallel computation libraries. (e.g., JULIA_CPU_CORES, OPENBLAS_NUM_THREADS)

Service Ports

As of Backend.AI v19.03, service ports are our preferred way to run computation workloads inside Backend.AI kernels. It provides tunneled access to Jupyter Notebooks and other daemons running in containers.

As of Backend.AI v19.09, Backend.AI provides SSH (including SFTP and SCP) and ttyd (web-based xterm shell) as intrinsic services for all kernels. “Intrinsic” means that image authors do not have to do anything to support/enable the services.

As of Backend.AI v20.03, image authors may define their own service ports using service definition JSON files installed at /etc/backend.ai/service-defs in their images.

Port Mapping Declaration

A custom service port should define two things. First, the image label ai.backend.service-ports contains the port mapping declarations. Second, the service definition file which specifies how to start the service process.

A port mapping declaration is composed of three values: the service name, the protocol, and the container-side port number. The label may contain multiple port mapping declarations separated by commas, like the following example:

jupyter:http:8080,tensorboard:http:6006

The name may be an non-empty arbitrary ASCII alphanumeric string. We use the kebab-case for it. The protocol may be one of tcp, http, and pty, but currently most services use http.

Note that there are a few port numbers reserved for Backend.AI itself and intrinsic service ports. The TCP port 2000 and 2001 is reserved for the query mode, whereas 2002 and 2003 are reserved for the native pseudo-terminal mode (stdin and stdout combined with stderr), 2200 for the intrinsic SSH service, and 7681 for the intrinsic ttyd service.

Up to Backend.AI 19.09, this was the only method to define a service port for images, and the service-specific launch sequences were all hard-coded in the ai.backend.kernel module.

Service Definition DSL

Now the image author should define the service launch sequences using a DSL (domain-specific language). The service definitions are written as JSON files in the container’s /etc/backend.ai/service-defs directory. The file names must be same with the name parts of the port mapping declarations.

For example, a sample service definition file for “jupyter” service (hence its filename must be /etc/backend.ai/service-defs/jupyter.json) looks like:

{
    "prestart": [
      {
        "action": "write_tempfile",
        "args": {
          "body": [
            "c.NotebookApp.allow_root = True\n",
            "c.NotebookApp.ip = \"0.0.0.0\"\n",
            "c.NotebookApp.port = {ports[0]}\n",
            "c.NotebookApp.token = \"\"\n",
            "c.FileContentsManager.delete_to_trash = False\n"
          ]
        },
        "ref": "jupyter_cfg"
      }
    ],
    "command": [
        "{runtime_path}",
        "-m", "jupyterlab",
        "--no-browser",
        "--config", "{jupyter_cfg}"
    ],
    "url_template": "http://{host}:{port}/"
}

A service definition is composed of three major fields: prestart that contains a list of prestart actions, command as a list of template-enabled strings, and an optional url_template as a template-enabled string that defines the URL presented to the end-user on CLI or used as the redirection target on GUI with wsproxy.

The “template-enabled” strings may have references to a contextual set of variables in curly braces. All the variable substitution follows the Python’s brace-style formatting syntax and rules.

Available predefined variables

There are a few predefined variables as follows:

  • ports: A list of TCP ports used by the service. Most services have only one port. An item in the list may be referenced using bracket notation like {ports[0]}.

  • runtime_path: A string representing the full path to the runtime, as specified in the ai.backend.runtime-path image label.

Available prestart actions

A prestart action is composed of two mandatory fields action and args (see the table below), and an optional field ref. The ref field defines a variable that stores the result of the action and can be referenced in later parts of the service definition file where the arguments are marked as “template-enabled”.

Action Name

Arguments

Return

write_file

  • body: a list of string lines (template-enabled)

  • filename: a string representing the file name (template-enabled)

  • mode: an optional octal number as string, representing UNIX file permission (default: “755”)

  • append: an optional boolean. If set true, open the file in the appending mode.

None

write_tempfile

  • body: a list of string line (template-enabled)

  • mode: an optional octal number as string, representing UNIX file permission (default: “755”)

The generated file path

mkdir

  • path: the directory path (template-enabled) where parent directories are auto-created

None

run_command

  • command: the command-line argument list as passed to exec syscall (template-enabled)

A dictionary with two fields: out and err which contain the console output decoded as the UTF-8 encoding

log

  • body: a string to send as kernel log (template-enabled)

  • debug: a boolean to lower the logging level to DEBUG (default is INFO)

None

경고

run_command action should return quickly, otherwise the session creation latency will be increased. If you need to run a background process, you must use its own options to let it daemonize or wrap as a background shell command (["/bin/sh", "-c", "... &"]).

Interpretation of URL template

url_template field is used by the client SDK and wsproxy to fill up the actual URL presented to the end-user (or the end-user’s web browser as the redirection target). So its template variables are not parsed when starting the service, but they are parsed and interpolated by the clients. There are only three fixed variables: {protocol}, {host}, and {port}.

Here is a sample service-definition that utilizes the URL template:

{
  "command": [
    "/opt/noVNC/utils/launch.sh",
    "--vnc", "localhost:5901",
    "--listen", "{ports[0]}"
  ],
  "url_template": "{protocol}://{host}:{port}/vnc.html?host={host}&port={port}&password=backendai&autoconnect=true"
}

Jail Policy

(TODO: jail policy syntax and interpretation)

Adding Custom Jail Policy

To write a new policy implementation, extend the jail policy interface in Go. Ebmed it inside your jail build. Please give a look to existing jail policies as good references.

Example: An Ubuntu-based Kernel

FROM ubuntu:16.04

# Add commands for image customization
RUN apt-get install ...

# Backend.AI specifics
RUN apt-get install libssl
LABEL ai.backend.kernelspec=1 \
      ai.backend.resource.min.cpu=1 \
      ai.backend.resource.min.mem=256m \
      ai.backend.envs.corecount="OPENBLAS_NUM_THREADS,OMP_NUM_THREADS,NPROC" \
      ai.backend.features="batch query uid-match user-input" \
      ai.backend.base-distro="ubuntu16.04" \
      ai.backend.runtime-type="python" \
      ai.backend.runtime-path="/usr/local/bin/python" \
      ai.backend.service-ports="jupyter:http:8080"
COPY service-defs/*.json /etc/backend.ai/service-defs/
COPY policy.yml /etc/backend.ai/jail/policy.yml

Custom startup scripts (aka custom entrypoint)

When the image has preopen service ports and/or an endpoint port, Backend.AI automatically sets up application proxy tunnels as if the listening applications are already started.

To initialize and start such applications, put a shell script as /opt/container/bootstrap.sh when building the image. This per-image bootstrap script is executed as root by the agent-injected entrypoint.sh.

경고

Since Backend.AI overrides the command and the entrypoint of container images to run the kernel runner regardless of the image content, setting CMD or ENTRYPOINT in Dockerfile has no effects. You should use /opt/container/bootstrap.sh to migrate existing entrypoint/command wrappers.

경고

/opt/container/bootstrap.sh must return immediately to prevent the session from staying in the PREPARING status. This means that it should run service applications in background by daemonization.

To run a process as the user privilege, you should use su-exec which is also injected by the agent like:

/opt/kernel/su-exec "${LOCAL_GROUP_ID}:${LOCAL_USER_ID}" /path/to/your/service

Implementation details

The query mode I/O protocol

The input is a ZeroMQ’s multipart message with two payloads. The first payload should contain a unique identifier for the code snippet (usually a hash of it), but currently it is ignored (reserved for future caching implementations). The second payload should contain a UTF-8 encoded source code string.

The reply is a ZeroMQ’s multipart message with a single payload, containing a UTF-8 encoded string of the following JSON object:

{
    "stdout": "hello world!",
    "stderr": "oops!",
    "exceptions": [
        ["exception-name", ["arg1", "arg2"], false, null]
    ],
    "media": [
        ["image/png", "data:image/base64,...."]
    ],
    "options": {
        "upload_output_files": true
    }
}

Each item in exceptions is an array composed of four items: exception name, exception arguments (optional), a boolean indicating if the exception is raised outside the user code (mostly false), and a traceback string (optional).

Each item in media is an array of two items: MIME-type and the data string. Specific formats are defined and handled by the Backend.AI Media module.

The options field may present optionally. If upload_output_files is true (default), then the agent uploads the files generated by user code in the working directory (/home/work) to AWS S3 bucket and make their URLs available in the front-end.

The pseudo-terminal mode protocol

If you want to allow users to have real-time interactions with your kernel using web-based terminals, you should implement the PTY mode as well. A good example is our “git” kernel runner.

The key concept is separation of the “outer” daemon and the “inner” target program (e.g., a shell). The outer daemon should wrap the inner program inside a pseudo-tty. As the outer daemon is completely hidden in terminal interaction by the end-users, the programming language may differ from the inner program. The challenge is that you need to implement piping of ZeroMQ sockets from/to pseudo-tty file descriptors. It is up to you how you implement the outer daemon, but if you choose Python for it, we recommend to use asyncio or similar event loop libraries such as tornado and Twisted to mulitplex sockets and file descriptors for both input/output directions. When piping the messages, the outer daemon should not apply any specific transformation; it should send and receive all raw data/control byte sequences transparently because the front-end (e.g., terminal.js) is responsible for interpreting them. Currently we use PUB/SUB ZeroMQ socket types but this may change later.

Optionally, you may run the query-mode loop side-by-side. For example, our git kernel supports terminal resizing and pinging commands as the query-mode inputs. There is no fixed specification for such commands yet, but the current CodeOnWeb uses the followings:

  • %resize <rows> <cols>: resize the pseudo-tty’s terminal to fit with the web terminal element in user browsers.

  • %ping: just a no-op command to prevent kernel idle timeouts while the web terminal is open in user browsers.

A best practice (not mandatory but recommended) for PTY mode kernels is to automatically respawn the inner program if it terminates (e.g., the user has exited the shell) so that the users are not locked in a “blank screen” terminal.

Using Mocked Accelerators

For developers who do not have access to physical accelerator devices such as CUDA GPUs, we provide a mock-up plugin to simulate the system configuration with such devices, allowing development and testing of accelerator-related features in various components including the web UI.

Configuring the mock-accelerator plugin

Check out the examples in the configs/accelerator directory.

Here are the description of each field:

  • slot_name: The resource slot’s main key name. The plugin’s resource slot name has the form of "<slot_name>.<subtype>", where the subtype may be something such as device (default), shares (for the fractional allocation mode). For CUDA MIG devices, it becomes a string including the slice size from the device memory size such as 10g-mig.

    To configure the fractional allocation mode, you should also specify the etcd accelerator plugin configuration like the following JSON, where unit_mem and unit_proc is used as the divisor to calculate 1.0 fraction:

    {
       "config": {
         "plugins": {
           "accelerator": {
             "<slot_name>": {
               "allocation_mode": "fractional",
               "unit_mem": 1073741824,
               "unit_proc": 10
             }
           }
         }
       }
    }
    

    In the above example, the 10 subprocessors and 1 GiB of device memory is regarded as 1.0 fractional device. You may store it as a JSON file and put in the etcd configuration tree like:

    $ ./backend.ai mgr etcd put-json '' mydevice-fractional-mode.json
    
  • device_plugin_name: The class name to use as the actual implementation. Currently there are two: CUDADevice and MockDevice.

  • formats.<subtype>: The tables for per-subtype formatting details

    • display_icon: The device icon type displayed in the UI.

    • display_unit: The resource slot unit displayed in the UI, alongside the amount numbers.

    • human_readable_name: The device name displayed in the UI.

    • description: The device description displayed in the UI.

    • number_format: The number formatting string used for the UI.

      • binary: A boolean flag to indicate whether to use the binary suffixes (divided by 2^(10n) instead of 10^(3n))

      • round_length: The length of fixed points to wrap the numeric value of this resource slot. If zero, the number is treated as an integer.

  • devices: The list of mocked device declarations

    • mother_uuid: The unique ID of the device, which may be random-generated

    • model_name: The model name to report to the manager as metadata

    • numa_node: The NUMA node index to place this device.

    • subproc_count: The number of sub-processing cores (e.g., the number of streaming multi-processors of CUDA GPUs)

    • memory_size: The size of on-device memory represented as human-readable binary sizes

    • is_mig_devices: (CUDA-specific) whether this device is a MIG slice or a full device

Activating the mock-accelerator plugin

Add "ai.backend.accelerator.mock" to the agent.toml’s [agent].allowed-compute-plugins field. Then restart the agent.

Version Numbering

  • Version numbering uses x.y.z format (where x, y, z are integers).

  • Mostly, we follow the calendar versioning scheme.

  • x.y is a release branch name (major releases per 6 months).

    • When y is smaller than 10, we prepend a zero sign like 05 in the version numbers (e.g., 20.09.0).

    • When referring the version in other Python packages as requirements, you need to strip the leading zeros (e.g., 20.9.0 instead of 20.09.0) because Python setuptools normalizes the version integers.

  • x.y.z is a release tag name (patch releases).

  • When releasing x.y.0:

    • Create a new x.y branch, do all bugfix/hotfix there, and make x.y.z releases there.

    • All fixes must be first implemented on the main branch and then cherry-picked back to x.y branches.

      • When cherry-picking, use the -e option to edit the commit message.
        Append Backported-From: main and Backported-To: X.Y lines after one blank line at the end of the existing commit message.

    • Change the version number of main to x.(y+1).0.dev0

    • There is no strict rules about alpha/beta/rc builds yet. We will elaborate as we scale up.
      Once used, alpha versions will have aN suffixes, beta versions bN suffixes, and RC versions rcN suffixes where N is an integer.

  • New development should go on the main branch.

    • main: commit here directly if your changes are a self-complete one as a single commit.

    • Use both short-lived and long-running feature branches freely, but ensure there names differ from release branches and tags.

  • The major/minor (x.y) version of Backend.AI subprojects will go together to indicate compatibility. Currently manager/agent/common versions progress this way, while client SDKs have their own version numbers and the API specification has a different vN.yyyymmdd version format.

    • Generally backend.ai-manager 1.2.p is compatible with backend.ai-agent 1.2.q (where p and q are same or different integers)

      • As of 22.09, this won’t be guaranteed any more. All server-side core component versions should exactly match with others, as we release them at once from the mono-repo, even for those who do not have any code changes.

    • The client is guaranteed to be backward-compatible with the server they share the same API specification version.

Upgrading

You can upgrade the installed Python packages using pip install -U ... command along with dependencies.

If you have cloned the stable version of source code from git, then pull and check out the next x.y release branch. It is recommended to re-run pip install -U -r requirements.txt as dependencies might be updated.

For the manager, ensure that your database schema is up-to-date by running alembic upgrade head. If you setup your development environment with Pants and install-dev.sh script, keep your database schema up-to-date via ./py -m alembic upgrade head instead of plain alembic command above.

Also check if any manual etcd configuration scheme change is required, though we will try to keep it compatible and automatically upgrade when first executed.

버전 업그레이드 가이드

Upgrading from 20.03 to 20.09

(TODO)

Migrating from the Docker Hub to cr.backend.ai

As of November 2020, the Docker Hub has begun to limit the retention time and the rate of pulls of public images. Since Backend.AI uses a number of Docker images with variety of access frequencies, we decided to migrate to our own container registry, https://cr.backend.ai.

It is strongly recommended to set a maintenance period if there are active users of the Backend.AI cluster to prevent new session starts during migration. This registry migration does not affect existing running sessions, though the Docker image removal in the agent nodes can only be done after terminating all existing containers started with the old images and there will be brief disconnection of service ports as the manager requires to be restarted.

  1. Update your Backend.AI installation to the latest version (manager 20.03.11 or 20.09.0b2) to get support for Harbor v2 container registries.

  2. Save the following JSON snippet as registry-config.json.

    {
      "config": {
        "docker": {
          "registry": {
            "cr.backend.ai": {
              "": "https://cr.backend.ai",
              "type": "harbor2",
              "project": "stable,community"
            }
          }
        }
      }
    }
    
  3. Run the following using the manager CLI on one of the manager nodes:

    $ sudo systemctl stop backendai-manager  # stop the manager daemon (may differ by setup)
    $ backend.ai mgr etcd put-json '' registry-config.json
    $ backend.ai mgr image rescan cr.backend.ai
    $ sudo systemctl start backendai-manager  # start the manager daemon (may differ by setup)
    
    • The agents will automatically pull the images since the image references are changed even when the new images are actually same to the existing ones. It is recommended to pull the essential images by yourself in the agents to avoid long waiting times when starting sessions using the docker pull command in the agent nodes.

    • Now the images are categorized with additional path prefix, such as stable and community. More prefixes may be introduced in the future and some prefixes may be set only available to specific set of users/user groups, with dedicated credentials.

      For example, lablup/python:3.6-ubuntu18.04 is now referred as cr.backend.ai/stable/python:3.6-ubuntu18.04.

    • If you have configured image aliases, you need to update them manually as well, using the backend.ai mgr image alias command. This does not affect existing sessions running with old aliases.

  4. Update the allowed docker registries policy for each domain using the backend.ai mgr dbshell command. Remove “index.docker.io” from the existing values and replace “…” below with your own domain names and additional registries.

    SELECT name, allowed_docker_registries FROM domains;  -- check the current config
    UPDATE domains SET allowed_docker_registries = '{cr.backend.ai,...}' WHERE name = '...';
    
  5. Now you may start new sessions using the images from the new registry.

  6. After terminating all existing sessions using the old images from the Docker Hub (i.e., images whose names start with lablup/ prefix), remove the image metadata and registry configuration using the manager CLI:

    $ backend.ai mgr etcd delete --prefix images/index.docker.io
    $ backend.ai mgr etcd delete --prefix config/docker/registry/index.docker.io
    
  7. Run docker rmi commands to clean up the pulled images in the agent nodes. (Automatic/managed removal of images will be implemented in the future versions of Backend.AI)

Backend.AI Manager 레퍼런스

Manager API Common Concepts

API and Document Conventions

HTTP Methods

We use the standard HTTP/1.1 methods (RFC-2616), such as GET, POST, PUT, PATCH and DELETE, with some additions from WebDAV (RFC-3253) such as REPORT method to send JSON objects in request bodies with GET semantics.

If your client runs under a restrictive environment that only allows a subset of above methods, you may use the universal POST method with an extra HTTP header like X-Method-Override: REPORT, so that the Backend.AI gateway can recognize the intended HTTP method.

Parameters in URI and JSON Request Body

The parameters with colon prefixes (e.g., :id) are part of the URI path and must be encoded using a proper URI-compatible encoding schemes such as encodeURIComponent(value) in Javascript and urllib.parse.quote(value, safe='~()*!.\'') in Python 3+.

Other parameters should be set as a key-value pair of the JSON object in the HTTP request body. The API server accepts both UTF-8 encoded bytes and standard-compliant Unicode-escaped strings in the body.

HTTP Status Codes and JSON Response Body

The API responses always contain a root JSON object, regardless of success or failures.

For successful responses (HTTP status 2xx), the root object has a varying set of key-value pairs depending on the API.

For failures (HTTP status 4xx/5xx), the root object contains at least two keys: type which uniquely identifies the failure reason as an URI and title for human-readable error messages. Some failures may return extra structured information as additional key-value pairs. We use RFC 7807-style problem detail description returned in JSON of the response body.

JSON Field Notation

Dot-separated field names means a nested object. If the field name is a pure integer, it means a list item.

예시

Meaning

a

The attribute a of the root object. (e.g., 123 at {"a": 123})

a.b

The attribute b of the object a on the root. (e.g., 456 at {"a": {"b": 456}})

a.0

An item in the list a on the root. 0 means an arbitrary array index, not the specific item at index zero. (e.g., any of 13, 57, 24, and 68 at {"a": [13, 57, 24, 68]})

a.0.b

The attribute b of an item in the list a on the root. (e.g., any of 1, 2, and 3 at {"a": [{"b": 1}, {"b": 2}, {"b": 3}]})

JSON Value Types

This documentation uses a type annotation style similar to Python’s typing module, but with minor intuitive differences such as lower-cased generic type names and wildcard as asterisk * instead of Any.

The common types are array (JSON array), object (JSON object), int (integer-only subset of JSON number), str (JSON string), and bool (JSON true or false). tuple and list are aliases to array. Optional values may be omitted or set to null.

We also define several custom types:

Type

Description

decimal

Fractional numbers represented as str not to loose precision. (e.g., to express money amounts)

slug

Similar to str, but the values should contain only alpha-numeric characters, hyphens, and underscores. Also, hyphens and underscores should have at least one alphanumeric neighbor as well as cannot become the prefix or suffix.

datetime

ISO-8601 timestamps in str, e.g., "YYY-mm-ddTHH:MM:SS.ffffff+HH:MM". It may include an optional timezone information. If timezone is not included, the value is assumed to be UTC. The sub-seconds parts has at most 6 digits (micro-seconds).

enum[*]

Only allows a fixed/predefined set of possible values in the given parametrized type.

API Versioning

A version string of the Backend.AI API uses two parts: a major revision (prefixed with v) and minor release dates after a dot following the major revision. For example, v23.20250101 indicates a 23rd major revision with a minor release at January 1st in 2025.

We keep backward compatibility between minor releases within the same major version. Therefore, all API query URLs are prefixed with the major revision, such as /v2/kernel/create. Minor releases may introduce new parameters and response fields but no URL changes. Accessing unsupported major revision returns HTTP 404 Not Found.

버전 v3.20170615에서 변경: Version prefix in API queries are deprecated. (Yet still supported currently) For example, now users should call /kernel/create rather than /v2/kernel/create.

A client must specify the API version in the HTTP request header named X-BackendAI-Version. To check the latest minor release date of a specific major revision, try a GET query to the URL with only the major revision part (e.g., /v2). The API server will return a JSON string in the response body containing the full version. When querying the API version, you do not have to specify the authorization header and the rate-limiting is enforced per the client IP address. Check out more details about 인증 and Rate Limiting.

Example version check response body:

{
   "version": "v2.20170315"
}

JSON Object References

Paging Query Object

It describes how many items to fetch for object listing APIs. If index exceeds the number of pages calculated by the server, an empty list is returned.

Key

Type

Description

size

int

The number of items per page. If set zero or this object is entirely omitted, all items are returned and index is ignored.

index

int

The page number to show, zero-based.

Paging Info Object

It contains the paging information based on the paging query object in the request.

Key

Type

Description

pages

int

The number of total pages.

count

int

The number of all items.

KeyPair Item Object

Key

Type

Description

accessKey

slug

The access key part.

isActive

bool

Indicates if the keypair is active or not.

totalQueries

int

The number of queries done via this keypair. It may have a stale value.

created

datetime

The timestamp when the keypair was created.

KeyPair Properties Object

Key

Type

Description

isActive

bool

Indicates if the keypair is activated or not. If not activated, all authentication using the keypair returns 401 Unauthorized. When changed from true to false, existing running sessions continue to run but any requests to create new sessions are refused. (default: true)

concurrecy

int

The maximum number of concurrent sessions allowed for this keypair. (default: 5)

ML.clusterSize

int

Sets the number of instances clustered together when launching new machine learning sessions. (default: 1)

ML.instanceMemory

int (MiB)

Sets the memory limit of each instance in the cluster launched for new machine learning sessions. (default: 8)

The enterprise edition offers the following additional properties:

Key

Type

Description

cost.automatic

bool

If set true, enables automatic cost optimization (BETA). With supported kernel types, it automatically suspends or resize the sessions not to exceed the configured cost limit per day. (default: false)

cost.dailyLimit

str

The string representation of money amount as decimals. The currency is fixed to USD. (default: "50.00")

Service Port Object

Key

Type

Description

name

slug

The name of service provided by the container. See also: 터미널 에뮬레이션

protocol

str

The type of network protocol used by the container service.

Batch Execution Query Object

Key

Type

Description

build

str

The bash command to build the main program from the given uploaded files.

If this field is not present, an empty string or null, it skips the build step.

If this field is a constant string "*", it will use a default build script provided by the kernel. For example, the C kernel’s default Makefile adds all C source files under the working directory and copmiles them into ./main executable, with commonly used C/link flags: "-pthread -lm -lrt -ldl".

exec

str

The bash command to execute the main program.

If this is not present, an empty string, or null, the server only performs the build step and options.buildLog is assumed to be true (the given value is ignored).

clean

str

The bash command to clean the intermediate files produced during the build phase. The clean step comes before the build step if specified so that the build step can (re)start fresh.

If the field is not present, an empty string, or null, it skips the clean step.

Unlike the build and exec command, the default for "*" is do-nothing to prevent deletion of other files unrelated to the build by bugs or mistakes.

참고

A client can distinguish whether the current output is from the build phase or the execution phase by whether it has received build-finished status or not.

참고

All shell commands are by default executed under /home/work. The common environment is:

TERM=xterm
LANG=C.UTF-8
SHELL=/bin/bash
USER=work
HOME=/home/work

but individual kernels may have additional environment settings.

경고

The shell does NOT have access to sudo or the root privilege. Though, some kernels may allow installation of language-specific packages in the user directory.

Also, your build script and the main program is executed inside Backend.AI Jail, meaning that some system calls are blocked by our policy. Since ptrace syscall is blocked, you cannot use native debuggers such as gdb.

This limitation, however, is subject to change in the future.

Example:

{
  "build": "gcc -Wall main.c -o main -lrt -lz",
  "exec": "./main"
}
Execution Result Object

Key

Type

Description

runId

str

The user-provided run identifier. If the user has NOT provided it, this will be set by the API server upon the first execute API call. In that case, the client should use it for the subsequent execute API calls during the same run.

status

enum[str]

One of "continued", "waiting-input", "finished", "clean-finished", "build-finished", or "exec-timeout". See more details at Code Execution Model.

exitCode

int | null

The exit code of the last process. This field has a valid value only when the status is "finished", "clean-finished" or "build-finished". Otherwise it is set to null.

For batch-mode kernels and query-mode kernels without global context support, exitCode is the return code of the last executed child process in the kernel. In the execution step of a batch mode run, this is always 127 (a UNIX shell common practice for “command not found”) when the build step has failed.

For query-mode kernels with global context support, this value is always zero, regardless of whether the user code has caused an exception or not.

A negative value (which cannot happen with normal process termination) indicates a Backend.AI-side error.

console

list[object]

A list of Console Item Object.

options

object

An object containing extra display options. If there is no options indicated by the kernel, this field is null. When result.status is "waiting-input", it has a boolean field is_password so that you could use different types of text boxes for user inputs.

files

list[object]

A list of Execution Result File Object that represents files generated in /home/work/.output directory of the container during the code execution .

Console Item Object

Key

Type

Description

(root)

[enum, *]

A tuple of the item type and the item content. The type may be "stdout", "stderr", and others.

See more details at Handling Console Output.

Execution Result File Object

Key

Type

Description

name

str

The name of a created file after execution.

url

str

The URL of a create file uploaded to AWS S3.

Container Stats Object

Key

Type

Description

cpu_used

int (msec)

The total time the kernel was running.

mem_max_bytes

int (Byte)

The maximum memory usage.

mem_cur_bytes

int (Byte)

The current memory usage.

net_rx_bytes

int (Byte)

The total amount of received data through network.

net_tx_bytes

int (Byte)

The total amount of transmitted data through network.

io_read_bytes

int (Byte)

The total amount of received data from IO.

io_write_bytes

int (Byte)

The total amount of transmitted data to IO.

io_max_scratch_size

int (Byte)

Currently not used field.

io_write_bytes

int (Byte)

Currently not used field.

Creation Config Object

Key

Type

Description

environ

object

A dictionary object specifying additional environment variables. The values must be strings.

mounts

list[str]

An optional list of the name of virtual folders that belongs to the current API key. These virtual folders are mounted under /home/work. For example, if the virtual folder name is abc, you can access it on /home/work/abc.

If the name contains a colon in the middle, the second part of the string indicates the alias location in the kernel’s file system which is relative to /home/work.

You may mount up to 5 folders for each session.

clusterSize

int

The number of instances bundled for this session.

resources

Resource Slot Object

The resource slot specification for each container in this session.

버전 v4.20190315에 추가.

instanceMemory

int (MiB)

The maximum memory allowed per instance. The value is capped by the per-kernel image limit. Additional charges may apply on the public API service.

버전 v4.20190315부터 폐지됨.

instanceCores

int

The number of CPU cores. The value is capped by the per-kernel image limit. Additional charges may apply on the public API service.

버전 v4.20190315부터 폐지됨.

instanceGPUs

float

The fraction of GPU devices (1.0 means a whole device). The value is capped by the per-kernel image limit. Additional charges may apply on the public API service.

버전 v4.20190315부터 폐지됨.

Resource Slot Object

Key

Type

Description

cpu

str | int

The number of CPU cores.

mem

str | int

The amount of main memory in bytes. When the slot object is used as an input to an API, it may be represented as binary numbers using the binary scale suffixes such as k, m, g, t, p, e, z, and y, e.g., “512m”, “512M”, “512MiB”, “64g”, “64G”, “64GiB”, etc. When the slot object is used as an output of an API, this field is always represented in the unscaled number of bytes as strings.

경고

When parsing this field as JSON, you must check whether your JSON library or the programming language supports large integers. For instance, most modern Javascript engines support up to \(2^{53}-1\) (8 PiB – 1) which is often defined as the Number.MAX_SAFE_INTEGER constant. Otherwise you need to use a third-party big number calculation library. To prevent unexpected side-effects, Backend.AI always returns this field as a string.

cuda.device

str | int

The number of CUDA devices. Only available when the server is configured to use the CUDA agent plugin.

cuda.shares

str

The virtual share of CUDA devices represented as fractional decimals. Only available when the server is configured to use the CUDA agent plugin with the fractional allocation mode (enterprise edition only).

tpu.device

str | int

The number of TPU devices. Only available when the server is configured to use the TPU agent plugin (cloud edition only).

(others)

str

More resource slot types may be available depending on the server configuration and agent plugins. There are two types for an arbitrary slot: “count” (the default) and “bytes”.

For “count” slots, you may put arbitrary positive real number there, but fractions may be truncated depending on the plugin implementation.

For “bytes” slots, its interpretation and representation follows that of the mem field.

Resource Preset Object

Key

Type

Description

name

str

The name of this preset.

resource_slots

Resource Slot Object

The pre-configured combination of resource slots. If it contains slot types that are not currently used/activated in the cluster, they will be removed when returned via /resource/* REST APIs.

shared_memory

int (Byte)

The pre-configured shared memory size. Client can send humanized strings like ‘2g’, ‘128m’, ‘534773760’, etc, and they will be automatically converted into bytes.

Virtual Folder Creation Result Object

Key

Type

Description

id

UUID

An internally used unique identifier of the created vfolder. Currently it has no use in the client-side.

name

str

The name of created vfolder, as the client has given.

host

str

The host name where the vfolder is created.

user

UUID

The user who has the ownership of this vfolder.

group

UUID

The group who is the owner of this vfolder.

버전 v4.20190615에 추가: user and group fields.

Virtual Folder List Item Object

Key

Type

Description

name

str

The human readable name set when created.

id

slug

The unique ID of the folder.

host

str

The host name where this folder is located.

is_owner

bool

True if the client user is the owner of this folder. False if the folder is shared from a group or another user.

permission

enum

The requested user’s permission for this folder. (One of “ro”, “rw”, and “wd” which represents read-only, read-write, and write-delete respectively. Currently “rw” and “wd” has no difference.)

user

UUID

The user ID if the owner of this item is a user vfolder. Otherwise, null.

group

UUID

The group ID if the owner of this item is a group vfolder. Otherwise, null.

type

enum

The owner type of vfolder. One of “user” or “group”.

버전 v4.20190615에 추가: user, group, and type fields.

Virtual Folder Item Object

Key

Type

Description

name

str

The human readable name set when created.

id

UUID

The unique ID of the folder.

host

str

The host name where this folder is located.

is_owner

bool

True if the client user is the owner of this folder. False if the folder is shared from a group or another user.

num_files

int

The number of files in this folder.

permission

enum

The requested user’s permission for this folder.

created_at

datetime

The date and time when the folder is created.

last_used

datetime

The date and time when the folder is last used.

user

UUID

The user ID if the owner of this item is a user. Otherwise, null.

group

UUID

The group ID if the owner of this item is a group. Otherwise, null.

type

enum

The owner type of vfolder. One of “user” or “group”.

버전 v4.20190615에 추가: user, group, and type fields.

Virtual Folder File Object

Key

Type

Description

filename

str

The filename.

mode

int

The file’s mode (permission) bits as an integer.

size

int

The file’s size.

ctime

int

The timestamp when the file is created.

mtime

int

The timestamp when the file is last modified.

atime

int

The timestamp when the file is last accessed.

Virtual Folder Invitation Object

Key

Type

Description

id

UUID

The unique ID of the invitation. Use this when making API requests referring this invitation.

inviter

str

The inviter’s user ID (email) of the invitation.

permission

str

The permission that the invited user will have.

state

str

The current state of the invitation.

vfolder_id

UUID

The unique ID of the vfolder where the user is invited.

vfolder_name

str

The name of the vfolder where the user is invited.

Key

Type

Description

content

str

The retrieved content (multi-line string) of fstab.

node

str

The node type, either “agent” or “manager.

node_id

str

The node’s unique ID.

버전 v4.20190615에 추가.

인증

엑세스 토큰과 비밀 키

API 서버에 요청하려면, 클라이언트에서는 API 엑세스 키와 비밀 키가 쌍으로 필요합니다. 클라우드 서비스 나 Backend.AI 클러스터의 관리자에서 해당 키를 확인할 수 있습니다.

서버는 각 클라이언트를 식별하기 위해 API 키를 사용하며, 클라이언트를 인증하고 API 요청의 무결성을 검증하기 위해 비밀 키를 사용합니다.

경고

API 액세스 키와 비밀 키가 외부로 유출되는 것을 막기 위한 보안 조치로, Backend.AI 를 활용해 공개된 프론트엔드 서비스를 설계할 때는 가급적 API 서비스를 위한 서버사이드 프록시를 사용하는 것을 권장합니다.

로컬 배포의 경우, 환경 설정에 마스터 더미 데이터(dummy data) 쌍을 만들 수 있습니다. (TODO)

API 요청의 공통 구조

HTTP 헤더

메서드

GET / REPORT / POST / PUT / PATCH / DELETE

쿼리스트링

If your access key has the administrator privilege, your client may optionally specify other user’s access key as the owner_access_key parameter of the URL query string (in addition to other API-specific ones if any) to change the scope of access key applied to access and manipulation of keypair-specific resources such as kernels and vfolders.

버전 v4.20190315에 추가.

Content-Type

항상 ``application/json``이어야 합니다

Authorization

Signature information generated as the section Signing API Requests describes.

Date

요청의 날짜, 시간은 RFC 8022나 ISO 8601의 형식입니다. 만약 특별한 시간대가 명시되지 않았다면, UTC로 가정합니다. 서버의 시간과 요청의 시간의 차이는 15분 이내여야 합니다.

X-BackendAI-Date

Date``와 동일합니다. 만약 ``Date 항목이 있다면 생략됩니다.

X-BackendAI-Version

vX.yyymmdd 에서 ``X``는 주 버전이고 ``yyyymmdd``는 특정 API 버전의 마이너 릴리스 날짜입니다. (예: 20160914)

X-BackendAI-Client-Token

선택 사항: 반복된 요청을 서버가 구별할 수 있게 하는 클라이언트 생성 랜덤 문자열입니다.때때로 나타나는 장애에 대해 여러 번 재시도하여 멱등성을 유지하는 것이 중요합니다. (구현 중)*멱등성 : 연산을 여러 번 적용하더라도 결과가 달라지지 않는 성질

Body

JSON 인코딩 요청 매개 변수

API 응답들의 공통 구조

HTTP 헤더

상태 코드

API에 대한 HTTP 표준 상태 코드입니다. 모든 API에서 일반적으로 사용되는 응답에는 200, 201, 204, 400, 401, 403, 404, 429, 그리고 500이 포함되지만 이에 국한되지 않습니다.

Content-Type

application/json 및 변형 (예: application/problem+json - 오류)

Link

Web link headers specified as in RFC 5988. Only optionally used when returning a collection of objects.

X-RateLimit-*

The rate-limiting information (see Rate Limiting).

Body

JSON 인코딩 결과

API 요청 서명

각각의 API 요청은 서명과 함께 등록되어야 합니다. 먼저, 클라이언트는 API 비밀 키에서 파생된 서명 키와 HTTP 요청을 정규화하여 얻을 수 있는 서명할 문자열을 생성해야 합니다.

서명 키 생성하기

다음은 비밀 키로부터 파생된 서명 키를 생성하는 Python 코드 입니다. 키는 현재 날짜 (시간 제외) 및 API 엔드포인트 주소에 대해 중첩되어 생성됩니다.

import hashlib, hmac
from datetime import datetime

SECRET_KEY = b'abc...'

def sign(key, msg):
  return hmac.new(key, msg, hashlib.sha256).digest()

def get_sign_key():
  t = datetime.utcnow()
  k1 = sign(SECRET_KEY, t.strftime('%Y%m%d').encode('utf8'))
  k2 = sign(k1, b'your.sorna.api.endpoint')
  return k2
서명할 문자열 생성하기

서명할 문자열은 다음과 같은 요청 정보 값에 의해 생성됩니다.

  • HTTP 메서드 (대문자)

  • 쿼리스트링을 포함하는 URI

  • UTC 시간대를 활용하고 ISO 8601 형식으로 구성된 (YYYYmmddTHHMMSSZ) Date 의 값 (단, 요청에서 Date 가 주어지지 않았다면, X-BackendAI-Date 를 사용함.)

  • 정규화 된 Host 의 헤더/값 쌍

  • 정규화 된 Content-Type 의 헤더/값 쌍

  • 정규화 된 X-BackendAI-Version 의 헤더/값 쌍

  • body 전체를 hex로 인코딩한 hash 값. 사용한 hash 함수는 반드시 Authorization 헤더에 명시된 함수와 동일해야 합니다. (e.g., SHA256).

서명할 문자열을 생성하기 위해, 클라이언트는 위의 값들을 newline 문자열 ("\n", ASCII 10) 을 활용하여 합쳐야 합니다. 모든 non-ASCII 문자열은 반드시 UTF-8 형식으로 인코딩 되어야 합니다. HTTP 헤더/값 쌍을 정규화 하기 위해선, 먼저 맨 앞이나 맨 뒤에 붙어있는 공백 문자열 ("\n", "\r", " ", "\t"; 또는 ASCII 10, 13, 32, 9) 을 제거해야 하며, 소문자로 변경된 헤더의 이름과 값을 콜론 (":", ASCII 58) 문자열을 통해 연결해야 합니다.

요청과 응답 예시 에 적힌 예시는 다음과 같은 방식으로 서명할 문자열을 생성 하였습니다.

GET
/v2
20160930T01:23:45Z
host:your.sorna.api.endpoint
content-type:application/json
x-sorna-version:v2.20170215
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

이 예시에서, 해시 값 ``e3b0c4…``는 GET 요청에 대한 본문이 없으므로 SHA256 해시 함수를사용하여 빈 문자열에서 생성됩니다.

그리고, 클라이언트는 서명 키와 hash 함수로 부터 생성된 문자열을 활용해 다음과 같이 서명해야 합니다.

import hashlib, hmac

str_to_sign = 'GET\n/v2...'
sign_key = get_sign_key()  # see "Generating a signing key"
m = hmac.new(sign_key, str_to_sign.encode('utf8'), hashlib.sha256)
signature = m.hexdigest()
서명 작성

마지막으로, 클라이언트는 이제 Authorization HTTP 헤더를 구성해야 합니다.

Authorization: BackendAI signMethod=HMAC-SHA256, credential=<access-key>:<signature>
요청과 응답 예시

이 예시에서, 우리는 다음과 같은 더미 엑세스 키와 비밀 키를 사용할 것입니다.

  • 엑세스 키 예시: AKIAIOSFODNN7EXAMPLE

  • 비밀 키 예시: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

최신 API 버전을 확인하는 예시
GET /v2 HTTP/1.1
Host: your.sorna.api.endpoint
Date: 20160930T01:23:45Z
Authorization: BackendAI signMethod=HMAC-SHA256, credential=AKIAIOSFODNN7EXAMPLE:022ae894b4ecce097bea6eca9a97c41cd17e8aff545800cd696112cc387059cf
Content-Type: application/json
X-BackendAI-Version: v2.20170215
HTTP/1.1 200 OK
Content-Type: application/json
Content-Language: en
Content-Length: 31
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1999
X-RateLimit-Reset: 897065

{
   "version": "v2.20170215"
}
authorization 헤더를 포함하지 않아 발생하는 실패 예시
GET /v2/kernel/create HTTP/1.1
Host: your.sorna.api.endpoint
Content-Type: application/json
X-BackendAI-Date: 20160930T01:23:45Z
X-BackendAI-Version: v2.20170215
HTTP/1.1 401 Unauthorized
Content-Type: application/problem+json
Content-Language: en
Content-Length: 139
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1998
X-RateLimit-Reset: 834821

{
   "type": "https://sorna.io/problems/unauthorized",
   "title": "Unauthorized access",
   "detail": "Authorization header is missing."
}

Rate Limiting

The API server imposes a rate limit to prevent clients from overloading the server. The limit is applied to the last N minutes at ANY moment (N is 15 minutes by default).

For public non-authorized APIs such as version checks, the server uses the client’s IP address seen by the server to impose rate limits. Due to this, please keep in mind that large-scale NAT-based deployments may encounter the rate limits sooner than expected. For authorized APIs, it uses the access key in the authorization header to impose rate limits. The rate limit includes both all successful and failed requests.

Upon a valid request, the HTTP response contains the following header fields to help the clients flow-control their requests.

HTTP 헤더

X-RateLimit-Limit

The maximum allowed number of requests during the rate-limit window.

X-RateLimit-Remaining

The number of further allowed requests left for the moment.

X-RateLimit-Window

The constant value representing the window size in seconds. (e.g., 900 means 15 minutes)

버전 v3.20170615에서 변경: Deprecated X-RateLimit-Reset and transitional X-Retry-After as we have implemented a rolling counter that measures last 15 minutes API call counts at any moment.

When the limit is exceeded, further API calls will get HTTP 429 “Too Many Requests”. If the client seems to be DDoS-ing, the server may block the client forever without prior notice.

Manager REST API

Backend.AI REST API is for running instant compute sessions at scale in clouds or on-premise clusters.

Session Management

Here are the API calls to create and manage compute sessions.

Creating Session
  • URI: /session (/session/create also works for legacy)

  • Method: POST

Creates a new session or returns an existing session, depending on the parameters.

매개변수들

매개변수

타입

설명

image

str

The kernel runtime type in the form of the Docker image name and tag. For legacy, the API also recognizes the lang field when image is not present.

버전 v4.20190315에서 변경.

clientSessionToken

slug

A client-provided session token, which must be unique among the currently non-terminated sessions owned by the requesting access key. Clients may reuse the token if the previous session with the same token has been terminated.

It may contain ASCII alphabets, numbers, and hyphens in the middle. The length must be between 4 to 64 characters inclusively. It is useful for aliasing the session with a human-friendly name.

enqueueOnly

bool

(optional) If set true, the API returns immediately after queueing the session creation request to the scheduler. Otherwise, the manager will wait until the session gets started actually. (default: false)

버전 v4.20190615에 추가.

maxWaitSeconds

int

(optional) Set the maximum duration to wait until the session starts after queued, in seconds. If zero, the manager will wait indefinitely. (default: 0)

버전 v4.20190615에 추가.

reuseIfExists

bool

(optional) If set true, the API returns without creating a new session if a session with the same ID and the same image already exists and not terminated. In this case config options are ignored. If set false but a session with the same ID and image exists, the manager returns an error: “session already exists”. (default: true)

버전 v4.20190615에 추가.

group

str

(optional) The name of a user group (aka “project”) to launch the session within. (default: "default")

버전 v4.20190615에 추가.

domain

str

(optional) The name of a domain to launch the session within (default: "default")

버전 v4.20190615에 추가.

config

object

(optional) A Creation Config Object to specify kernel configuration including resource requirements. If not given, the kernel is created with the minimum required resource slots defined by the target image.

tag

str

(optional) A per-session, user-provided tag for administrators to keep track of additional information of each session, such as which sessions are from which users.

예시:

{
  "image": "python:3.6-ubuntu18.04",
  "clientSessionToken": "mysession-01",
  "enqueueOnly": false,
  "maxWaitSeconds": 0,
  "reuseIfExists": true,
  "domain": "default",
  "group": "default",
  "config": {
    "clusterSize": 1,
    "environ": {
      "MYCONFIG": "XXX",
    },
    "mounts": ["mydata", "mypkgs"],
    "resources": {
      "cpu": "2",
      "mem": "4g",
      "cuda.devices": "1",
    }
  },
  "tag": "example-tag"
}
응답

HTTP Status Code

설명

200 OK

The session is already running and you are okay to reuse it.

201 Created

The session is successfully created.

401 Invalid API parameters

There are invalid or malformed values in the API parameters.

406 Not acceptable

The requested resource limits exceed the server’s own limits.

필드

타입

sessId

slug

The session ID used for later API calls, which is same to the value of clientSessionToken. This will be random-generated by the server if clientSessionToken is not provided.

status

str

The status of the created kernel. This is always "PENDING" if enqueueOnly is set true. In other cases, it may be either "RUNNING" (normal case), "ERROR", or even "TERMINATED" depending on what happens during session startup.

버전 v4.20190615에 추가.

servicePorts

list[object]

The list of Service Port Object. This field becomes an empty list if enqueueOnly is set true, because the final service ports are determined when the session becomes ready after scheduling.

참고

In most cases the service ports are same to what specified in the image metadata, but the agent may add shared services for all sessions.

버전 v4.20190615에서 변경.

created

bool

True if the session is freshly created.

예시:

{
  "sessId": "mysession-01",
  "status": "RUNNING",
  "servicePorts": [
    {"name": "jupyter", "protocol": "http"},
    {"name": "tensorboard", "protocol": "http"}
  ],
  "created": true
}
Getting Session Information
  • URI: /session/:id

  • 메소드 : GET

Retrieves information about a session. For performance reasons, the returned information may not be real-time; usually they are updated every a few seconds in the server-side.

매개변수들

매개변수

타입

설명

:id

slug

The session ID.

응답

HTTP Status Code

설명

200 OK

The information is successfully returned.

404 Not Found

There is no such session.

Key

타입

설명

lang

str

The kernel’s programming language

age

int (msec)

The time elapsed since the kernel has started.

memoryLimit

int (KiB)

The memory limit of the kernel in KiB.

numQueriesExecuted

int

The number of times the kernel has been accessed.

cpuCreditUsed

int (msec)

The total time the kernel was running.

Destroying Session
  • URI: /session/:id

  • Method: DELETE

Terminates a session.

매개변수들

매개변수

타입

설명

:id

slug

The session ID.

응답

HTTP Status Code

설명

204 내용 없음

The session is successfully destroyed.

404 Not Found

There is no such session.

Key

타입

설명

stats

object

The Container Stats Object of the kernel when deleted.

Restarting Session
  • URI: /session/:id

  • Method: PATCH

Restarts a session. The idle time of the session will be reset, but other properties such as the age and CPU credit will continue to accumulate. All global states such as global variables and modules imports are also reset.

매개변수들

매개변수

타입

설명

:id

slug

The session ID.

응답

HTTP Status Code

설명

204 내용 없음

The session is successfully restarted.

404 Not Found

There is no such session.

코드 실행 (쿼리 모드)

실행중인 스니펫
  • URI: /session/:id

  • Method: POST

Executes a snippet of user code using the specified session. Each execution request to a same session may have side-effects to subsequent executions. For instance, setting a global variable in a request and reading the variable in another request is completely legal. It is the job of the user (or the front-end) to guarantee the correct execution order of multiple interdependent requests. When the session is terminated or restarted, all such volatile states vanish.

매개변수들

매개변수

타입

설명

:id

slug

The session ID.

mode

str

상수 문자열 "query".

code

str

A string of user-written code. All non-ASCII data must be encoded in UTF-8 or any format acceptable by the session.

runId

str

특정 실행을 위한 클라이언트측 고유 식별자 문자열 입니다. 해당 실행 개념에 대한 자세한 사항은 “ref”`code-execution-model`을 참조하십시오. 문자열이 주어지지 않는다면, API 서버가 첫 번째 응답에 임의로 하나를 할당할 것이고 클라이언트는 이후 같은 실행에서 이를 사용해야만 합니다.

예시:

{
  "mode": "query",
  "code": "print('Hello, world!')",
  "runId": "5facbf2f2697c1b7"
}
응답

HTTP Status Code

설명

200 OK

The session has responded with the execution result. The response body contains a JSON object as described below.

필드

타입

result

object

Execution Result Object.

참고

사용자 코드가 예외를 제기해도, 이러한 쿼리들은 성공적인 실행으로 취급됩니다. 즉, 이 API의 실패는 유저 코드가 아닌 우리의 API 하위 시스템에 오류가 있다는 것을 의미합니다.

경고

만일 유저 코드가 시스템을 위반하려고 하거나, 충돌(e.g., “세그멘테이션 결함”)을 일으키거나, 너무 오래 실행(timeout)하면, 세션이 자동적으로 종료됩니다. 이러한 경우에, 예상보다 빠르게 "finishted" 상태로 불완전한 콘솔 로그를 얻게 됩니다. 상황에 따라 ``result.stderr``에도 구체적인 에러 정보가 포함될 수 있습니다.

우리는 다양한 Python 코드가 실행될 때 몇가지 예시 반환을 나타냅니다.

예시: 간단한 반환.

print("Hello, world!")
{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "finished",
    "console": [
      ["stdout", "Hello, world!\n"]
    ],
    "options": null
  }
}

예시: 런타임 에러.

a = 123
print('what happens now?')
a = a / 0
{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "finished",
    "console": [
      ["stdout", "what happens now?\n"],
      ["stderr", "Traceback (most recent call last):\n  File \"<input>\", line 3, in <module>\nZeroDivisionError: division by zero"],
    ],
    "options": null
  }
}

예시: 멀티미디어 결과

미디어 출력물은 실행 순서에 따라 다른 콘솔 출력물과도 혼합됩니다.

import matplotlib.pyplot as plt
a = [1,2]
b = [3,4]
print('plotting simple line graph')
plt.plot(a, b)
plt.show()
print('done')
{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "finished",
    "console": [
      ["stdout", "plotting simple line graph\n"],
      ["media", ["image/svg+xml", "<?xml version=\"1.0\" ..."]],
      ["stdout", "done\n"]
    ],
    "options": null
  }
}

예시: 지속적 결과

import time
for i in range(5):
    print(f"Tick {i+1}")
    time.sleep(1)
print("done")
{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "continued",
    "console": [
      ["stdout", "Tick 1\nTick 2\n"]
    ],
    "options": null
  }
}

당신은 빈 ``code``필드로 다른 API 쿼리를 만들어야 합니다.

{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "continued",
    "console": [
      ["stdout", "Tick 3\nTick 4\n"]
    ],
    "options": null
  }
}

다시.

{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "finished",
    "console": [
      ["stdout", "Tick 5\ndone\n"],
    ],
    "options": null
  }
}

예시: 사용자 입력

print("What is your name?")
name = input(">> ")
print(f"Hello, {name}!")
{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "waiting-input",
    "console": [
      ["stdout", "What is your name?\n>> "]
    ],
    "options": {
      "is_password": false
    }
  }
}

당신은 사용자 입력으로 채워진 code 필드로 다른 API 쿼리를 만들어야 합니다.

{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "finished",
    "console": [
      ["stdout", "Hello, Lablup!\n"]
    ],
    "options": null
  }
}
자동-완성
  • URI: /session/:id/complete

  • Method: POST

매개변수들

매개변수

타입

설명

:id

slug

The session ID.

code

str

A string containing the code until the current cursor position.

options.post

str

현재 커서 위치까지 코드가 포함된 문자열.

options.line

str

현재 라인의 내용을 포함하는 문자열

options.row

int

커서의 라인 번호(0-based)를 나타내는 정수

options.col

int

커서의 현재 라인에 있는 열 번호(0-based)를 나타내는 정수

예시:

{
  "code": "pri",
  "options": {
    "post": "\nprint(\"world\")\n",
    "line": "pri",
    "row": 0,
    "col": 3
  }
}
응답

HTTP Status Code

설명

200 OK

The session has responded with the execution result. The response body contains a JSON object as described below.

필드

타입

result

list[str]

An ordered list containing the possible auto-completion matches as strings. This may be empty if the current session does not implement auto-completion or no matches have been found.

일치를 선택하고 이를 코드 텍스트에 병합하는 것은 front-end 구현에 달려 있다.

예시:

{
  "result": [
    "print",
    "printf"
  ]
}
인터럽트
  • URI: /session/:id/interrupt

  • Method: POST

매개변수들

매개변수

타입

설명

:id

slug

The session ID.

응답

HTTP Status Code

설명

204 내용 없음

Sent the interrupt signal to the session. Note that this does not guarantee the effectiveness of the interruption.

코드 실행 (배치 모드)

Some sessions provide the batch mode, which offers an explicit build step required for multi-module programs or compiled programming languages. In this mode, you first upload files in prior to execution.

파일 업로드
  • URI: /session/:id/upload

  • Method: POST

매개변수들

Upload files to the session. You may upload multiple files at once using multi-part form-data encoding in the request body (RFC 1867/2388). The uploaded files are placed under /home/work directory (which is the home directory for all sessions by default), and existing files are always overwritten. If the filename has a directory part, non-existing directories will be auto-created. The path may be either absolute or relative, but only sub-directories under /home/work is allowed to be created.

힌트

This API is for uploading frequently-changing source files in prior to batch-mode execution. All files uploaded via this API is deleted when the session terminates. Use virtual folders to store and access larger, persistent, static data and library files for your codes.

경고

이 API를 사용해서 탑재된 가상 폴더에 직접적으로 파일을 업로드할 수 없습니다.그러나, 이 후 사용을 위해서 빌드 스크립트나 메인 프로그램 안의 가상폴더로 생성된 파일을 복사/이동할 수 있다.

이 API에는 몇 가지 제한이 있습니다.

각 파일의 최대 사이즈

1 MiB

업로드 요청당 파일 수

20

응답

HTTP Status Code

설명

204 OK

성공.

400 Bad Request

업로드된 파일 중 하나가 크기 제한을 초과하거나 파일이 너무 많을 때 반환됩니다.

빌드 단계를 사용하여 실행한다.
  • URI: /session/:id

  • Method: POST

매개변수들

매개변수

타입

설명

:id

slug

The session ID.

mode

enum[str]

상수 문자열 "batch".

code

str

반드시 빈 문자열 ``””``이어야 합니다.

runId

str

특정 실행을 위한 클라이언트측 고유 식별자 문자열 입니다. 해당 실행 개념에 대한 자세한 사항은 “ref”`code-execution-model`을 참조하십시오. 문자열이 주어지지 않는다면, API 서버가 첫 번째 응답에 임의로 하나를 할당할 것이고 클라이언트는 이후 같은 실행에서 이를 사용해야만 합니다.

options

object

Batch Execution Query Object.

예시:

{
  "mode": "batch",
  "options": "{batch-execution-query-object}",
  "runId": "af9185c5fb0eacb2"
}
응답

HTTP Status Code

설명

200 OK

The session has responded with the execution result. The response body contains a JSON object as described below.

필드

타입

result

object

Execution Result Object.

나열된 파일

Once files are uploaded to the session or generated during the execution of the code, there is a need to identify what files actually are in the current session. In this case, use this API to get the list of files of your compute sesison.

  • URI: /session/:id/files

  • 메소드 : GET

매개변수들

매개변수

타입

설명

:id

slug

The session ID.

path

str

세션 내부 경로 (default: \home\work).

응답

HTTP Status Code

설명

200 OK

성공.

404 Not Found

해당 경로가 존재하지 않습니다.

필드

타입

files

str

파일 목록을 포함하는 문자열화된 json

folder_path

str

세션 내부의 절대 경로

errors

str

지정된 경로를 스캔하는 동안 생성된 오류

다운로드한 파일

컴퓨팅 세션에서 파일을 다운로드 하십시오.

응답 내용은 tarfile binaries가 있는 멀티파트입니다. unpacking과 저장 같은 Post-processing은 클라이언트가 다뤄야만 합니다

  • URI: /session/:id/download

  • 메소드 : GET

매개변수들

매개변수

타입

설명

:id

slug

The session ID.

files

list[str]

File paths inside the session container to download. (maximum 5 files at once)

응답

HTTP Status Code

설명

200 OK

성공.

코드 실행(스트리밍)

The streaming mode provides a lightweight and interactive method to connect with the session containers.

코드 실행
  • URI: /stream/session/:id/execute

  • Method: GET Websocket으로 업그레이드

:HTTP를 통해 긴 폴링 방식을 사용하는 doc:`exec-batch`와 :doc:`exec-query`실시간 스트리밍 버전이다.

(구성 중)

버전 v4.20181215에 추가.

터미널 에뮬레이션
  • URI: /stream/session/:id/pty?app=:service

  • Method: GET Websocket으로 업그레이드

이 엔드포인트는 네이티브 웹소켓을 통해 Json 객체의 이중 연속 스트림을 제공한다.웹소켓은 바이너리 스트림을 지원하지만, 서로 다른 브라우저를 통한 Javascript의 타이핑된 배열 지원에서 이상한 점을 방지하기 위해 현재는 JSON 페이로드만 전달하는 TEXT 메시지에 의존하고 있다.

서비스 이름은 :ref:`the session creation API <create-session-api>`가 반환한 :ref:`service port objects <service-port-object>`의 리스트에서 가져와야 합니다.

참고

우리는 socket.io나 SockJS같은 기존 웹소켓 에뮬레이션 인터페이스를 제공하지 *않습니다.*기존 브라우저 사용자를 지원하려면 자신의 프록시를 설정해야 합니다.

버전 v4.20181215에서 변경: service 쿼리 매개변수를 추가하십시오.

매개변수들

매개변수

타입

설명

:id

slug

The session ID.

:service

slug

연결할 서비스 명

클라이언트-대-서버 프로토콜

엔드포인트에서 다음 네 가지 유형의 입력 메시지를 수락합니다.

표준 입력 스트림

모든 ASCII (및 UTF-8) 입력은 반드시 base64 문자열로 인코딩되어야 합니다.문자에는 제어 문자도 포함될 수 있습니다.

{
  "type": "stdin",
  "chars": "<base64-encoded-raw-characters>"
}
터미널 크기 조정

터미널 크기를 주어진 행과 열 수로 설정하십시오. 스스로 계산을 해야 합니다.

예를 들어, 웹브라우저의 경우, ASCII 문자 하나만 포함된 터미널 컨테이너 요소와 동일한 (monospace) 글꼴 스타일을 이용해서, 임시로 생성된 보이지 않는 HTML 너비와 높이를 측정하여 간단한 계산을 할 수 있다.

{
  "type": "resize",
  "rows": 25,
  "cols": 80
}

Use this to keep the session alive (preventing it from auto-terminated by idle timeouts) by sending pings periodically while the user-side browser is open.

{
  "type": "ping",
}
재시작

Use this to restart the session without affecting the working directory and usage counts. Useful when your foreground terminal program does not respond for whatever reasons.

{
  "type": "restart",
}
서버-대-고객 프로토콜
표준 결과/에러 스트림

Since the terminal is an output device, all stdout/stderr outputs are merged into a single stream as we see in real terminals. This means there is no way to distinguish stdout and stderr in the client-side, unless your session applies some special formatting to distinguish them (e.g., make all stderr otuputs red).

The terminal output is compatible with xterm (including 256-color support).

{
  "type": "out",
  "data": "<base64-encoded-raw-characters>"
}
Server-side errors
{
  "type": "error",
  "data": "<human-readable-message>"
}

이벤트 모니터링

Session Lifecycle Events
  • URI: /events/session

  • 메소드 : GET

Provides a continuous message-by-message JSON object stream of session lifecycles. It uses HTML5 Server-Sent Events (SSE). Browser-based clients may use the EventSource API for convenience.

버전 v4.20190615에 추가: 이 버전에서 처음 적절하게 구현되었고 구현 되지 않았던 이전 인터페이스는 폐지됩니다.

버전 v5.20191215에서 변경: The URI is changed from /stream/session/_/events to /events/session.

매개변수들

매개변수

타입

설명

sessionId

slug

The session ID to monitor the lifecycle events. If set "*", the API will stream events from all sessions visible to the client depending on the client’s role and permissions.

ownerAccessKey

str

(선택사항) 다른 액세스 키(사용자)가 다른 세션 인스턴스에 대해 동일한 세션 아이디를 공유할 수 있기 때문에 지정된 세션 소유자의 접근 키 입니다. 도메인 소유자나 슈퍼관리자만이 이를 지정할 수 있습니다.

group

str

The group name to filter the lifecycle events. If set "*", the API will stream events from all sessions visible to the client depending on the client’s role and permissions.

응답

응답은 text/event-stream 형식을 따르는 연속적인 여러 줄의 UTF-8 텍스트 스트림입니다. 각 이벤트는 이벤트 타입과 데이터로 구성되어 있으며, 데이터는 JSON 으로 인코딩 됩니다.

가능한 이벤트 명(추후 더 많은 이벤트가 추가될 수 있습니다)

이벤트 명

설명

session_preparing

세션이 방금 작업 큐에 스케쥴링 되었고, 에이전트로부터 자원 할당을 받았습니다.

session_pulling

The session begins pulling the session image (usually from a Docker registry) to the scheduled agent.

session_creating

세션이 컨테이너 (또는 다른 에이전트 백엔드의 다른 객체로) 생성되는 중입니다.

session_started

세션에서 코드를 실행할 준비를 마쳤습니다.

session_terminated

세션이 종료되었습니다.

이벤트소스 API를 사용할 경우, 다음을 따르는 이벤트 리스너를 추가해주십시오 :

const sse = new EventSource('/events/session', {
  withCredentials: true,
});
sse.addEventListener('session_started', (e) => {
  console.log('session_started', JSON.parse(e.data));
});

참고

이벤트소스 API는 (콘솔-서버가 종단점 일 때) 반드시 브라우저 쿠키를 사용하는 세션-기반 인증 모드여야 합니다. 다른 방식으로는 관리자 서버에 대해 실행되는 표준 fetch API를 사용해 이벤트 스트림 파서를 수동으로 구현해야 합니다.

이벤트 데이터는 다음과 같은 JSON 문자열로 구성됩니다 (추후 더 많은 필드가 추가될 수 있음) :

필드 명

설명

sessionId

소스 세션 아이디

ownerAccessKey

세션을 소유하고 있는 액세스 키

reason

이 이벤트가 왜 발생했는지에 대한 짧은 문자열입니다. 이것은 ``null``이거나 빈 문자열일 수 있습니다.

result

Only present for session-terminated events. Only meaningful for batch-type sessions. Either one of: "UNDEFINED", "SUCCESS", "FAILURE"

{
  "sessionId": "mysession-01",
  "ownerAccessKey": "MYACCESSKEY",
  "reason": "self-terminated",
  "result": "SUCCESS"
}
Background Task Progress Events
  • URI: /events/background-task

  • Method: GET for server-side events

버전 v5.20191215에 추가.

매개변수들

매개변수

타입

설명

taskId

UUID

The background task ID to monitor the progress and completion.

응답

The response is a continuous stream of UTF-8 text lines following text/event-stream format. Each event is composed of the event type and data, where the data part is encoded as JSON. Possible event names (more events may be added in the future):

이벤트 명

설명

task_updated

Updates for the progress. This can be generated many times during the background task execution.

task_done

The background task is successfully completed.

tak_failed

The background task has failed. Check the message field and/or query the error logs API for error details.

task_cancelled

The background task is cancelled in the middle. Usually this means that the server is being shutdown for maintenance.

server_close

This event indicates explicit server-initiated close of the event monitoring connection, which is raised just after the background task is either done/failed/cancelled. The client should not reconnect because there is nothing more to monitor about the given task.

The event data (per-line JSON objects) include the following fields:

필드 명

타입

설명

task_id

str

The background task ID.

current_progress

int

The current progress value. Only meaningful for task_update events. If total_progress is zero, this value should be ignored.

total_progress

int

The total progress count. Only meaningful for task_update events. The scale may be an arbitrary positive integer. If the total count is not defined, this may be zero.

message

str

An optional human-readable message indicating what the task is doing. It may be null. For example, it may contain the name of agent or scaling group being worked on for image preload/unload APIs.

Check out the session lifecycle events API for example client-side Javascript implementations to handle text/event-stream responses.

If you make the request for the tasks already finished, it may return either “404 Not Found” (the result is expired or the task ID is invalid) or a single event which is one of task_done, task_fail, or task_cancel followed by immediate response disconnection. Currently, the results for finished tasks may be archived up to one day (24 hours).

Service Ports (aka Service Proxies)

The service ports API provides WebSocket-based authenticated and encrypted tunnels to network-facing services (“container services”) provided by the kernel container. The main advantage of this feature is that all application-specific network traffic are wrapped as a standard WebSocket API (no need to open extra ports of the manager). It also hides the container from the client and the client from the container, offerring an extra level of security.

_images/service-ports.svg

The diagram showing how tunneling of TCP connections via WebSockets works.

As 그림 7 shows, all TCP traffic to a container service could be sent to a WebSocket connection to the following API endpoints. A single WebSocket connection corresponds to a single TCP connection to the service, and there may be multiple concurrent WebSocket connections to represent multiple TCP connections to the service. It is the client’s responsibility to accept arbitrary TCP connections from users (e.g., web browsers) with proper authorization for multi-user setups and wrap those as WebSocket connections to the following APIs.

When the first connection is initiated, the Backend.AI Agent running the designated kernel container signals the kernel runner daemon in the container to start the designated service. It shortly waits for the in-container port opening and then delivers the first packet to the service. After initialization, all WebSocket payloads are delivered back and forth just like normal TCP packets. Note that the WebSocket message type must be BINARY.

The container service will see the packets from the manager and it never knows the real origin of packets unless the service-level protocol enforces to state such client-side information. Likewise, the client never knows the container’s IP address (though the port numbers are included in service port objects returned by the session creation API).

참고

Currently non-TCP (e.g., UDP) services are not supported.

Service Proxy (HTTP)
  • URI: /stream/kernel/:id/httpproxy?app=:service

  • Method: GET Websocket으로 업그레이드

The service proxy API allows clients to directly connect to service daemons running inside compute sessions, such as Jupyter and TensorBoard.

서비스 이름은 :ref:`the session creation API <create-session-api>`가 반환한 :ref:`service port objects <service-port-object>`의 리스트에서 가져와야 합니다.

버전 v4.20181215에 추가.

매개변수들

매개변수

타입

설명

:id

slug

커널 ID

:service

slug

연결할 서비스 명

Service Proxy (TCP)
  • URI: /stream/kernel/:id/tcpproxy?app=:service

  • Method: GET Websocket으로 업그레이드

This is the TCP version of service proxy, so that client users can connect to native services running inside compute sessions, such as SSH.

서비스 이름은 :ref:`the session creation API <create-session-api>`가 반환한 :ref:`service port objects <service-port-object>`의 리스트에서 가져와야 합니다.

버전 v4.20181215에 추가.

매개변수들

매개변수

타입

설명

:id

slug

커널 ID

:service

slug

연결할 서비스 명

리소스 프리셋

리소스 프리셋은 사전에 구성된 리소스 슬롯을 저장하고, 커널 생성 API를 호출하기 전에 주어진 프리셋의 할당 가능 여부를 동적으로 확인하는 간단한 저장소를 제공합니다.

리소스 프리셋을 추가/수정/삭제하려면 관리자 GraphQL API를 사용해야 합니다.

버전 v4.20190315에 추가.

리소스 프리셋 목록

관리자가 구성한 리소스 프리셋 목록을 반환합니다.

  • URI: /resource/presets

  • 메소드 : GET

매개변수

None.

응답

HTTP 상태 코드

설명

200 OK

프리셋 목록이 반환됩니다.

필드

타입

presets

list[object]

Resource Preset Object 목록

리소스 프리셋의 할당 가능 여부 확인

현재 keypair와 스케일링 그룹의 리소스 제한을 추가로 반환하고, 리소스 프리셋의 할당 가능 여부를 확인하여 각 프리셋 항목에 allocatable boolean 필드를 추가합니다.

  • URI: /resource/check-presets

  • Method: POST

매개변수

None.

응답

HTTP 상태 코드

설명

200 OK

프리셋 목록이 반환됩니다.

401 Unauthorized

인증되지 않은 클라이언트입니다.

필드

타입

keypair_limits

Resource Slot Object

현재 액세스 키에 대해 허용되는 총 리소스 슬롯의 최대량. “Infinity” 문자열로 무한대를 표현할 수 있습니다.

keypair_using

Resource Slot Object

The amount of total resource slots used by the current access key.

keypair_remaining

Resource Slot Object

현재 액세스 키에 대해 남아있는 총 리소스 슬롯 양. “Infinity” 문자열로 무한대를 표현할 수 있습니다.

scaling_group_remaining

Resource Slot Object

현재 스케일링 그룹에 대해 남아있는 총 리소스 슬롯 양. 서버가 자동 스케일링을 위해 구성되어 있으면 “Infinity” 문자열로 무한대를 표현할 수 있습니다.

presets

list[object]

:ref:`resource-preset-object`의 목록, 하지만 추가 boolean 필드인 ``allocatable``이 있습니다. 이 필드는 주어진 리소스 슬롯이 실제로 keypair의 리소스 제한과 스케일링 그룹의 현재 사용량을 고려하여 할당 가능한지를 표시합니다.

가상 폴더

가상 폴더는 여러 다른 세션에서 공유되며, 지속적이고, 재사용 가능한 파일을 제공합니다.

새로운 세션을 생성할 때 가상 폴더를 마운트할 수 있으며, 로컬 파일 시스템의 일반 디렉토리처럼 사용할 수 있습니다. 물론, 가상 폴더의 내용에 대한 읽기/쓰기는 내부적으로 네트워크 파일 시스템을 사용하기 때문에, 주요 스크래치 디렉토리(대부분의 커널에서는 /home/work)에 비해 성능이 저하될 수 있습니다.

또한, 가상 폴더를 다른 사용자와 공유할 수 있으며, 적절한 권한을 부여하여 초대할 수 있습니다. 현재, 권한은 각각 세 가지 레벨로 구분됩니다: 읽기 전용, 읽기-쓰기, 읽기-쓰기-삭제. 이들은 각각 'ro' , 'rw', 'wd' 라는 짧은 문자열로 표현됩니다. 가상 폴더의 소유자는 폴더에 대해 읽기-쓰기-삭제 권한을 가집니다.

가상 폴더 목록

현재 키페어로 생성된 가상 폴더의 목록을 반환합니다.

  • URI: /folders

  • 메소드 : GET

매개변수

매개변수

타입

설명

all

bool

(선택 사항) 매개 변수가 True 인 경우, 현재 사용자가 속하지 않는 폴더를 포함한 모든 가상 폴더를 반환합니다. 최고 관리자만이 해당 매개변수(기본값: False)를 사용할 수 있습니다.

group_id

UUID | str

(선택 사항) 매개 변수가 설정되면, 지정된 그룹에 속한 가상 폴더를 반환합니다. 사용자 유형 가상 폴더에는 영향을 미치지 않습니다.

응답

HTTP 상태 코드

설명

200 OK

Success

필드

타입

(root)

list[object]

A list of Virtual Folder List Item Object

예시:

[
   {
      "name": "myfolder",
      "id": "b4b1b16c-d07f-4f1f-b60e-da9449aa60a6",
      "host": "local:volume1",
      "usage_mode": "general",
      "created_at": "2020-11-28 13:30:30.912056+00",
      "is_owner": "true",
      "permission": "rw",
      "user": "dfa9da54-4b28-432f-be29-c0d680c7a412",
      "group": null,
      "creator": "admin@lablup.com",
      "user_email": "admin@lablup.com",
      "group_name": null,
      "ownership_type": "user",
      "unmanaged_path": null,
      "cloneable": "false",
   }
]
가상 폴더 호스트 목록

새로운 가상 폴더를 만들 수 있는 현재 키페어의 호스트 이름 목록을 반환합니다.

버전 v4.20190315에 추가.

  • URI: /folders/_/hosts

  • 메소드 : GET

매개변수

None

응답

HTTP 상태 코드

설명

200 OK

Success

필드

타입

default

str

The default virtual folder host

allowed

list[str]

The list of available virtual folder hosts

예시:

{
  "default": "seoul:nfs1",
  "allowed": ["seoul:nfs1", "seoul:nfs2", "seoul:cephfs1"]
}
가상 폴더 생성
  • URI: /folders

  • Method: POST

현재 API 키와 연결된 가상 폴더를 생성합니다.

매개변수

매개변수

타입

설명

name

str

The human-readable name of the virtual folder

host

str

(optional) The name of the virtual folder host

usage_mode

str

(선택 사항) 가상 폴더의 목적. 가능한 값은 general, model, data (기본값: general) 입니다.

permission

str

(선택 사항) 가상 폴더의 기본 공유 권한. 가상 폴더의 소유자는 해당 매개 변수와 관계없이 항상 wd 권한을 가집니다. 가능한 값은 ro, rw, wd (기본값: rw) 입니다.

group_id

UUID | str

(선택 사항) 해당 매개 변수가 설정되면 그룹 유형의 가상 폴더가 생성됩니다. 만약 비어 있다면, 사용자 유형의 가상 폴더가 생성됩니다.

quota

int

(선택 사항) 가상 폴더의 쿼터를 바이트 단위로 설정합니다. 하지만 쿼터는 xfs 파일 시스템에서만 지원됩니다. 디렉토리 단위 쿼터를 지원하지 않는 다른 파일 시스템에서는 해당 매개 변수가 무시됩니다.

예시:

{
  "name": "My Data",
  "host": "seoul:nfs1"
}
응답

HTTP 상태 코드

설명

201 Created

커널이 성공적으로 생성되었습니다.

400 Bad Request

잘못되었거나 이미 존재하는 가상 폴더와 중복되는 이름입니다.

406 Not acceptable

가상 폴더의 개수 제한을 초과했습니다. (예: 가질 수 있는 최대 폴더 개수)

필드

타입

id

slug

The unique folder ID used for later API calls

name

str

The human-readable name of the created virtual folder

host

str

The name of the virtual folder host where the new folder is created

예시:

{
  "id": "aef1691db3354020986d6498340df13c",
  "name": "My Data",
  "host": "nfs1",
  "usage_mode": "general",
  "permission": "rw",
  "creator": "admin@lablup.com",
  "ownership_type": "user",
  "user": "dfa9da54-4b28-432f-be29-c0d680c7a412",
  "group": "",
}
가상 폴더 정보 얻기
  • URI: /folders/:name

  • 메소드 : GET

가상 폴더의 정보를 검색합니다. 성능상의 이유로, 반환된 정보는 실시간이 아닐 수 있습니다; 보통 서버측에서 몇 초마다 업데이트됩니다.

매개변수

매개변수

타입

설명

name

str

The human-readable name of the virtual folder

응답

HTTP 상태 코드

설명

200 OK

정보가 성공적으로 반환되었습니다.

404 Not Found

폴더가 없거나 폴더에 접근할 적절한 권한을 가지고 있지 않습니다.

필드

타입

(root)

object

Virtual Folder Item Object

가상 폴더 지우기
  • URI: /folders/:name

  • Method: DELETE

주어진 가상 폴더의 모든 내용을 즉시 삭제하고 미래의 마운트에 폴더를 사용할 수 없게 합니다.

위험

실행 중인 커널 중 삭제된 가상 폴더를 마운트한 커널이 있다면, 해당 커널은 오작동할 수 있습니다!

경고

이 API가 호출되면 내용을 되돌릴 수 있는 방법이 없습니다.

매개변수

매개변수

설명

name

The human-readable name of the virtual folder

응답

HTTP 상태 코드

설명

204 No Content

폴더가 성공적으로 삭제되었습니다.

404 Not Found

폴더가 없거나 폴더를 삭제할 적절한 권한을 가지고 있지 않습니다.

가상 폴더 이름 바꾸기
  • URI: /folders/:name/rename

  • Method: POST

현재 API 키와 연결된 가상 폴더의 이름을 바꿉니다.

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

new_name

str

New virtual folder name

응답

HTTP 상태 코드

설명

201 Created

The folder is successfully renamed.

404 Not Found

폴더를 찾을 수 없거나 폴더 이름을 바꿀 적절한 권한을 가지고 있지 않습니다.

가상 폴더 내 파일 목록

현재 키페어와 연결된 가상 폴더 내 파일 목록을 반환합니다.

  • URI: /folders/:name/files

  • 메소드 : GET

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

path

str

Path inside the virtual folder (default: root)

응답

HTTP 상태 코드

설명

200 OK

Success.

404 Not Found

폴더에 접근할 수 없거나 경로를 찾을 수 없습니다.

필드

타입

files

list[object]

Virtual Folder File Object 목록

가상 폴더에 파일 업로드

현재 키페어와 연결된 가상 폴더에 로컬 파일을 업로드합니다. Manager는 내부적으로 Backend.AI Storage-Proxy 서비스에 업로드를 위임합니다. JSON 웹 토큰은 요청 인증에 사용됩니다.

  • URI: /folders/:name/request-upload

  • Method: POST

경고

가상 폴더 내 동일한 이름의 파일이 이미 존재하면 경고 없이 덮어써집니다.

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

path

str

Path of the local file to upload

size

int

The total size of the local file to upload

응답

HTTP 상태 코드

설명

200 OK

Success.

필드

타입

token

str

Storage-Proxy 서비스에 업로드 세션 인증을 위한 JSON 웹 토큰

url

str

Storage-Proxy에 대한 요청 URL. 클라이언트는 이 URL을 사용하여 파일을 업로드해야 합니다.

가상 폴더 내에 새 디렉토리 생성하기

현재 키페어에 연결된 가상 폴더에 새 디렉토리를 생성합니다. 이 API는 존재하지 않는 상위 디렉토리를 재귀적으로 생성합니다.

  • URI: /folders/:name/mkdir

  • Method: POST

경고

가상 폴더 내 동일한 이름의 디렉토리가 이미 존재하면 경고 없이 덮어써질 수 있습니다.

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder.

path

str

The relative path of a new folder to create inside the virtual folder

parents

bool

True 일 경우, 지정한 상위 디렉토리가 존재하지 않으면 같이 생성합니다.

exist_ok

bool

디렉토리 이름이 이미 존재하면 오류 없이 덮어씁니다.

응답

HTTP 상태 코드

설명

201 Created

Success.

400 Bad Request

디렉토리가 아닌 동일한 이름의 파일이 이미 존재합니다.

404 Not Found

파일이 없거나 해당 폴더에 쓰기 권한이 없습니다.

가상 폴더에 있는 파일 또는 디렉토리 다운로드하기

현재 keypair에서 연결된 가상 폴더에 있는 파일 또는 디렉토리를 다운로드합니다. 내부적으로, Manager는 Backend.AI Storage-Proxy 서비스에 다운로드를 위임합니다. JSON 웹 토큰은 요청의 인증에 사용됩니다.

버전 v4.20190315에 추가.

  • URI: /folders/:name/request-download

  • Method: POST

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

path

str

다운로드를 위해 가상 폴더 내 파일 또는 디렉토리의 경로

archive

bool

해당 매개변수가 Truepath``가 디렉토리일 경우, 디렉토리는 zip 파일로 압축됩니다 (기본값: ``False).

응답

HTTP 상태 코드

설명

200 OK

Success.

404 Not Found

파일을 찾을 수 없거나 해당 폴더에 접근할 수 있는 권한이 없습니다.

필드

타입

token

str

JSON 웹 토큰은 Storage-Proxy 서비스의 다운로드 세션 인증에 사용됩니다.

url

str

저장-프록시에 대한 요청 URL입니다. 클라이언트는 이 URL을 사용하여 파일을 다운로드해야 합니다.

가상 폴더 내 파일 삭제하기

가상 폴더 내 파일을 삭제합니다.

경고

한 번 API가 호출되면 파일을 되돌릴 수 없습니다.

  • URI: /folders/:name/delete-files

  • Method: DELETE

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

files

list[str]

File paths inside the virtual folder to delete

recursive

bool

설정이 True로 되어 있으면 폴더를 재귀적으로 삭제합니다. 기본값은 False입니다.

응답

HTTP 상태 코드

설명

200 OK

Success.

400 Bad Request

recursive 옵션을 True로 설정하지 않고 폴더를 삭제하려고 했습니다.

404 Not Found

해당 폴더가 없거나 폴더 내 파일을 삭제할 수 있는 권한이 없습니다.

가상 폴더 내 파일 이름을 변경하기

Rename a file inside a virtual folder.

  • URI: /folders/:name/rename-file

  • Method: POST

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

target_path

str

The relative path of target file or directory

new_name

str

The new name of the file or directory

is_dir

bool

Flag that indicates the target_path is a directory or not

응답

HTTP 상태 코드

설명

200 OK

Success.

400 Bad Request

is_dir 옵션을 True로 설정하지 않고 디렉토리 이름을 변경하려고 했습니다.

404 Not Found

해당 폴더가 없거나 폴더 내 파일을 이름을 변경할 수 있는 권한이 없습니다.

가상 폴더 내 초대 목록 보기

요청한 사용자가 받은 보류중인 초대 목록을 반환합니다. 이 목록 다른 사용자가 보낸 초대가 표시됩니다.

  • URI : /folders/invitations/list

  • 메소드 : GET

매개변수

이 API는 파라미터가 필요하지 않습니다.

응답

HTTP 상태 코드

설명

200 OK

Success.

필드

타입

invitations

list[object]

A list of Virtual Folder Invitation Object

초대 생성하기

적절한 권한으로 가상 폴더를 공유할 다른 사용자를 초대합니다. 해당 사용자가 이미 초대되어 있다면 이 API는 새로운 초대를 생성하지 않거나 기존 초대 권한을 업데이트하지 않습니다.

  • URI : /folders/:name/invite

  • Method: POST

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

perm

str

The permission to grant to invitee

emails

list[slug]

A list of user emails to invite

응답

HTTP 상태 코드

설명

200 OK

Success.

400 Bad Request

초대받을 사용자가 주어지지 않았습니다.

404 Not Found

초대가 없습니다.

필드

타입

invited_ids

list[slug]

A list of invited user emails

초대 수락하기

초대를 수락하고 초대에 따른 가상 폴더의 권한을 부여받습니다.

  • URI : /folders/invitations/accept

  • Method: POST

매개변수

매개변수

타입

설명

inv_id

slug

The unique invitation ID

응답

HTTP 상태 코드

설명

200 OK

Success.

400 Bad Request

대상 가상 폴더 이름이 이미 존재하는 가상 폴더 이름과 중복됩니다.

404 Not Found

존재하지 않는 초대입니다.

초대 거절하기

초대를 거절합니다.

  • URI : /folders/invitations/delete

  • Method: DELETE

매개변수

매개변수

타입

설명

inv_id

slug

The unique invitation ID

응답

HTTP 상태 코드

설명

200 OK

Success.

404 Not Found

존재하지 않는 초대입니다.

필드

타입

msg

str

Detail message for the invitation deletion

보낸 초대 목록

요청한 사용자가 보낸 가상 폴더 초대 목록을 반환합니다. 이미 수락하거나 거절한 초대는 포함되지 않습니다.

  • URI : /folders/invitations/list-sent

  • 메소드 : GET

매개변수

이 API는 파라미터가 필요하지 않습니다.

응답

HTTP 상태 코드

설명

200 OK

Success.

필드

타입

invitations

list[object]

A list of Virtual Folder Invitation Object

초대 업데이트

이미 보냈지만 수락하거나 거절하지 않은 초대 권한을 업데이트합니다.

  • URI : /folders/invitations/update/:inv_id

  • Method: POST

매개변수

매개변수

타입

설명

:inv_id

str

The unique invitation ID

perm

str

The permission to grant to invitee

응답

HTTP 상태 코드

설명

200 OK

Success.

400 Bad Request

권한을 지정하지 않았습니다.

404 Not Found

초대가 없습니다.

필드

타입

msg

str

An update message string

공유받은 가상 폴더 나가기

공유받은 가상 폴더에서 나갑니다.

그룹 vfolder 또는 요청한 사용자가 소유한 vfolder는 나갈 수 없습니다.

  • URI : /folders/:name/leave

  • Method: POST

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

응답

HTTP 상태 코드

설명

200 OK

Success.

404 Not Found

가상 폴더가 없습니다.

필드

타입

msg

str

A result message string

가상 폴더 공유 사용자 목록

요청자의 가상 폴더를 공유하는 사용자 목록을 반환합니다.

  • URI : /folders/_/shared

  • 메소드 : GET

매개변수

매개변수

타입

설명

vfolder_id

str

(선택 사항) 공유된 사용자를 나열할 가상 폴더의 고유 ID입니다. 지정하지 않으면 요청자가 생성한 모든 가상 폴더를 공유하는 사용자를 나열합니다.

응답

HTTP 상태 코드

설명

200 OK

Success.

필드

타입

shared

list[object]

A list of information about shared users.

예시:

[
   {
      "vfolder_id": "aef1691db3354020986d6498340df13c",
      "vfolder_name": "My Data",
      "shared_by": "admin@lablup.com",
      "shared-to": {
         "uuid": "dfa9da54-4b28-432f-be29-c0d680c7a412",
         "email": "user@lablup.com"
      },
      "perm": "ro"
   }
]
공유된 가상 폴더의 권한 업데이트

공유된 가상 폴더의 사용자 권한을 업데이트합니다.

  • URI : /folders/_/shared

  • Method: POST

매개변수

매개변수

타입

설명

vfolder

UUID

The unique virtual folder ID

user

UUID

The unique user ID

perm

str

The permission to update for the user on vfolder

응답

HTTP 상태 코드

설명

200 OK

Success.

400 Bad Request

권한 또는 사용자가 지정되지 않았습니다.

404 Not Found

가상 폴더가 없습니다.

필드

타입

msg

str

An update message string

개별 사용자에게 그룹 가상 폴더 공유

권한을 덮어쓰는 사용자에게 그룹 가상 폴더를 공유합니다.

생성된 초대를 통해 vfolder_permission 관계를 직접 생성합니다. 그룹 가상 폴더만 직접 공유할 수 있습니다.

이 API는 그룹 가상 폴더를 읽기 전용 권한으로 모든 그룹 멤버에게 공유하고, 일부 사용자에게는 읽기/쓰기 권한을 부여하고 싶을 때 유용합니다.

안내: 이 API는 그룹 가상 폴더에만 사용할 수 있습니다.

  • URI: /folders/:name/share

  • Method: POST

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

permission

str

Overriding permission to share the group virtual folder

emails

list[str]

A list of user emails to share

응답

HTTP 상태 코드

설명

201 Created

Success.

400 Bad Request

권한 혹은 이메일이 주어지지 않았습니다.

404 Not Found

가상 폴더가 없습니다.

필드

타입

shared_emails

list[str]

A list of user emails those are successfully shared the virtual folder

사용자로부터 그룹 가상 폴더 공유 해제

Unshare a group virtual folder from users

안내: 이 API는 그룹 가상 폴더에만 사용할 수 있습니다.

  • URI: /folders/:name/unshare

  • Method: DELETE

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

emails

list[str]

A list of user emails to unshare

응답

HTTP 상태 코드

설명

200 OK

Success.

400 Bad Request

이메일이 주어지지 않았습니다.

404 Not Found

가상 폴더가 없습니다.

필드

타입

unshared_emails

list[str]

A list of user emails those are successfully unshared the virtual folder

가상 폴더 복제

Clone a virtual folder

  • URI: /folders/:name/clone

  • Method: POST

매개변수

매개변수

타입

설명

:name

str

The human-readable name of the virtual folder

cloneable

bool

``True``일 경우, 복제된 가상 폴더를 다시 복제할 수 있습니다.

target_name

str

The name of the new virtual folder

target_host

str

The targe host volume of the new virtual folder

usage_mode

str

(선택 사항) 새 가상 폴더의 목적. 허용되는 값은 general, model, data (기본값: general)입니다.

permission

str

(선택 사항) 새 가상 폴더의 기본 공유 권한. 가상 폴더의 소유자는 해당 매개변수와 무관하게 항상 wd 권한을 가집니다. 허용되는 값은 ro, rw, wd 입니다.

응답

HTTP 상태 코드

설명

200 OK

Success.

400 Bad Request

타겟 이름, 타겟 호스트, 권한이 없습니다.

403 Forbidden

소스 가상 폴더는 복제할 수 없습니다.

404 Not Found

가상 폴더가 없습니다.

필드

타입

unshared_emails

list[str]

A list of user emails those are successfully unshared the virtual folder.

필드

타입

(root)

list[object]

Virtual Folder List Item Object

예시:

{
   "name": "my cloned folder",
   "id": "b4b1b16c-d07f-4f1f-b60e-da9449aa60a6",
   "host": "local:volume1",
   "usage_mode": "general",
   "created_at": "2020-11-28 13:30:30.912056+00",
   "is_owner": "true",
   "permission": "rw",
   "user": "dfa9da54-4b28-432f-be29-c0d680c7a412",
   "group": null,
   "creator": "admin@lablup.com",
   "user_email": "admin@lablup.com",
   "group_name": null,
   "ownership_type": "user",
   "unmanaged_path": null,
   "cloneable": "false"
}

Code Execution Model

The core of the user API is the execute call which allows clients to execute user-provided codes in isolated compute sessions (aka kernels). Each session is managed by a kernel runtime, whose implementation is language-specific. A runtime is often a containerized daemon that interacts with the Backend.AI agent via our internal ZeroMQ protocol. In some cases, kernel runtimes may be just proxies to other code execution services instead of actual executor daemons.

Inside each compute session, a client may perform multiple runs. Each run is for executing different code snippets (the query mode) or different sets of source files (the batch mode). The client often has to call the execute API multiple times to finish a single run. It is completely legal to mix query-mode runs and batch-mode runs inside the same session, given that the kernel runtime supports both modes.

To distinguish different runs which may be overlapped, the client must provide the same run ID to all execute calls during a single run. The run ID should be unique for each run and can be an arbitrary random string. If the run ID is not provided by the client at the first execute call of a run, the API server will assign a random one and inform it to the client via the first response. Normally, if two or more runs are overlapped, they are processed in a FIFO order using an internal queue. But they may be processed in parallel if the kernel runtime supports parallel processing. Note that the API server may raise a timeout error and cancel the run if the waiting time exceeds a certain limit.

In the query mode, usually the runtime context (e.g., global variables) is preserved for next subsequent runs, but this is not guaranteed by the API itself—it’s up to the kernel runtime implementation.

_images/run-state-diagram.svg

The state diagram of a “run” with the execute API.

The execute API accepts 4 arguments: mode, runId, code, and options (opts). It returns an Execution Result Object encoded as JSON.

Depending on the value of status field in the returned Execution Result Object, the client must perform another subsequent execute call with appropriate arguments or stop. 그림 8 shows all possible states and transitions between them via the status field value.

If status is "finished", the client should stop.

If status is "continued", the client should make another execute API call with the code field set to an empty string and the mode field set to "continue". Continuation happens when the user code runs longer than a few seconds to allow the client to show its progress, or when it requires extra step to finish the run cycle.

If status is "clean-finished" or "build-finished" (this happens at the batch-mode only), the client should make the same continuation call. Since cleanup is performed before every build, the client will always receive "build-finished" after "clean-finished" status. All outputs prior to "build-finished" status return are from the build program and all future outputs are from the executed program built. Note that even when the exitCode value is non-zero (failed), the client must continue to complete the run cycle.

If status is "waiting-input", you should make another execute API call with the code field set to the user-input text and the mode field set to "input". This happens when the user code calls interactive input() functions. Until you send the user input, the current run is blocked. You may use modal dialogs or other input forms (e.g., HTML input) to retrieve user inputs. When the server receives the user input, the kernel’s input() returns the given value. Note that each kernel runtime may provide different ways to trigger this interactive input cycle or may not provide at all.

When each call returns, the console field in the Execution Result Object have the console logs captured since the last previous call. Check out the following section for details.

Handling Console Output

The console output consists of a list of tuple pairs of item type and item data. The item type is one of "stdout", "stderr", "media", "html", or "log".

When the item type is "stdout" or "stderr", the item data is the standard I/O stream outputs as (non-escaped) UTF-8 string. The total length of either streams is limited to 524,288 Unicode characters per each execute API call; all excessive outputs are truncated. The stderr often includes language-specific tracebacks of (unhandled) exceptions or errors occurred in the user code. If the user code generates a mixture of stdout and stderr, the print ordering is preserved and each contiguous block of stdout/stderr becomes a separate item in the console output list so that the client user can reconstruct the same console output by sequentially rendering the items.

참고

The text in the stdout/stderr item may contain arbitrary terminal control sequences such as ANSI color codes and cursor/line manipulations. It is the user’s job to strip out them or implement some sort of terminal emulation.

Since the console texts are not escaped, the client user should take care of rendering and escaping depending on the UI implementation. For example, use <pre> element, replace newlines with <br>, or apply white-space: pre CSS style when rendering as HTML. An easy way to do escape the text safely is to use insertAdjacentText() DOM API.

When the item type is "media", the item data is a pair of the MIME type and the content data. If the MIME type is text-based (e.g., "text/plain") or XML-based (e.g., "image/svg+xml"), the content is just a string that represent the content. Otherwise, the data is encoded as a data URI format (RFC 2397). You may use backend.ai-media library to handle this field in Javascript on web-browsers.

When the item type is "html", the item data is a partial HTML document string, such as a table to show tabular data. If you are implementing a web-based front-end, you may use it directly to the standard DOM API, for instance, consoleElem.insertAdjacentHTML(value, "beforeend").

When the item type is "log", the item data is a 4-tuple of the log level, the timestamp in the ISO 8601 format, the logger name and the log message string. The log level may be one of "debug", "info", "warning", "error", or "fatal". You may use different colors/formatting by the log level when printing the log message. Not every kernel runtime supports this rich logging facility.

Manager GraphQL API

Backend.AI GraphQL API is for developing in-house management consoles.

두 가지 모드의 작업이 있습니다.

  1. 전체 관리자 접근 : 모든 사용자에 대한 모든 정보를 쿼리할 수 있습니다. 이 방법은 권한있는 키 쌍을 필요로 합니다.

  2. 제한된 사용자 계정 : 오직 본인의 정보에 대해서만 질문할 수 있습니다.만약 본인의 고유한 키 쌍을 사용한다면, 서버는 이 모드에서의 요청을 처리합니다.

경고

관리자 API는 오직 인증된 요청들만 받아들입니다.

To test and debug with the Admin API easily, try the proxy mode of the official Python client. It provides an insecure (non-SSL, non-authenticated) local HTTP proxy where all the required authorization headers are attached from the client configuration. Using this you do not have to add any custom header configurations to your favorite API development tools such as GraphiQL.

Domain Management

쿼리 스키마
type Domain {
  name: String
  description: String
  is_active: Boolean
  created_at: DateTime
  modified_at: DateTime
  total_resource_slots: JSONString  # ResourceSlot
  allowed_vfolder_hosts: [String]
  allowed_docker_registries: [String]
  integration_id: String
  scaling_groups: [String]
}

type Query {
  domain(name: String): Domain
  domains(is_active: Boolean): [Domain]
}
뮤테이션 스키마
input DomainInput {
  description: String
  is_active: Boolean
  total_resource_slots: JSONString  # ResourceSlot
  allowed_vfolder_hosts: [String]
  allowed_docker_registries: [String]
  integration_id: String
}

input ModifyDomainInput {
  name: String
  description: String
  is_active: Boolean
  total_resource_slots: JSONString  # ResourceSlot
  allowed_vfolder_hosts: [String]
  allowed_docker_registries: [String]
  integration_id: String
}

type CreateDomain {
  ok: Boolean
  msg: String
  keypair: KeyPair
}

type ModifyDomain {
  ok: Boolean
  msg: String
}

type DeleteDomain {
  ok: Boolean
  msg: String
}

type Mutation {
  create_domain(name: String!, props: DomainInput!): CreateDomain
  modify_domain(name: String!, props: ModifyDomainInput!): ModifyDomain
  delete_domain(name: String!): DeleteDomain
}

Scaling Group Management

쿼리 스키마
type ScalingGroup {
  name: String
  description: String
  is_active: Boolean
  created_at: DateTime
  driver: String
  driver_opts: JSONString
  scheduler: String
  scheduler_opts: JSONString
}

type Query {
  scaling_group(name: String): ScalingGroup
  scaling_groups(name: String, is_active: Boolean): [ScalingGroup]
  scaling_groups_for_domain(domain: String!, is_active: Boolean): [ScalingGroup]
  scaling_groups_for_user_group(user_group: String!, is_active: Boolean): [ScalingGroup]
  scaling_groups_for_keypair(access_key: String!, is_active: Boolean): [ScalingGroup]
}
뮤테이션 스키마
input ScalingGroupInput {
  description: String
  is_active: Boolean
  driver: String!
  driver_opts: JSONString
  scheduler: String!
  scheduler_opts: JSONString
}

input ModifyScalingGroupInput {
  description: String
  is_active: Boolean
  driver: String
  driver_opts: JSONString
  scheduler: String
  scheduler_opts: JSONString
}

type CreateScalingGroup {
  ok: Boolean
  msg: String
  scaling_group: ScalingGroup
}

type ModifyScalingGroup {
  ok: Boolean
  msg: String
}

type DeleteScalingGroup {
  ok: Boolean
  msg: String
}

type AssociateScalingGroupWithDomain {
  ok: Boolean
  msg: String
}

type AssociateScalingGroupWithKeyPair {
  ok: Boolean
  msg: String
}

type AssociateScalingGroupWithUserGroup {
  ok: Boolean
  msg: String
}

type DisassociateAllScalingGroupsWithDomain {
  ok: Boolean
  msg: String
}

type DisassociateAllScalingGroupsWithGroup {
  ok: Boolean
  msg: String
}

type DisassociateScalingGroupWithDomain {
  ok: Boolean
  msg: String
}

type DisassociateScalingGroupWithKeyPair {
  ok: Boolean
  msg: String
}

type DisassociateScalingGroupWithUserGroup {
  ok: Boolean
  msg: String
}

type Mutation {
  create_scaling_group(name: String!, props: ScalingGroupInput!): CreateScalingGroup
  modify_scaling_group(name: String!, props: ModifyScalingGroupInput!): ModifyScalingGroup
  delete_scaling_group(name: String!): DeleteScalingGroup
  associate_scaling_group_with_domain(domain: String!, scaling_group: String!): AssociateScalingGroupWithDomain
  associate_scaling_group_with_user_group(scaling_group: String!, user_group: String!): AssociateScalingGroupWithUserGroup
  associate_scaling_group_with_keypair(access_key: String!, scaling_group: String!): AssociateScalingGroupWithKeyPair
  disassociate_scaling_group_with_domain(domain: String!, scaling_group: String!): DisassociateScalingGroupWithDomain
  disassociate_scaling_group_with_user_group(scaling_group: String!, user_group: String!): DisassociateScalingGroupWithUserGroup
  disassociate_scaling_group_with_keypair(access_key: String!, scaling_group: String!): DisassociateScalingGroupWithKeyPair
  disassociate_all_scaling_groups_with_domain(domain: String!): DisassociateAllScalingGroupsWithDomain
  disassociate_all_scaling_groups_with_group(user_group: String!): DisassociateAllScalingGroupsWithGroup
}

Resource Preset Management

쿼리 스키마
type ResourcePreset {
  name: String
  resource_slots: JSONString
  shared_memory: BigInt
}

type Query {
  resource_preset(name: String!): ResourcePreset
  resource_presets(): [ResourcePreset]
}
뮤테이션 스키마
input CreateResourcePresetInput {
  resource_slots: JSONString
  shared_memory: String
}

type CreateResourcePreset {
  ok: Boolean
  msg: String
  resource_preset: ResourcePreset
}

input ModifyResourcePresetInput {
  resource_slots: JSONString
  shared_memory: String
}

type ModifyResourcePreset {
  ok: Boolean
  msg: String
}

type DeleteResourcePreset {
  ok: Boolean
  msg: String
}

type Mutation {
  create_resource_preset(name: String!, props: CreateResourcePresetInput!): CreateResourcePreset
  modify_resource_preset(name: String!, props: ModifyResourcePresetInput!): ModifyResourcePreset
  delete_resource_preset(name: String!): DeleteResourcePreset
}

Agent Monitoring

쿼리 스키마
type Agent {
  id: ID
  status: String
  status_changed: DateTime
  region: String
  scaling_group: String
  available_slots: JSONString  # ResourceSlot
  occupied_slots: JSONString   # ResourceSlot
  addr: String
  first_contact: DateTime
  lost_at: DateTime
  live_stat: JSONString
  version: String
  compute_plugins: JSONString
  compute_containers(status: String): [ComputeContainer]

  # legacy fields
  mem_slots: Int
  cpu_slots: Float
  gpu_slots: Float
  tpu_slots: Float
  used_mem_slots: Int
  used_cpu_slots: Float
  used_gpu_slots: Float
  used_tpu_slots: Float
  cpu_cur_pct: Float
  mem_cur_bytes: Float
}

type Query {
  agent_list(
   limit: Int!,
   offset: Int!
   order_key: String,
   order_asc: Boolean,
   scaling_group: String,
   status: String,
 ): PaginatedList[Agent]
}

User Management

쿼리 스키마
type User {
  uuid: UUID
  username: String
  email: String
  password: String
  need_password_change: Boolean
  full_name: String
  description: String
  is_active: Boolean
  created_at: DateTime
  domain_name: String
  role: String
  groups: [UserGroup]
}

type UserGroup {  # shorthand reference to Group
  id: UUID
  name: String
}

type Query {
  user(domain_name: String, email: String): User
  user_from_uuid(domain_name: String, user_id: String): User
  users(domain_name: String, group_id: String, is_active: Boolean): [User]
}
뮤테이션 스키마
input UserInput {
  username: String!
  password: String!
  need_password_change: Boolean!
  full_name: String
  description: String
  is_active: Boolean
  domain_name: String!
  role: String
  group_ids: [String]
}

input ModifyUserInput {
  username: String
  password: String
  need_password_change: Boolean
  full_name: String
  description: String
  is_active: Boolean
  domain_name: String
  role: String
  group_ids: [String]
}

type CreateKeyPair {
  ok: Boolean
  msg: String
  keypair: KeyPair
}

type ModifyUser {
  ok: Boolean
  msg: String
  user: User
}

type DeleteUser {
  ok: Boolean
  msg: String
}

type Mutation {
  create_user(email: String!, props: UserInput!): CreateUser
  modify_user(email: String!, props: ModifyUserInput!): ModifyUser
  delete_user(email: String!): DeleteUser
}

Group Management

쿼리 스키마
type Group {
  id: UUID
  name: String
  description: String
  is_active: Boolean
  created_at: DateTime
  modified_at: DateTime
  domain_name: String
  total_resource_slots: JSONString  # ResourceSlot
  allowed_vfolder_hosts: [String]
  integration_id: String
  scaling_groups: [String]
}

type Query {
  group(id: String!): Group
  groups(domain_name: String, is_active: Boolean): [Group]
}
뮤테이션 스키마
input GroupInput {
  description: String
  is_active: Boolean
  domain_name: String!
  total_resource_slots: JSONString  # ResourceSlot
  allowed_vfolder_hosts: [String]
  integration_id: String
}

input ModifyGroupInput {
  name: String
  description: String
  is_active: Boolean
  domain_name: String
  total_resource_slots: JSONString  # ResourceSlot
  user_update_mode: String
  user_uuids: [String]
  allowed_vfolder_hosts: [String]
  integration_id: String
}

type CreateGroup {
  ok: Boolean
  msg: String
  keypair: KeyPair
}

type ModifyGroup {
  ok: Boolean
  msg: String
}

type DeleteGroup {
  ok: Boolean
  msg: String
}

type Mutation {
  create_group(name: String!, props: GroupInput!): CreateGroup
  modify_group(name: String!, props: ModifyGroupInput!): ModifyGroup
  delete_group(name: String!): DeleteGroup
}

키 쌍 관리

쿼리 스키마
type KeyPair {
  user_id: String
  access_key: String
  secret_key: String
  is_active: Boolean
  is_admin: Boolean
  resource_policy: String
  created_at: DateTime
  last_used: DateTime
  concurrency_used: Int
  rate_limit: Int
  num_queries: Int
  user: UUID
  ssh_public_key: String
  vfolders: [VirtualFolder]
  compute_sessions(status: String): [ComputeSession]
}

type Query {
  keypair(domain_name: String, access_key: String): KeyPair
  keypairs(domain_name: String, email: String, is_active: Boolean): [KeyPair]
}
뮤테이션 스키마
input KeyPairInput {
  is_active: Boolean
  resource_policy: String
  concurrency_limit: Int
  rate_limit: Int
}

input ModifyKeyPairInput {
  is_active: Boolean
  is_admin: Boolean
  resource_policy: String
  concurrency_limit: Int
  rate_limit: Int
}

type CreateKeyPair {
  ok: Boolean
  msg: String
  keypair: KeyPair
}

type ModifyKeyPair {
  ok: Boolean
  msg: String
}

type DeleteKeyPair {
  ok: Boolean
  msg: String
}

type Mutation {
  create_keypair(props: KeyPairInput!, user_id: String!): CreateKeyPair
  modify_keypair(access_key: String!, props: ModifyKeyPairInput!): ModifyKeyPair
  delete_keypair(access_key: String!): DeleteKeyPair
}

KeyPair Resource Policy Management

쿼리 스키마
type KeyPairResourcePolicy {
  name: String
  created_at: DateTime
  default_for_unspecified: String
  total_resource_slots: JSONString  # ResourceSlot
  max_concurrent_sessions: Int
  max_containers_per_session: Int
  idle_timeout: BigInt
  max_vfolder_count: Int
  max_vfolder_size: BigInt
  allowed_vfolder_hosts: [String]
}

type Query {
  keypair_resource_policy(name: String): KeyPairResourcePolicy
  keypair_resource_policies(): [KeyPairResourcePolicy]
}
뮤테이션 스키마
input CreateKeyPairResourcePolicyInput {
  default_for_unspecified: String!
  total_resource_slots: JSONString!
  max_concurrent_sessions: Int!
  max_containers_per_session: Int!
  idle_timeout: BigInt!
  max_vfolder_count: Int!
  max_vfolder_size: BigInt!
  allowed_vfolder_hosts: [String]
}

input ModifyKeyPairResourcePolicyInput {
  default_for_unspecified: String
  total_resource_slots: JSONString
  max_concurrent_sessions: Int
  max_containers_per_session: Int
  idle_timeout: BigInt
  max_vfolder_count: Int
  max_vfolder_size: BigInt
  allowed_vfolder_hosts: [String]
}

type CreateKeyPairResourcePolicy {
  ok: Boolean
  msg: String
  resource_policy: KeyPairResourcePolicy
}

type ModifyKeyPairResourcePolicy {
  ok: Boolean
  msg: String
}

type DeleteKeyPairResourcePolicy {
  ok: Boolean
  msg: String
}

type Mutation {
  create_keypair_resource_policy(name: String!, props: CreateKeyPairResourcePolicyInput!): CreateKeyPairResourcePolicy
  modify_keypair_resource_policy(name: String!, props: ModifyKeyPairResourcePolicyInput!): ModifyKeyPairResourcePolicy
  delete_keypair_resource_policy(name: String!): DeleteKeyPairResourcePolicy
}

컴퓨팅 세션 모니터링

As of Backend.AI v20.03, compute sessions are composed of one or more containers, while interactions with sessions only occur with the master container when using REST APIs. The GraphQL API allows users and admins to check details of sessions and their belonging containers.

버전 v5.20191215에서 변경.

쿼리 스키마

ComputeSession provides information about the whole session, including user-requested parameters when creating sessions.

type ComputeSession {
  # identity and type
  id: UUID
  name: String
  type: String
  id: UUID
  tag: String

  # image
  image: String
  registry: String
  cluster_template: String  # reserved for future release

  # ownership
  domain_name: String
  group_name: String
  group_id: UUID
  user_email: String
  user_id: UUID
  access_key: String
  created_user_email: String  # reserved for future release
  created_user_uuid: UUID     # reserved for future release

  # status
  status: String
  status_changed: DateTime
  status_info: String
  created_at: DateTime
  terminated_at: DateTime
  startup_command: String
  result: String

  # resources
  resource_opts: JSONString
  scaling_group: String
  service_ports: JSONString   # only available in master
  mounts: List[String]            # shared by all kernels
  occupied_slots: JSONString  # ResourceSlot; sum of belonging containers

  # statistics
  num_queries: BigInt

  # owned containers (aka kernels)
  containers: List[ComputeContainer]  # full list of owned containers

  # pipeline relations
  dependencies: List[ComputeSession]  # full list of dependency sessions
}

The sessions may be queried one by one using compute_session field on the root query schema, or as a paginated list using compute_session_list.

type Query {
  compute_session(
    id: UUID!,
  ): ComputeSession

  compute_session_list(
    limit: Int!,
    offset: Int!,
    order_key: String,
    order_asc: Boolean,
    domain_name: String,  # super-admin can query sessions in any domain
    group_id: String,     # domain-admins can query sessions in any group
    access_key: String,   # admins can query sessions of other users
    status: String,
  ): PaginatedList[ComputeSession]
}

ComputeContainer provides information about individual containers that belongs to the given session. Note that the client must assume that id is different from container_id, because agents may be configured to use non-Docker backends.

참고

The container ID in the GraphQL queries and REST APIs are different from the actual Docker container ID. The Docker container IDs can be queried using container_id field of ComputeContainer objects. If the agents are configured to using non-Docker-based backends, then container_id may also be completely arbitrary identifiers.

type ComputeContainer {
  # identity
  id: UUID
  role: String      # "master" is reserved, other values are defined by cluster templates
  hostname: String  # used by sibling containers in the same session
  session_id: UUID

  # image
  image: String
  registry: String

  # status
  status: String
  status_changed: DateTime
  status_info: String
  created_at: DateTime
  terminated_at: DateTime

  # resources
  agent: String               # super-admin only
  container_id: String
  resource_opts: JSONString
  # NOTE: mounts are same in all containers of the same session.
  occupied_slots: JSONString  # ResourceSlot

  # statistics
  live_stat: JSONString
  last_stat: JSONString
}

In the same way, the containers may be queried one by one using compute_container field on the root query schema, or as a paginated list using compute_container_list for a single session.

참고

The container ID of the master container of each session is same to the session ID.

type Query {
  compute_container(
    id: UUID!,
  ): ComputeContainer

  compute_container_list(
    limit: Int!,
    offset: Int!,
    session_id: UUID!,
    role: String,
  ): PaginatedList[ComputeContainer]
}
Query Example
query(
  $limit: Int!,
  $offset: Int!,
  $ak: String,
  $status: String,
) {
  compute_session_list(
    limit: $limit,
    offset: $offset,
    access_key: $ak,
    status: $status,
  ) {
    total_count
    items {
      id
      name
      type
      user_email
      status
      status_info
      status_updated
      containers {
        id
        role
        agent
      }
    }
  }
}
API Parameters

Using the above GraphQL query, clients may send the following JSON object as the request:

{
  "query": "...",
  "variables": {
    "limit": 10,
    "offset": 0,
    "ak": "AKIA....",
    "status": "RUNNING"
  }
}
API Response
{
  "compute_session_list": {
    "total_count": 1,
    "items": [
      {
        "id": "12c45b55-ce3c-418d-9c58-223bbba307f1",
        "name": "mysession",
        "type": "interactive",
        "user_email": "user@lablup.com",
        "status": "RUNNING",
        "status_info": null,
        "status_updated": "2020-02-16T15:47:28.997335+00:00",
        "containers": [
          {
            "id": "12c45b55-ce3c-418d-9c58-223bbba307f1",
            "role": "master",
            "agent": "i-agent01"
          },
          {
            "id": "12c45b55-ce3c-418d-9c58-223bbba307f2",
            "role": "slave",
            "agent": "i-agent02"
          },
          {
            "id": "12c45b55-ce3c-418d-9c58-223bbba307f3",
            "role": "slave",
            "agent": "i-agent03"
          }
        ]
      }
    ]
  }
}

가상 폴더 관리

쿼리 스키마
type VirtualFolder {
  id: UUID
  host: String
  name: String
  user: UUID
  group: UUID
  unmanaged_path: UUID
  max_files: Int
  max_size: Int
  created_at: DateTime
  last_used: DateTime
  num_files: Int
  cur_size: BigInt
}

type Query {
  vfolder_list(
    limit: Int!,
    offset: Int!,
    order_key: String,
    order_asc: Boolean,
    domain_name: String,
    group_id: String,
    access_key: String,
  ): PaginatedList[VirtualFolder]
}

Image Management

쿼리 스키마
type Image {
  name: String
  humanized_name: String
  tag: String
  registry: String
  digest: String
  labels: [KVPair]
  aliases: [String]
  size_bytes: BigInt
  resource_limits: [ResourceLimit]
  supported_accelerators: [String]
  installed: Boolean
  installed_agents: [String]  # super-admin only
}
type Query {
  image(reference: String!): Image

  images(
    is_installed: Boolean,
    is_operation: Boolean,
    domain: String,         # only settable by super-admins
    group: String,
    scaling_group: String,  # null to take union of all agents from allowed scaling groups
  ): [Image]
}

The image list is automatically filtered by: 1) the allowed docker registries of the current user’s domain, 2) whether at least one agent in the union of all agents from the allowed scaling groups for the current user’s group has the image or not. The second condition applies only when the value of group is given explicitly. If scaling_group is not null, then only the agents in the given scaling group are checked for image availability instead of taking the union of all agents from the allowed scaling groups.

If the requesting user is a super-admin, clients may set the filter conditions as they want. If the filter conditions are not specified by the super-admin, clients work like v19.09 and prior versions

버전 v5.20191215에 추가: domain, group, and scaling_group filters are added to the images root query field.

버전 v5.20191215에서 변경: images query returns the images currently usable by the requesting user as described above. Previously, it returned all etcd-registered images.

뮤테이션 스키마
type RescanImages {
  ok: Boolean
  msg: String
  task_id: String
}

type PreloadImage {
  ok: Boolean
  msg: String
  task_id: String
}

type UnloadImage {
  ok: Boolean
  msg: String
  task_id: String
}

type ForgetImage {
  ok: Boolean
  msg: String
}

type AliasImage {
  ok: Boolean
  msg: String
}

type DealiasImage {
  ok: Boolean
  msg: String
}

type Mutation {
  rescan_images(registry: String!): RescanImages
  preload_image(reference: String!, target_agents: String!): PreloadImage
  unload_image(reference: String!, target_agents: String!): UnloadImage
  forget_image(reference: String!): ForgetImage
  alias_image(alias: String!, target: String!): AliasImage
  dealias_image(alias: String!): DealiasImage
}

All these mutations are only allowed for super-admins.

The query parameter target_agents takes a special expression to indicate a set of agents.

The mutations that returns task_id may take an arbitrarily long time to complete. This means that getting the response does not necessarily mean that the requested task is complete. To monitor the progress and actual completion, clients should use the background task API using the task_id value.

버전 v5.20191215에 추가: forget_image, preload_image and unload_image are added to the root mutation.

버전 v5.20191215에서 변경: rescan_images now returns immediately and its completion must be monitored using the new background task API.

GraphQL의 기본

관리자 API는 단일 GraphQL 엔드포인트를 쿼리와 뮤테이션에 사용합니다.

https://api.backend.ai/admin/graphql

GraphQL에 대한 개념과 구문에 대한 더 많은 정보를 위해서는, 다음 사이트를 방문하십시오.

HTTP 요청 규칙

클라이언트는 HTTP 메소드인 POST를 사용해야 합니다. 서버는 다른 GraphQL 서버 구현과 매우 유사한 쿼리``와 ``변수 두 가지 필드를 포함하는 JSON 인코딩 본문을 허용합니다.

경고

현재 API 게이트웨이는 Insomnia와 GraphiQL과 같은 API 개발 도구에서 자주 사용되는 스키마 검색을 지원하지 않습니다.

필드 명명 규칙

필드 네임을 자동으로 대문자로 만들지 않습니다. 서버 측의 프레임 워크가 파이썬을 사용하기 때문에 모든 필드 네임은 파이썬 상에서 일반적인 밑줄 스타일을 따릅니다.

Common Object Types

ResourceLimit represents a range (min, max) of specific resource slot (key). The max value may be the string constant “Infinity” if not specified.

type ResourceLimit {
  key: String
  min: String
  max: String
}

KVPair is used to represent a mapping data structure with arbitrary (runtime-determined) key-value pairs, in contrast to other data types in GraphQL which have a set of predefined static fields.

type KVPair {
  key: String
  value: String
}
페이지 번호 규칙

GraphQL 자체는 동일 유형의 다수 객체들에게 질문할 때, 페이지 번호 정보를전달하는 방법을 강요하지 않습니다.

We use a pagination convention as described below:

interface Item {
  id: UUID
  # other fields are defined by concrete types
}

interface PaginatedList(
  offset: Integer!,
  limit: Integer!,
  # some concrete types define ordering customization fields:
  #   order_key: String,
  #   order_asc: Boolean,
  # other optional filter condition may be added by concrete types
) {
  total_count: Integer
  items: [Item]
}

offset and limit are interpreted as SQL’s offset and limit clauses. For the first page, set the offset to zero and the limit to the page size. The items field may contain from zero up to limit items. Use total_count field to determine how many pages are there. Fields that support pagination is suffixed with _list in our schema.

커스텀 스칼라 타입
  • 범용 고유 식별자 : ``문자열``로 표시되는 16진수 형식의 범용 고유 식별자 값(단일 하이픈으로 연결된 8-4-4-4-12 영숫자)

  • DateTime : ``문자열``로 표시되는 ISO-8601 형식의 날짜-시간 값

  • BigInt: GraphQL’s integer is officially 32-bits only, so we define a “big integer” type which can represent from -9007199254740991 (-253+1) to 9007199254740991 (253-1) (or, ±(8 PiB - 1 byte). This range is regarded as a “safe” (i.e., can be compared without loosing precision) integer range in most Javascript implementations which represent numbers in the IEEE-754 double (64-bit) format.

  • JSONString: It contains a stringified JSON value, whereas the whole query result is already a JSON object. A client must parse the value again to get an object representation.

인증

관리자 API는 사용자 API와 같은 인증 방법을 공유한다.

버전관리

As we use GraphQL, there is no explicit versioning. Check out the descriptions for each API for its own version history.

Backend.AI REST API Reference

GET /acl

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

GET /config/resource-slots
Status Codes:
  • 200 OK – Successful response

GET /config/resource-slots/details
Query Parameters:
  • sgroup (string) –

Status Codes:
  • 200 OK – Successful response

GET /config/vfolder-types
Status Codes:
  • 200 OK – Successful response

GET /config/docker-registries

Returns the list of all registered docker registries.

Preconditions:

  • Superadmin privilege required.

Status Codes:
  • 200 OK – Successful response

POST /config/get

A raw access API to read key-value pairs from the etcd.

경고

When reading the keys with prefix=True, it uses a simple string-prefix matching over the flattened keys (with the delimiter “/”). Thus, it may return additional keys that you may not want.

For example, reading “some/key1” will fetch all of the following keys:

some/key1
some/key1/field1
some/key1/field2
some/key12
some/key12/field1
some/key12/field2

To avoid this issue, developers must use dedicated CRUD APIs instead of relying on the etcd raw access APIs whenever possible.

Preconditions:

  • Superadmin privilege required.

Request JSON Object:
  • key (string, required) –

  • prefix (boolean, required) –

Status Codes:
  • 200 OK – Successful response

POST /config/set

A raw access API to write key-value pairs into the etcd.

Preconditions:

  • Superadmin privilege required.

Request JSON Object:
  • key (string, required) –

  • value (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /config/delete

A raw access API to delete key-value pairs from the etcd.

경고

When deleting the keys with prefix=True, it uses a simple string-prefix matching over the flattened keys (with the delimiter “/”). This may result in unexpected deletion of sibling keys.

For example, deleting “some/key1” will DELETE all of the following keys:

some/key1
some/key1/field1
some/key1/field2
some/key12
some/key12/field1
some/key12/field2

To avoid this issue, developers must use dedicated CRUD APIs instead of relying on the etcd raw access APIs whenever possible.

Preconditions:

  • Superadmin privilege required.

Request JSON Object:
  • key (string, required) –

  • prefix (boolean, required) –

Status Codes:
  • 200 OK – Successful response

GET /events/background-task

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • task_id (string:uuid, required) –

Status Codes:
  • 200 OK – Successful response

GET /events/session

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • name (string, required) –

  • ownerAccessKey (string) –

  • sessionId (string:uuid) –

  • group (string, required) –

  • scope (string:enum, required) –

Status Codes:
  • 200 OK – Successful response

GET /auth

Preconditions:

  • User privilege required.

Query Parameters:
  • echo (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /auth

Preconditions:

  • User privilege required.

Request JSON Object:
  • echo (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /auth/test

Preconditions:

  • User privilege required.

Query Parameters:
  • echo (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /auth/test

Preconditions:

  • User privilege required.

Request JSON Object:
  • echo (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /auth/authorize
Request JSON Object:
  • type (string:enum, required) –

  • domain (string, required) –

  • username (string, required) –

  • password (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /auth/role

Preconditions:

  • User privilege required.

Query Parameters:
  • group (string:uuid) –

Status Codes:
  • 200 OK – Successful response

POST /auth/signup
Request JSON Object:
  • domain (string, required) –

  • email (string, required) –

  • password (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /auth/signout

Preconditions:

  • User privilege required.

Request JSON Object:
  • email (string, required) –

  • password (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /auth/update-password-no-auth

Update user’s password without any authorization to allows users to update passwords that have expired because it’s been too long since a user changed the password.

Request JSON Object:
  • domain (string, required) –

  • username (string, required) –

  • current_password (string, required) –

  • new_password (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /auth/update-password

Preconditions:

  • User privilege required.

Request JSON Object:
  • old_password (string, required) –

  • new_password (string, required) –

  • new_password2 (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /auth/update-full-name

Preconditions:

  • User privilege required.

Request JSON Object:
  • email (string, required) –

  • full_name (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /auth/ssh-keypair

Preconditions:

  • User privilege required.

Status Codes:
  • 200 OK – Successful response

PATCH /auth/ssh-keypair

Preconditions:

  • User privilege required.

Status Codes:
  • 200 OK – Successful response

POST /auth/ssh-keypair

Preconditions:

  • User privilege required.

Request JSON Object:
  • pubkey (string, required) –

  • privkey (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • name (string, required) –

  • host (string) –

  • usage_mode (string:enum) –

  • permission (string:enum) –

  • unmanaged_path (string) –

  • group (string:uuid) –

  • quota (string) – Size in binary format (e.g. 2KB, 3M, 4GiB)

  • cloneable (boolean, required) –

Status Codes:
  • 200 OK – Successful response

GET /folders

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • all (boolean, required) –

  • group_id

  • owner_user_email (string:email) –

Status Codes:
  • 200 OK – Successful response

DELETE /folders

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • id (string:uuid, required) –

Status Codes:
  • 200 OK – Successful response

GET /folders/{name}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Status Codes:
  • 200 OK – Successful response

DELETE /folders/{name}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /folders/_/hosts

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • group_id

Status Codes:
  • 200 OK – Successful response

GET /folders/_/all-hosts

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

GET /folders/_/allowed-types

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

GET /folders/_/all_hosts

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

GET /folders/_/allowed_types

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

GET /folders/_/perf-metric

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • folder_host (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/rename

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • new_name (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/update-options

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • cloneable (boolean) –

  • permission (string:enum) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/mkdir

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • path (string, required) –

  • parents (boolean, required) –

  • exist_ok (boolean, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/request-upload

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • path (string, required) –

  • size (integer, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/request-download

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • path (string, required) –

  • archive (boolean, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/move-file

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • src (string, required) –

  • dst (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/rename-file

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • target_path (string, required) –

  • new_name (string, required) –

  • is_dir (boolean, required) –

Status Codes:
  • 200 OK – Successful response

DELETE /folders/{name}/delete-files

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Query Parameters:
  • files (array, required) –

  • recursive (boolean, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/rename_file

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • target_path (string, required) –

  • new_name (string, required) –

  • is_dir (boolean, required) –

Status Codes:
  • 200 OK – Successful response

DELETE /folders/{name}/delete_files

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Query Parameters:
  • files (array, required) –

  • recursive (boolean, required) –

Status Codes:
  • 200 OK – Successful response

GET /folders/{name}/files

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Query Parameters:
  • path (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/invite

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • perm (string:enum, required) –

  • emails[] (string) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/leave

Leave a shared vfolder.

Cannot leave a group vfolder or a vfolder that the requesting user owns.

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • shared_user_uuid (string) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/share

Preconditions:

  • Admin privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • permission (string:enum, required) –

  • emails[] (string) –

Status Codes:
  • 200 OK – Successful response

DELETE /folders/{name}/unshare

Unshare a group folder from users.

Preconditions:

  • Admin privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Query Parameters:
  • emails (array, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/{name}/clone

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • name (string, required) –

Request JSON Object:
  • cloneable (boolean, required) –

  • target_name (string, required) –

  • target_host (string) –

  • usage_mode (string:enum) –

  • permission (string:enum) –

Status Codes:
  • 200 OK – Successful response

GET /folders/invitations/list-sent

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

GET /folders/invitations/list_sent

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

POST /folders/invitations/update/{inv_id}

Update sent invitation’s permission. Other fields are not allowed to be updated.

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • inv_id (string, required) –

Request JSON Object:
  • perm (string:enum, required) –

Status Codes:
  • 200 OK – Successful response

GET /folders/invitations/list

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

POST /folders/invitations/accept

Accept invitation by invitee.

* `inv_ak` parameter is removed from 19.06 since virtual folder's ownership is
moved from keypair to a user or a group.

:param inv_id: ID of vfolder_invitations row.

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • inv_id (string, required) –

Status Codes:
  • 200 OK – Successful response

DELETE /folders/invitations/delete

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • inv_id (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /folders/_/shared

List shared vfolders.

Not available for group vfolders.

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • vfolder_id (string:uuid) –

Status Codes:
  • 200 OK – Successful response

POST /folders/_/shared

Update permission for shared vfolders.

If params[‘perm’] is None, remove user’s permission for the vfolder.

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • vfolder (string:uuid, required) –

  • user (string:uuid, required) –

  • perm (string:enum) –

Status Codes:
  • 200 OK – Successful response

GET /folders/_/fstab

Return the contents of /etc/fstab file.

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • fstab_path (string) –

  • agent_id (string) –

Status Codes:
  • 200 OK – Successful response

GET /folders/_/mounts

List all mounted vfolder hosts in vfroot.

All mounted hosts from connected (ALIVE) agents are also gathered. Generally, agents should be configured to have same hosts structure, but newly introduced one may not.

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

POST /folders/_/mounts

Mount device into vfolder host.

Mount a device (eg: nfs) located at fs_location into <vfroot>/name in the host machines (manager and all agents). fs_type can be specified by requester, which fallbaks to ‘nfs’.

If scaling_group is specified, try to mount for agents in the scaling group.

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • fs_location (string, required) –

  • name (string, required) –

  • fs_type (string, required) –

  • options (string) –

  • scaling_group (string) –

  • fstab_path (string) –

  • edit_fstab (boolean, required) –

Status Codes:
  • 200 OK – Successful response

DELETE /folders/_/mounts

Unmount device from vfolder host.

Unmount a device (eg: nfs) located at <vfroot>/name from the host machines (manager and all agents).

If scaling_group is specified, try to unmount for agents in the scaling group.

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • name (string, required) –

  • scaling_group (string) –

  • fstab_path (string) –

  • edit_fstab (boolean, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/_/change-ownership

Change the ownership of vfolder For now, we only provide changing the ownership of user-folder

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • vfolder (string:uuid, required) –

  • user_email (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /folders/_/quota

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • folder_host (string, required) –

  • id (string:uuid, required) –

Status Codes:
  • 200 OK – Successful response

POST /folders/_/quota

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • folder_host (string, required) –

  • id (string:uuid, required) –

  • input (object, required) – Mapping(String => Any)

Status Codes:
  • 200 OK – Successful response

GET /folders/_/usage

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • folder_host (string, required) –

  • id (string:uuid, required) –

Status Codes:
  • 200 OK – Successful response

GET /folders/_/used-bytes

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • folder_host (string, required) –

  • id (string:uuid, required) –

Status Codes:
  • 200 OK – Successful response

POST //graphql

Preconditions:

  • User privilege required.

Request JSON Object:
  • query (string, required) –

  • variables (object) – Mapping(String => Any)

  • operation_name (string) –

Status Codes:
  • 200 OK – Successful response

POST //gql

Preconditions:

  • User privilege required.

Request JSON Object:
  • query (string, required) –

  • variables (object) – Mapping(String => Any)

  • operation_name (string) –

Status Codes:
  • 200 OK – Successful response

GET /services

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • name (string) –

Status Codes:
  • 200 OK – Successful response

POST /services

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • name (string, required) –

  • desired_session_count (integer, required) –

  • image (string, required) –

  • arch (string, required) –

  • group (string, required) –

  • domain (string, required) –

  • cluster_size (integer, required) –

  • cluster_mode (string:enum, required) –

  • tag (string) –

  • startup_command (string) –

  • bootstrap_script (string) –

  • callback_url (string:uri) –

  • owner_access_key (string) –

  • open_to_public (boolean, required) –

  • config (object, required) –

  • config.model (string, required) –

  • config.model_version (string) –

  • config.model_mount_destination (string, required) –

  • config.environ (object) – Mapping(String => String)

  • config.scaling_group (string) –

  • config.resources (object) – Mapping(String => Any)

  • config.resource_opts (object) – Mapping(String => Any)

Status Codes:
  • 200 OK – Successful response

GET /services/{service_id}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • service_id (string, required) –

Status Codes:
  • 200 OK – Successful response

DELETE /services/{service_id}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • service_id (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /services/{service_id}/errors

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • service_id (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /services/{service_id}/errors/clear

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • service_id (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /services/{service_id}/scale

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • service_id (string, required) –

Request JSON Object:
  • to (integer, required) –

Status Codes:
  • 200 OK – Successful response

POST /services/{service_id}/sync

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • service_id (string, required) –

Status Codes:
  • 200 OK – Successful response

PUT /services/{service_id}/routings/{route_id}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • service_id (string, required) –

  • route_id (string, required) –

Request JSON Object:
  • traffic_ratio (integer, required) –

Status Codes:
  • 200 OK – Successful response

DELETE /services/{service_id}/routings/{route_id}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • service_id (string, required) –

  • route_id (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /services/{service_id}/token

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • service_id (string, required) –

Request JSON Object:
  • duration (string) – Human-readable time duration representation (e.g. 2y, 3d, 4m, 5h, 6s, …)

  • valid_until (integer) –

Status Codes:
  • 200 OK – Successful response

POST /session

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • name (string, required) –

  • image (string, required) –

  • arch (string, required) –

  • type (string:enum, required) –

  • group (string, required) –

  • domain (string, required) –

  • cluster_size (integer, required) –

  • cluster_mode (string:enum, required) –

  • config (object, required) – Mapping(String => Any)

  • tag (string) –

  • enqueueOnly (boolean, required) –

  • maxWaitSeconds (integer, required) –

  • starts_at (string) –

  • reuseIfExists (boolean, required) –

  • startupCommand (string) –

  • bootstrap_script (string) –

  • dependencies[] (string:uuid) –

  • callback_url (string:uri) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

POST /session/_/create

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • name (string, required) –

  • image (string, required) –

  • arch (string, required) –

  • type (string:enum, required) –

  • group (string, required) –

  • domain (string, required) –

  • cluster_size (integer, required) –

  • cluster_mode (string:enum, required) –

  • config (object, required) – Mapping(String => Any)

  • tag (string) –

  • enqueueOnly (boolean, required) –

  • maxWaitSeconds (integer, required) –

  • starts_at (string) –

  • reuseIfExists (boolean, required) –

  • startupCommand (string) –

  • bootstrap_script (string) –

  • dependencies[] (string:uuid) –

  • callback_url (string:uri) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

POST /session/_/create-from-template

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • template_id (string:uuid) –

  • name (string) –

  • image (string, required) –

  • arch (string) –

  • type (string:enum) –

  • group (string, required) –

  • domain (string, required) –

  • cluster_size (integer, required) –

  • cluster_mode (string:enum, required) –

  • config (object, required) – Mapping(String => Any)

  • tag (string, required) –

  • enqueueOnly (boolean, required) –

  • maxWaitSeconds (integer, required) –

  • starts_at (string) –

  • reuseIfExists (boolean, required) –

  • startupCommand (string, required) –

  • bootstrap_script (string, required) –

  • dependencies[] (string:uuid) –

  • callback_url (string:uri, required) –

  • owner_access_key (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /session/_/create-cluster

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • clientSessionToken (string, required) –

  • template_id (string:uuid) –

  • type (string:enum, required) –

  • group (string, required) –

  • domain (string, required) –

  • scaling_group (string) –

  • tag (string) –

  • enqueueOnly (boolean, required) –

  • maxWaitSeconds (integer, required) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

GET /session/_/match

A quick session-ID matcher API for use with auto-completion in CLI.

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • id (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /session/_/sync-agent-registry

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • agent (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /session/{session_name}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

PATCH /session/{session_name}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Request JSON Object:
  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

DELETE /session/{session_name}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Query Parameters:
  • forced (boolean, required) –

  • recursive (boolean, required) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

POST /session/{session_name}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

HEAD /session/_/logs

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • session_name (string:uuid, required) –

Status Codes:
  • 200 OK – Successful response

GET /session/_/logs

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • session_name (string:uuid, required) –

Status Codes:
  • 200 OK – Successful response

GET /session/{session_name}/direct-access-info

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /session/{session_name}/logs

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Query Parameters:
  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

POST /session/{session_name}/rename

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Request JSON Object:
  • name (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /session/{session_name}/interrupt

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /session/{session_name}/complete

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /session/{session_name}/shutdown-service

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Request JSON Object:
  • service_name (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /session/{session_name}/upload

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /session/{session_name}/download

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Query Parameters:
  • files (array, required) –

Status Codes:
  • 200 OK – Successful response

GET /session/{session_name}/download_single

Download a single file from the scratch root. Only for small files.

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Query Parameters:
  • file (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /session/{session_name}/files

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /session/{session_name}/start-service

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Request JSON Object:
  • login_session_token (string) –

  • app (string, required) –

  • port (integer) –

  • envs (string) –

  • arguments (string) –

Status Codes:
  • 200 OK – Successful response

POST /session/{session_name}/commit

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Request JSON Object:
  • login_session_token (string) –

  • filename (string) –

Status Codes:
  • 200 OK – Successful response

GET /session/{session_name}/commit

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Query Parameters:
  • login_session_token (string) –

Status Codes:
  • 200 OK – Successful response

GET /session/{session_name}/abusing-report

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Query Parameters:
  • login_session_token (string) –

Status Codes:
  • 200 OK – Successful response

GET /session/{session_name}/dependency-graph

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /stream/session/{session_name}/pty

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /stream/session/{session_name}/execute

WebSocket-version of gateway.kernel.execute().

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /stream/session/{session_name}/apps

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /stream/session/{session_name}/httpproxy

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Query Parameters:
  • app (string, required) –

  • port (integer) –

  • envs (string) –

  • arguments (string) –

Status Codes:
  • 200 OK – Successful response

GET /stream/session/{session_name}/tcpproxy

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • session_name (string, required) –

Query Parameters:
  • app (string, required) –

  • port (integer) –

  • envs (string) –

  • arguments (string) –

Status Codes:
  • 200 OK – Successful response

GET /manager/status
Status Codes:
  • 200 OK – Successful response

PUT /manager/status

Preconditions:

  • Superadmin privilege required.

Request JSON Object:
  • status (string:enum, required) –

  • force_kill (boolean, required) –

Status Codes:
  • 200 OK – Successful response

GET /manager/announcement
Status Codes:
  • 200 OK – Successful response

POST /manager/announcement

Preconditions:

  • Superadmin privilege required.

Request JSON Object:
  • enabled (boolean, required) –

  • message (string) –

Status Codes:
  • 200 OK – Successful response

POST /manager/scheduler/operation

Preconditions:

  • Superadmin privilege required.

Request JSON Object:
  • op (string:enum, required) –

  • args (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /manager/scheduler/trigger

Preconditions:

  • Superadmin privilege required.

Request JSON Object:
  • event (string:enum, required) –

Status Codes:
  • 200 OK – Successful response

GET /manager/scheduler/status

Preconditions:

  • Superadmin privilege required.

Status Codes:
  • 200 OK – Successful response

GET /resource/presets

Returns the list of all resource presets.

Preconditions:

  • User privilege required.

Status Codes:
  • 200 OK – Successful response

POST /resource/check-presets

Returns the list of all resource presets in the current scaling group, with additional information including allocatability of each preset, amount of total remaining resources, and the current keypair resource limits.

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • scaling_group (string) –

  • group (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /resource/recalculate-usage

Update keypair_resource_usages in redis and agents.c.occupied_slots.

Those two values are sometimes out of sync. In that case, calling this API re-calculates the values for running containers and updates them in DB.

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

GET /resource/usage/month

Return usage statistics of terminated containers for a specified month. The date/time comparison is done using the configured timezone.

Parameters:
  • group_ids – If not None, query containers only in those groups.

  • month – The year-month to query usage statistics. ex) “202006” to query for Jun 2020

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • group_ids (array) –

  • month (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /resource/usage/period

Return usage statistics of terminated containers belonged to the given group for a specified period in dates. The date/time comparison is done using the configured timezone.

Parameters:
  • group_id – If not None, query containers only in the group.

  • str (end_date) – “yyyymmdd” format.

  • str – “yyyymmdd” format.

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • group_id (string) –

  • start_date (string, required) –

  • end_date (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /resource/stats/user/month

Return time-binned (15 min) stats for terminated user sessions over last 30 days.

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

GET /resource/stats/admin/month

Return time-binned (15 min) stats for all terminated sessions over last 30 days.

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

GET /resource/watcher

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • agent_id (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /resource/watcher/agent/start

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • agent_id (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /resource/watcher/agent/stop

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • agent_id (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /resource/watcher/agent/restart

Preconditions:

  • Superadmin privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • agent_id (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /scaling-groups

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • group (required) –

Status Codes:
  • 200 OK – Successful response

GET /scaling-groups/{scaling_group}/wsproxy-version

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • scaling_group (string, required) –

Query Parameters:
  • group (required) –

Status Codes:
  • 200 OK – Successful response

POST /template/cluster

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • group (string, required) –

  • domain (string, required) –

  • owner_access_key (string) –

  • payload (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /template/cluster

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • all (boolean, required) –

  • group_id

Status Codes:
  • 200 OK – Successful response

GET /template/cluster/{template_id}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • template_id (string, required) –

Query Parameters:
  • format (string:enum) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

PUT /template/cluster/{template_id}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • template_id (string, required) –

Request JSON Object:
  • payload (string, required) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

DELETE /template/cluster/{template_id}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • template_id (string, required) –

Query Parameters:
  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

POST /template/session

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • group (string, required) –

  • domain (string, required) –

  • owner_access_key (string) –

  • payload (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /template/session

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • all (boolean, required) –

  • group_id

Status Codes:
  • 200 OK – Successful response

GET /template/session/{template_id}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • template_id (string, required) –

Query Parameters:
  • format (string:enum) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

PUT /template/session/{template_id}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • template_id (string, required) –

Request JSON Object:
  • group (string, required) –

  • domain (string, required) –

  • payload (string, required) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

DELETE /template/session/{template_id}

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • template_id (string, required) –

Query Parameters:
  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

GET /image/import

Preconditions:

  • Admin privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

POST /image/import

Import a docker image and convert it to a Backend.AI-compatible one, by automatically installing a few packages and adding image labels.

Currently we only support auto-conversion of Python-based kernels (e.g., NGC images) which has its own Python version installed.

Internally, it launches a temporary kernel in an arbitrary agent within the client’s domain, the “default” group, and the “default” scaling group. (The client may change the group and scaling group using launchOptions. If the client is a super-admin, it uses the “default” domain.)

This temporary kernel occupies only 1 CPU core and 1 GiB memory. The kernel concurrency limit is not applied here, but we choose an agent based on their resource availability. The owner of this kernel is always the client that makes the API request.

This API returns immediately after launching the temporary kernel. The client may check the progress of the import task using session logs.

Preconditions:

  • Admin privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • src (string, required) –

  • target (string, required) –

  • architecture (string, required) –

  • launchOptions (object, required) –

  • launchOptions.scalingGroup (string, required) –

  • launchOptions.group (string, required) –

  • brand (string, required) –

  • baseDistro (string:enum, required) –

  • minCPU (integer, required) –

  • minMemory (string, required) – Size in binary format (e.g. 2KB, 3M, 4GiB)

  • preferredSharedMemory (string, required) – Size in binary format (e.g. 2KB, 3M, 4GiB)

  • supportedAccelerators[] (string) –

  • runtimeType (string:enum, required) –

  • runtimePath (string, required) – POSIX path

  • CPUCountEnvs[] (string) –

  • servicePorts[] (object) –

  • servicePorts[].name (string, required) –

  • servicePorts[].protocol (string:enum, required) –

  • servicePorts[].ports[] (integer) –

Status Codes:
  • 200 OK – Successful response

POST /user-config/dotfiles

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • data (string, required) –

  • path (string, required) –

  • permission (string, required) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

GET /user-config/dotfiles

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • path (string) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

PATCH /user-config/dotfiles

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • data (string, required) –

  • path (string, required) –

  • permission (string, required) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

DELETE /user-config/dotfiles

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • path (string, required) –

  • owner_access_key (string) –

Status Codes:
  • 200 OK – Successful response

POST /user-config/bootstrap-script

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • script (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /user-config/bootstrap-script

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Status Codes:
  • 200 OK – Successful response

POST /domain-config/dotfiles

Preconditions:

  • Admin privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • domain (string, required) –

  • data (string, required) –

  • path (string, required) –

  • permission (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /domain-config/dotfiles

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • domain (string, required) –

  • path (string) –

Status Codes:
  • 200 OK – Successful response

PATCH /domain-config/dotfiles

Preconditions:

  • Admin privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • domain (string, required) –

  • data (string, required) –

  • path (string, required) –

  • permission (string, required) –

Status Codes:
  • 200 OK – Successful response

DELETE /domain-config/dotfiles

Preconditions:

  • Admin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • domain (string, required) –

  • path (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /group-config/dotfiles

Preconditions:

  • Admin privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • group (string:uuid) –

  • domain (string) –

  • data (string, required) –

  • path (string, required) –

  • permission (string, required) –

Status Codes:
  • 200 OK – Successful response

GET /group-config/dotfiles

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • group (required) –

  • domain (string) –

  • path (string) –

Status Codes:
  • 200 OK – Successful response

PATCH /group-config/dotfiles

Preconditions:

  • Admin privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • group (string:uuid) –

  • domain (string) –

  • data (string, required) –

  • path (string, required) –

  • permission (string, required) –

Status Codes:
  • 200 OK – Successful response

DELETE /group-config/dotfiles

Preconditions:

  • Admin privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • group (required) –

  • domain (string) –

  • path (string, required) –

Status Codes:
  • 200 OK – Successful response

POST /logs/error

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Request JSON Object:
  • severity (string:enum, required) –

  • source (string, required) –

  • message (string, required) –

  • context_lang (string, required) –

  • context_env (string, required) – JSON string

  • request_url (string) –

  • request_status (integer) –

  • traceback (string) –

Status Codes:
  • 200 OK – Successful response

GET /logs/error

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Query Parameters:
  • mark_read (boolean, required) –

  • page_size (integer, required) –

  • page_no (integer, required) –

Status Codes:
  • 200 OK – Successful response

POST /logs/error/{log_id}/clear

Preconditions:

  • User privilege required.

  • Manager status required: RUNNING

Parameters:
  • log_id (string, required) –

Status Codes:
  • 200 OK – Successful response

Backend.AI 에이전트 레퍼런스

커널 관리를 위한 RPC 인터페이스

도커 백엔드

쿠버네티스 백엔드

Compute Plugins라는 이름으로도 알려진 Accelerators

Backend.AI Storage Proxy 레퍼런스

Storage Proxy Manager 대응 API

Storage Proxy 클라이언트 대응 API

Python용 Backend.AI 클라이언트 SDK

Python 3.8 or higher is required.

python.org에서 공식 설치 패키지를 다운로드하거나, 사용 중인 운영환경에 적합한 별도의 패키지 관리자(예: homebrew, miniconda, pyenv 등)를 이용할 수 있습니다. 이 클라이언트 SDK는 Linux, macOS, Windows 환경에서 테스트되었습니다.

SDK 라이브러리와 도구를 다른 소프트웨어와의 충돌 없이 설치하기 위해서 별도의 Python 가상환경(virtual environment)을 만드실 것을 권장합니다.

$ python3 -m venv venv-backend-ai
$ source venv-backend-ai/bin/activate
(venv-backend-ai) $

PyPI로부터 클라이언트 라이브러리를 설치합니다.

(venv-backend-ai) $ pip install -U pip setuptools
(venv-backend-ai) $ pip install backend.ai-client

자신의 API keypair를 다음과 같이 환경변수에 설정합니다:

(venv-backend-ai) $ export BACKEND_ACCESS_KEY=AKIA...
(venv-backend-ai) $ export BACKEND_SECRET_KEY=...

그 다음엔 첫 명령어를 실행해봅니다:

(venv-backend-ai) $ backend.ai --help
...
(venv-backend-ai) $ backend.ai ps
...

Check out more details with the below table of contents.

설치하기

Linux/macOS

We recommend using pyenv to manage your Python versions and virtual environments to avoid conflicts with other Python applications.

Create a new virtual environment (Python 3.6 or higher) and activate it on your shell. Then run the following commands:

pip install -U pip setuptools
pip install -U backend.ai-client-py

Create a shell script my-backendai-env.sh like:

export BACKEND_ACCESS_KEY=...
export BACKEND_SECRET_KEY=...
export BACKEND_ENDPOINT=https://my-precious-cluster
export BACKEND_ENDPOINT_TYPE=api

Run this shell script before using backend.ai command.

참고

The console-server users should set BACKEND_ENDPOINT_TYPE to session. For details, check out the client configuration document.

Windows

We recommend using the Anaconda Navigator to manage your Python environments with a slick GUI app.

Create a new environment (Python 3.6 or higher) and launch a terminal (command prompt). Then run the following commands:

python -m pip install -U pip setuptools
python -m pip install -U backend.ai-client-py

Create a batch file my-backendai-env.bat like:

chcp 65001
set PYTHONIOENCODING=UTF-8
set BACKEND_ACCESS_KEY=...
set BACKEND_SECRET_KEY=...
set BACKEND_ENDPOINT=https://my-precious-cluster
set BACKEND_ENDPOINT_TYPE=api

Run this batch file before using backend.ai command.

Note that this batch file switches your command prompt to use the UTF-8 codepage for correct display of special characters in the console logs.

Verification

Run backend.ai ps command and check if it says “there is no compute sessions running” or something similar.

If you encounter error messages about “ACCESS_KEY”, then check if your batch/shell scripts have the correct environment variable names.

If you encounter network connection error messages, check if the endpoint server is configured correctly and accessible.

클라이언트 설정

Backend.AI API 설정에는 endpoint URL과 API keypair (access 및 secret key) 등이 포함됩니다.

설정 방법은 2가지가 있습니다:

  1. Setting environment variables before running your program that uses this SDK. This applies to the command-line interface as well.

  2. Manually creating APIConfig instance and creating sessions with it.

The list of configurable environment variables are:

  • BACKEND_ENDPOINT

  • BACKEND_ENDPOINT_TYPE

  • BACKEND_ACCESS_KEY

  • BACKEND_SECRET_KEY

  • BACKEND_VFOLDER_MOUNTS

Please refer the parameter descriptions of APIConfig’s constructor for what each environment variable means and what value format should be used.

Command Line Interface

Configuration

참고

Please consult the detailed usage in the help of each command (use -h or --help argument to display the manual).

Check out the client configuration for configurations via environment variables.

Session Mode

When the endpoint type is "session", you must explicitly login and logout into/from the console server.

$ backend.ai login
Username: myaccount@example.com
Password:
✔ Login succeeded.

$ backend.ai ...  # any commands

$ backend.ai logout
✔ Logout done.
API Mode

After setting up the environment variables, just run any command:

$ backend.ai ...
Checking out the current configuration

Run the following command to list your current active configurations.

$ backend.ai config

Compute Sessions

참고

Please consult the detailed usage in the help of each command (use -h or --help argument to display the manual).

Listing sessions

List the session owned by you with various status filters. The most recently status-changed sessions are listed first. To prevent overloading the server, the result is limited to the first 10 sessions and it provides a separate --all option to paginate further sessions.

backend.ai ps

The ps command is an alias of the following admin session list command. If you have the administrator privilege, you can list sessions owned by other users by adding --access-key option here.

backend.ai admin session list

Both commands offer options to set the status filter as follows. For other options, please consult the output of --help.

Option

Included Session Status

(no option)

PENDING, PREPARING, RUNNING, RESTARTING, TERMINATING, RESIZING, SUSPENDED, and ERROR.

--running

PREPARING, PULLING, and RUNNING.

--dead

CANCELLED and TERMINATED.

Both commands offer options to specify which fields of sessions should be printed as follows.

Option

Included Session Fields

(no option)

Session ID, Owner, Image, Type,

Status, Status Info, Last updated, and Result.

--id-only

Session ID.

--detail

Session ID, Owner, Image, Type,

Status, Status Info, Last updated, Result,

Tag, Created At, Occupied Resource, Used Memory (MiB),

Max Used Memory (MiB), and CPU Using (%).

-f, --format

Specified fields by user.

참고

Fields for -f/--format option can be displayed by specifying comma-separated parameters.

Available parameters for this option are: id, status, status_info, created_at, last_updated, result, image, type, task_id, tag, occupied_slots, used_memory, max_used_memory, cpu_using.

For example:

backend.ai admin session --format id,status,cpu_using
Running simple sessions

The following command spawns a Python session and executes the code passed as -c argument immediately. --rm option states that the client automatically terminates the session after execution finishes.

backend.ai run --rm -c 'print("hello world")' python:3.6-ubuntu18.04

참고

By default, you need to specify language with full version tag like python:3.6-ubuntu18.04. Depending on the Backend.AI admin’s language alias settings, this can be shortened just as python. If you want to know defined language aliases, contact the admin of Backend.AI server.

The following command spawns a Python session and executes the code passed as ./myscript.py file, using the shell command specified in the --exec option.

backend.ai run --rm --exec 'python myscript.py arg1 arg2' \
           python:3.6-ubuntu18.04 ./myscript.py

Please note that your run command may hang up for a very long time due to queueing when the cluster resource is not sufficiently available.

To avoid indefinite waiting, you may add --enqueue-only to return immediately after posting the session creation request.

참고

When using --enqueue-only, the codes are NOT executed and relevant options are ignored. This makes the run command to the same of the start command.

Or, you may use --max-wait option to limit the maximum waiting time. If the session starts within the given --max-wait seconds, it works normally, but if not, it returns without code execution like when used --enqueue-only.

To watch what is happening behind the scene until the session starts, try backend.ai events <sessionID> to receive the lifecycle events such as its scheduling and preparation steps.

Running sessions with accelerators

Use one or more -r options to specify resource requirements when using backend.ai run and backend.ai start commands.

For instance, the following command spawns a Python TensorFlow session using a half of virtual GPU device, 4 CPU cores, and 8 GiB of the main memory to execute ./mygpucode.py file inside it.

backend.ai run --rm \
           -r cpu=4 -r mem=8g -r cuda.shares=2 \
           python-tensorflow:1.12-py36 ./mygpucode.py
Terminating or cancelling sessions

Without --rm option, your session remains alive for a configured amount of idle timeout (default is 30 minutes). You can see such sessions using the backend.ai ps command. Use the following command to manually terminate them via their session IDs. You may specifcy multiple session IDs to terminate them at once.

backend.ai rm <sessionID> [<sessionID>...]

If you terminate PENDING sessions which are not scheduled yet, they are cancelled.

Container Applications

참고

Please consult the detailed usage in the help of each command (use -h or --help argument to display the manual).

Starting a session and connecting to its Jupyter Notebook

The following command first spawns a Python session named “mysession” without running any code immediately, and then executes a local proxy which connects to the “jupyter” service running inside the session via the local TCP port 9900. The start command shows application services provided by the created compute session so that you can choose one in the subsequent app command. In the start command, you can specify detailed resource options using -r and storage mounts using -m parameter.

backend.ai start -t mysession python
backend.ai app -b 9900 mysession jupyter

Once executed, the app command waits for the user to open the displayed address using appropriate application. For the jupyter service, use your favorite web browser just like the way you use Jupyter Notebooks. To stop the app command, press Ctrl+C or send the SIGINT signal.

Accessing sessions via a web terminal

All Backend.AI sessions expose an intrinsic application named "ttyd". It is an web application that embeds xterm.js-based full-screen terminal that runs on web browsers.

backend.ai start -t mysession ...
backend.ai app -b 9900 mysession ttyd

Then open http://localhost:9900 to access the shell in a fully functional web terminal using browsers. The default shell is /bin/bash for Ubuntu/CentOS-based images and /bin/ash for Alpine-based images with a fallback to /bin/sh.

참고

This shell access does NOT grant your root access. All compute session processes are executed as the user privilege.

Accessing sessions via native SSH/SFTP

Backend.AI offers direct access to compute sessions (containers) via SSH and SFTP, by auto-generating host identity and user keypairs for all sessions. All Baceknd.AI sessions expose an intrinsic application named "sshd" like "ttyd".

To connect your sessions with SSH, first prepare your session and download an auto-generated SSH keypair named id_container. Then start the service port proxy (“app” command) to open a local TCP port that proxies the SSH/SFTP traffic to the compute sessions:

$ backend.ai start -t mysess ...
$ backend.ai session download mysess id_container
$ mv id_container ~/.ssh
$ backend.ai app mysess sshd -b 9922

In another terminal on the same PC, run your ssh client like:

$ ssh -o StrictHostKeyChecking=no \
>     -o UserKnownHostsFile=/dev/null \
>     -i ~/.ssh/id_container \
>     work@localhost -p 9922
Warning: Permanently added '[127.0.0.1]:9922' (RSA) to the list of known hosts.
f310e8dbce83:~$

This SSH port is also compatible with SFTP to browse the container’s filesystem and to upload/download large-sized files.

You could add the following to your ~/.ssh/config to avoid type extra options every time.

Host localhost
  User work
  IdentityFile ~/.ssh/id_container
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null
$ ssh localhost -p 9922

경고

Since the SSH keypair is auto-generated every time when your launch a new compute session, you need to download and keep it separately for each session.

To use your own SSH private key across all your sessions without downloading the auto-generated one every time, create a vfolder named .ssh and put the authorized_keys file that includes the public key. The keypair and .ssh directory permissions will be automatically updated by Backend.AI when the session launches.

$ ssh-keygen -t rsa -b 2048 -f id_container
$ cat id_container.pub > authorized_keys
$ backend.ai vfolder create .ssh
$ backend.ai vfolder upload .ssh authorized_keys

Storage Management

참고

Please consult the detailed usage in the help of each command (use -h or --help argument to display the manual).

Backend.AI abstracts shared network storages into per-user slices called “virtual folders” (aka “vfolders”), which can be shared between users and user group members.

Creating vfolders and managing them

The command-line interface provides a set of subcommands under backend.ai vfolder to manage vfolders and files inside them.

To list accessible vfolders including your own ones and those shared by other users:

$ backend.ai vfolder list

To create a virtual folder named “mydata1”:

$ backend.ai vfolder create mydata1 mynas

The second argument mynas corresponds to the name of a storage host. To list up storage hosts that you are allowed to use:

$ backend.ai vfolder list-hosts

To delete the vfolder completely:

$ backend.ai vfolder delete mydata1
File transfers and management

To upload a file from the current working directory into the vfolder:

$ backend.ai vfolder upload mydata1 ./bigdata.csv

To download a file from the vfolder into the current working directory:

$ backend.ai vfolder download mydata1 ./bigresult.txt

To list files in the vfolder’s specific path:

$ backend.ai vfolder ls mydata1 .

To delete files in the vfolder:

$ backend.ai vfolder rm mydata1 ./bigdata.csv

경고

All file uploads and downloads overwrite existing files and all file operations are irreversible.

Running sessions with storages

The following command spawns a Python session where the virtual folder “mydata1” is mounted. The execution options are omitted in this example. Then, it downloads ./bigresult.txt file (generated by your code) from the “mydata1” virtual folder.

$ backend.ai vfolder upload mydata1 ./bigdata.csv
$ backend.ai run --rm -m mydata1 python:3.6-ubuntu18.04 ...
$ backend.ai vfolder download mydata1 ./bigresult.txt

In your code, you may access the virtual folder via /home/work/mydata1 (where the default current working directory is /home/work) just like a normal directory. If you want to mount vfolders in other path, add ‘/’ as prefix at the forefont of the vfolder path.

By reusing the same vfolder in subsequent sessions, you do not have to download the result and upload it as the input for next sessions, just keeping them in the storage.

Creating default files for kernels

Backend.AI has a feature called ‘dotfile’, created to all the kernels user spawns. As you can guess, dotfile’s path should start with .. The following command creates dotfile named .aws/config with permission 755. This file will be created under /home/work every time user spawns Backend.AI kernel.

$ backend.ai dotfile create .aws/config < ~/.aws/config

코드 실행하기 (고급)

참고

각 명령어의 help에서 자세한 사용법을 확인할 것을 권장합니다. help는 -h 혹은 --help 를 명령어의 인자로 입력하여 불러올 수 있습니다.

동시 세션 실행하기

run 명령어는 Running simple sessions 에 설명되어 있는 단일 세션 실행뿐만 아니라 여러 세션의 동시 실행 기능을 제공하고, 이 때 --exec 옵션으로 입력되는 인수와 -e / --env 옵션으로 입력되는 환경 변수가 사용됩니다.

--exec 옵션에 해당하는 변수를 설정할 때에는 --exec-range 이 사용되고, --env 옵션에 해당하는 변수를 설정할 때에는 --env-range 이 사용됩니다.

아래는 4개의 세션을 생성하도록 환경 변수의 범위를 설정하는 예시입니다.

backend.ai run -c 'import os; print("Hello world, {}".format(os.environ["CASENO"]))' \
    -r cpu=1 -r mem=256m \
    -e 'CASENO=$X' \
    --env-range=X=case:1,2,3,4 \
    lablup/python:3.6-ubuntu18.04

range 옵션은 “range expressions” 라는 형식의 인자를 입력 받습니다. range 옵션의 앞 부분은 해당 변수와 등호(=)로 이루어져 있습니다. 뒷 부분은 아래에 나열된 종류의 형식의 식입니다.

식 (Expression)

설명

case:CASE1,CASE2,...,CASEN

문자열 또는 숫자로 구성된 배열

linspace:START,STOP,POINTS

numpy.linspace() 와 같은 문법으로 정의되는 구간의 값들. 예를 들어 linspace:1,2,3 은 길이가 3인 배열 [1, 1.5, 2]을 생성합니다.

range:START,STOP,STEP

파이썬의 range() 와 같은 문법으로 정의되는 수의 범위 (range). 예를 들어 range:1,6,2 은 배열 [1, 3, 5]을 생성합니다.

run 명령어에 여러 개의 range 옵션을 입력하면 클라이언트는 각 범위로 정의된 값들의 모든 조합에 해당하는 개수의 세션을 생성합니다.

참고

생성된 세션을 실행하기 위한 사용자 혹은 클러스터의 자원이 부족하면 일부 세션은 큐에 추가되어 커맨드가 실행되는 데에 오랜 시간이 걸릴 수 있습니다.

경고

위의 기능은 클라이언트에서 구현되었기 때문에 모든 케이스가 완료 될 때까지 클라이언트는 서버와의 연결을 유지해야 합니다. 서버의 batch job 스케줄링은 현재 개발 중입니다!

Session Templates

Creating and starting session template

Users may define commonly used set of session creation parameters as reusable templates.

A session template includes common session parameters such as resource slots, vfolder mounts, the kernel image to use, and etc. It also support an extra feature that automatically clones a Git repository upon startup as a bootstrap command.

The following sample shows how a session template looks like:

---
api_version: v1
kind: taskTemplate
metadata:
  name: template1234
  tag: example-tag
spec:
  kernel:
    environ:
      MYCONFIG: XXX
    git:
      branch: '19.09'
      commit: 10daee9e328876d75e6d0fa4998d4456711730db
      repository: https://github.com/lablup/backend.ai-agent
      destinationDir: /home/work/baiagent
    image: python:3.6-ubuntu18.04
  resources:
    cpu: '2'
    mem: 4g
  mounts:
    hostpath-test: /home/work/hostDesktop
    test-vfolder:
  sessionType: interactive

The backend.ai sesstpl command set provides the basic CRUD operations of user-specific session templates.

The create command accepts the YAML content either piped from the standard input or read from a file using -f flag:

$ backend.ai sesstpl create < session-template.yaml
# -- or --
$ backend.ai sesstpl create -f session-template.yaml

Once the session template is uploaded, you may use it to start a new session:

$ backend.ai start-template <templateId>

with substituting <templateId> to your template ID.

Other CRUD command examples are as follows:

$ backend.ai sesstpl update <templateId> < session-template.yaml
$ backend.ai sesstpl list
$ backend.ai sesstpl get <templateId>
$ backend.ai sesstpl delete <templateId>
Full syntax for task template
---
api_version or apiVersion: str, required
kind: Enum['taskTemplate', 'task_template'], required
metadata: required
  name: str, required
  tag: str (optional)
spec:
  type or sessionType: Enum['interactive', 'batch'] (optional), default=interactive
  kernel:
    image: str, required
    environ: map[str, str] (optional)
    run: (optional)
      bootstrap: str (optional)
      stratup or startup_command or startupCommand: str (optional)
    git: (optional)
      repository: str, required
      commit: str (optional)
      branch: str (optional)
      credential: (optional)
        username: str
        password: str
      destination_dir or destinationDir: str (optional)
  mounts: map[str, str] (optional)
  resources: map[str, str] (optional)

Developer Guides

Client Session

Client Session Objects

This module is the first place to begin with your Python programs that use Backend.AI API functions.

The high-level API functions cannot be used alone – you must initiate a client session first because each session provides proxy attributes that represent API functions and run on the session itself.

To achieve this, during initialization session objects internally construct new types by combining the BaseFunction class with the attributes in each API function classes, and makes the new types bound to itself. Creating new types every time when creating a new session instance may look weird, but it is the most convenient way to provide class-methods in the API function classes to work with specific session instances.

When designing your application, please note that session objects are intended to live long following the process’ lifecycle, instead of to be created and disposed whenever making API requests.

class ai.backend.client.session.BaseSession(*, config=None, proxy_mode=False)

The base abstract class for sessions.

property proxy_mode: bool

If set True, it skips API version negotiation when opening the session.

abstractmethod open()

Initializes the session and perform version negotiation.

반환 형식:

Optional[Awaitable[None]]

abstractmethod close()

Terminates the session and releases underlying resources.

반환 형식:

Optional[Awaitable[None]]

property closed: bool

Checks if the session is closed.

property config: APIConfig

The configuration used by this session object.

class ai.backend.client.session.Session(*, config=None, proxy_mode=False)

A context manager for API client sessions that makes API requests synchronously. You may call simple request-response APIs like a plain Python function, but cannot use streaming APIs based on WebSocket and Server-Sent Events.

property closed: bool

Checks if the session is closed.

property config: APIConfig

The configuration used by this session object.

property proxy_mode: bool

If set True, it skips API version negotiation when opening the session.

open()

Initializes the session and perform version negotiation.

반환 형식:

None

close()

Terminates the session. It schedules the close() coroutine of the underlying aiohttp session and then enqueues a sentinel object to indicate termination. Then it waits until the worker thread to self-terminate by joining.

반환 형식:

None

property worker_thread

The thread that internally executes the asynchronous implementations of the given API functions.

class ai.backend.client.session.AsyncSession(*, config=None, proxy_mode=False)

A context manager for API client sessions that makes API requests asynchronously. You may call all APIs as coroutines. WebSocket-based APIs and SSE-based APIs returns special response types.

property closed: bool

Checks if the session is closed.

property config: APIConfig

The configuration used by this session object.

property proxy_mode: bool

If set True, it skips API version negotiation when opening the session.

open()

Initializes the session and perform version negotiation.

반환 형식:

Awaitable[None]

close()

Terminates the session and releases underlying resources.

반환 형식:

Awaitable[None]

Examples

Here are several examples to demonstrate the functional API usage.

Initialization of the API Client
Implicit configuration from environment variables
from ai.backend.client.session import Session

def main():
    with Session() as api_session:
        print(api_session.System.get_versions())

if __name__ == "__main__":
    main()
Explicit configuration
from ai.backend.client.config import APIConfig
from ai.backend.client.session import Session

def main():
    config = APIConfig(
        endpoint="https://api.backend.ai.local",
        endpoint_type="api",
        domain="default",
        group="default",  # the default project name to use
    )
    with Session(config=config) as api_session:
        print(api_session.System.get_versions())

if __name__ == "__main__":
    main()
Asyncio-native API session
import asyncio
from ai.backend.client.session import AsyncSession

async def main():
    async with AsyncSession() as api_session:
        print(api_session.System.get_versions())

if __name__ == "__main__":
    asyncio.run(main())

더 보기

The interface of API client session objects: ai.backend.client.session

Working with Compute Sessions

참고

From here, we omit the main() function structure in the sample codes.

Listing currently running compute sessions
import functools
from ai.backend.client.session import Session

with Session() as api_session:
    fetch_func = functools.partial(
        api_session.ComputeSession.paginated_list,
        status="RUNNING",
    )
    current_offset = 0
    while True:
        result = fetch_func(page_offset=current_offset, page_size=20)
        if result.total_count == 0:
            # no items found
            break
        current_offset += len(result.items)
        for item in result.items:
           print(item)
        if current_offset >= result.total_count:
            # end of list
            break
Creating and destroying a compute session
from ai.backend.client.session import Session

with Session() as api_session:
    my_session = api_session.ComputeSession.get_or_create(
        "python:3.9-ubuntu20.04",      # registered container image name
        mounts=["mydata", "mymodel"],  # vfolder names
        resources={"cpu": 8, "mem": "32g", "cuda.device": 2},
    )
    print(my_session.id)
    my_session.destroy()
Accessing Container Applications

Launchable apps may vary for sessions. From here we illustrate an example to create a ttyd (web-based terminal) app, which is available for all Backend.AI sessions.

참고

This example is only applicable for the Backend.AI cluster with AppProxy v2 enabled and configured. AppProxy v2 only ships with enterprise version of Backend.AI.

The ComputeSession.start_service() API
import requests

from ai.backend.client.request import Request
from ai.backend.client.session import Session

app_name = "ttyd"

with Session() as api_session:
    sess = api_session.ComputeSession.get_or_create(...)
    service_info = sess.start_service(app_name, login_session_token="dummy")
    app_proxy_url = f"{service_info['wsproxy_addr']}/v2/proxy/{service_info['token']}/{sess.id}/add?app={app_name}"
    resp = requests.get(app_proxy_url)
    body = resp.json()
    auth_url = body["url"]
    print(auth_url)  # opening this link from browser will navigate user to the terminal session

Set the value login_session_token to a dummy string like "dummy" as it is a trace of the legacy interface, which is no longer used.

Alternatively, in versions before 23.09.8, you may use the raw ai.backend.client.Request to call the server-side start_service API.

import asyncio

import aiohttp

from ai.backend.client.request import Request
from ai.backend.client.session import AsyncSession

app_name = "ttyd"

async def main():
    async with AsyncSession() as api_session:
        sess = api_session.ComputeSession.get_or_create(...)
        rqst = Request(
            "POST",
            f"/session/{sess.id}/start-service",
        )
        rqst.set_json({"app": app_name, "login_session_token": "dummy"})
        async with rqst.fetch() as resp:
            body = await resp.json()
            app_proxy_url = f"{body['wsproxy_addr']}/v2/proxy/{body['token']}/{sess.id}/add?app={app_name}"

        async with aiohttp.ClientSession() as client:
            async with client.get(app_proxy_url) as resp:
                body = await resp.json()
                auth_url = body["url"]
                print(auth_url)  # opening this link from browser will navigate user to the terminal session

if __name__ == "__main__":
    asyncio.run(main())
Code Execution via API
Synchronous mode
Snippet execution (query mode)

This is the minimal code to execute a code snippet with this client SDK.

import sys
from ai.backend.client.session import Session

with Session() as api_session:
    my_session = api_session.ComputeSession.get_or_create("python:3.9-ubuntu20.04")
    code = 'print("hello world")'
    mode = "query"
    run_id = None
    try:
        while True:
            result = my_session.execute(run_id, code, mode=mode)
            run_id = result["runId"]  # keeps track of this particular run loop
            for rec in result.get("console", []):
                if rec[0] == "stdout":
                    print(rec[1], end="", file=sys.stdout)
                elif rec[0] == "stderr":
                    print(rec[1], end="", file=sys.stderr)
                else:
                    handle_media(rec)
            sys.stdout.flush()
            if result["status"] == "finished":
                break
            else:
                mode = "continued"
                code = ""
    finally:
        my_session.destroy()

You need to take care of client_token because it determines whether to reuse kernel sessions or not. Backend.AI cloud has a timeout so that it terminates long-idle kernel sessions, but within the timeout, any kernel creation requests with the same client_token let Backend.AI cloud to reuse the kernel.

Script execution (batch mode)

You first need to upload the files after creating the session and construct a opts struct.

import sys
from ai.backend.client.session import Session

with Session() as session:
    compute_sess = session.ComputeSession.get_or_create("python:3.6-ubuntu18.04")
    compute_sess.upload(["mycode.py", "setup.py"])
    code = ""
    mode = "batch"
    run_id = None
    opts = {
        "build": "*",  # calls "python setup.py install"
        "exec": "python mycode.py arg1 arg2",
    }
    try:
        while True:
            result = kern.execute(run_id, code, mode=mode, opts=opts)
            opts.clear()
            run_id = result["runId"]
            for rec in result.get("console", []):
                if rec[0] == "stdout":
                    print(rec[1], end="", file=sys.stdout)
                elif rec[0] == "stderr":
                    print(rec[1], end="", file=sys.stderr)
                else:
                    handle_media(rec)
            sys.stdout.flush()
            if result["status"] == "finished":
                break
            else:
                mode = "continued"
                code = ""
    finally:
        compute_sess.destroy()
Handling user inputs

Inside the while-loop for kern.execute() above, change the if-block for result['status'] as follows:

...
if result["status"] == "finished":
    break
elif result["status"] == "waiting-input":
    mode = "input"
    if result["options"].get("is_password", False):
        code = getpass.getpass()
    else:
        code = input()
else:
    mode = "continued"
    code = ""
...

A common gotcha is to miss setting mode = "input". Be careful!

Handling multi-media outputs

The handle_media() function used above examples would look like:

def handle_media(record):
    media_type = record[0]  # MIME-Type string
    media_data = record[1]  # content
    ...

The exact method to process media_data depends on the media_type. Currently the following behaviors are well-defined:

  • For (binary-format) images, the content is a dataURI-encoded string.

  • For SVG (scalable vector graphics) images, the content is an XML string.

  • For application/x-sorna-drawing, the content is a JSON string that represents a set of vector drawing commands to be replayed the client-side (e.g., Javascript on browsers)

Asynchronous mode

The async version has all sync-version interfaces as coroutines but comes with additional features such as stream_execute() which streams the execution results via websockets and stream_pty() for interactive terminal streaming.

import asyncio
import json
import sys
import aiohttp
from ai.backend.client.session import AsyncSession

async def main():
    async with AsyncSession() as api_session:
        compute_sess = await api_session.ComputeSession.get_or_create(
            "python:3.6-ubuntu18.04",
            client_token="mysession",
        )
        code = 'print("hello world")'
        mode = "query"
        try:
            async with compute_sess.stream_execute(code, mode=mode) as stream:
                # no need for explicit run_id since WebSocket connection represents it!
                async for result in stream:
                    if result.type != aiohttp.WSMsgType.TEXT:
                        continue
                    result = json.loads(result.data)
                    for rec in result.get("console", []):
                        if rec[0] == "stdout":
                            print(rec[1], end="", file=sys.stdout)
                        elif rec[0] == "stderr":
                            print(rec[1], end="", file=sys.stderr)
                        else:
                            handle_media(rec)
                    sys.stdout.flush()
                    if result["status"] == "finished":
                        break
                    elif result["status"] == "waiting-input":
                        mode = "input"
                        if result["options"].get("is_password", False):
                            code = getpass.getpass()
                        else:
                            code = input()
                        await stream.send_text(code)
                    else:
                        mode = "continued"
                        code = ""
        finally:
            await compute_sess.destroy()

if __name__ == "__main__":
    asyncio.run(main())

버전 19.03에 추가.

Working with model service

Along with working AppProxy v2 deployments, model service requires a resource group configured to accept the inference workload.

Starting model service
from ai.backend.client.session import Session

with Session() as api_session:
    compute_sess = api_session.Service.create(
        "python:3.6-ubuntu18.04",
        "Llama2-70B",
        1,
        service_name="Llama2-service",
        resources={"cuda.shares": 2, "cpu": 8, "mem": "64g"},
        open_to_public=False,
    )

If you set open_to_public=True, the endpoint accepts anonymous traffic without the authentication token (see below).

Making request to model service endpoint
from ai.backend.client.session import Session

with Session() as api_session:
    compute_sess = api_session.Service.create(...)
    service_info = compute_sess.info()
    endpoint = service_info["url"]  # this value can be None if no successful inference service deployment has been made

    token_info = compute_sess.generate_api_token("3600s")
    token = token_info["token"]
    headers = {"Authorization": f"BackendAI {token}"}  # providing token is not required for public model services
    resp = requests.get(f"{endpoint}/v1/models", headers=headers)

The token returned by the generate_api_token() method is a JSON web token (JWT), which conveys all required information to authenticate the inference request. Once generated, it cannot be revoked. A token may have its own expiration date/time. The lifetime of a token is configured by the user who deploys the inference model, and currently there is no intrinsic minimum/maximum limits of the lifetime.

버전 23.09에 추가.

Testing

Unit Tests

Unit tests perform function-by-function tests to ensure their individual functionality. This test suite runs without depending on the server-side and thus it is executed in Travis CI for every push.

How to run
$ python -m pytest -m 'not integration' tests
Integration Tests

Integration tests combine multiple invocations of high-level interfaces to make underlying API requests to a running gateway server to test the full functionality of the client as well as the manager.

They are marked as “integration” using the @pytest.mark.integration decorator to each test case.

경고

The integration tests actually make changes to the target gateway server and agents. If some tests fail, those changes may remain in an inconsistent state and requires a manual recovery such as resetting the database and populating fixtures again, though the test suite tries to clean up them properly.

So, DO NOT RUN it against your production server.

Prerequisite

Please refer the README of the manager and agent repositories to set up them. To avoid an indefinite waiting time for pulling Docker images:

  • (manager) python -m ai.backend.manager.cli image rescan

  • (agent) docker pull

    • lablup/python:3.6-ubuntu18.04

    • lablup/lua:5.3-alpine3.8

The manager must also have at least the following active suerp-admin account in the default domain and the default group.

  • Example super-admin account:

    • User ID: admin@lablup.com

    • Password wJalrXUt

    • Access key: AKIAIOSFODNN7EXAMPLE

    • Secret key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

One or more testing-XXXX domain, one or more testing-XXXX groups, and one ore more dummy users are created and used during the tests and destroyed after running tests. XXXX will be filled with random identifiers.

The halfstack configuration and the example-users.json, example-keypairs.json, example-set-user-main-access-keys.json fixture is compatible with this integration test suite.

How to run

Execute the gateway and at least one agent in their respective virtualenvs and hosts:

$ python -m ai.backend.client.gateway.server
$ python -m ai.backend.client.agent.server
$ python -m ai.backend.client.agent.watcher

Then run the tests:

$ export BACKEND_ENDPOINT=...
$ python -m pytest -m 'integration' tests

고수준 함수 인터페이스

관리자용 함수

class ai.backend.client.func.admin.Admin

Provides the function interface for making admin GraphQL queries.

참고

Depending on the privilege of your API access key, you may or may not have access to querying/mutating server-side resources of other users.

classmethod await query(query, variables=None)

Sends the GraphQL query and returns the response.

매개변수:
  • query (str) – The GraphQL query string.

  • variables (Optional[Mapping[str, Any]]) – An optional key-value dictionary to fill the interpolated template variables in the query.

반환 형식:

Any

반환:

The object parsed from the response JSON string.

에이전트 함수

class ai.backend.client.func.agent.Agent

Provides a shortcut of Admin.query() that fetches various agent information.

참고

All methods in this function class require your API access key to have the admin privilege.

classmethod await paginated_list(status='ALIVE', scaling_group=None, *, fields=(FieldSpec(field_ref='id', humanized_name='ID', field_name='id', alt_name='id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scaling_group', humanized_name='Scaling Group', field_name='scaling_group', alt_name='scaling_group', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='available_slots', humanized_name='Available Slots', field_name='available_slots', alt_name='available_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={}), FieldSpec(field_ref='occupied_slots', humanized_name='Occupied Slots', field_name='occupied_slots', alt_name='occupied_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Lists the keypairs. You need an admin privilege for this operation.

반환 형식:

PaginatedResult

classmethod await detail(agent_id, fields=(FieldSpec(field_ref='id', humanized_name='ID', field_name='id', alt_name='id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scaling_group', humanized_name='Scaling Group', field_name='scaling_group', alt_name='scaling_group', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='addr', humanized_name='Addr', field_name='addr', alt_name='addr', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='region', humanized_name='Region', field_name='region', alt_name='region', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='first_contact', humanized_name='First Contact', field_name='first_contact', alt_name='first_contact', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='cpu_cur_pct', humanized_name='CPU Usage (%)', field_name='cpu_cur_pct', alt_name='cpu_cur_pct', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='mem_cur_bytes', humanized_name='Used Memory (MiB)', field_name='mem_cur_bytes', alt_name='mem_cur_bytes', formatter=<ai.backend.client.output.formatters.MiBytesOutputFormatter object>, subfields={}), FieldSpec(field_ref='available_slots', humanized_name='Available Slots', field_name='available_slots', alt_name='available_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={}), FieldSpec(field_ref='occupied_slots', humanized_name='Occupied Slots', field_name='occupied_slots', alt_name='occupied_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={}), FieldSpec(field_ref='local_config', humanized_name='Local Config', field_name='local_config', alt_name='local_config', formatter=<ai.backend.client.output.formatters.NestedDictOutputFormatter object>, subfields={})))
반환 형식:

Sequence[dict]

Auth Functions

class ai.backend.client.func.auth.Auth

Provides the function interface for login session management and authorization.

classmethod await login(user_id, password, otp=None)

Log-in into the endpoint with the given user ID and password. It creates a server-side web session and return a dictionary with "authenticated" boolean field and JSON-encoded raw cookie data.

반환 형식:

dict

classmethod await logout()

Log-out from the endpoint. It clears the server-side web session.

반환 형식:

None

classmethod await update_password(old_password, new_password, new_password2)

Update user’s password. This API works only for account owner.

반환 형식:

dict

classmethod await update_password_no_auth(domain, user_id, current_password, new_password)

Update user’s password. This is used to update EXPIRED password only. This function fetch a request to manager.

반환 형식:

dict

classmethod await update_password_no_auth_in_session(user_id, current_password, new_password)

Update user’s password. This is used to update EXPIRED password only. This function fetch a request to webserver.

반환 형식:

dict

Configuration

ai.backend.client.config.get_env(key, default=Undefined.token, *, clean=<function default_clean>)

Retrieves a configuration value from the environment variables. The given key is uppercased and prefixed by "BACKEND_" and then "SORNA_" if the former does not exist.

매개변수:
  • key (str) – The key name.

  • default (Union[str, Mapping, Undefined]) – The default value returned when there is no corresponding environment variable.

  • clean (Callable[[Any], TypeVar(T)]) – A single-argument function that is applied to the result of lookup (in both successes and the default value for failures). The default is returning the value as-is.

반환 형식:

TypeVar(T)

반환:

The value processed by the clean function.

ai.backend.client.config.get_config()

Returns the configuration for the current process. If there is no explicitly set APIConfig instance, it will generate a new one from the current environment variables and defaults.

반환 형식:

APIConfig

ai.backend.client.config.set_config(conf)

Sets the configuration used throughout the current process.

반환 형식:

None

class ai.backend.client.config.APIConfig(*, endpoint=None, endpoint_type=None, domain=None, group=None, storage_proxy_address_map=None, version=None, user_agent=None, access_key=None, secret_key=None, hash_type=None, vfolder_mounts=None, skip_sslcert_validation=None, connection_timeout=None, read_timeout=None, announcement_handler=None)

Represents a set of API client configurations. The access key and secret key are mandatory – they must be set in either environment variables or as the explicit arguments.

매개변수:
  • endpoint (Union[URL, str]) – The URL prefix to make API requests via HTTP/HTTPS. If this is given as str and contains multiple URLs separated by comma, the underlying HTTP request-response facility will perform client-side load balancing and automatic fail-over using them, assuming that all those URLs indicates a single, same cluster. The users of the API and CLI will get network connection errors only when all of the given endpoints fail – intermittent failures of a subset of endpoints will be hidden with a little increased latency.

  • endpoint_type (str) – Either "api" or "session". If the endpoint type is "api" (the default if unspecified), it uses the access key and secret key in the configuration to access the manager API server directly. If the endpoint type is "session", it assumes the endpoint is a Backend.AI console server which provides cookie-based authentication with username and password. In the latter, users need to use backend.ai login and backend.ai logout to manage their sign-in status, or the API equivalent in login() and logout() methods.

  • version (str) – The API protocol version.

  • user_agent (str) – A custom user-agent string which is sent to the API server as a User-Agent HTTP header.

  • access_key (str) – The API access key. If deliberately set to an empty string, the API requests will be made without signatures (anonymously).

  • secret_key (str) – The API secret key.

  • hash_type (str) – The hash type to generate per-request authentication signatures.

  • vfolder_mounts (Iterable[str]) – A list of vfolder names (that must belong to the given access key) to be automatically mounted upon any Kernel.get_or_create() calls.

DEFAULTS: Mapping[str, Union[str, Mapping]] = {'connection_timeout': '10.0', 'domain': 'default', 'endpoint': 'https://api.cloud.backend.ai', 'endpoint_type': 'api', 'group': 'default', 'hash_type': 'sha256', 'read_timeout': '0', 'storage_proxy_address_map': {}, 'version': 'v7.20230615'}

The default values for config parameters settable via environment variables except the access and secret keys.

property endpoint: URL

The currently active endpoint URL. This may change if there are multiple configured endpoints and the current one is not accessible.

property endpoints: Sequence[URL]

All configured endpoint URLs.

property endpoint_type: str

The configured endpoint type.

property domain: str

The configured domain.

property group: str

The configured group.

property storage_proxy_address_map: Mapping[str, str]

The storage proxy address map for overriding.

property user_agent: str

The configured user agent string.

property access_key: str

The configured API access key.

property secret_key: str

The configured API secret key.

property version: str

The configured API protocol version.

property hash_type: str

The configured hash algorithm for API authentication signatures.

property vfolder_mounts: Sequence[str]

The configured auto-mounted vfolder list.

property skip_sslcert_validation: bool

Whether to skip SSL certificate validation for the API gateway.

property connection_timeout: float

The maximum allowed duration for making TCP connections to the server.

property read_timeout: float

The maximum allowed waiting time for the first byte of the response from the server.

property announcement_handler: Callable[[str], None] | None

The announcement handler to display server-set announcements.

KeyPair 함수

class ai.backend.client.func.keypair.KeyPair(access_key)

Provides interactions with keypairs.

classmethod await create(user_id, is_active=True, is_admin=False, resource_policy=Undefined.TOKEN, rate_limit=Undefined.TOKEN, fields=(FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))

Creates a new keypair with the given options. You need an admin privilege for this operation.

반환 형식:

dict

classmethod await update(access_key, is_active=Undefined.TOKEN, is_admin=Undefined.TOKEN, resource_policy=Undefined.TOKEN, rate_limit=Undefined.TOKEN)

Creates a new keypair with the given options. You need an admin privilege for this operation.

반환 형식:

dict

classmethod await delete(access_key)

Deletes an existing keypair with given ACCESSKEY.

classmethod await list(user_id=None, is_active=None, fields=(FieldSpec(field_ref='user_id', humanized_name='Email', field_name='user_id', alt_name='user_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_admin', humanized_name='Admin?', field_name='is_admin', alt_name='is_admin', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))

Lists the keypairs. You need an admin privilege for this operation.

반환 형식:

Sequence[dict]

classmethod await paginated_list(is_active=None, domain_name=None, *, user_id=None, fields=(FieldSpec(field_ref='user_id', humanized_name='Email', field_name='user_id', alt_name='user_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_admin', humanized_name='Admin?', field_name='is_admin', alt_name='is_admin', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Lists the keypairs. You need an admin privilege for this operation.

반환 형식:

PaginatedResult[dict]

await info(fields=(FieldSpec(field_ref='user_id', humanized_name='Email', field_name='user_id', alt_name='user_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_admin', humanized_name='Admin?', field_name='is_admin', alt_name='is_admin', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))

Returns the keypair’s information such as resource limits.

매개변수:

fields (Sequence[FieldSpec]) – Additional per-agent query fields to fetch.

반환 형식:

dict

버전 18.12에 추가.

classmethod await activate(access_key)

Activates this keypair. You need an admin privilege for this operation.

반환 형식:

dict

classmethod await deactivate(access_key)

Deactivates this keypair. Deactivated keypairs cannot make any API requests unless activated again by an administrator. You need an admin privilege for this operation.

반환 형식:

dict

매니저 함수

class ai.backend.client.func.manager.Manager

Provides controlling of the gateway/manager servers.

버전 18.12에 추가.

classmethod await status()

Returns the current status of the configured API server.

classmethod await freeze(force_kill=False)

Freezes the configured API server. Any API clients will no longer be able to create new compute sessions nor create and modify vfolders/keypairs/etc. This is used to enter the maintenance mode of the server for unobtrusive manager and/or agent upgrades.

매개변수:

force_kill (bool) – If set True, immediately shuts down all running compute sessions forcibly. If not set, clients who have running compute session are still able to interact with them though they cannot create new compute sessions.

classmethod await unfreeze()

Unfreezes the configured API server so that it resumes to normal operation.

classmethod await get_announcement()

Get current announcement.

classmethod await update_announcement(enabled=True, message=None)

Update (create / delete) announcement.

매개변수:
  • enabled (bool) – If set False, delete announcement.

  • message (str) – Announcement message. Required if enabled is True.

classmethod await scheduler_op(op, args)

Perform a scheduler operation.

매개변수:
  • op (str) – The name of scheduler operation.

  • args (Any) – Arguments specific to the given operation.

Scaling Group Functions

class ai.backend.client.func.scaling_group.ScalingGroup(name)

Provides getting scaling-group information required for the current user.

The scaling-group is an opaque server-side configuration which splits the whole cluster into several partitions, so that server administrators can apply different auto-scaling policies and operation standards to each partition of agent sets.

classmethod await list_available(group)

List available scaling groups for the current user, considering the user, the user’s domain, and the designated user group.

classmethod await list(fields=(FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='description', humanized_name='Description', field_name='description', alt_name='description', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_public', humanized_name='Public?', field_name='is_public', alt_name='is_public', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='driver', humanized_name='Driver', field_name='driver', alt_name='driver', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scheduler', humanized_name='Scheduler', field_name='scheduler', alt_name='scheduler', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='use_host_network', humanized_name='Use Host Network', field_name='use_host_network', alt_name='use_host_network', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_addr', humanized_name='Wsproxy Addr', field_name='wsproxy_addr', alt_name='wsproxy_addr', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_api_token', humanized_name='Wsproxy Api Token', field_name='wsproxy_api_token', alt_name='wsproxy_api_token', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))

List available scaling groups for the current user, considering the user, the user’s domain, and the designated user group.

반환 형식:

Sequence[dict]

classmethod await detail(name, fields=(FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='description', humanized_name='Description', field_name='description', alt_name='description', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_public', humanized_name='Public?', field_name='is_public', alt_name='is_public', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='driver', humanized_name='Driver', field_name='driver', alt_name='driver', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='driver_opts', humanized_name='Driver Opts', field_name='driver_opts', alt_name='driver_opts', formatter=<ai.backend.client.output.formatters.NestedDictOutputFormatter object>, subfields={}), FieldSpec(field_ref='scheduler', humanized_name='Scheduler', field_name='scheduler', alt_name='scheduler', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scheduler_opts', humanized_name='Scheduler Opts', field_name='scheduler_opts', alt_name='scheduler_opts', formatter=<ai.backend.client.output.formatters.NestedDictOutputFormatter object>, subfields={}), FieldSpec(field_ref='use_host_network', humanized_name='Use Host Network', field_name='use_host_network', alt_name='use_host_network', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_addr', humanized_name='Wsproxy Addr', field_name='wsproxy_addr', alt_name='wsproxy_addr', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_api_token', humanized_name='Wsproxy Api Token', field_name='wsproxy_api_token', alt_name='wsproxy_api_token', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))

Fetch information of a scaling group by name.

매개변수:
  • name (str) – Name of the scaling group.

  • fields (Sequence[FieldSpec]) – Additional per-scaling-group query fields.

반환 형식:

dict

classmethod await create(name, *, description='', is_active=True, is_public=True, driver, driver_opts=Undefined.TOKEN, scheduler, scheduler_opts=Undefined.TOKEN, use_host_network=False, wsproxy_addr=None, wsproxy_api_token=None, fields=None)

Creates a new scaling group with the given options.

반환 형식:

dict

classmethod await update(name, *, description=Undefined.TOKEN, is_active=Undefined.TOKEN, is_public=Undefined.TOKEN, driver=Undefined.TOKEN, driver_opts=Undefined.TOKEN, scheduler=Undefined.TOKEN, scheduler_opts=Undefined.TOKEN, use_host_network=Undefined.TOKEN, wsproxy_addr=Undefined.TOKEN, wsproxy_api_token=Undefined.TOKEN, fields=None)

Update existing scaling group.

반환 형식:

dict

classmethod await delete(name)

Deletes an existing scaling group.

classmethod await associate_domain(scaling_group, domain)

Associate scaling_group with domain.

매개변수:
  • scaling_group (str) – The name of a scaling group.

  • domain (str) – The name of a domain.

classmethod await dissociate_domain(scaling_group, domain)

Dissociate scaling_group from domain.

매개변수:
  • scaling_group (str) – The name of a scaling group.

  • domain (str) – The name of a domain.

classmethod await dissociate_all_domain(domain)

Dissociate all scaling_groups from domain.

매개변수:

domain (str) – The name of a domain.

classmethod await associate_group(scaling_group, group_id)

Associate scaling_group with group.

매개변수:
  • scaling_group (str) – The name of a scaling group.

  • group_id (str) – The ID of a group.

classmethod await dissociate_group(scaling_group, group_id)

Dissociate scaling_group from group.

매개변수:
  • scaling_group (str) – The name of a scaling group.

  • group_id (str) – The ID of a group.

classmethod await dissociate_all_group(group_id)

Dissociate all scaling_groups from group.

매개변수:

group_id (str) – The ID of a group.

ComputeSession Functions

class ai.backend.client.func.session.ComputeSession(name, owner_access_key=None)

Provides various interactions with compute sessions in Backend.AI.

The term ‘kernel’ is now deprecated and we prefer ‘compute sessions’. However, for historical reasons and to avoid confusion with client sessions, we keep the backward compatibility with the naming of this API function class.

For multi-container sessions, all methods take effects to the master container only, except destroy() and restart() methods. So it is the user’s responsibility to distribute uploaded files to multiple containers using explicit copies or virtual folders which are commonly mounted to all containers belonging to the same compute session.

classmethod await paginated_list(status=None, access_key=None, *, fields=(FieldSpec(field_ref='id', humanized_name='Session ID', field_name='id', alt_name='session_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='image', humanized_name='Image', field_name='image', alt_name='image', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='type', humanized_name='Type', field_name='type', alt_name='type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status_info', humanized_name='Status Info', field_name='status_info', alt_name='status_info', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status_changed', humanized_name='Last Updated', field_name='status_changed', alt_name='status_changed', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='result', humanized_name='Result', field_name='result', alt_name='result', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='abusing_reports', humanized_name='Abusing Reports', field_name='abusing_reports', alt_name='abusing_reports', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Fetches the list of sessions.

매개변수:
  • status (str) –

    Fetches sessions in a specific status (PENDING, SCHEDULED, PULLING, PREPARING,

    RUNNING, RESTARTING, RUNNING_DEGRADED, TERMINATING, TERMINATED, ERROR, CANCELLED)

  • fields (Sequence[FieldSpec]) – Additional per-session query fields to fetch.

반환 형식:

PaginatedResult[dict]

classmethod await get_or_create(image, *, name=None, type_='interactive', starts_at=None, enqueue_only=False, max_wait=0, no_reuse=False, dependencies=None, callback_url=None, mounts=None, mount_map=None, envs=None, startup_command=None, resources=None, resource_opts=None, cluster_size=1, cluster_mode='single-node', domain_name=None, group_name=None, bootstrap_script=None, tag=None, architecture='x86_64', scaling_group=None, owner_access_key=None, preopen_ports=None, assign_agent=None)

Get-or-creates a compute session. If name is None, it creates a new compute session as long as the server has enough resources and your API key has remaining quota. If name is a valid string and there is an existing compute session with the same token and the same image, then it returns the ComputeSession instance representing the existing session.

매개변수:
  • image (str) – The image name and tag for the compute session. Example: python:3.6-ubuntu. Check out the full list of available images in your server using (TODO: new API).

  • name (str) –

    A client-side (user-defined) identifier to distinguish the session among currently running sessions. It may be used to seamlessly reuse the session already created.

    버전 19.12.0에서 변경: Renamed from clientSessionToken.

  • type

    Either "interactive" (default) or "batch".

    버전 19.09.0에 추가.

  • enqueue_only (bool) –

    Just enqueue the session creation request and return immediately, without waiting for its startup. (default: false to preserve the legacy behavior)

    버전 19.09.0에 추가.

  • max_wait (int) –

    The time to wait for session startup. If the cluster resource is being fully utilized, this waiting time can be arbitrarily long due to job queueing. If the timeout reaches, the returned status field becomes "TIMEOUT". Still in this case, the session may start in the future.

    버전 19.09.0에 추가.

  • no_reuse (bool) –

    Raises an explicit error if a session with the same image and the same name already exists instead of returning the information of it.

    버전 19.09.0에 추가.

  • mounts (List[str]) – The list of vfolder names that belongs to the current API access key.

  • mount_map (Mapping[str, str]) – Mapping which contains custom path to mount vfolder. Key and value of this map should be vfolder name and custom path. Default mounts or relative paths are under /home/work. If you want different paths, names should be absolute paths. The target mount path of vFolders should not overlap with the linux system folders. vFolders which has a dot(.) prefix in its name are not affected.

  • envs (Mapping[str, str]) – The environment variables which always bypasses the jail policy.

  • resources (Mapping[str, str | int]) – The resource specification. (TODO: details)

  • cluster_size (int) –

    The number of containers in this compute session. Must be at least 1.

    버전 19.09.0에 추가.

    버전 20.09.0에서 변경.

  • cluster_mode (Literal['single-node', 'multi-node']) –

    Set the clustering mode whether to use distributed nodes or a single node to spawn multiple containers for the new session.

    버전 20.09.0에 추가.

  • tag (str) – An optional string to annotate extra information.

  • owner – An optional access key that owns the created session. (Only available to administrators)

반환 형식:

ComputeSession

반환:

The ComputeSession instance.

classmethod await create_from_template(template_id, *, name=Undefined.TOKEN, type_=Undefined.TOKEN, starts_at=None, enqueue_only=Undefined.TOKEN, max_wait=Undefined.TOKEN, dependencies=None, callback_url=Undefined.TOKEN, no_reuse=Undefined.TOKEN, image=Undefined.TOKEN, mounts=Undefined.TOKEN, mount_map=Undefined.TOKEN, envs=Undefined.TOKEN, startup_command=Undefined.TOKEN, resources=Undefined.TOKEN, resource_opts=Undefined.TOKEN, cluster_size=Undefined.TOKEN, cluster_mode=Undefined.TOKEN, domain_name=Undefined.TOKEN, group_name=Undefined.TOKEN, bootstrap_script=Undefined.TOKEN, tag=Undefined.TOKEN, scaling_group=Undefined.TOKEN, owner_access_key=Undefined.TOKEN)

Get-or-creates a compute session from template. All other parameters provided will be overwritten to template, including vfolder mounts (not appended!). If name is None, it creates a new compute session as long as the server has enough resources and your API key has remaining quota. If name is a valid string and there is an existing compute session with the same token and the same image, then it returns the ComputeSession instance representing the existing session.

매개변수:
  • template_id (str) – Task template to apply to compute session.

  • image (str | Undefined) – The image name and tag for the compute session. Example: python:3.6-ubuntu. Check out the full list of available images in your server using (TODO: new API).

  • name (str | Undefined) –

    A client-side (user-defined) identifier to distinguish the session among currently running sessions. It may be used to seamlessly reuse the session already created.

    버전 19.12.0에서 변경: Renamed from clientSessionToken.

  • type

    Either "interactive" (default) or "batch".

    버전 19.09.0에 추가.

  • enqueue_only (bool | Undefined) –

    Just enqueue the session creation request and return immediately, without waiting for its startup. (default: false to preserve the legacy behavior)

    버전 19.09.0에 추가.

  • max_wait (int | Undefined) –

    The time to wait for session startup. If the cluster resource is being fully utilized, this waiting time can be arbitrarily long due to job queueing. If the timeout reaches, the returned status field becomes "TIMEOUT". Still in this case, the session may start in the future.

    버전 19.09.0에 추가.

  • no_reuse (bool | Undefined) –

    Raises an explicit error if a session with the same image and the same name already exists instead of returning the information of it.

    버전 19.09.0에 추가.

  • mounts (Union[List[str], Undefined]) – The list of vfolder names that belongs to the current API access key.

  • mount_map (Union[Mapping[str, str], Undefined]) – Mapping which contains custom path to mount vfolder. Key and value of this map should be vfolder name and custom path. Default mounts or relative paths are under /home/work. If you want different paths, names should be absolute paths. The target mount path of vFolders should not overlap with the linux system folders. vFolders which has a dot(.) prefix in its name are not affected.

  • envs (Union[Mapping[str, str], Undefined]) – The environment variables which always bypasses the jail policy.

  • resources (Union[Mapping[str, str | int], Undefined]) – The resource specification. (TODO: details)

  • cluster_size (int | Undefined) –

    The number of containers in this compute session. Must be at least 1.

    버전 19.09.0에 추가.

    버전 20.09.0에서 변경.

  • cluster_mode (Union[Literal['single-node', 'multi-node'], Undefined]) –

    Set the clustering mode whether to use distributed nodes or a single node to spawn multiple containers for the new session.

    버전 20.09.0에 추가.

  • tag (str | Undefined) – An optional string to annotate extra information.

  • owner – An optional access key that owns the created session. (Only available to administrators)

반환 형식:

ComputeSession

반환:

The ComputeSession instance.

await destroy(*, forced=False, recursive=False)

Destroys the compute session. Since the server literally kills the container(s), all ongoing executions are forcibly interrupted.

await restart()

Restarts the compute session. The server force-destroys the current running container(s), but keeps their temporary scratch directories intact.

await rename(new_id)

Renames Session ID of running compute session.

await commit()

Commit a running session to a tar file in the agent host.

await interrupt()

Tries to interrupt the current ongoing code execution. This may fail without any explicit errors depending on the code being executed.

await complete(code, opts=None)

Gets the auto-completion candidates from the given code string, as if a user has pressed the tab key just after the code in IDEs.

Depending on the language of the compute session, this feature may not be supported. Unsupported sessions returns an empty list.

매개변수:
  • code (str) – An (incomplete) code text.

  • opts (dict) – Additional information about the current cursor position, such as row, col, line and the remainder text.

반환 형식:

Iterable[str]

반환:

An ordered list of strings.

await get_info()

Retrieves a brief information about the compute session.

await get_logs()

Retrieves the console log of the compute session container.

await get_dependency_graph()

Retrieves the root node of dependency graph of the compute session.

await get_status_history()

Retrieves the status transition history of the compute session.

await execute(run_id=None, code=None, mode='query', opts=None)

Executes a code snippet directly in the compute session or sends a set of build/clean/execute commands to the compute session.

For more details about using this API, please refer the official API documentation.

매개변수:
  • run_id (str) – A unique identifier for a particular run loop. In the first call, it may be None so that the server auto-assigns one. Subsequent calls must use the returned runId value to request continuation or to send user inputs.

  • code (str) – A code snippet as string. In the continuation requests, it must be an empty string. When sending user inputs, this is where the user input string is stored.

  • mode (str) – A constant string which is one of "query", "batch", "continue", and "user-input".

  • opts (dict) – A dict for specifying additional options. Mainly used in the batch mode to specify build/clean/execution commands. See the API object reference for details.

반환:

An execution result object

await upload(files, basedir=None, show_progress=False)

Uploads the given list of files to the compute session. You may refer them in the batch-mode execution or from the code executed in the server afterwards.

매개변수:
  • files (Sequence[str | Path]) –

    The list of file paths in the client-side. If the paths include directories, the location of them in the compute session is calculated from the relative path to basedir and all intermediate parent directories are automatically created if not exists.

    For example, if a file path is /home/user/test/data.txt (or test/data.txt) where basedir is /home/user (or the current working directory is /home/user), the uploaded file is located at /home/work/test/data.txt in the compute session container.

  • basedir (Union[str, Path, None]) – The directory prefix where the files reside. The default value is the current working directory.

  • show_progress (bool) – Displays a progress bar during uploads.

await download(files, dest='.', show_progress=False)

Downloads the given list of files from the compute session.

매개변수:
  • files (Sequence[str | Path]) – The list of file paths in the compute session. If they are relative paths, the path is calculated from /home/work in the compute session container.

  • dest (str | Path) – The destination directory in the client-side.

  • show_progress (bool) – Displays a progress bar during downloads.

await list_files(path='.')

Gets the list of files in the given path inside the compute session container.

매개변수:

path (str | Path) – The directory path in the compute session.

await get_abusing_report()

Retrieves abusing reports of session’s sibling kernels.

await start_service(app, *, port=Undefined.TOKEN, envs=Undefined.TOKEN, arguments=Undefined.TOKEN, login_session_token=Undefined.TOKEN)

Starts application from Backend.AI session and returns access credentials to access AppProxy endpoint.

반환 형식:

Mapping[str, Any]

listen_events(scope='*')

Opens the stream of the kernel lifecycle events. Only the master kernel of each session is monitored.

반환 형식:

SSEContextManager

반환:

a StreamEvents object.

stream_events(scope='*')

Opens the stream of the kernel lifecycle events. Only the master kernel of each session is monitored.

반환 형식:

SSEContextManager

반환:

a StreamEvents object.

stream_pty()

Opens a pseudo-terminal of the kernel (if supported) streamed via websockets.

반환 형식:

WebSocketContextManager

반환:

a StreamPty object.

stream_execute(code='', *, mode='query', opts=None)

Executes a code snippet in the streaming mode. Since the returned websocket represents a run loop, there is no need to specify run_id explicitly.

반환 형식:

WebSocketContextManager

class ai.backend.client.func.session.StreamPty(session, underlying_response, **kwargs)

A derivative class of WebSocketResponse which provides additional functions to control the terminal.

Session Template Functions

class ai.backend.client.func.session_template.SessionTemplate(template_id, owner_access_key=None)
classmethod await create(template, domain_name=None, group_name=None, owner_access_key=None)
반환 형식:

SessionTemplate

classmethod await list_templates(list_all=False)
반환 형식:

List[Mapping[str, str]]

await get(body_format='yaml')
반환 형식:

str

await put(template)
반환 형식:

Any

await delete()
반환 형식:

Any

가상폴더 함수

class ai.backend.client.func.vfolder.VFolder(name)
classmethod await create(name, host=None, unmanaged_path=None, group=None, usage_mode='general', permission='rw', quota='0', cloneable=False)
classmethod await delete_by_id(oid)
classmethod await list(list_all=False)
classmethod await paginated_list(group=None, *, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Fetches the list of vfolders. Domain admins can only get domain vfolders.

매개변수:
  • group (str) – Fetch vfolders in a specific group.

  • fields (Sequence[FieldSpec]) – Additional per-vfolder query fields to fetch.

반환 형식:

PaginatedResult[dict]

classmethod await paginated_own_list(*, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Fetches the list of own vfolders.

매개변수:

fields (Sequence[FieldSpec]) – Additional per-vfolder query fields to fetch.

반환 형식:

PaginatedResult[dict]

classmethod await paginated_invited_list(*, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Fetches the list of invited vfolders.

매개변수:

fields (Sequence[FieldSpec]) – Additional per-vfolder query fields to fetch.

반환 형식:

PaginatedResult[dict]

classmethod await paginated_project_list(*, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Fetches the list of invited vfolders.

매개변수:

fields (Sequence[FieldSpec]) – Additional per-vfolder query fields to fetch.

반환 형식:

PaginatedResult[dict]

classmethod await list_hosts()
classmethod await list_all_hosts()
classmethod await list_allowed_types()
await info()
await delete()
await rename(new_name)
await download(relative_paths, *, basedir=None, dst_dir=None, chunk_size=16777216, show_progress=False, address_map=None, max_retries=20)
반환 형식:

None

await upload(sources, *, basedir=None, recursive=False, dst_dir=None, chunk_size=16777216, address_map=None, show_progress=False)
반환 형식:

None

await mkdir(path, parents=False, exist_ok=False)
반환 형식:

str

await rename_file(target_path, new_name)
await move_file(src_path, dst_path)
await delete_files(files, recursive=False)
await list_files(path='.')
await invite(perm, emails)
classmethod await invitations()
classmethod await accept_invitation(inv_id)
classmethod await delete_invitation(inv_id)
classmethod await get_fstab_contents(agent_id=None)
classmethod await get_performance_metric(folder_host)
classmethod await list_mounts()
classmethod await mount_host(name, fs_location, options=None, edit_fstab=False)
classmethod await umount_host(name, edit_fstab=False)
await share(perm, emails)
await unshare(emails)
await leave(shared_user_uuid=None)
await clone(target_name, target_host=None, usage_mode='general', permission='rw')
await update_options(name, permission=None, cloneable=None)
classmethod await list_shared_vfolders()
classmethod await shared_vfolder_info(vfolder_id)
classmethod await update_shared_vfolder(vfolder, user, perm=None)
classmethod await change_vfolder_ownership(vfolder, user_email)

저수준 레퍼런스

기반 함수

This module defines a few utilities that ease complexities to support both synchronous and asynchronous API functions, using some tricks with Python metaclasses.

Unless your are contributing to the client SDK, probably you won’t have to use this module directly.

class ai.backend.client.func.base.APIFunctionMeta(name, bases, attrs, **kwargs)

Converts all methods marked with api_function() into session-aware methods that are either plain Python functions or coroutines.

mro()

Return a type’s method resolution order.

class ai.backend.client.func.base.BaseFunction
@ai.backend.client.func.base.api_function(meth)

Mark the wrapped method as the API function method.

요청 API

This module provides low-level API request/response interfaces based on aiohttp.

Depending on the session object where the request is made from, Request and Response differentiate their behavior: works as plain Python functions or returns awaitables.

class ai.backend.client.request.Request(method='GET', path=None, content=None, *, content_type=None, params=None, reporthook=None, override_api_version=None)

The API request object.

with async with fetch(**kwargs) as Response

Sends the request to the server and reads the response.

You may use this method either with plain synchronous Session or AsyncSession. Both the followings patterns are valid:

from ai.backend.client.request import Request
from ai.backend.client.session import Session

with Session() as sess:
  rqst = Request('GET', ...)
  with rqst.fetch() as resp:
    print(resp.text())
from ai.backend.client.request import Request
from ai.backend.client.session import AsyncSession

async with AsyncSession() as sess:
  rqst = Request('GET', ...)
  async with rqst.fetch() as resp:
    print(await resp.text())
반환 형식:

FetchContextManager

async with connect_websocket(**kwargs) as WebSocketResponse or its derivatives

Creates a WebSocket connection. :rtype: WebSocketContextManager

경고

This method only works with AsyncSession.

property content: bytes | bytearray | str | StreamReader | IOBase | None

Retrieves the content in the original form. Private codes should NOT use this as it incurs duplicate encoding/decoding.

set_content(value, *, content_type=None)

Sets the content of the request.

반환 형식:

None

set_json(value)

A shortcut for set_content() with JSON objects.

반환 형식:

None

attach_files(files)

Attach a list of files represented as AttachedFile.

반환 형식:

None

connect_events(**kwargs)

Creates a Server-Sent Events connection. :rtype: SSEContextManager

경고

This method only works with AsyncSession.

class ai.backend.client.request.Response(session, underlying_response, *, async_mode=False, **kwargs)
class ai.backend.client.request.WebSocketResponse(session, underlying_response, **kwargs)

A high-level wrapper of aiohttp.ClientWebSocketResponse.

class ai.backend.client.request.FetchContextManager(session, rqst_ctx_builder, *, response_cls=<class 'ai.backend.client.request.Response'>, check_status=True)

The context manager returned by Request.fetch().

It provides both synchronous and asynchronous context manager interfaces.

class ai.backend.client.request.WebSocketContextManager(session, ws_ctx_builder, *, on_enter=None, response_cls=<class 'ai.backend.client.request.WebSocketResponse'>)

The context manager returned by Request.connect_websocket().

class ai.backend.client.request.AttachedFile(filename, stream, content_type)

A struct that represents an attached file to the API request.

매개변수:
  • filename (str) – The name of file to store. It may include paths and the server will create parent directories if required.

  • stream (Any) – A file-like object that allows stream-reading bytes.

  • content_type (str) – The content type for the stream. For arbitrary binary data, use “application/octet-stream”.

content_type

Alias for field number 2

count(value, /)

Return number of occurrences of value.

filename

Alias for field number 0

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

stream

Alias for field number 1

예외 클래스들

class ai.backend.client.exceptions.BackendError

Exception type to catch all ai.backend-related errors.

add_note()

Exception.add_note(note) – add a note to the exception

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class ai.backend.client.exceptions.BackendAPIError(status, reason, data)

Exceptions returned by the API gateway.

add_note()

Exception.add_note(note) – add a note to the exception

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class ai.backend.client.exceptions.BackendClientError

Exceptions from the client library, such as argument validation errors and connection failures.

add_note()

Exception.add_note(note) – add a note to the exception

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

Type Definitions

class ai.backend.client.types.Undefined(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

A special type to represent an undefined value.

ai.backend.client.types.undefined

A placeholder to signify an undefined value as a singleton object of Undefined and distinguish it from a null (None) value.

색인과 표