Setup OS Environment

Backend.AI and its associated components share common requirements and configurations for proper operation. This section explains how to configure the OS environment.

Note

This section assumes the installation on Ubuntu 20.04 LTS.

Create a user account for operation

We will create a user account bai to install and operate Backend.AI services. Set the UID and GID to 1100 to prevent conflicts with other users or groups. sudo privilege is required so add bai to sudo group.

$ username="bai"
$ password="secure-password"
$ sudo adduser --disabled-password --uid 1100 --gecos "" $username
$ echo "$username:$password" | sudo chpasswd
$ sudo usermod -aG sudo bai

If you do not want to expose your password in the shell history, remove the --disabled-password option and interactively enter your password.

Login as the bai user and continue the installation.

Install Docker engine

Backend.AI requires Docker Engine to create a compute session with the Docker container backend. Also, some service components are deployed as containers. So installing Docker Engine is required. Ensure docker-compose-plugin is installed as well to use docker compose command.

After the installation, add the bai user to the docker group not to issue the sudo prefix command every time interacting with the Docker engine.

$ sudo usermod -aG docker bai

Logout and login again to apply the group membership change.

Optimize sysctl/ulimit parameters

This is not essential but the recommended step to optimize the performance and stability of operating Backend.AI. Refer to the guide of the Manager repiository for the details of the kernel parameters and the ulimit settings. Depending on the Backend.AI services you install, the optimal values may vary. Each service installation section guide with the values, if needed.

Note

Modern systems may have already set the optimal parameters. In that case, you can skip this step.

To cleanly separate the configurations, you may follow the steps below.

Save the resource limit parameters in /etc/security/limits.d/99-backendai.conf.

root hard nofile 512000
root soft nofile 512000
root hard nproc 65536
root soft nproc 65536
bai hard nofile 512000
bai soft nofile 512000
bai hard nproc 65536
bai soft nproc 65536

Logout and login again to apply the resource limit changes.

Save the kernel parameters in /etc/sysctl.d/99-backendai.conf.

fs.file-max=2048000
net.core.somaxconn=1024
net.ipv4.tcp_max_syn_backlog=1024
net.ipv4.tcp_slow_start_after_idle=0
net.ipv4.tcp_fin_timeout=10
net.ipv4.tcp_window_scaling=1
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_early_retrans=1
net.ipv4.ip_local_port_range="10000 65000"
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem=4096 12582912 16777216
net.ipv4.tcp_wmem=4096 12582912 16777216
vm.overcommit_memory=1

Apply the kernel parameters with sudo sysctl -p /etc/sysctl.d/99-backendai.conf.

Prepare required Python versions and virtual environments

Prepare a Python distribution whose version meets the requirements of the target package. Backend.AI 22.09, for example, requires Python 3.10. The latest information on the Python version compatibility can be found at here.

There can be several ways to prepare a specific Python version. Here, we will be using pyenv and pyenv-virtualenv.

Use pyenv to manually build and select a specific Python version

Install pyenv and pyenv-virtualenv. Then, install a Python version that are needed:

$ pyenv install "${YOUR_PYTHON_VERSION}"

Note

You may need to install suggested build environment to build Python from pyenv.

Then, you can create multiple virtual environments per service. To create a virtual environment for Backend.AI Manager 22.09.x and automatically activate it, for example, you may run:

$ mkdir "${HOME}/manager"
$ cd "${HOME}/manager"
$ pyenv virtualenv "${YOUR_PYTHON_VERSION}" bai-22.09-manager
$ pyenv local bai-22.09-manager
$ pip install -U pip setuptools wheel

You also need to make pip available to the Python installation with the latest wheel and setuptools packages, so that any non-binary extension packages can be compiled and installed on your system.

Use a standalone static built Python

We can use a standalone static built Python.

Warning

Details will be added later.

Configure network aliases

Although not required, using a network aliases instead of IP addresses can make setup and operation easier. Edit the /etc/hosts file for each node and append the contents like example below to access each server with network aliases.

##### BEGIN for Backend.AI services #####
10.20.30.10 bai-m1   # management node 01
10.20.30.20 bai-a01  # agent node 01 (GPU 01)
10.20.30.22 bai-a02  # agent node 02 (GPU 02)
##### END for Backend.AI services #####

Note that the IP addresses should be accessible from other nodes, if you are installing on multiple servers.

Mount a shared storage

Having a shared storage volume makes it easy to save and manage data inside a Backend.AI compute environment. If you have a dedicated storage, mount it with the name of your choice under /vfroot/ directory on each server. You must mount it in the same path in all management and compute nodes.

Detailed mount procedures may vary depending on the storage type or vendor. For a usual NFS, adding the configurations to /etc/fstab and executing sudo mount -a will do the job.

Note

It is recommended to unify the UID and GID of the Storage Proxy service, all of the Agent services across nodes, container UID and GID (configurable in agent.toml), and the NFS volume.

If you do not have a dedicated storage or installing on one server, you can use a local directory. Just create a directory /vfroot/local.

$ sudo mkdir -p /vfroot/local
$ sudo chown -R ${UID}.${GID} /vfroot

Setup accelerators

If there are accelerators (e.g., GPU) on the server, you have to install the vendor-specific drivers and libraries to make sure the accelerators are properly set up and working. Please refer to the vendor documentation for the details.

To integrate NVIDIA GPUs,
- Install the NVIDIA driver and CUDA toolkit.
- Install the NVIDIA container toolkit (nvidia-docker2).

Pull container images

For compute nodes, you need to pull some container images that are required for creating a compute session. Lablup provides a set of open container images and you may pull the following starter images:

docker pull cr.backend.ai/stable/filebrowser:21.02-ubuntu20.04
docker pull cr.backend.ai/stable/python:3.9-ubuntu20.04
docker pull cr.backend.ai/stable/python-pytorch:1.11-py38-cuda11.3
docker pull cr.backend.ai/stable/python-tensorflow:2.7-py38-cuda11.3