Install from Source
Note
For production deployments, we recommend to create separate virtualenvs for individual services and install the pre-built wheel distributions, following Install from Packages.
Setting Up Manager and Agent (single node, all-in-one)
Check out Development Setup.
Setting Up Additional Agents (multi-node)
Updating manager configuration for multi-nodes
Since scripts/install-dev.sh
assumes a single-node all-in-one setup, it configures the etcd and Redis addresses to be 127.0.0.1
.
You need to update the etcd configuration of the Redis address so that additional agent nodes can connect to the Redis server using the address advertised via etcd:
$ ./backend.ai mgr etcd get config/redis/addr
127.0.0.1:xxxx
$ ./backend.ai mgr etcd put config/redis/addr MANAGER_IP:xxxx # use the port number read above
where MANAGER_IP
is an IP address of the manager node accessible from other agent nodes.
Installing additional agents in different nodes
First, you need to initialize a working copy of the core repository for each additional agent node.
As our scripts/install-dev.sh
does not yet provide an “agent-only” installation mode,
you need to manually perform the same repository cloning along with the pyenv, Python, and Pants setup procedures as the script does.
Note
Since we use the mono-repo for the core packages, there is no way to separately clone the agent sources only. Just clone the entire repository and configure/execute the agent only. Ensure that you also pull the LFS files and submodules when you manually clone it.
Once your pants
is up and working, run pants export
to populate virtualenvs and install dependencies.
Then start to configure agent.toml
by copying it from configs/agent/halfstack.toml as follows:
agent.toml
[etcd].addr.host
: Replace withMANAGER_IP
[agent].rpc-listen-addr.host
: Replace withAGENT_IP
[container].bind-host
: Replace withAGENT_IP
[watcher].service-addr.host
: Replace withAGENT_IP
where AGENT_IP
is an IP address of this agent node accessible from the manager and MANAGER_IP
is an IP address of the manager node accessible from this agent node.
Now execute ./backend.ai ag start-server
to connect this agent node to an existing manager.
We assume that the agent and manager nodes reside in a same local network, where all TCP ports are open to each other.
If this is not the case, you should configure firewalls to open all the port numbers appearing in agent.toml
.
There are more complicated setup scenarios such as splitting network planes for control and container-to-container communications, but we provide assistance with them for enterprise customers only.
Setting Up Accelerators
Ensure that your accelerator is properly set up using vendor-specific installation methods.
Clone the accelerator plugin package into plugins
directory if necessary or just use one of the already existing one in the mono-repo.
You also need to configure agent.toml
’s [agent].allow-compute-plugins
with the full package path (e.g., ai.backend.accelerator.cuda_open
) to activate them.
Configuring Overlay Networks for Multi-node Training (Optional)
Note
All other features of Backend.AI except multi-node training work without this configuration. The Docker Swarm mode is used to configure overlay networks to ensure privacy between cluster sessions, while the container monitoring and configuration is done by Backend.AI itself.
Currently the cross-node inter-container overlay routing is controlled via Docker Swarm’s overlay networks. In the manager, you need to create a Swarm. In the agent nodes, you need to join the Swarm. Then restart all manager and agent daemons to make it working.