Hpc Lab Setup

The HPC Lab is a persistent deployment in Vultr to allow the SA team to prototype solutions in a shared environment. Goals

A testing site for our Ansible automation, Warewulf, and IQube deployments.

Environment to troubleshoot Warewulf issues. 

Easy button to deploy Warewulf Slurm demo environemnt. 

Prerequisites

Access to Vultr

Mountain access key that has access to Classic HPC Infrastructure 8 subscription

Access to Bitbucket repos

Ansible installed on your laptop

Terraform

API Access to Vultr

Authorise your IP address to use the Vultr API. (This can change frequently, depending on your ISP.) Deploy a Slurm HPC Lab Environment on Vultr Deploying the Environment

The following section walks though how to deploy a Slurm HPC cluster of CPU instances on Vultr where:

Slurm components (controller, and database) are installed using packages from OpenHPC

Warewulf is installed through Mountain using subscription warewulf-rocky-8

Slurm compute node Warewulf image is imported using Mountain subscription warewulf-node-images

Warewulf overlays are configured using Ansible templates/files. 

Warewulf nodes and node profiles are managed using an Ansible fact and rendered using a Jinja template

Pull down classic-hpc repository

git clone git@bitbucket.org:ciqinc/classic-hpc.git && cd classic-hpc

Set up Ansible dynamic inventory. 

    Populate inventory.vultr.yml with your Vultr API key in field api_keyand update the filter with your user name for the owner Vultr tag. For example owner:bphan. When provisioning the HPC lab environment, your instances will be tagged with owner:<username>. 

    Inventory groups are set based on instance tags: warewulf_servers, scheduler, and database.

Create a copy of configs/warewulf-slurm-vultr-demo.yml called cluster.config.yml

cp config/warewulf-slurm-vultr-demo.yml configs/cluster.config.yml

Run the playbook to stand up the HPC lab

ansible-playbook -u root -i inventory.vultr.yml
–extra-vars “bitbucket_ssh_private_key=/path/to/private/key”
–extra-vars “vultr_api_key=[redacted]”
–extra-vars “vultr_ssh_public_key=/path/to/public/key”
–extra-vars “ciq_mountain_access_key=[redacted]”
playbooks/setup-hpc-lab.yml

Run example hello world job.

Log in via SSH to hpc-lab-control0. 

    You can easily find the public IP of this host using command ansible-inventory -i inventory.vultr.yml --list

Switch to your test user - su - test

Allocate resources for your job. The following command will allocate a single node and 4 cores: 

    salloc -N 1 -n 4 

Compile an example MPI hello world job

    mpicc -O3 /opt/ohpc/pub/examples/mpi/hello.c

Run the compiled program

    prun ./a.out

You should see this following output: [test@hpc-lab-control0 ~]$ prun ./a.out [prun] Master compute host = hpc-lab-control0 [prun] Resource manager = slurm [prun] Launch cmd = mpirun ./a.out (family=openmpi4)

Hello, world (4 procs total) –> Process # 0 of 4 is alive. -> bphan-hpc-lab-compute0 –> Process # 2 of 4 is alive. -> bphan-hpc-lab-compute0 –> Process # 3 of 4 is alive. -> bphan-hpc-lab-compute0 –> Process # 1 of 4 is alive. -> bphan-hpc-lab-compute0

If you have successfully run the test job, your environment should be set up to run the following demo: Tearing down the HPC lab

When you have finished testing or doing a demo you can tear down the provisioned environment by running the playbook destroy-hpc-lab.yml. ansible-playbook -u root -i inventory.vultr.yml
–extra-vars “vultr_api_key=[redacted]”
–extra-vars “vultr_ssh_public_key=/path/to/public/key”
playbooks/destroy-hpc-lab.yml Deployment FAQ

Q: I have submitted a job or requested compute resources in Slurm, but my job is stuck in a pending state or the scheduler is taking a long time to assign my user compute resources. How do I debug this?

A: First check the status of your nodes using command sinfo. For example, the output below shows node bphan-hpc-lab-compute0 is in a down state, which prevent cluster users from using the compute resource. [root@hpc-lab-control0 ~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 1-00:00:00 1 down bphan-hpc-lab-compute0

If there is an asterisk beside the state (ex. down*), this mean that the Slurm controller cannot reach the Slurm client (slurmd) on the compute node. This typically indicates the service had issues starting up during the boot process.

To put the compute node back into state idle, run the following command as root on the node the Slurm controller is running on: scontrol update nodename=bphan-hpc-lab-compute0 state=idle

After running the command above, we can run sinfo again to confirm the node is back in and idle state. [root@hpc-lab-control0 ~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 1-00:00:00 1 idle bphan-hpc-lab-compute0

If your job was previously queued, it should being running once the compute node is put back into an idle state. If there is a job running on the node, you should see the node in state alloc. Warewulf Demo FAQ

Q: How is a Warewulf node image container differ from a Docker container I would create which runs, for example, an Apache server?

A: Warewulf node image container have a kernel installed in the container. When a Warewulf managed node is PXE booting, it will use the kernel within the node image to boot the bare metal node.

Q: Can I manage a VM running in VMWare with Warewulf?

A: Yes.

Q: Can Warewulf manage a login node in addition to my compute nodes?

A: Yes, Warewulf can manage a login node. We would recommend setting up a new node profile for a login node. This new profile will have an additional network interface configured which allows users to SSH into the login node.

Q: How do you tell Warewulf which interface to provision over?

A: That’s meant to be the “primary” interface. But, ultimately, it depends on the BIOS / firmware on the compute node and which interface it attempts to PXE boot over first. Deploy a IQube Lab Environment

TODO Configuration

https://bitbucket.org/ciqinc/hpc-lab/

https://bitbucket.org/ciqinc/classic-hpc/

https://bitbucket.org/ciqinc/iqube-ansible/