December 18, 2016

Day 18 - GPU-enabled cloud farm

Written by: Sergey Karatkevich (@kevit)
Edited by: Eric Sigler (@esigler)

Why did we start this project?

Our company, Servers.com is here for a purpose. The purpose is to provide you with the quality hosting services, including all the additional tools you may need. One great example is Prisma, a mobile app.

We have been Prisma’s hosting partner since the day the app was launched. Despite an explosive popularity growth of the app and hefty download numbers, we were able to support their needs in terms of provisioning new servers and balancing the loads. Later, when the app’s code was optimized, so that we could reuse the part of the hardware, we decided to create a new product: Prisma Cloud, which is dedicated GPU hosting infrastructure.

Prisma processed their pictures on Dell servers with NVIDIA Titan X and NVIDIA 1080 GPUs, so, that was our starting point.

What were the major problems?

Each video card exposes two devices in lspci:

42:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
42:00.1 Audio device: NVIDIA Corporation Device 10f0 (rev a1)

You can easily remove the audio device through /sys:

echo -n "1" > /sys/bus/pci/devices/0000\:42\:00.1/remove

Officially the NVIDIA GeForce GTX 1080 is supported by Linux via the NV proprietary driver as 367.18 Beta. At that time the driver was quite new and still not packaged, even as an experiment.

364.19-1 1
          1 http://mirror.yandex.ru/debian experimental/non-free amd64 Packages
     361.45.18-2 500
        500 http://mirror.yandex.ru/debian sid/non-free amd64 Packages

So, we used a new driver from NVIDIA website:

chmod +x NVIDIA-Linux-x86_64-367.35.run
./NVIDIA-Linux-x86_64-367.35.run -a  --dkms -Z  -s
update-initramfs -u 
modprobe nvidia-uvm
./cuda_8.0.27_linux.run --override --silent --toolkit --samples --verbose

and patched:

./cuda_8.0.27.1_linux.run --silent --accept-eula

NVIDIA is trying to limit virtualization inside kvm, so kvm=off is your friend. You are obliged to use qemu 2.1+. Later we faced another limitation with ffmpeg (only two concurrent flow per one 1080 card)

<kvm>
   <hidden state='on'/>
</kvm>

What does it look like from the host?

Your host should provide SR-IOV and DMAR (DMA remapping). It can be switched on via BIOS/EFI:

dmesg|grep -e DMAR -e IOMMU

IOMMU (input/output memory management unit) should be turned on in kernel options:

iommu_intel=on 

Drivers snd_hda_intel and nouveau should be blacklisted

modprobe.blacklist=snd_hda_intel,nouveau

And the VFIO driver should be loaded:

modprobe vfio

What does it look like from the OpenStack side?

You should define a PCIe device:

[DEFAULT]
pci_passthrough_whitelist = { "vendor_id": "10de", "product_id": "1b80" }
pci_alias = { "vendor_id":"10de", "product_id":"1b80", "name":"nvidia" }

apply proper filters:

scheduler_default_filters=AggregateInstanceExtraSpecsFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,AggregateImagePropertiesStrictIsolation,AggregateCoreFilter,DiskFilter,PciPassthroughFilter

and set proper flavour settings:

meta: pci_passthrough:alias = nvidia:1 ( nvidia coming from pci_alias directive in nova.conf)
nova flavor-key GPU.SSD.30 set "pci_passthrough:alias"="nvidia:1" (1 - number of cards)

Migrate your instance in Openstack

An automated migration is still in the development, but you can migrate the instance manually for now.

Symptoms:

libvirtError: Requested operation is not valid: PCI device 0000:84:00.0 is in use by driver QEMU

And a simple migration process:

nova migrate uuid
nova reset-state uuid --active
nova stop uuid
nova start uuid
Removing  source-node flag: rm -r /var/lib/instances/uuid_resize

Reference

PCI-passtrough
Openstack
Nvidia blob

No comments :