Infiniband

From Proxmox VE
Jump to: navigation, search

Introduction

Infiniband can be used with DRBD to speed up replication, this article covers setting up IP over Infiniband(IPoIB)

Subnet Manager

Infiniband requires a subnet manager to function. Many Infiniband switches have a built in subnet manager that can be enabled. When using multiple switches you can enable a subnet manager on all of them for redundancy.

If your switch does not have a subnet manager, or if you are not using a switch then you need to run a subnet manager on your node(s). opensm package in Debian Squeeze and up should be suffecient if you need a subnet manager.

Sockets Direct Protocol (SDP)

SDP can be used with a preload library to speed up TCP/IP communications over Infiniband. DRBD supports SDP and offers some performance gains.

The Linux Kernel does not include the SDP module. If you want to use SDP you need to install OFED. Thus far I have been unable to get OFED to compile for Proxmox 2.0.

IPoIB

IP over Infiniband allows sending IP packets over the Infiniband fabric.

Proxmox 1.X Prerequisites

Debian Lenny network scripts do not work well with Infiniband interfaces. This can be corrected by installing the following packages from Debian squeeze:

ifenslave-2.6_1.1.0-17_amd64.deb
net-tools_1.60-23_amd64.deb
ifupdown_0.6.10_amd64.deb

= Proxmox 2.0=

Nothing special is needed with Proxmox 2.0, everything seems to work out of the box.

AFAIK this is needed [ rob f 2013-07-13 ]. 2013-08-02 we have subnet manager running on IB switch, so we uninstalled. TBD: is this needed under some circumstances?

aptitude install  opensm

Proxmox 3.x

See directions for 2.0. Nothing has changed in 3 that warrants noting here.

Create IPoIB Interface

Bonding

It is not possible to bond Infiniband to increase throughput If you want to use bonding for redundancy create a bonding interface.

/etc/modprobe.d/aliases-bond.conf

alias bond0 bonding
options bond0 mode=1 miimon=100 downdelay=200 updelay=200 max_bonds=2


Infiniband interfaces are named ib0,ib1, etc. Edit /etc/network/interfaces

auto bond0
iface bond0 inet static
        address  192.168.1.1
        netmask  255.255.255.0
        slaves ib0 ib1
        bond_miimon 100
        bond_mode active-backup
        pre-up modprobe ib_ipoib
        pre-up echo connected > /sys/class/net/ib0/mode
        pre-up echo connected > /sys/class/net/ib1/mode
        pre-up modprobe bond0
        mtu 65520 

To bring up the interface:

ifup bond0

Without Bonding

Edit /etc/network/interfaces

auto ib0
iface ib0 inet static
        address  192.168.1.1
        netmask  255.255.255.0
        pre-up modprobe ib_ipoib
        pre-up echo connected > /sys/class/net/ib0/mode
        mtu 65520 

To bring up the interface:

ifup ib0

TCP/IP Tuning

These settings performed best on my servers, your mileage may vary.

edit /etc/sysctl.conf

#Infiniband Tuning
net.ipv4.tcp_mem=1280000 1280000 1280000
net.ipv4.tcp_wmem = 32768 131072 1280000
net.ipv4.tcp_rmem = 32768 131072 1280000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=1524288
net.ipv4.tcp_sack=0
net.ipv4.tcp_timestamps=0

To apply the changes now:

sysctl -p


iperf speed tests

this is the 1-st time I've used iperf. there are probably better options to use for the iperf command.

on systems to test install

aptitude install iperf

on one system run as server. in example it is using Ip 10.0.99.8

iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------

on a client.

# iperf -c 10.0.99.8
------------------------------------------------------------
Client connecting to 10.0.99.8, TCP port 5001
TCP window size:  646 KByte (default)
------------------------------------------------------------
[  3] local 10.0.99.30 port 38629 connected with 10.0.99.8 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  8.98 GBytes  7.71 Gbits/sec

I want to see the infiniband interface exposed in my VMs - can I do that?

The short answer is no. The long answer is that you can use manual routes to pass traffic through the virtio interface to your IB card. This will inflict a potentially enormous penalty on the transfer rates but it does work.

In the long run, what is needed is a special KVM driver, similar to the Virtio driver, that would allow the IB card to be abstracted and presented to all your VMs as a separate device.

Using IB for cluster networking

IB can be used for cluster communications.

Edit /etc/hosts and change the host names/IPs to your IB network.  
Reboot each host, and make sure ssh can connect to all hosts from each host over IB.    

admin

Install this and check docs in pkg and at http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-4.html

apt-get install infiniband-diags
 aptitude show infiniband-diags

Package: infiniband-diags                
New: yes
State: installed
Automatically installed: no
Version: 1.4.4-20090314-1.2
Priority: extra
Section: net
Maintainer: OFED and Debian Developement and Discussion <pkg-ofed-devel@lists.alioth.debian.org>
Architecture: amd64
Uncompressed Size: 472 k
Depends: libc6 (>= 2.3), libibcommon1, libibmad1, libibumad1, libopensm2, perl
Description: InfiniBand diagnostic programs
 InfiniBand is a switched fabric communications link used in high-performance computing and enterprise data centers. Its
 features include high throughput, low latency, quality of service and failover, and it is designed to be scalable. 
 
 This package provides diagnostic programs and scripts needed to diagnose an InfiniBand subnet.
Homepage: http://www.openfabrics.org
  • ibstat . in the following we had a cable not fully connected to ib card
# ibstat
CA 'mthca0'
        CA type: MT25208
        Number of ports: 2
        Firmware version: 5.3.0
        Hardware version: a0
        Node GUID: 0x0002c90200277c9c
        System image GUID: 0x0002c90200277c9f
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 18
                LMC: 0
                SM lid: 3
                Capability mask: 0x02510a68
                Port GUID: 0x0002c90200277c9d
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510a68
                Port GUID: 0x0002c90200277c9e