References
Network
Enabling SRIOV on Intel Ethernet Server Adapter
First thing first
Step 1: Check that the Intel Ethernet Server Adapter. For more information, do take a look at Using SR-IOV with Intel® Ethernet Server Adapters
In a nutshell, You blacklist the vf driver in the host, and enable the VFs as part of the kvm guests.
Step 1: Add a line /etc/modprobe.conf
options ixgbe max_vfs=8
The above configuration will create 8 Virtual Nics per Port. The Intel Card supports up to 64 FVs.
Step 2: Blacklist the ixgbevf driver by creating a file called /etc/modprobe.d/blacklist-ixgbevf.conf
blacklist ixgbevf
Step 3: Reboot the machine
Compiling Chelseio IWARP Drivers (2.8.0.0) on CentOS 5
The below is a subset of the Chelsio 2.8.0.0 ReadMe
The Chelsio Unified Wire software has been developed to run on 64-bit Linux
based platforms. Following is the list of Drivers/Software and supported Linux
distributions. Here is a subset of the README.
The OS I used was CentOS 5.8
|########################|#####################################################| | Linux Distribution | Driver/Software | |########################|#####################################################| |RHEL5.8,2.6.18-308.el5 |NIC/TOE,vNIC,iWARP,WD-UDP*,WD-TOE*,iSCSI Target*, | | |Bonding,IPv6,Bypass*,Sniffer & Tracer | | |UM(Agent,Client),UDP-SO,Filtering,TM | |------------------------|-----------------------------------------------------| |RHEL5.9,2.6.18-348.el5 |NIC/TOE*,vNIC*,iWARP*,WD-UDP*,WD-TOE*,iSCSI Target*, | | |Bonding*,IPv6*,Bypass*,Sniffer & Tracer*,UDP-SO*, | | |Filtering*,TM* | |------------------------|-----------------------------------------------------| |RHEL6.3, |NIC/TOE,vNIC,iWARP,WD-UDP,WD-TOE*,iSCSI Target*, | |2.6.32-279.el6 |iSCSI Initiator*,FCoE Initiator*, | | |Bonding,IPv6,Bypass*,Sniffer & Tracer,UDP-SO, | | |UM(Agent,Client,WebGUI),Filtering,TM | |------------------------|-----------------------------------------------------| |RHEL6.4, |NIC/TOE,vNIC,iWARP,WD-UDP,WD-TOE,iSCSI Target, | |2.6.32-358.el6 |iSCSI Initiator,FCoE Initiator,Bonding,IPv6,Bypass, | | |Sniffer & Tracer,UDP-SO,UM(Agent,Client,WebGUI), | | |Filtering,TM,uBoot(DUD) | |------------------------|-----------------------------------------------------|
Strangely, I was not able to compile with 3.5.1. It seems that the compat-rdma on 3.5.1 is having issues with CentOS 5.8. See Failed to build compat-rdma RPM when compiling OFED 3.5.1 on CentOS 5.8
I tried with OFED 1.5.4.1, but errors occurred as well. But compiling OFED 1.5.3.2 works well and Chelsio T420-BCH was able to compile nicely with OFED 1.5.3.2. To download OFED 1.5.3.2, do visit the OFED Downloads Site
Part 1
To compile from source
i. Download the tarball ChelsioUwire-x.x.x.x.tar.gz
ii. Untar the tarball
[root@host]# tar zxvfm ChelsioUwire-x.x.x.x.tar.gz
iii. Change your current working directory to Chelsio Unified Wire package
directory. Build the source:
[root@host]# make
iv. Install the drivers, tools and libraries:
[root@host]# make install
v. The default configuration tuning option is Unified Wire.
The configuration tuning can be selected using the following commands:
[root@host]# make CONF=(T5/T4 Configuration)
[root@host]# make CONF=(T5/T4 Configuration install)
(where T5/T4 Configuration is
UNIFIED_WIRE, HIGH_CAPACITY_TOE, HIGH_CAPACITY_RDMA, LOW_LATENCY, UDP_OFFLOAD, T5_WIRE_DIRECT_LATENCY)
Part 2 – Installing Individual Drivers
i. To build and install iWARP driver against outbox OFED:
[root@host]# make iwarp
[root@host]# make iwarp_install
Part 3a – Loading IWARP Drivers
Manually Load Drivers
To load the iWARP driver we need to load the NIC driver & core RDMA drivers first:
[root@host]# modprobe cxgb4
[root@host]# modprobe iw_cxgb4
[root@host]# modprobe rdma_ucm
Part 3b – Automatic IWARP Drivers
To load the Chelsio iWARP drivers automatically, add this additional lines to /etc/modprobe.conf
options iw_cxgb4 peer2peer=1 install cxgb4 /sbin/modprobe -i cxgb4; /sbin/modprobe -f iw_cxgb4; /sbin/modprobe rdma_ucm alias eth1 cxgb4 # assuming eth1 is used by the Chelsio interface
Finally Reboot the system to load the new modules
References:
Registering sufficent memory for OpenIB when using Mellanox HCA
If you encountered errors like “error registering openib memory” similar to what is written below. You may want to take a look at the OpenMPI FAQ – I’m getting errors about “error registering openib memory”; what do I do? .
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered. You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.
See this Open MPI FAQ item for more information on these Linux kernel module
parameters:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: node02
Registerable memory: 32768 MiB
Total memory: 65476 MiB
Your MPI job will continue, but may be behave poorly and/or hang.
The explanation solution can be found at How to increase MTT Size in Mellanox HCA.
In summary, the error occurred when applications which consumed a large amount of memory, application might fail when not enough memory can be registered with RDMA. There is a need to increase MTT size. But increasing MTT size hasve the downside of increasing the number of “cache misses” and increases latency.
1. To check your value of log_num_mtt
# cat /sys/module/mlx4_core/parameters/log_num_mtt
2. To check your value of log_mtts_per_seg
# cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
There are 2 parameters that affect registered memory. This can be taken from http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem
With Mellanox hardware, two parameters are provided to control the size of this table:
- log_num_mtt (on some older Mellanox hardware, the parameter may be
num_mtt, notlog_num_mtt): number of memory translation tables - log_mtts_per_seg:
The amount of memory that can be registered is calculated using this formula:
In newer hardware:
max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE
In older hardware:
max_reg_mem = num_mtt * (2^log_mtts_per_seg) * PAGE_SIZE
For example if your server’s Physical Memory is 64GB RAM. You will need to registered 2 times the Memory (2x64GB=128GB) for the max_reg_mem. You will also need to get the PAGE_SIZE (See Virtual Memory PAGESIZE on CentOS)
max_reg_mem = (2^ log_num_mtt) * (2^3) * (4 kB) 128GB = (2^ log_num_mtt) * (2^3) * (4 kB) 2^37 = (2^ log_num_mtt) * (2^3) * (2^12) 2^22 = (2^ log_num_mtt) 22 = log_num_mtt
The setting is found in /etc/modprobe.d/mlx4_mtt.conf for every nodes.
References:
Adding and Removing the 2nd Mellanox Ethernet Port as an uplink to an Existing Vswitch using the CLI
At VSphere 5.1 Client. I was able to see the Dual-Port Network Adapter (vmnic22.p1, vmnic22.p2) after I install the Vmware Installing Mellanox ConnectX® EN 10GbE Drivers for VMware® ESX 5.x Server.
But somehow I am not able to use the 2nd port of the Mellanox ConnectX 10G on the VSphere Client > Configuration > Networking. It will not be visible. However at the VSphere Client > Configuration > Networking > Add Networking, I not able to see the 2nd Port being available.
I found the document from Mellanox (MellanoxMLX4_ENDriverforVMwareESXi-5.xREADME) which is useful to resolve the issue. At Page 10,
Adding the Device as an uplink to an Existing Vswitch using the CLI
Step 1: Log into the ESXi server with root permission
Step 2: Add an uplink to a vswitch, run:
# esxcli network vswitch standard uplink add –u <uplink_name> -v <vswitch_name>
* Uplink_name refer to the name used by ESX for the network Adapter. For example, vmnic22.p2 is the uplink name
Step 3: Check that uplink was added successfully. Run:
# esxcli network vswitch standard list -v <vswitch_name>
Removing the Device an an uplink to an Existing Vswitch using the CLI
Step 1: Log into the ESXi server with root permissions
Step 2: Remove an uplink from a vswitch, run:
# esxcli network vswitch standard uplink remove -u <uplink_name> -v <vswitch_name>
Upgrading Mellanox ConnectX® EN 10GbE Drivers for VMware® ESX 5.x Server
Do read the Blog Entry Installing Mellanox ConnectX® EN 10GbE Drivers for VMware® ESX 5.x Server
Step 1: At VMware ESX 5.x Hypervisor,
- Click the F2 button
- Select “Troubleshoot Options”
- “Enable ESXi Shell” and “Enable SSH”
Step 2: Download the VMware ESXi 5.0 Driver for Mellanox ConnectX Ethernet Adapters
Step 3: Unzip the mlx4_en-mlnx-1.6.1.2-offline_bundle-471530.zip
Step 4: The upgrade process is similar to a new install, except the command that should be issued is the following:
# esxcli software vib upgrade -v {VIBFILE}
In the example above, this would be:
# esxcli software vib update -v /tmp/net-mlx4-en-1.6.1.2-1OEM.500.0.0.406165.x86_64.vib
Installing Mellanox ConnectX® EN 10GbE Drivers for VMware® ESX 5.x Server
If you have Mellanox Technologies MT27500 Family [ConnectX-3] 10G Ethernet Card, it may not be automatically detected by VMware ESX 5.x Hypervisor. You have to install the driver manually into Vmware 5.x
Step 1: At VMware ESX 5.x Hypervisor,
- Click the F2 button
- Select “Troubleshoot Options”
- “Enable ESXi Shell” and “Enable SSH”
Step 2: Download the VMware ESXi 5.0 Driver for Mellanox ConnectX Ethernet Adapters
Step 3: Unzip the mlx4_en-mlnx-1.6.1.2-offline_bundle-471530.zip
Step 4: Read the README file
VMware uses a file package called a VIB (VMware Installation Bundle) as the mechanism for installing or upgrading software packages on an ESX server.
The file may be installed directly on an ESX server from the command line, or through the VMware Update Manager (VUM).
Step 5: For New Installation (From README and modified)
For new installs, you should perform the following steps:
Step 5a: Copy the VIB to the ESX server. Technically, you can place the file anywhere that is accessible to the ESX console shell, but for these instructions, we’ll assume the location is in ‘/tmp’.
Here’s an example of using the Linux ‘scp’ utility to copy the file from a local system to an ESX server located at 10.10.10.10:
# scp net-mlx4-en-1.6.1.2-1OEM.500.0.0.406165.x86_64.vib root@10.10.10.10:/tmp
Step 5b: Issue the following command (full path to the VIB must be specified):
# esxcli software vib install -v {VIBFILE}
In the example above, this would be:
# esxcli software vib install -v /tmp/net-mlx4-en-1.6.1.2-1OEM.500.0.0.406165.x86_64.vib
PBS scripts for mpirun parameters for Chelsio / Infiniband Cards
If you are running Chelsio Cards, you may want to specify the mpirun parameters to ensure the
/usr/mpi/intel/openmpi-1.4.3/bin/mpirun -mca btl openib,sm,self --bind-to-core --report-bindings -np $NCPUS -machinefile $PBS_NODEFILE $PBS_O_WORKDIR/$file
–bind-to-core: Bind each MPI process to a core
–mca btl openib,sm,self: (Infiniband, shared memory, the loopback)
For information on Interprocess communication with shared memory,
Diagnostic Tools to diagnose Infiniband Fabric Information
There are a few diagnostic tools to diagnose Infiniband Fabric Information. Use man for the parameters for the
- ibnodes – (Show Infiniband nodes in topology)
- ibhosts – (Show InfiniBand host nodes in topology)
- ibswitches- (Show InfiniBand switch nodes in topology)
- ibnetdiscover – (Discover InfiniBand topology)
- ibchecknet – (Validate IB subnet and report errors)
- ibdiag (Scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices)
- perfquery (find errors on a particular or number of HCA’s and switch ports)
ibnodes (Show Infiniband nodes in topology)
ibnodes is a script which either walks the IB subnet topology or uses an already saved topology file and extracts the IB nodes (CAs and switches)
# ibnodes
..... Ca : 0x0000000000009b02 ports 2 "c00 HCA-1" Ca : 0x0000000000005af0 ports 1 "h00 HCA-1" Switch : 0x00000000000000fa ports 36 "IBM HSSM" enhanced port 0 lid 19 lmc 0 .....
ibhosts (Show InfiniBand host nodes in topology)
ibhosts is a script which either walks the IB subnet topology or uses an already saved topology file and extracts the CA nodes.
# ibhosts
Ca : 0x0000000000009b02 ports 2 "c00 HCA-1" Ca : 0x0000000000005af0 ports 1 "h00 HCA-1"
ibswitches (Show InfiniBand switch nodes in topology)
ibswitches is a script which either walks the IB subnet topology or uses an already saved topology file and extracts the switch nodes.
# ibswitches
Switch : 0x00000000000003fa ports 36 "IBM HSSM" enhanced port 0 lid 19 lmc 0 Switch : 0x00000000000003cc ports 36 "IBM HSSM" enhanced port 0 lid 16 lmc 0
ibnetdiscover (Discover InfiniBand topology)
ibnetdiscover performs IB subnet discovery and outputs a human readable topology file. GUIDs, node types, and port numbers are displayed as well as port LIDs and NodeDescriptions. All nodes (and links) are displayed (full topology). Optionally, this utility can be used to list the current connected nodes by nodetype. The output is printed to standard output unless a topology file is specified.
# ibnetdiscover
# # Topology file: generated on Mon Jan 28 14:19:57 2013 # # Initiated from node 0000000000000080 port 0000090300451281 vendid=0x2c9 devid=0xc738 sysimgguid=0x2c90000000000 switchguid=0x2c90000000080(0000000000080) Switch 36 "S-0002c9030071ba80" # "MF0;switch-6260a0:SX90Y3245/U1" enhanced port 0 lid 2 lmc 0 [2] "H-00000000000011e0"[1](00000000000e1) # "node-c01 HCA-1" lid 3 4xQDR [3] "H-00000000000012d0"[1](00000000000d1) # "node-c02 HCA-1" lid 4 4xQDR .... ....
ibchecknet (Validate IB subnet and report errors)
# ibchecknet
...... ...... ## Summary: 31 nodes checked, 0 bad nodes found ## 88 ports checked, 59 bad ports found ## 12 ports have errors beyond threshold
perfquery command
The perfquery command is useful for find errors on a particular or number of HCA’s and switch ports. You can also use perfquery to reset HCA and switch port counters.
# Port counters: Lid 1 port 1 PortSelect:......................1 CounterSelect:...................0x1400 SymbolErrorCounter:..............0 LinkErrorRecoveryCounter:........0 LinkDownedCounter:...............0 PortRcvErrors:...................13 PortRcvRemotePhysicalErrors:.....0 PortRcvSwitchRelayErrors:........0 PortXmitDiscards:................0 PortXmitConstraintErrors:........0 PortRcvConstraintErrors:.........0 CounterSelect2:..................0x00 LocalLinkIntegrityErrors:........0 ExcessiveBufferOverrunErrors:....0 VL15Dropped:.....................0 PortXmitData:....................199578830 PortRcvData:.....................504398997 PortXmitPkts:....................15649860 PortRcvPkts:.....................15645526 PortXmitWait:....................0
References:
Diagnostic Tools to diagnose Infiniband Device
There are a few Diagnostic Tools to diagnose Infiniband Devices.
- ibv_devinfo (Query RDMA devices)
- ibstat (Query basic status of InfiniBand device(s))
- ibstatus (Query basic status of InfiniBand device(s))
ibv_devinfo (Query RDMA devices)
Print information about RDMA devices available for use from userspace.
# ibv_devinfo
hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.10.2322 node_guid: 0002:c903:0045:1280 sys_image_guid: 0002:c903:0045:1283 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: IBM0FD0140019 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 1 port_lmc: 0x00 link_layer: IB port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: IB
ibstat (Query basic status of InfiniBand device(s))
ibstat is a binary which displays basic information obtained from the local IB driver. Output includes LID, SMLID, port state, link width active, and port physical state.
It is similar to the ibstatus utility but implemented as a binary rather than a script. It has options to list CAs and/or ports and displays more information than ibstatus.
# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.10.2322
Hardware version: 0
Node GUID: 0x0002c90300451280
System image GUID: 0x0002c90300451283
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251486a
Port GUID: 0x0002c90300451281
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x0002c90300451282
Link layer: InfiniBand
ibstatus – (Query basic status of InfiniBand device(s))
ibstatus is a script which displays basic information obtained from the local IB driver. Output includes LID, SMLID, port state, link width active, and port physical state.
# ibstatus
Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0002:c903:0045:1281 base lid: 0x1 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand Infiniband device 'mlx4_0' port 2 status: default gid: fe80:0000:0000:0000:0002:c903:0045:1282 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand