|
|
 |
 |
 |
 |
 |  |  |
|
|
Performance Report: Product Numbers A5483A, A5513A, and A5515A
Introduction
A series of tests were run to evaluate the performance of HP ATM 622 and 155 Mbps adapters on various HP 9000 server and workstation platforms. The tests also measured the effect of Classical IP and LAN Emulation interfaces, extended packet sizes, and TCP configuration parameters on the adapter's performance.
A second set of tests evaluated the scalability with respect to number of adapters and number of CPUs on various HP 9000 server platforms.
This document presents the performance results and provides tuning guidelines to achieve maximum possible performance from the ATM adapters.
Appendix A provides the Test Methodology and Environment used for the performance tests.
Test highlights
- The 622 Mbps adapter delivers bidirectional TCP throughput in excess of 1 Gbit/sec on the N4000 platform. This is 94% of the theoretical maximum TCP throughput for a 622 Mbps link.
- The unidirectional TCP throughput varies linearly with the number of 622 Mbps Adapters. The TCP throughput of 4.2 Gbits/sec is achieved with eight 622 adapters on N4000 platform.
- The unidirectional TCP throughput for the 155 Mbps adapter is 134.5 Mbps, which is equal to the theoretical maximum throughput for a 155 Mbps link.
- The TCP throughput varies linearly with the number of 155 Mbps adapters. Aggregate bidirectional throughput of 3 Gbits/sec is achieved with twelve 155 Mbps adapters on V2500 and N4000 platforms.
All the results above use the Classical IP as the interface with MTU size of 9180 bytes.
Single adapter configuration test results
The theoretical maximum TCP throughput is approximately 135 Mbps for the 155 adapter and 542 Mbps for the 622 adapter. See Appendix A for calculations of the theoretical maximum TCP throughput using protocol overhead at various layers.
|
| |
 |  |  |
 |
| Table 1: maximum TCP throughput on N4000*
Table 1 below shows the TCP throughput numbers for the N4000 platform. See Tables 4 and 5 for results on other platforms.
|
| | N4000 (155 Mbps Adapter) | N4000 (622 Mbps Adapter) |
| Outbound throughput Mbps | 134.45 | 538.14 |
| CPU utilization % | 6.5 | 26.85 |
| Inbound throughput Mbps | 132.03 | 535.42 |
| CPU utilization % | 5.65 | 23.61 |
| Bidirectional throughput Mbps | 260.12 | 1005 |
| CPU utilization % | 11.17 | 57.1 |
|
 |
|
*One CPU, Classical IP Interface, MTU 9180 bytes
|
 |
 |
| Table 2: TCP throughput on various interface types*
The packet size (that is, the Maximum Transmission Unit [MTU] size) and the interface type influence the performance of the adapters. Table 2 below shows the maximum TCP throughput and CPU utilization figures for three different interface types for 155 Mbps adapters. Table 3 below shows the maximum throughput and CPU utilization figures for the same interface types for 622 Mbps adapters.
|
| Test/interface | Classical IP MTU 9180 bytes | LANE MTU 9218 bytes | LANE MTU 1516 bytes |
| Outbound Mbps | 134.45 | 134.13 | 128.4 |
| CPU utilization % | 6.5 | 8.4 | 10.4 |
| Inbound Mbps | 132.03 | 132.38 | 128.22 |
| CPU utilization % | 5.65 | 6.42 | 8.49 |
| Bidirectional Mbps | 260.12 | 258.14 | 240.25 |
| CPU utilization % | 11.17 | 12.24 | 17.8 |
|
 |
|
*N4000 system, One CPU, 155 Mbps Adapter
|
 |
 |
| Table 3: TCP throughput on various interface types*
|
| Test/interface | Classical IP MTU 9180 bytes | LANE MTU 9218 bytes | LANE MTU 1516 bytes |
| Outbound Mbps | 538.14 | 533 | 448 |
| CPU utilization % | 26.85 | 28.58 | 88.79 |
| Inbound Mbps | 535.42 | 537 | 454.48 |
| CPU utilization % | 23.61 | 52.52 | 20.05 |
| Bidirectional Mbps | 1005 | 965 | 453.15 |
| CPU utilization % | 57.1 | 77.14 | 71.27 |
|
 |
|
*N4000 system, 1 CPU, 622 Mbps Adapter
The TCP window size is an important parameter that affects throughput performance. For 622 Mbps adapters, the TCP window size on the receiving end must be set to at least 144Kbytes (instead of the default window size of 64 Kbytes) to achieve the maximum throughput. For 155 and 622 Mbps adapters, the window size of 144 Kbytes also minimizes the sensitivity to application messages size. See Figure 1 below for an example of Outbound throughput.
Figure 1: TCP Window Size and Outbound TCP throughput on N4000 (Classical IP, 622 Adapter)
The second set of tests consists of exchange of a single request and response packet of 1 byte each. The packet is at the TCP level. The performance metric is the aggregate number of request/response packet pairs (transactions) per second. This performance metric is an indication of the number of user level packets the adapter can process in a second.
Test results show that the CPU and the platform type influence the single byte request/response performance.
A transaction consisting of a request size of 1024 bytes and a response size of 2048 bytes over a UDP protocol models an environment of NFS services running over ATM.
|
 |
 |
| Table 41: results on different platforms with one 155 adapter*
Tables 4 and 5 below show the throughput and request/response results on different platforms. All the results are for MTU size of 9180 bytes.
|
| Test/platform | V2250 | V2500 | N4000 |
| Memory | 1 GB | 1 GB | 1 GB |
| Outbound TCP throughput Mbps | 134.73 | 134.85 | 134.45 |
| CPU utilization % | 8.3 | 5.99 | 6.5 |
| Inbound TCP throughput Mbps | 132.78 | 132.36 | 132.03 |
| CPU utilization % | 13.41 | 4.81 | 5.65 |
| Bidirectional TCP throughput Mbps | 261.6 | 255.58 | 260.12 |
| CPU utilization % | 24.675 | 10.25 | 11.17 |
| TCP Single byte R/R transactions/sec | 8364 | 14654 | 20421 |
| CPU utilization % | 100 | 89.5 | 92.37 |
| UDP (1K/2K) R/R transactions/sec | 7264 | 8694 | 8769 |
| CPU utilization % | 100.00 | 65.63 | 51.11 |
|
 |
|
*Classical IP, One CPU
1 The V2500 Platform tests for 155 and 622 adapters were run with 2 CPUs because the minimum configuration for V2500 is 2 CPUs. The CPU utilization reported for V2500 is per CPU.
|
 |
 |
| Table 5: results on different platforms with one 622 adapter*
|
| Test/platform | V2250 | V2500 | N4000 |
| Memory | 1 GB | 1 GB | 1 GB |
| Outbound TCP throughput Mbps | 529.67 | 537.75 | 538.14 |
| CPU utilization % | 65.78 | 15.52 | 26.85 |
| Inbound TCP throughput Mbps | 512.41 | 538.29 | 535.42 |
| CPU utilization % | 81.15 | 18.61 | 23.61 |
| Bidirectional TCP throughput Mbps | 680.3 | 962.95 | 1005 |
| CPU utilization % | 91.81 | 50.2 | 57.1 |
| TCP Single byte R/R transactions/sec | 8367 | 20944 | 28536 |
| CPU utilization % | 100 | 100 | 100 |
| UDP (1K/2K) R/R transactions/sec | 7413 | 14658 | 18901 |
| CPU utilization % | 100.00 | 100.00 | 100.00 |
|
 |
|
*Classical IP, One CPU
The Bidirectional TCP throughput on the V2250 is limited to 680 Mbps due to the fact that V2250 has 1X PCI I/O Backplane (32 bit and 33 Mhz). The V2500 has a 2X PCI I/O Backplane (64 bit, 33Mhz).
Multiple adapter configuration test results
For a Classical IP interface with MTU size of 9180 bytes, the TCP throughput in outbound direction varies almost linearly with the number of adapters, for up to eight 622 adapters and 12 155 Adapters for V2500 and N4000 platforms. See Figures 2 and 3 below.
Figure 2: TCP Outbound throughput Scalability (N4000: 8 CPUs, 622 Adapters, Classical IP)
Figure 3: TCP throughput Scalability (V2500: 8 CPUs, 155 Adapters, Classical IP)
Each netperf process was bound to the same CPU that the adapter uses to send interrupts to. The Classical IP interface on each adapter was configured with a different IP subnet; that is, there were eight IP subnets configured in an eight-adapter test.
On V-series platforms, one 622 adapter was configured per PCI Controller, and a maximum of two 155 adapters were installed in a PCI Controller. On N-series platforms, the 622 Adapters were installed in the Twin Turbo slots.
|
 |
 |
| Table 6: results on different platforms with 8 155 adapters
Tables 6 and 7 below show scalability results on different server platforms for up to eight 155 and 622 adapters, respectively.
|
| Test/platform | V2250 | V2500 | N4000 |
| Memory | 1 GB | 1 GB | 1 GB |
| Outbound TCP throughput Mbps | 1034 | 1041 | 1076 |
| CPU utilization % | 23.96 | 20.94 | 6.04 |
| Inbound TCP throughput Mbps | 1038 | 1054 | 1059 |
| CPU utilization % | 21.36 | 10.94 | 7.05 |
| Bidirectional TCP throughput Mbps | 1698 | 1902 | 2062 |
| CPU utilization % | 43.97 | 50.12 | 26.86 |
| TCP Single byte R/R transactions/sec | 38097 | 56918 | 62466 |
| CPU utilization % | 95.16 | 100 | 100 |
| UDP (1K/2K) R/R transactions/sec | 34450 | 36752 | 40943 |
| CPU utilization % | 100.00 | 100.00 | 100.00 |
|
 |
|
*8 CPUs, Classical IP
|
 |
 |
| Table 72: results on different platforms with 8 622 adapters*
|
| Test/platform | V2250 | V2500 | N4000 |
| Memory | 1 GB | 1 GB | 1 GB |
| Outbound TCP throughput Mbps | 3338 | 3799 | 4209 |
| CPU utilization % | 95.67 | 77.61 | 95.78 |
| Inbound TCP throughput Mbps | 1965 | 3605 | 4000 |
| CPU utilization % | 70.76 | 54.47 | 83.76 |
| Bidirectional TCP throughput Mbps | 2360 | 3262 | 4020 |
| CPU utilization % | 96.42 | 85.24 | 90.97 |
| TCP Single byte R/R transactions/sec | 44819 | 67935 | 75763 |
| CPU utilization % | 100 | 100 | 100 |
| UDP (1K/2K) R/R transactions/sec | 33669 | 52632 | 49787 |
| CPU utilization % | 100 | 100 | 100 |
|
 |
|
*8 CPUs, Classical IP
2 The Inbound and Bidirectional throughput data for the N4000 platform is estimated.
|
 |
 |
| Table 8: TCP throughput results on LAN emulation interface*
Tests on the LAN Emulation Interface show that the maximum throughput using multiple adapters is less than the corresponding results for the Classical IP interface. Tests with MTU size of 9 Kbytes give better results than MTU size of 1500 bytes. For example, see Tables 8 and 9 below for results on 155 and 622 adapters, respectively.
|
| Test/interface | LANE MTU 9218 bytes | LANE MTU 1516 bytes |
| Outbound TCP throughput Mbps | 538 | 475 |
| CPU utilization % | 2.75 | 6.3 |
| Inbound TCP throughput Mbps | 529 | 268 |
| CPU utilization % | 4.04 | 3.28 |
| Bidirectional TCP throughput Mbps | 772 | 338 |
| CPU utilization % | 6.8 | 4.90 |
|
 |
|
*N4000: 8 CPUs, 4 155 Adapters
|
 |
 |
| Table 9: TCP throughput results on LAN emulation interface*
|
| Test/interface | LANE MTU 9218 bytes | LANE MTU 1516 bytes |
| Outbound TCP throughput Mbps | 1567 | 598 |
| CPU utilization % | 14.36 | 11.83 |
| Inbound TCP throughput Mbps | 1221 | 444 |
| CPU utilization % | 11.91 | 6.91 |
| Bidirectional TCP throughput Mbps | 1529 | 680 |
| CPU utilization % | 23.54 | 20.88 |
|
 |
|
*N4000: 8 CPUs, 4 622 Adapters
Tuning guidelines
• System-related Guidelines
When installing multiple adapters, make sure the adapters are evenly distributed in the card cages on V series platforms. In particular, avoid installing more than 3 155 adapters and more than 1 622 adapter per PCI Controller on the V series platforms. On N-class servers, install the 622 adapters in the Twin Turbo slots. Also, evenly distribute the adapters in slots of both I/O controllers on N-class systems.
For best performance, we recommend that all four memory carriers be installed on the N-class systems. Also, distribute the memory modules evenly on the memory carriers. When using multiple adapters, 4 Gbytes of memory is recommended. For single adapter installations, the memory size of 1 Gbyte is recommended.
• Application-related Guidelines
For standard applications (for example, ftp, NFS, and others), if the socket size is configurable, choose socket size (on the receiving end) and message length values of at least 144 Kbytes. This sets the TCP window size parameter. Alternatively, the TCP window size on the receiving end can be set using the ndd command. Set the tcp_recv_hiwater_def parameter to 144 Kbytes. If the receiving system is a non-HP platform, use a command equivalent to the ndd command on that platform.
For a customized application, use SETSOCKOPT IOCTL to define a socket size of at least 128 Kbytes. When possible, design the application to use the largest message size possible.
Throughput-intensive applications can see an increase of about 20% when the application is bound to a specific CPU. This is done as follows:
Find out which CPU the adapter is interrupting by using the sar command. With enough traffic on the adapter, the sar output should show interrupts on only one CPU. The application is bound to this CPU by using the mpctl system call.
• Network design-related Guidelines
When using LAN Emulation, HP recommends that all of the ATM systems be put on a separate subnet and use an MTU size of 9 Kbytes for the subnet. Subnets that include legacy systems must use the MTU size used by legacy systems; for example, Ethernet uses maximum MTU size of 1500 bytes.
Appendix A: test methodology and test environment
Test methodology
HP's Netperf version 2.1 was used as the test tool in all the measurements. Netperf measures the throughput at the TCP level. This measurement models many real life applications since these applications use the TCP protocol.
The target systems were configured with a minimum of 2 Gbytes of memory. The ATM software version used was I.11.00.10 with Build Revision 7.39; this corresponds to the DART45 release of ATM software. HP-UX 11.0 was used as the underlying operating system. All the configuration parameters were default with the exception of modifying the LANE MTU size. See Figures 4 and 5 below for the test topology.
Figure 4: single adapter test configuration
The FORE systems ASX-1000 ATM switch ran ForeThought version 4.0 switch software. The Cisco 1010 ATM switch ran version 11.1 of the CISCO IOS software. The N4000 platform had all the four memory carriers installed for the multiple adapter tests. Table 10 below provides the server and workstation platform configurations on which the performance tests were run.
Figure 5: multiple adapter test configuration
 |
 |
 |
| Table 10: server and workstation platform characteristics
|
| Platform | CPU | ICACHE/DCACHE | Memory | SPECint95 | OS |
| N4000 | PA8500
440 MHz | 0.5 MB/1 MB | 1 GB3 | 34 | 11.00
May 99 |
| V2500 | PA8500
440 MHz | 0.5 MB/1 MB | 1 GB | N/A | 11.00
Dec 98 |
| V2250 | PA8200
240 MHz | 2 MB/2 MB | 1 GB | 16.4 | 11.00
Dec 98 |
|
 |
|
3 8 GB for the Multiple adapter tests.
Protocol overheads and theoretical maximum TCP throughput
The throughput as seen by a TCP application is less than the OC-3 physical layer throughput of 155 Mbps or OC-12 (622 Mbps). This is because of the various protocol headers that are inserted by the protocol stack. For example, the protocol headers for Classical IP interface are as follows:
- TCP/IP Overhead
This Consists of 20 bytes of TCP header + 20 bytes of IP header + eight bytes of LLC header. This amounts to 48 bytes
- ATM AAL5 Overhead
The Convergence Sublayer appends an 8-byte trailer. It will also add a Pad. The Pad size varies from 0-47 bytes, depending on Application Message size.
- ATM Layer Overhead
The ATM Layer will append a five byte header to each Segmentation and Reassembly PDU (48 bytes) to form an ATM cell.
- Physical Layer Overhead
The Transmission Convergence sublayer creates SONET frames of 27 ATM cells by appending one overhead ATM cell to 26 data ATM cells.
This leads to the following equation for Theoretical TCP throughput (T)
T = (D / (D+ 56 + p)) * (48/53) * (26/27) * 155.52 Mbps (or 622 Mbps)
Where, D = application message size in bytes.
P = 0-47 bytes padding added at AAL5 layer.
For Message sizes smaller than 4 Kbytes, The TCP/IP and AAL5 layer protocol overhead forms a major part and theoretical throughput reduces dramatically. For typical message sizes, the theoretical TCP throughput is as follows:
|
 |
 |
| Messages
|
| Message size | Theoretical TCP throughput (155 adapter) | Theoretical TCP throughput (622 adapter) |
| 128 Bytes | 90 Mbps | 362 Mbps |
| 256 Bytes | 103 Mbps | 413 Mbps |
| 1024 Bytes | 126 Mbps | 503 Mbps |
| 4096 Bytes | 133 Mbps | 535 Mbps |
| 64 Kbytes | 135 Mbps | 542 Mbps |
|
 |
|
It should be noted that the above calculation assumes Classical IP interface. LAN Emulation adds additional 8 bytes as header. Therefore, the theoretical throughput will decrease accordingly.
|
 |
|