Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.4.1
-
None
-
RHEL6.4/distro OFED/1.8.9 clients/2.4.1 servers
-
3
-
11939
Description
We have 2 separate 1.8.9 clients that have processes hung in the D state with the clients endlessly looping from FULL to DISCONN and then reestablishing connectivity. One appears to be looping on a Bulk IO write to a stale file handle (10.36.202.142@o2ib) and the other appears to be a BULK IO read from a kiblnd failure (10.36.202.138@o2ib). The timeouts affect filesystem availability, but other activities proceed in between these disconnections.
Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues.
server logs relevant to 10.36.202.138@o2ib issue:
Dec 8 12:11:52 atlas-oss3b4 kernel: [1032387.863173] Lustre: atlas2-OST009b: Client 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs
Dec 8 12:11:52 atlas-oss3b4 kernel: [1032387.919690] LustreError: 11590:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff880761cd2800 x1453425689933409/t0(0) o3->3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386523292 ref 1 fl Interpret:/2/0 rc 0/0
Dec 8 12:11:52 atlas-oss3b4 kernel: [1032388.030434] Lustre: atlas2-OST009b: Bulk IO read error with 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib), client will retry: rc -110
client log on 10.36.202.138@o2ib:
Dec 8 12:11:51 dtn04.ccs.ornl.gov kernel: LustreError: 24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689933409 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 19s ago has failed due to network error (567s prior to deadline).
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: atlas2-OST009b-osc-ffff880c392f9400: Connection to service atlas2-OST009b via nid 10.36.225.185@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: Skipped 7 previous similar messages
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: LustreError: 11-0: an error occurred while communicating with 10.36.225.185@o2ib. The ost_connect operation failed with -16
lctl dk output:
00000100:00080000:9:1386522391.428708:0:24622:0:(client.c:1392:ptlrpc_check_set()) resend bulk old x1453425689698153 new x1453425689814678
00000100:02000400:4:1386522391.428708:0:24623:0:(import.c:1016:ptlrpc_connect_interpret()) Server atlas2-OST009b_UUID version (2.4.1.0) is much newer than client version (1.8.9)
00000800:00000100:8:1386522411.140346:0:1644:0:(o2iblnd_cb.c:1813:kiblnd_close_conn_locked()) Closing conn to 10.36.225.185@o2ib: error 0(waiting)
00000100:00020000:6:1386522411.140744:0:24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000
00000100:00000400:3:1386522411.151303:0:24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689814678 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 20s ago has failed due to network error (567s prior to deadline).
server logs relevant to 10.36.202.142@o2ib issue:
Dec 8 09:27:15 atlas-oss4h1 kernel: [1022525.123761] LustreError: 113676:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880ff253d000 x145343054
1439222/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386513410 ref 1 fl Interpret:/2/0 rc 0/0
Dec 8 09:27:15 atlas-oss4h1 kernel: [1022525.123900] Lustre: atlas2-OST03e0: Bulk IO write error with 10c57078-ce10-72c6-b97d-e7c9a32a240c (at 10.36.202.142@o2ib), client will retry:
rc -110
client logs from 10.36.202.142@o2ib:
Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection to service atlas2-OST03e0 via nid 10.36.226.48@o2ib was lost; in progress operati
ons using this service will wait for recovery to complete.
Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 39 previous similar messages
Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection restored to service atlas2-OST03e0 using nid 10.36.226.48@o2ib.
Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 41 previous similar messages
The lctl dk output is less revealing for this case, but I will attach it. What flags are desirable for getting more relevant information?
From the OS, if I try to kill the process or stat the inode, I see:
00000080:00020000:3:1386519720.291103:0:16284:0:(file.c:3348:ll_inode_revalidate_fini()) failure -116 inode 144117425485470173
[root@dtn-sch01 ~]# fuser -k -m /lustre/atlas2
Cannot stat file /proc/975/fd/25: Stale file handle
[root@dtn-sch01 ~]# ls -l /proc/975/fd/25
l-wx------ 1 cfuson ccsstaff 64 Dec 7 22:46 /proc/975/fd/25 -> /lustre/atlas1/stf007/scratch/cfuson/TestDir/SubDir2/13G-3.tar
[root@dtn-sch01 ~]#
Attachments
Issue Links
- is related to
-
LU-793 Reconnections should not be refused when there is a request in progress from this client.
-
- Resolved
-
Could you run lnet-selftest to check the network between problematic OST and 1.8 clients?
I just posted the instructions of how to use lnet-selftest here and the wrapper script is attached:
= Preparation =
The LNET Selftest kernel module must be installed and loaded on all targets in the test before the application is started. Identify the set of all systems that will participate in a session and ensure that the kernel module has been loaded. To load the kernel module:
modprobe lnet_selftest
Dependencies are automatically resolved and loaded by modprobe. This will make sure all the necessary modules are loaded: libcfs, lnet, lnet_selftest and any one of the klnds (kernel lustre network devices, i.e. ksocklnd, ko2iblnd, etc.).
Identify a "console" node from which to conduct the tests. This is the single system from which all LNET selftest commands will be executed. The console node owns the LNET selftest session and there should be only one active session on the network at any given time (strictly speaking one can run several LNET selftest sessions in parallel across a network but this is generally discouraged unless the sessions are carefully isolated).
It is strongly recommended that a survey and analysis of raw network performance between the target systems is carried out prior to running the LNET selftest benchmark. This will help to identify and measure any performance overhead introduced by LNET. The HPDD SE team has recently been evaluating Netperf for this purpose on TCP/IP-based networks with good results. Refer to the HPDD SE Netperf page for details on how to manage this exercise.
= Using the Wrapper Script =
Use the LNET Selftest wrapper to execute the test cases referenced in this document. The header of the script has some variables that need to be set in accordance with the target environment. Without changes, the script is very unlikely to operate correctly, if at all. Here is a listing of the header:
Single Client Throughput – LNET Selftest Read (2 Nodes, 1:1)
Used to establish point to point unidirectional read performance between two nodes.
Set the wrapper up as follows:
CN: the concurrency setting simulates the number of threads performing communication. The LNET Selftest default is 1, which is not enough to properly exercise the connection. Set to at least 16, but experiment with higher values (32 or 64 being reasonable choices).
SZ: the size setting determines the size of the IO transaction. For bandwidth (throughput) measurements, use 1M.
TM: test time in seconds– how long to run the benchmark for. Set to a reasonable number in order to ensure collection of sufficient data to extrapolate a meaningful average (at least 60 seconds).
BRW: The Bulk Read/Write test to use. There are only two choices "read" or "write".
CKSUM: The checksum checking method. Choose either "simple" or "full".
LFROM: a space-separated list of NIDs that represent the "from" list (or source) in LNET Selftest. This is often a set of clients.
LTO: a space-separated list of NIDs that represent the "to" list (or destination) in LNET Selftest. This is often a set of servers.
Change the LFROM and LTO lists as required.
Run the script several times, changing the concurrency setting with at the start of every new run. Use the sequence 1, 2, 4, 8, 16, 32, 64, 128. Modify the output filename for each run so that it is clear what results have been captured into each file.