Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.4.1
-
None
-
RHEL6.4/distro OFED/1.8.9 clients/2.4.1 servers
-
3
-
11939
Description
We have 2 separate 1.8.9 clients that have processes hung in the D state with the clients endlessly looping from FULL to DISCONN and then reestablishing connectivity. One appears to be looping on a Bulk IO write to a stale file handle (10.36.202.142@o2ib) and the other appears to be a BULK IO read from a kiblnd failure (10.36.202.138@o2ib). The timeouts affect filesystem availability, but other activities proceed in between these disconnections.
Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues.
server logs relevant to 10.36.202.138@o2ib issue:
Dec 8 12:11:52 atlas-oss3b4 kernel: [1032387.863173] Lustre: atlas2-OST009b: Client 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs
Dec 8 12:11:52 atlas-oss3b4 kernel: [1032387.919690] LustreError: 11590:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff880761cd2800 x1453425689933409/t0(0) o3->3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386523292 ref 1 fl Interpret:/2/0 rc 0/0
Dec 8 12:11:52 atlas-oss3b4 kernel: [1032388.030434] Lustre: atlas2-OST009b: Bulk IO read error with 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib), client will retry: rc -110
client log on 10.36.202.138@o2ib:
Dec 8 12:11:51 dtn04.ccs.ornl.gov kernel: LustreError: 24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689933409 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 19s ago has failed due to network error (567s prior to deadline).
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: atlas2-OST009b-osc-ffff880c392f9400: Connection to service atlas2-OST009b via nid 10.36.225.185@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: Skipped 7 previous similar messages
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: LustreError: 11-0: an error occurred while communicating with 10.36.225.185@o2ib. The ost_connect operation failed with -16
lctl dk output:
00000100:00080000:9:1386522391.428708:0:24622:0:(client.c:1392:ptlrpc_check_set()) resend bulk old x1453425689698153 new x1453425689814678
00000100:02000400:4:1386522391.428708:0:24623:0:(import.c:1016:ptlrpc_connect_interpret()) Server atlas2-OST009b_UUID version (2.4.1.0) is much newer than client version (1.8.9)
00000800:00000100:8:1386522411.140346:0:1644:0:(o2iblnd_cb.c:1813:kiblnd_close_conn_locked()) Closing conn to 10.36.225.185@o2ib: error 0(waiting)
00000100:00020000:6:1386522411.140744:0:24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000
00000100:00000400:3:1386522411.151303:0:24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689814678 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 20s ago has failed due to network error (567s prior to deadline).
server logs relevant to 10.36.202.142@o2ib issue:
Dec 8 09:27:15 atlas-oss4h1 kernel: [1022525.123761] LustreError: 113676:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880ff253d000 x145343054
1439222/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386513410 ref 1 fl Interpret:/2/0 rc 0/0
Dec 8 09:27:15 atlas-oss4h1 kernel: [1022525.123900] Lustre: atlas2-OST03e0: Bulk IO write error with 10c57078-ce10-72c6-b97d-e7c9a32a240c (at 10.36.202.142@o2ib), client will retry:
rc -110
client logs from 10.36.202.142@o2ib:
Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection to service atlas2-OST03e0 via nid 10.36.226.48@o2ib was lost; in progress operati
ons using this service will wait for recovery to complete.
Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 39 previous similar messages
Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection restored to service atlas2-OST03e0 using nid 10.36.226.48@o2ib.
Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 41 previous similar messages
The lctl dk output is less revealing for this case, but I will attach it. What flags are desirable for getting more relevant information?
From the OS, if I try to kill the process or stat the inode, I see:
00000080:00020000:3:1386519720.291103:0:16284:0:(file.c:3348:ll_inode_revalidate_fini()) failure -116 inode 144117425485470173
[root@dtn-sch01 ~]# fuser -k -m /lustre/atlas2
Cannot stat file /proc/975/fd/25: Stale file handle
[root@dtn-sch01 ~]# ls -l /proc/975/fd/25
l-wx------ 1 cfuson ccsstaff 64 Dec 7 22:46 /proc/975/fd/25 -> /lustre/atlas1/stf007/scratch/cfuson/TestDir/SubDir2/13G-3.tar
[root@dtn-sch01 ~]#
Attachments
Issue Links
- is related to
-
LU-793 Reconnections should not be refused when there is a request in progress from this client.
-
- Resolved
-
The failure was present with 1.8.8. Below are the logs from starting the the lustre 1.8.8 client, mounting filesystems, and then running file * in the directory that has been causing issue. The symptoms were the same, where the application would hang. Recognizing the problem, I went to unmount to try the next case (this has never worked in the past and before
LU-3067, we had to reboot the client). As usual that failed but the logs have some different messages with 1.8.8.Does that tell us anything new?
Jan 8 08:33:17 client kernel: Lustre: Listener bound to ib0:10.36.202.142:987:mlx4_0
Jan 8 08:33:17 client kernel: Lustre: Added LNI 10.36.202.142@o2ib [63/2560/0/180]
Jan 8 08:33:17 client kernel: Lustre: Build Version: lustre/scripts--PRISTINE-2.6.32-358.23.2.el6.x86_64
Jan 8 08:33:18 client kernel: Lustre: Lustre Client File System; http://www.lustre.org/
Jan 8 08:33:18 client kernel: Lustre: MGC10.36.226.77@o2ib: Reactivating import
Jan 8 08:33:18 client kernel: Lustre: Server MGS version (2.4.1.0) is much newer than client version (1.8.8)
Jan 8 08:33:18 client kernel: Lustre: 8437:0:(obd_config.c:1127:class_config_llog_handler()) skipping 'lmv' config: cmd=cf001,clilmv:lmv
Jan 8 08:33:18 client kernel: Lustre: 8437:0:(obd_config.c:1127:class_config_llog_handler()) skipping 'lmv' config: cmd=cf014,clilmv:linkfarm-MDT0000_UUID
Jan 8 08:33:18 client kernel: Lustre: Server linkfarm-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8)
Jan 8 08:33:18 client kernel: Lustre: client supports 64-bits dir hash/offset!
Jan 8 08:33:18 client kernel: Lustre: Client linkfarm-client(ffff8803f4fd1000) mount complete
Jan 8 08:33:19 client kernel: Lustre: Server atlas1-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8)
Jan 8 08:33:19 client kernel: Lustre: client supports 64-bits dir hash/offset!
Jan 8 08:33:19 client kernel: Lustre: Client atlas1-client(ffff8803f4fd7400) mount complete
Jan 8 08:33:19 client kernel: Lustre: 8445:0:(obd_config.c:1127:class_config_llog_handler()) skipping 'lmv' config: cmd=cf001,clilmv:lmv
Jan 8 08:33:21 client kernel: Lustre: Server atlas2-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8)
Jan 8 08:33:21 client kernel: Lustre: client supports 64-bits dir hash/offset!
Jan 8 08:33:21 client kernel: Lustre: Client atlas2-client(ffff880c397d6400) mount complete
Jan 8 08:33:22 client kernel: Lustre: MGC10.36.227.200@o2ib: Reactivating import
Jan 8 08:33:22 client kernel: Lustre: Client widow1-client(ffff880fd01dec00) mount complete
Jan 8 08:35:14 client kernel: LustreError: 8410:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88080e1c4000
Jan 8 08:35:14 client kernel: Lustre: setting import linkfarm-MDT0000_UUID INACTIVE by administrator request
Jan 8 08:35:14 client kernel: LustreError: 9877:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff88039adf6cd0) ns: linkfarm-MDT0000-mdc-ffff8803f4fd1000 lock: ffff88082165bc00/0x
f06137b3b58e545f lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0xdca99b6a04cb5ba3 expref: -99 pid: 8466 timeout: 0
Jan 8 08:35:14 client kernel: Lustre: setting import linkfarm-OST0000_UUID INACTIVE by administrator request
Jan 8 08:35:14 client kernel: Lustre: client linkfarm-client(ffff8803f4fd1000) umount complete
Jan 8 08:35:14 client kernel: LustreError: 9879:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff8804249bad10) ns: atlas1-MDT0000-mdc-ffff8803f4fd7400 lock: ffff88082165ba00/0xf06137b3b58e5466 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0xdff80ba35f064331 expref: -99 pid: 8466 timeout: 0
Jan 8 08:35:16 client kernel: Lustre: client atlas1-client(ffff8803f4fd7400) umount complete
Jan 8 08:35:16 client kernel: Lustre: setting import atlas2-MDT0000_UUID INACTIVE by administrator request
Jan 8 08:35:16 client kernel: LustreError: 9881:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff880c39246cd0) ns: atlas2-MDT0000-mdc-ffff880c397d6400 lock: ffff88082165b800/0xf06137b3b58e546d lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0x7f4ba8a4cefab31f expref: -99 pid: 8466 timeout: 0
Jan 8 08:35:16 client kernel: LustreError: 2866:0:(mdc_locks.c:653:mdc_enqueue()) ldlm_cli_enqueue error: -108
Jan 8 08:35:16 client kernel: LustreError: 2866:0:(file.c:3331:ll_inode_revalidate_fini()) failure -108 inode 144115188193296385
Jan 8 08:35:17 client kernel: Lustre: client widow-client(ffff88081da07c00) umount complete
Jan 8 08:35:17 client kernel: Lustre: setting import widow2-MDT0000_UUID INACTIVE by administrator request
Jan 8 08:35:18 client kernel: Lustre: client widow2-client(ffff88038e15b400) umount complete
Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Jan 8 08:35:18 client kernel: LustreError: 10318:0:(mdc_locks.c:653:mdc_enqueue()) ldlm_cli_enqueue error: -108
Jan 8 08:35:18 client kernel: LustreError: 10318:0:(file.c:3331:ll_inode_revalidate_fini()) failure -108 inode 144115188193296385
Jan 8 08:35:32 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
Jan 8 08:35:48 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
Jan 8 08:36:04 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
Jan 8 08:36:19 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
Jan 8 08:36:35 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
Jan 8 08:36:50 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
Jan 8 08:37:06 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
Jan 8 08:37:37 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
Jan 8 08:38:24 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
Jan 8 08:39:42 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
Jan 8 08:42:02 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108