Details
-
Bug
-
Resolution: Incomplete
-
Minor
-
None
-
Lustre 2.1.3
-
None
-
Client: kernel: sles11sp1 2.6.32.54-0.3.1.20120223-nasa
lustre-client-2.1.3-1nasC_2.6.32.54_0.3.1.20120223_nasa
Server kernel: centos 6.2 2.6.32-220.4.1.el6.20120607.x86_64.lustre212
lustre-2.1.2-2nasS_ofed154_2.6.32_220.4.1.el6.20120607.x86_64.lustre212.x86_64
https://github.com/jlan/lustre-nas/tree/nas-2.1.3/
Client: kernel: sles11sp1 2.6.32.54-0.3.1.20120223-nasa lustre-client-2.1.3-1nasC_2.6.32.54_0.3.1.20120223_nasa Server kernel: centos 6.2 2.6.32-220.4.1.el6.20120607.x86_64.lustre212 lustre-2.1.2-2nasS_ofed154_2.6.32_220.4.1.el6.20120607.x86_64.lustre212.x86_64 https://github.com/jlan/lustre-nas/tree/nas-2.1.3/
-
3
-
5134
Description
A front end node pfe3 hangs on df. Nagios reporting nbp1 unmounted.
/var/log/messages showed lustre errors before hang are as below. We needed to reboot pfe3 to get nbp1 mounted again.
...Oct 7 15:08:25 pfe3 kernel: [490266.072028] Lustre: 7366:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1349646649/real 1349646649] req@ffff880024adb800 x1414695477195603/t0(0) o6->nbp1-OST003c-osc-ffff880073e9d400@10.151.26.31@o2ib:6/4 lens 512/400 e 2 to 1 dl 1349647705 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Oct 7 15:08:25 pfe3 kernel: [490266.158936] Lustre: 7366:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
Oct 7 15:08:25 pfe3 kernel: [490266.188374] Lustre: nbp1-OST003c-osc-ffff880073e9d400: Connection to nbp1-OST003c (at 10.151.26.31@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 7 15:08:25 pfe3 kernel: [490266.252863] LustreError: 11-0: an error occurred while communicating with 10.151.26.31@o2ib. The ost_connect operation failed with -16
Oct 7 15:08:25 pfe3 kernel: [490266.289310] LustreError: Skipped 20 previous similar messages
Oct 7 15:10:05 pfe3 kernel: [490366.252622] LustreError: 11-0: an error occurred while communicating with 10.151.26.31@o2ib. The ost_connect operation failed with -16
Oct 7 15:10:05 pfe3 kernel: [490366.289087] LustreError: Skipped 3 previous similar messages
Oct 7 15:10:39 pfe3 ntpd[5013]: kernel time sync status change 2001
Oct 7 15:11:36 pfe3 envmodule: bkup load nas
Oct 7 15:11:37 pfe3 envmodule: bkup load nas
Oct 7 15:12:15 pfe3 kernel: [490496.137877] Lustre: nbp1-OST005f-osc-ffff880073e9d400: Connection to nbp1-OST005f (at 10.151.26.34@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 7 15:12:15 pfe3 kernel: [490496.236273] LustreError: 167-0: This client was evicted by nbp1-OST005f; in progress operations using this service will fail.
Oct 7 15:12:15 pfe3 kernel: [490496.270510] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lock@ffff88005932a358[2 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x1005f0000:0x2fa3e62:0x0]
lock@ffff88005932a358
Oct 7 15:12:16 pfe3 kernel: [490496.389231] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5
Oct 7 15:12:16 pfe3 kernel: [490496.389268] LustreError: 9848:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace nbp1-OST005f-osc-ffff880073e9d400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct 7 15:12:16 pfe3 kernel: [490496.389271] LustreError: 9848:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff8800641f5c00 (49954402/0/0/0) (rc: 1)
Oct 7 15:12:16 pfe3 kernel: [490496.389299] Lustre: nbp1-OST005f-osc-ffff880073e9d400: Connection restored to nbp1-OST005f (at 10.151.26.34@o2ib)
Oct 7 15:13:00 pfe3 kernel: [490541.188284] LustreError: 11-0: an error occurred while communicating with 10.151.26.31@o2ib. The ost_connect operation failed with -16
Oct 7 15:13:00 pfe3 kernel: [490541.224748] LustreError: Skipped 7 previous similar messages
...
The messages on ldlm_resource_complain() seems to carry the same signature of ORI-735, but it happens to 2.1.3 on our production systems.