[LU-2132] 2.1.3 client hangs on 'df' Created: 09/Oct/12  Updated: 15/Mar/14  Resolved: 15/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jay Lan (Inactive) Assignee: Zhenyu Xu
Resolution: Incomplete Votes: 0
Labels: None
Environment:

Client: kernel: sles11sp1 2.6.32.54-0.3.1.20120223-nasa
lustre-client-2.1.3-1nasC_2.6.32.54_0.3.1.20120223_nasa
Server kernel: centos 6.2 2.6.32-220.4.1.el6.20120607.x86_64.lustre212
lustre-2.1.2-2nasS_ofed154_2.6.32_220.4.1.el6.20120607.x86_64.lustre212.x86_64

https://github.com/jlan/lustre-nas/tree/nas-2.1.3/


Attachments: File LU-2132.OST005f.tgz     File console.pfe3    
Severity: 3
Rank (Obsolete): 5134

 Description   

A front end node pfe3 hangs on df. Nagios reporting nbp1 unmounted.

/var/log/messages showed lustre errors before hang are as below. We needed to reboot pfe3 to get nbp1 mounted again.

...Oct 7 15:08:25 pfe3 kernel: [490266.072028] Lustre: 7366:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1349646649/real 1349646649] req@ffff880024adb800 x1414695477195603/t0(0) o6->nbp1-OST003c-osc-ffff880073e9d400@10.151.26.31@o2ib:6/4 lens 512/400 e 2 to 1 dl 1349647705 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Oct 7 15:08:25 pfe3 kernel: [490266.158936] Lustre: 7366:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
Oct 7 15:08:25 pfe3 kernel: [490266.188374] Lustre: nbp1-OST003c-osc-ffff880073e9d400: Connection to nbp1-OST003c (at 10.151.26.31@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 7 15:08:25 pfe3 kernel: [490266.252863] LustreError: 11-0: an error occurred while communicating with 10.151.26.31@o2ib. The ost_connect operation failed with -16
Oct 7 15:08:25 pfe3 kernel: [490266.289310] LustreError: Skipped 20 previous similar messages
Oct 7 15:10:05 pfe3 kernel: [490366.252622] LustreError: 11-0: an error occurred while communicating with 10.151.26.31@o2ib. The ost_connect operation failed with -16
Oct 7 15:10:05 pfe3 kernel: [490366.289087] LustreError: Skipped 3 previous similar messages
Oct 7 15:10:39 pfe3 ntpd[5013]: kernel time sync status change 2001
Oct 7 15:11:36 pfe3 envmodule: bkup load nas
Oct 7 15:11:37 pfe3 envmodule: bkup load nas
Oct 7 15:12:15 pfe3 kernel: [490496.137877] Lustre: nbp1-OST005f-osc-ffff880073e9d400: Connection to nbp1-OST005f (at 10.151.26.34@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 7 15:12:15 pfe3 kernel: [490496.236273] LustreError: 167-0: This client was evicted by nbp1-OST005f; in progress operations using this service will fail.
Oct 7 15:12:15 pfe3 kernel: [490496.270510] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lock@ffff88005932a358[2 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x1005f0000:0x2fa3e62:0x0]

{ Oct 7 15:12:15 pfe3 kernel: [490496.318170] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lovsub@ffff88001513c820: [0 ffff880070c99c28 R(1):[2304, 18446744073709551615]@[0x44855761059:0x128f3:0x0]] [9 ffff880069152e98 P(0):[0, 18446744073709551615]@[0x44855761059:0x128f3:0x0]] Oct 7 15:12:16 pfe3 kernel: [490496.389224] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff880070f789e0: ffff880052ef3480 40160002 0x3bf5f5ecface519f 3 ffff880029da22b8 size: 859986 mtime: 1349647704 atime: 1349647702 ctime: 1349647704 blocks: 1688 Oct 7 15:12:16 pfe3 kernel: [490496.389229] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) }

lock@ffff88005932a358
Oct 7 15:12:16 pfe3 kernel: [490496.389231] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5
Oct 7 15:12:16 pfe3 kernel: [490496.389268] LustreError: 9848:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace nbp1-OST005f-osc-ffff880073e9d400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct 7 15:12:16 pfe3 kernel: [490496.389271] LustreError: 9848:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff8800641f5c00 (49954402/0/0/0) (rc: 1)
Oct 7 15:12:16 pfe3 kernel: [490496.389299] Lustre: nbp1-OST005f-osc-ffff880073e9d400: Connection restored to nbp1-OST005f (at 10.151.26.34@o2ib)
Oct 7 15:13:00 pfe3 kernel: [490541.188284] LustreError: 11-0: an error occurred while communicating with 10.151.26.31@o2ib. The ost_connect operation failed with -16
Oct 7 15:13:00 pfe3 kernel: [490541.224748] LustreError: Skipped 7 previous similar messages
...

The messages on ldlm_resource_complain() seems to carry the same signature of ORI-735, but it happens to 2.1.3 on our production systems.



 Comments   
Comment by Peter Jones [ 09/Oct/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 10/Oct/12 ]

Can you please upload the logs of affected client node (pfe3 I guess) and OSS node which contains the OST005f? Thank you.

Comment by Jay Lan (Inactive) [ 10/Oct/12 ]

Attached are console log of pfe3 and LU-2132.OST005f.tgz as request.

Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ]

Hi Bobijam,
Is this ticket going anywhere, or did we reach a dead end?
Should I mark it as resolved?
Thanks,
~ jfc.

Comment by John Fuchs-Chesney (Inactive) [ 15/Mar/14 ]

Looks like we will not make any further progress on this issue.

Generated at Sat Feb 10 01:22:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.