[LU-8255] LustreError: 38237:0:(file.c:3165:ll_inode_revalidate_fini()) nbp6: revalidate FID [0x20007200e:0x90d8:0x0] error: rc = -71 Created: 09/Jun/16  Updated: 29/Jun/17  Resolved: 29/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Mahmoud Hanafi Assignee: nasf (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Environment:

Client is running lustre 2.7.1
Server is running lustre 2.5.3


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

For one particular user's on the clients we are getting lots of these error.

Jun  9 15:23:12 r221i4s0 kernel: [1465510992.386829] LustreError: 11-0: nbp6-MDT0000-mdc-ffff8803058bb000: operation ldlm_enqueue to node 10.151.26.79@o2ib failed: rc = -71
Jun  9 15:23:12 r221i4s0 kernel: [1465510992.398829] LustreError: Skipped 5 previous similar messages
Jun  9 15:23:12 r221i4s0 kernel: [1465510992.406829] LustreError: 74346:0:(file.c:3165:ll_inode_revalidate_fini()) nbp6: revalidate FID [0x200071fef:0x1fd19:0x0] error: rc = -71
Jun  9 15:23:12 r221i4s0 kernel: [1465510992.406829] LustreError: 74346:0:(file.c:3165:ll_inode_revalidate_fini()) Skipped 5 previous similar messages
Jun  9 15:23:13 r154i0n0 kernel: [1465510993.479567] LustreError: 11-0: nbp6-MDT0000-mdc-ffff880302239800: operation ldlm_enqueue to node 10.151.26.79@o2ib failed: rc = -71
Jun  9 15:23:13 r154i0n0 kernel: [1465510993.491567] LustreError: Skipped 2 previous similar messages
Jun  9 15:23:13 r154i0n0 kernel: [1465510993.495567] LustreError: 68877:0:(file.c:3165:ll_inode_revalidate_fini()) nbp6: revalidate FID [0x200072005:0x11ea5:0x0] error: rc = -71
Jun  9 15:23:13 r154i0n0 kernel: [1465510993.495567] LustreError: 68877:0:(file.c:3165:ll_inode_revalidate_fini()) Skipped 2 previous similar messages
Jun  9 15:23:16 r221i3n1 kernel: [1465510996.818948] LustreError: 11-0: nbp6-MDT0000-mdc-ffff880302157000: operation ldlm_enqueue to node 10.151.26.79@o2ib failed: rc = -71
Jun  9 15:23:16 r221i3n1 kernel: [1465510996.830948] LustreError: 68219:0:(file.c:3165:ll_inode_revalidate_fini()) nbp6: revalidate FID [0x200071ef9:0x1dfce:0x0] er

I will upload MDS side debug to ftp site.



 Comments   
Comment by Mahmoud Hanafi [ 09/Jun/16 ]

uploaded logs to /uploads/LU-8255/mds.debugout.gz

Comment by Mahmoud Hanafi [ 09/Jun/16 ]

Looks like the user's job is creating and deleting lots of files and directories.

Comment by Peter Jones [ 10/Jun/16 ]

Fan Yong

Could you please advise?

Thanks

Peter

Comment by nasf (Inactive) [ 12/Jun/16 ]

The log is some huge (1.2 GB), but only contains some Lustre level debug like following:

...
00010000:00010000:1.0:1465510946.072747:0:39779:0:(ldlm_lockd.c:1181:ldlm_handle_enqueue0()) ### server-side enqueue handler START
00010000:00010000:1.0:1465510946.072750:0:39779:0:(ldlm_lockd.c:1269:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-nbp6-MDT0000_UUID lock: ffff88031d104740/0xbbcac3f0ff4e9f58 lrc: 2/0,0 mode: --/CR res: [0x20007200f:0xbad7:0x0].0 bits 0x0 rrc: 1 type: IBT flags: 0x40000000000000 nid: local remote: 0x3afc5dc70a36e37f expref: -99 pid: 39779 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1465510946.072766:0:39779:0:(ldlm_lockd.c:1407:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=-71) ns: mdt-nbp6-MDT0000_UUID lock: ffff88031d104740/0xbbcac3f0ff4e9f58 lrc: 1/0,0 mode: --/CR res: [0x20007200f:0xbad7:0x0].0 bits 0x2 rrc: 1 type: IBT flags: 0x44000000000000 nid: 10.151.40.7@o2ib remote: 0x3afc5dc70a36e37f expref: 10810 pid: 39779 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1465510946.072771:0:39779:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-nbp6-MDT0000_UUID lock: ffff88031d104740/0xbbcac3f0ff4e9f58 lrc: 0/0,0 mode: --/CR res: [0x20007200f:0xbad7:0x0].0 bits 0x2 rrc: 1 type: IBT flags: 0x44000000000000 nid: 10.151.40.7@o2ib remote: 0x3afc5dc70a36e37f expref: 10810 pid: 39779 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1465510946.072775:0:39779:0:(ldlm_lockd.c:1450:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff88031d104740, rc -71)
00000100:00100000:1.0:1465510946.072787:0:39779:0:(service.c:2074:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_091:f83705d3-c797-c136-06e3-a15eff96d960+10809:46841:x1535100545951340:12345-10.151.40.7@o2ib:101 Request procesed in 45us (78us total) trans 0 rc -71/-71
...

That means the server returned protocol error when handle ldlm enqueue RPC from the client. But without detailed logs, we cannot exactly point out where is wrong. I have ever try to simulate the interoperability trouble (client b2_7, server b2_5) locally, but cannot reproduce it. So please enable -1 level debug log on both the client and the MDS for a short time, and try the failed operation again, then please collect the Lustre debug logs on both the client and the MDS, and attach them on this Jira ticket directly.

Thanks!

(note: to make the debug logs to be small, please run "lctl clear" on both the client and the MDS before the new try)

Comment by nasf (Inactive) [ 14/Jul/16 ]

Any feedback? Thanks!

Comment by Mahmoud Hanafi [ 29/Jun/17 ]

We can close this case

Comment by Peter Jones [ 29/Jun/17 ]

ok Mahmoud

Generated at Sat Feb 10 02:15:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.