[LU-7290] lock callback not getting to client Created: 13/Oct/15  Updated: 06/Jan/17  Resolved: 14/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Jinshan Xiong (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7054 ib_cm scalling issue when lustre clie... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We believe that lock callback are not getting to clients. This happens on a server to several clients.

ct 13 09:07:13 nbp9-oss8 kernel: LustreError: 92805:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.27.22@o2ib  ns: filter-nbp9-OST00a7_UUID lock: ffff88065cc93980/0xb2e0788712229c2d lrc: 4/0,0 mode: PR/PR res: [0x838044:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000000020 nid: 10.151.27.22@o2ib remote: 0x589c09f2406c35d5 expref: 17 pid: 11839 timeout: 4695872869 lvb_type: 1
Oct 13 09:07:14 nbp9-oss8 kernel: LustreError: 0:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.27.22@o2ib  ns: filter-nbp9-OST0047_UUID lock: ffff8807552a2500/0xb2e0788712233bdd lrc: 4/0,0 mode: PR/PR res: [0x8342ac:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000000020 nid: 10.151.27.22@o2ib remote: 0x589c09f2406c9876 expref: 14 pid: 11960 timeout: 4695873146 lvb_type: 1
Oct 13 09:07:16 nbp9-oss8 kernel: LustreError: 0:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.27.22@o2ib  ns: filter-nbp9-OST0027_UUID lock: ffff880bebb08300/0xb2e0788712231e55 lrc: 4/0,0 mode: PR/PR res: [0x863302:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000000020 nid: 10.151.27.22@o2ib remote: 0x589c09f2406c7d95 expref: 22 pid: 11990 timeout: 4695875446 lvb_type: 1
Oct 13 09:15:24 nbp9-oss3 kernel: LustreError: 14326:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.26.25@o2ib  ns: filter-nbp9-OST00b2_UUID lock: ffff880345a5f9c0/0xb193eb7bec015e25 lrc: 5/0,0 mode: PR/PR res: [0x8b2dc6:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 10.151.26.25@o2ib remote: 0x73b551da2111846b expref: 50081 pid: 14044 timeout: 4696527377 lvb_type: 1


 Comments   
Comment by Peter Jones [ 14/Oct/15 ]

Jinshan

Could you please advise?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 14/Oct/15 ]

Hi Mahmoud,

Are you saying that the server had sent BLOCKING AST but for some reason the clients didn't receive it?

I asked because there is a suspicious LNET problem that it discards RPCs for unknown reason I have seen recently.

Comment by Mahmoud Hanafi [ 14/Oct/15 ]

We are only guessing that the RPC was sent and didn't arrive. So this may be the case that the RPC was discarded. Are the discards happing on the receiving side?

How can we identified if RPCs are getting discarded.

Comment by Jinshan Xiong (Inactive) [ 14/Oct/15 ]

As what I have seen, the client already received the RPC and handled it, but for unknown reason the reply was discarded.

The only way to identify this problem is to collect and analyze log. From the description of this ticket, it needs the logs at least 3 minutes before the lock timeout message is seen. For a busy server, that'll need a lot memory for debug log buffer. But we can mitigate this by reducing lock timeout value.

Comment by Mahmoud Hanafi [ 15/Oct/15 ]

What debug masking would you like. dlmtrace, rpctrace?

Comment by Jinshan Xiong (Inactive) [ 15/Oct/15 ]

Yes, and with extra vfstrace on client side.

Comment by Mahmoud Hanafi [ 16/Oct/15 ]

When does the timer start for the lockcall back. When its put on the send queue or when it is actually sent

Comment by Mahmoud Hanafi [ 18/Oct/15 ]

This may be related to https://jira.hpdd.intel.com/browse/LU-7054

Comment by Mahmoud Hanafi [ 14/Jul/16 ]

We haven't see this issue and have no reproducer. We can close it.

Comment by Peter Jones [ 14/Jul/16 ]

ok - thanks Mahmoud

Comment by Jessica Otey [ 04/Jan/17 ]

We are using 2.5.3 and I saw this exact issue today... I am curious as to whether it is related to load on the MDT... I was running a file system scan using robinhood, which considerably drives up the load on the MDT, and at one point during the scan, the client I was doing it from got evicted with this exact error.

Jan 4 10:56:46 asimov kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 10.7.17.127@o2ib ns: mdt naaschpc-MDT0000_UUID lock: ffff8805f53493c0/0x3a248a284f4caf4c lrc: 3/0,0 mode: PR/PR res: [0x2000027f7:0x3083:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x60200000000020 nid: 10.7.17.127@o2ib remote: 0x617c3509692de90b expref: 28235252 pid: 2622 timeout: 6086108175 lvb_type: 0
Jan 4 10:56:46 asimov kernel: LustreError: 10631:0:(ldlm_lib.c:2726:target_bulk_io()) @@@ bulk PUT failed: rc - 107 req@ffff8801d4d08400 x1550187285037668/t0(0) o37->68e8ddc6-ddeb-35a9-9896-59612db672b8@10.7.17.127@o2ib:0/0 lens 448/440 e 0 to 0 dl 1483545416 ref 1 fl Interpret:/0/0 rc 0/0
Jan 4 10:56:54 asimov kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 10.7.17.127@o2ib ns: mdt-naaschpc-MDT0000_UUID lock: ffff88045e367d80/0x3a248a281a847e5e lrc: 3/0,0 mode: PR/PR res: [0x700917:0xf73cee45:0x0].0 bits 0x13 rrc: 4 type: IBT flags: 0x60200000000020 nid: 10.7.17.127@o2ib remote: 0x617c350958b8fb9a expref: 25206201 pid: 2620 timeout: 6086116329 lvb_type: 0
Jan 4 10:57:09 asimov kernel: LustreError: 2164:0:(ldlm_lockd.c:662:ldlm_handle_ast_error()) ### client (nid 10.7.17.127@o2ib) returned 0 from blocking AST ns: mdt-naaschpc-MDT0000_UUID lock: ffff880ad4359980/0x3a248a284f4cb208 lrc: 4/0,0 mode: PR/PR res: [0x2000027f7:0x3084:0x0].0 bits 0x1b rrc: 2 type: IBT flags: 0x60200000000020 nid: 10.7.17.127@o2ib remote: 0x617c3509692de9c1 expref: 19443642 pid: 2622 timeout: 6086232216 lvb_type: 0

Comment by Jinshan Xiong (Inactive) [ 05/Jan/17 ]

There was a failure on target_bulk_io so I suspect that reader was stuck and lock timeout was just a fallout of this problem. Do you have a LNET router? can you also show me the log message on client 10.7.17.127?

Comment by Jessica Otey [ 06/Jan/17 ]

Yes, we do have 2 LNET routers....

Here's the log message from the client (10.7.17.127). You can see that it lost connection to the MDT... maybe related to the mds_getattr_lock failure? 

Jan 4 10:56:46 osiris kernel: LustreError: 11-0: naaschpc-MDT0000-mdc-ffff88bfb5504c00: Communicating with 10.7.17.8@o2ib, operation mds_getattr_lock failed with -107.
Jan 4 10:56:46 osiris kernel: Lustre: naaschpc-MDT0000-mdc-ffff88bfb5504c00: Connection to naaschpc-MDT0000 (at 10.7.17.8@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Jan 4 10:56:46 osiris kernel: LustreError: 167-0: naaschpc-MDT0000-mdc-ffff88bfb5504c00: This client was evicted by naaschpc-MDT0000; in progress operations using this service will fail.
Jan 4 10:56:46 osiris kernel: LustreError: 180049:0:(dir.c:422:ll_get_dir_page()) read cache page: [0x200000d0e:0x1d3d2:0x0] at 0: rc -5
Jan 4 10:56:46 osiris kernel: LustreError: 180049:0:(dir.c:422:ll_get_dir_page()) Skipped 2 previous similar messages
Jan 4 10:56:46 osiris kernel: LustreError: 180049:0:(dir.c:584:ll_dir_read()) error reading dir [0x200000d0e:0x1d3d2:0x0] at 0: rc -5
Jan 4 10:56:46 osiris kernel: LustreError: 180049:0:(dir.c:584:ll_dir_read()) Skipped 6 previous similar messages
Jan 4 10:56:46 osiris kernel: LustreError: 180040:0:(file.c:171:ll_close_inode_openhandle()) inode 144115244145431261 mdc close failed: rc = -5
Jan 4 10:56:46 osiris kernel: LustreError: 180040:0:(file.c:171:ll_close_inode_openhandle()) Skipped 36 previous similar messages
Jan 4 10:56:46 osiris kernel: LustreError: 76799:0:(mdc_locks.c:1257:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -108
Jan 4 10:56:46 osiris kernel: LustreError: 180047:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -5
Jan 4 10:56:46 osiris kernel: LustreError: Skipped 2 previous similar messages
Jan 4 10:56:46 osiris kernel: LustreError: 93923:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -108
Jan 4 10:56:46 osiris kernel: LustreError: 93923:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 743 previous similar messages
Jan 4 10:56:47 osiris kernel: LustreError: 34138:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -108
Jan 4 10:56:47 osiris kernel: LustreError: 34138:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 2 previous similar messages
Jan 4 10:56:49 osiris kernel: LustreError: 34145:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -108
Jan 4 10:56:49 osiris kernel: LustreError: 34145:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 37 previous similar messages
Jan 4 10:56:54 osiris kernel: LustreError: 93923:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -108
Jan 4 10:56:54 osiris kernel: LustreError: 93923:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 41 previous similar messages
Jan 4 10:57:02 osiris kernel: LustreError: 93923:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -108
Jan 4 10:57:02 osiris kernel: LustreError: 93923:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 16 previous similar messages
Jan 4 10:57:19 osiris kernel: LustreError: 93923:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -108
Jan 4 10:57:19 osiris kernel: LustreError: 93923:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 32 previous similar messages
Jan 4 10:57:31 osiris kernel: LustreError: 93893:0:(vvp_io.c:1229:vvp_io_init()) naaschpc: refresh file layout [0x2000027f1:0x15bf0:0x0] error -108.
Jan 4 10:57:31 osiris kernel: LustreError: 93893:0:(lmv_obd.c:1377:lmv_fid_alloc()) Can't alloc new fid, rc -19
Jan 4 10:57:31 osiris kernel: LustreError: 93893:0:(file.c:171:ll_close_inode_openhandle()) inode 144115359622978251 mdc close failed: rc = -108
Jan 4 10:57:31 osiris kernel: LustreError: 93893:0:(file.c:171:ll_close_inode_openhandle()) Skipped 12 previous similar messages
Jan 4 10:57:34 osiris kernel: LustreError: 93917:0:(vvp_io.c:1229:vvp_io_init()) naaschpc: refresh file layout [0x2000027f1:0x15bf0:0x0] error -108.
Jan 4 10:57:34 osiris kernel: LustreError: 93917:0:(lmv_obd.c:1377:lmv_fid_alloc()) Can't alloc new fid, rc -19
Jan 4 10:57:51 osiris kernel: LustreError: 93923:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -108
Jan 4 10:57:51 osiris kernel: LustreError: 93923:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 66 previous similar messages
Jan 4 10:58:11 osiris kernel: LustreError: 93923:0:(vvp_io.c:1229:vvp_io_init()) naaschpc: refresh file layout [0x2000027f1:0x15bd6:0x0] error -108.
Jan 4 10:58:11 osiris kernel: LustreError: 93923:0:(lmv_obd.c:1377:lmv_fid_alloc()) Can't alloc new fid, rc -19
Jan 4 10:58:24 osiris kernel: LustreError: 34155:0:(file.c:3076:ll_inode_revalidate_fini()) naaschpc: revalidate FID [0x6f8001:0xf73c880d:0x0] error: rc = -108
Jan 4 10:58:24 osiris kernel: LustreError: 34155:0:(file.c:3076:ll_inode_revalidate_fini()) Skipped 748 previous similar messages
Jan 4 10:58:32 osiris kernel: LustreError: 34155:0:(file.c:3076:ll_inode_revalidate_fini()) naaschpc: revalidate FID [0x6f8001:0xf73c880d:0x0] error: rc = -108
Jan 4 10:58:32 osiris kernel: LustreError: 34155:0:(file.c:3076:ll_inode_revalidate_fini()) Skipped 3 previous similar messages
Jan 4 10:58:41 osiris kernel: LustreError: 64243:0:(file.c:171:ll_close_inode_openhandle()) inode 144115359622963333 mdc close failed: rc = -108
Jan 4 10:58:41 osiris kernel: LustreError: 64243:0:(file.c:171:ll_close_inode_openhandle()) Skipped 200 previous similar messages
Jan 4 10:58:48 osiris kernel: LustreError: 34155:0:(file.c:3076:ll_inode_revalidate_fini()) naaschpc: revalidate FID [0x6f8001:0xf73c880d:0x0] error: rc = -108
Jan 4 10:58:48 osiris kernel: LustreError: 34155:0:(file.c:3076:ll_inode_revalidate_fini()) Skipped 7 previous similar messages
Jan 4 10:58:51 osiris kernel: LustreError: 64243:0:(ldlm_resource.c:809:ldlm_resource_complain()) naaschpc-MDT0000-mdc-ffff88bfb5504c00: namespace resource [0xc42689:0x6a37c331:0x0].0 (ffff88aa6a930c00) refcount nonzero (2) after lock cleanup; forcing cleanup.
Jan 4 10:58:51 osiris kernel: LustreError: 64243:0:(ldlm_resource.c:1448:ldlm_resource_dump()) — Resource: [0xc42689:0x6a37c331:0x0].0 (ffff88aa6a930c00) refcount = 3
Jan 4 10:58:51 osiris kernel: LustreError: 64243:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order):
Jan 4 10:58:51 osiris kernel: LustreError: 64243:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: naaschpc-MDT0000-mdc-ffff88bfb5504c00 lock: ffff88a9167e2d40/0x617c35096d0ccf87 lrc: 2/0,0 mode: CR/CR res: [0xc42689:0x6a37c331:0x0].0 bits 0x9 rrc: 3 type: IBT flags: 0x12e400000000 nid: local remote: 0x3a248a285bfbf38e expref: -99 pid: 180051 timeout: 0 lvb_type: 3

********************[ THEN MANY LINES LIKE THIS, where 'Granted locks' line is followed many --- Resource lines ]***********************

 Jan 4 10:58:52 osiris kernel: LustreError: 64243:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order):

Jan 4 10:58:52 osiris kernel: LustreError: 64243:0:(ldlm_resource.c:1448:ldlm_resource_dump()) — Resource: [0x200000d5e:0xe041:0x0].0 (ffff884eebed1d00) refcount = 3
Jan 4 10:58:52 osiris kernel: LustreError: 64243:0:(ldlm_resource.c:1448:ldlm_resource_dump()) — Resource: [0x200000c04:0x17b7b:0x0].0 (ffff884ca9c5ed40) refcount = 3

********************[ THEN FINALLY ]***********************

 Jan 4 10:58:52 osiris kernel: Lustre: naaschpc-MDT0000-mdc-ffff88bfb5504c00: Connection restored to naaschpc-MDT0000 (at 10.7.17.8@o2ib)

Generated at Sat Feb 10 02:07:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.