[LU-7290] lock callback not getting to client Created: 13/Oct/15 Updated: 06/Jan/17 Resolved: 14/Jul/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We believe that lock callback are not getting to clients. This happens on a server to several clients. ct 13 09:07:13 nbp9-oss8 kernel: LustreError: 92805:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.27.22@o2ib ns: filter-nbp9-OST00a7_UUID lock: ffff88065cc93980/0xb2e0788712229c2d lrc: 4/0,0 mode: PR/PR res: [0x838044:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000000020 nid: 10.151.27.22@o2ib remote: 0x589c09f2406c35d5 expref: 17 pid: 11839 timeout: 4695872869 lvb_type: 1 Oct 13 09:07:14 nbp9-oss8 kernel: LustreError: 0:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.27.22@o2ib ns: filter-nbp9-OST0047_UUID lock: ffff8807552a2500/0xb2e0788712233bdd lrc: 4/0,0 mode: PR/PR res: [0x8342ac:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000000020 nid: 10.151.27.22@o2ib remote: 0x589c09f2406c9876 expref: 14 pid: 11960 timeout: 4695873146 lvb_type: 1 Oct 13 09:07:16 nbp9-oss8 kernel: LustreError: 0:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.27.22@o2ib ns: filter-nbp9-OST0027_UUID lock: ffff880bebb08300/0xb2e0788712231e55 lrc: 4/0,0 mode: PR/PR res: [0x863302:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000000020 nid: 10.151.27.22@o2ib remote: 0x589c09f2406c7d95 expref: 22 pid: 11990 timeout: 4695875446 lvb_type: 1 Oct 13 09:15:24 nbp9-oss3 kernel: LustreError: 14326:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.26.25@o2ib ns: filter-nbp9-OST00b2_UUID lock: ffff880345a5f9c0/0xb193eb7bec015e25 lrc: 5/0,0 mode: PR/PR res: [0x8b2dc6:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 10.151.26.25@o2ib remote: 0x73b551da2111846b expref: 50081 pid: 14044 timeout: 4696527377 lvb_type: 1 |
| Comments |
| Comment by Peter Jones [ 14/Oct/15 ] |
|
Jinshan Could you please advise? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 14/Oct/15 ] |
|
Hi Mahmoud, Are you saying that the server had sent BLOCKING AST but for some reason the clients didn't receive it? I asked because there is a suspicious LNET problem that it discards RPCs for unknown reason I have seen recently. |
| Comment by Mahmoud Hanafi [ 14/Oct/15 ] |
|
We are only guessing that the RPC was sent and didn't arrive. So this may be the case that the RPC was discarded. Are the discards happing on the receiving side? How can we identified if RPCs are getting discarded. |
| Comment by Jinshan Xiong (Inactive) [ 14/Oct/15 ] |
|
As what I have seen, the client already received the RPC and handled it, but for unknown reason the reply was discarded. The only way to identify this problem is to collect and analyze log. From the description of this ticket, it needs the logs at least 3 minutes before the lock timeout message is seen. For a busy server, that'll need a lot memory for debug log buffer. But we can mitigate this by reducing lock timeout value. |
| Comment by Mahmoud Hanafi [ 15/Oct/15 ] |
|
What debug masking would you like. dlmtrace, rpctrace? |
| Comment by Jinshan Xiong (Inactive) [ 15/Oct/15 ] |
|
Yes, and with extra vfstrace on client side. |
| Comment by Mahmoud Hanafi [ 16/Oct/15 ] |
|
When does the timer start for the lockcall back. When its put on the send queue or when it is actually sent |
| Comment by Mahmoud Hanafi [ 18/Oct/15 ] |
|
This may be related to https://jira.hpdd.intel.com/browse/LU-7054 |
| Comment by Mahmoud Hanafi [ 14/Jul/16 ] |
|
We haven't see this issue and have no reproducer. We can close it. |
| Comment by Peter Jones [ 14/Jul/16 ] |
|
ok - thanks Mahmoud |
| Comment by Jessica Otey [ 04/Jan/17 ] |
|
We are using 2.5.3 and I saw this exact issue today... I am curious as to whether it is related to load on the MDT... I was running a file system scan using robinhood, which considerably drives up the load on the MDT, and at one point during the scan, the client I was doing it from got evicted with this exact error. Jan 4 10:56:46 asimov kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 10.7.17.127@o2ib ns: mdt naaschpc-MDT0000_UUID lock: ffff8805f53493c0/0x3a248a284f4caf4c lrc: 3/0,0 mode: PR/PR res: [0x2000027f7:0x3083:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x60200000000020 nid: 10.7.17.127@o2ib remote: 0x617c3509692de90b expref: 28235252 pid: 2622 timeout: 6086108175 lvb_type: 0 |
| Comment by Jinshan Xiong (Inactive) [ 05/Jan/17 ] |
|
There was a failure on target_bulk_io so I suspect that reader was stuck and lock timeout was just a fallout of this problem. Do you have a LNET router? can you also show me the log message on client 10.7.17.127? |
| Comment by Jessica Otey [ 06/Jan/17 ] |
|
Yes, we do have 2 LNET routers.... Here's the log message from the client (10.7.17.127). You can see that it lost connection to the MDT... maybe related to the mds_getattr_lock failure? Jan 4 10:56:46 osiris kernel: LustreError: 11-0: naaschpc-MDT0000-mdc-ffff88bfb5504c00: Communicating with 10.7.17.8@o2ib, operation mds_getattr_lock failed with -107. ********************[ THEN MANY LINES LIKE THIS, where 'Granted locks' line is followed many --- Resource lines ]*********************** Jan 4 10:58:52 osiris kernel: LustreError: 64243:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order): Jan 4 10:58:52 osiris kernel: LustreError: 64243:0:(ldlm_resource.c:1448:ldlm_resource_dump()) — Resource: [0x200000d5e:0xe041:0x0].0 (ffff884eebed1d00) refcount = 3 ********************[ THEN FINALLY ]*********************** Jan 4 10:58:52 osiris kernel: Lustre: naaschpc-MDT0000-mdc-ffff88bfb5504c00: Connection restored to naaschpc-MDT0000 (at 10.7.17.8@o2ib) |