[LU-12570] sanity test 134a crash with SSK in use Created: 22/Jul/19 Updated: 04/Oct/19 Resolved: 28/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
When running sanity with shared key enabled, it always crashes in test 134a like this: LustreError: 32759:0:(ofd_internal.h:412:ofd_info()) ASSERTION( info ) failed: LustreError: 32759:0:(ofd_internal.h:412:ofd_info()) LBUG Pid: 32759, comm: mdt00_008 3.10.0-7.6-debug #1 SMP Fri Jul 12 02:40:17 EDT 2019 Call Trace: [<ffffffffa01928cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [<ffffffffa019297c>] lbug_with_loc+0x4c/0xa0 [libcfs] [<ffffffffa0e94dca>] ofd_exit+0x0/0x236 [ofd] [<ffffffffa0e9397b>] ofd_lvbo_update+0xd2b/0xe30 [ofd] [<ffffffffa05ec99c>] ldlm_handle_ast_error+0x45c/0x820 [ptlrpc] [<ffffffffa05ee6ea>] ldlm_cb_interpret+0x19a/0x700 [ptlrpc] [<ffffffffa0608071>] ptlrpc_check_set.part.23+0x491/0x1e00 [ptlrpc] [<ffffffffa0609a3b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [<ffffffffa0609ddc>] ptlrpc_set_wait+0x31c/0x790 [ptlrpc] [<ffffffffa05c7e35>] ldlm_run_ast_work+0xd5/0x380 [ptlrpc] [<ffffffffa05fe8c5>] ldlm_reclaim_full+0x425/0x7a0 [ptlrpc] [<ffffffffa05f0338>] ldlm_handle_enqueue0+0x138/0x15d0 [ptlrpc] [<ffffffffa0676b42>] tgt_enqueue+0x62/0x210 [ptlrpc] [<ffffffffa067ef85>] tgt_request_handle+0x985/0x1630 [ptlrpc] [<ffffffffa0622568>] ptlrpc_server_handle_request+0x258/0xb00 [ptlrpc] [<ffffffffa062670a>] ptlrpc_main+0xcba/0x2500 [ptlrpc] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [<ffffffff817c8c5d>] ret_from_fork_nospec_begin+0x7/0x21 [<ffffffffffffffff>] 0xffffffffffffffff |
| Comments |
| Comment by Alex Zhuravlev [ 22/Jul/19 ] |
|
it's MDT thread which seem to be missing LCT_DT_THREAD ? |
| Comment by Alex Zhuravlev [ 22/Jul/19 ] |
static int mds_start_ptlrpc_service(struct mds_device *m) ... .tc_ctx_tags = LCT_MD_THREAD, } i.e. LCT_DT_THREAD is missing? |
| Comment by Alex Zhuravlev [ 22/Jul/19 ] |
|
it would be great to have a log/dump for the case |
| Comment by Oleg Drokin [ 22/Jul/19 ] |
|
sure thing. |
| Comment by Andreas Dilger [ 30/Jul/19 ] |
|
Alex, is this just a matter of adding LCT_DT_THREAD to mds_start_ptlrpc_service() setting up the threads? What is the impact/overhead of doing this (if any)? Would it be better to limit ldlm_reclaim_ns() to only clean up locks in the same namespace type as mentioned in LU-12592? |
| Comment by Sebastien Buisson [ 19/Aug/19 ] |
|
Hi, I never hit this crash myself, and I have many examples of sanity test 134a passing with SSK enabled, for instance all custom-103 sessions triggered from https://review.whamcloud.com/34380 (latest one is run with patch rebased on August, 1st). Not sure it is an issue with SSK, I am wondering if the crash you experience stills occurs with the modification suggested by Alex (LCT_DT_THREAD). |
| Comment by Alex Zhuravlev [ 19/Aug/19 ] |
|
sorry for late response, adding LCT_DT_THREAD is not quite enough - the problem is that the client is trying to cancel extent locks sending them to MDT. |
| Comment by Gerrit Updater [ 13/Sep/19 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36179 |
| Comment by Gerrit Updater [ 27/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36179/ |
| Comment by Peter Jones [ 28/Sep/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 28/Sep/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36312 |
| Comment by Gerrit Updater [ 04/Oct/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36312/ |