[LU-2959] ASSERTION( cli->cl_mgc_configs_dir ) for 200 osts x 2 oss Created: 13/Mar/13 Updated: 20/Jun/14 Resolved: 18/Oct/13 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Minh Diep | Assignee: | Nathaniel Clark |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | LB | ||
| Environment: |
while testing wide striping with 1mds, 2 oss. Each oss is creating 200 osts; and mount at the same time using a script. |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 7216 | ||||||||||||
| Description |
|
Hit LBUG on one of the oss. LustreError: 16207:0:(mgc_request.c:1682:mgc_copy_llog()) ASSERTION( cli->cl_mgc_configs_dir ) failed: LustreError: 16207:0:(mgc_request.c:1682:mgc_copy_llog()) LBUG Pid: 16207, comm: ll_cfg_requeue Call Trace: [<ffffffffa04e2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa04e2e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0dbfe2b>] mgc_process_cfg_log+0x134b/0x15c0 [mgc] [<ffffffffa0dc2093>] mgc_process_log+0x463/0x1390 [mgc] [<ffffffff814ead1a>] ? schedule_timeout+0x19a/0x2e0 [<ffffffffa0dbca30>] ? mgc_blocking_ast+0x0/0x770 [mgc] [<ffffffffa07aed40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffff81090d4c>] ? remove_wait_queue+0x3c/0x50 [<ffffffffa0dc3973>] mgc_requeue_thread+0x1a3/0x750 [mgc] [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 [<ffffffffa0dc37d0>] ? mgc_requeue_thread+0x0/0x750 [mgc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0dc37d0>] ? mgc_requeue_thread+0x0/0x750 [mgc] [<ffffffffa0dc37d0>] ? mgc_requeue_thread+0x0/0x750 [mgc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1363211399.16207 |
| Comments |
| Comment by Minh Diep [ 25/Mar/13 ] |
|
I haven't been able to reproduce this on latest master. We need to check with Alex to see if there's fix recently about this |
| Comment by Minh Diep [ 25/Mar/13 ] |
| Comment by Alex Zhuravlev [ 26/Mar/13 ] |
|
sorry, i'm not aware of any fix in this area. |
| Comment by Nathaniel Clark [ 26/Mar/13 ] |
|
I'm inclined to believe this is a race between lustre_fill_super() and LDLM_CB_CANCELING. I think it's a real bug, but I think it's very hard to get to. Many OST's on a single OSS helps exacerbate the race. I haven't nailed down the exact chain, yet. |
| Comment by Peter Jones [ 27/Mar/13 ] |
|
Dropping priority based on rarity |
| Comment by Nathaniel Clark [ 27/Mar/13 ] |
| Comment by Li Wei (Inactive) [ 01/Apr/13 ] |
|
It seems the MGC "fs" stuff, like cl_mgc_configs_dir, are time-shared among different OSTs. Each OST sets "fs" and cleans "fs" up in server_start_targets(). mgc_requeue_thread() should set up the "fs" itself, but does not seem to be doing so. |
| Comment by Mikhail Pershin [ 01/Apr/13 ] |
|
it doesn't but it checks that fs is proper one and don't process local log otherwise mgc_process_cfg_log(): if (lctxt && lsi && IS_SERVER(lsi) &&
(lsi->lsi_srv_mnt == cli->cl_mgc_vfsmnt) &&
!IS_MGS(lsi) && lsi->lsi_srv_mnt) {
|
| Comment by Alex Zhuravlev [ 01/Apr/13 ] |
|
I think we do not update backups on a subsequent requeue (after initial during service startup). so, we don't need to setup/clear fs in requeue. interesting thing is that mgc_fs_setup() takes cl_mgc_sem to serialize setup procedues and llog processing for different services, then mgc_fs_cleanup() release it. but requeue doesn't do this. |
| Comment by Cliff White (Inactive) [ 16/Apr/13 ] |
|
I have hit this assertion on Hyperion, testing 2.3.64 tag. Error occurred when attempting to mount a freshly formatted disk: Apr 15 16:55:57 hyperion-dit31 kernel: Lustre: lustre-OST0032: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Apr 15 16:55:57 hyperion-dit31 kernel: LustreError: 5369:0:(mgc_request.c:1686:mgc_copy_llog()) ASSERTION( cli->cl_mgc_configs_dir ) failed: Apr 15 16:55:57 hyperion-dit31 kernel: LustreError: 5369:0:(mgc_request.c:1686:mgc_copy_llog()) ASSERTION( cli->cl_mgc_configs_dir ) failed: Apr 15 16:55:57 hyperion-dit31 kernel: LustreError: 5369:0:(mgc_request.c:1686:mgc_copy_llog()) LBUG Apr 15 16:55:57 hyperion-dit31 kernel: LustreError: 5369:0:(mgc_request.c:1686:mgc_copy_llog()) LBUG Apr 15 16:55:57 hyperion-dit31 kernel: Pid: 5369, comm: ll_cfg_requeue Apr 15 16:55:57 hyperion-dit31 kernel: Apr 15 16:55:57 hyperion-dit31 kernel: Call Trace: Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa054c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa054ce97>] lbug_with_loc+0x47/0xb0 [libcfs] Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1075ecb>] mgc_process_cfg_log+0x134b/0x15c0 [mgc] Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1078133>] mgc_process_log+0x463/0x1390 [mgc] Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffff8150eb5a>] ? schedule_timeout+0x19a/0x2e0 Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1072a60>] ? mgc_blocking_ast+0x0/0x7e0 [mgc] Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa0acce00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffff8109705c>] ? remove_wait_queue+0x3c/0x50 Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1079a13>] mgc_requeue_thread+0x1a3/0x750 [mgc] Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20 Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1079870>] ? mgc_requeue_thread+0x0/0x750 [mgc] Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1079870>] ? mgc_requeue_thread+0x0/0x750 [mgc] Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1079870>] ? mgc_requeue_thread+0x0/0x750 [mgc] Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 The rest of the stack was lost. |
| Comment by Nathaniel Clark [ 20/Aug/13 ] |
|
Fixed by http://review.whamcloud.com/5049 |
| Comment by Niu Yawei (Inactive) [ 20/Jun/14 ] |
|
This bug is hit on several 2.4 sites, could someone backport it to 2.4 ? |