[LU-2959] ASSERTION( cli->cl_mgc_configs_dir ) for 200 osts x 2 oss Created: 13/Mar/13  Updated: 20/Jun/14  Resolved: 18/Oct/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Major
Reporter: Minh Diep Assignee: Nathaniel Clark
Resolution: Duplicate Votes: 0
Labels: LB
Environment:

while testing wide striping with 1mds, 2 oss. Each oss is creating 200 osts; and mount at the same time using a script.


Issue Links:
Duplicate
duplicates LU-2059 mgc to backup configuration on osd-ba... Resolved
is duplicated by LU-2059 mgc to backup configuration on osd-ba... Resolved
Severity: 3
Rank (Obsolete): 7216

 Description   

Hit LBUG on one of the oss.

LustreError: 16207:0:(mgc_request.c:1682:mgc_copy_llog()) ASSERTION( cli->cl_mgc_configs_dir ) failed:
LustreError: 16207:0:(mgc_request.c:1682:mgc_copy_llog()) LBUG
Pid: 16207, comm: ll_cfg_requeue

Call Trace:
 [<ffffffffa04e2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa04e2e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0dbfe2b>] mgc_process_cfg_log+0x134b/0x15c0 [mgc]
 [<ffffffffa0dc2093>] mgc_process_log+0x463/0x1390 [mgc]
 [<ffffffff814ead1a>] ? schedule_timeout+0x19a/0x2e0
 [<ffffffffa0dbca30>] ? mgc_blocking_ast+0x0/0x770 [mgc]
 [<ffffffffa07aed40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
 [<ffffffff81090d4c>] ? remove_wait_queue+0x3c/0x50
 [<ffffffffa0dc3973>] mgc_requeue_thread+0x1a3/0x750 [mgc]
 [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
 [<ffffffffa0dc37d0>] ? mgc_requeue_thread+0x0/0x750 [mgc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa0dc37d0>] ? mgc_requeue_thread+0x0/0x750 [mgc]
 [<ffffffffa0dc37d0>] ? mgc_requeue_thread+0x0/0x750 [mgc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1363211399.16207


 Comments   
Comment by Minh Diep [ 25/Mar/13 ]

I haven't been able to reproduce this on latest master. We need to check with Alex to see if there's fix recently about this

Comment by Minh Diep [ 25/Mar/13 ]

I used https://build.whamcloud.com/job/lustre-master/1337/

Comment by Alex Zhuravlev [ 26/Mar/13 ]

sorry, i'm not aware of any fix in this area.

Comment by Nathaniel Clark [ 26/Mar/13 ]

I'm inclined to believe this is a race between lustre_fill_super() and LDLM_CB_CANCELING. I think it's a real bug, but I think it's very hard to get to. Many OST's on a single OSS helps exacerbate the race. I haven't nailed down the exact chain, yet.

Comment by Peter Jones [ 27/Mar/13 ]

Dropping priority based on rarity

Comment by Nathaniel Clark [ 27/Mar/13 ]

http://review.whamcloud.com/5860

Comment by Li Wei (Inactive) [ 01/Apr/13 ]

It seems the MGC "fs" stuff, like cl_mgc_configs_dir, are time-shared among different OSTs. Each OST sets "fs" and cleans "fs" up in server_start_targets(). mgc_requeue_thread() should set up the "fs" itself, but does not seem to be doing so.

Comment by Mikhail Pershin [ 01/Apr/13 ]

it doesn't but it checks that fs is proper one and don't process local log otherwise mgc_process_cfg_log():

	if (lctxt && lsi && IS_SERVER(lsi) &&
            (lsi->lsi_srv_mnt == cli->cl_mgc_vfsmnt) &&
	    !IS_MGS(lsi) && lsi->lsi_srv_mnt) {
Comment by Alex Zhuravlev [ 01/Apr/13 ]

I think we do not update backups on a subsequent requeue (after initial during service startup). so, we don't need to setup/clear fs in requeue.
but given lsi_srv_mnt is unique, we shouldn't be calling mgc_copy_llog() on requeue.

interesting thing is that mgc_fs_setup() takes cl_mgc_sem to serialize setup procedues and llog processing for different services, then mgc_fs_cleanup() release it. but requeue doesn't do this.

Comment by Cliff White (Inactive) [ 16/Apr/13 ]

I have hit this assertion on Hyperion, testing 2.3.64 tag. Error occurred when attempting to mount a freshly formatted disk:

Apr 15 16:55:57 hyperion-dit31 kernel: Lustre: lustre-OST0032: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
Apr 15 16:55:57 hyperion-dit31 kernel: LustreError: 5369:0:(mgc_request.c:1686:mgc_copy_llog()) ASSERTION( cli->cl_mgc_configs_dir ) failed:
Apr 15 16:55:57 hyperion-dit31 kernel: LustreError: 5369:0:(mgc_request.c:1686:mgc_copy_llog()) ASSERTION( cli->cl_mgc_configs_dir ) failed:
Apr 15 16:55:57 hyperion-dit31 kernel: LustreError: 5369:0:(mgc_request.c:1686:mgc_copy_llog()) LBUG
Apr 15 16:55:57 hyperion-dit31 kernel: LustreError: 5369:0:(mgc_request.c:1686:mgc_copy_llog()) LBUG
Apr 15 16:55:57 hyperion-dit31 kernel: Pid: 5369, comm: ll_cfg_requeue
Apr 15 16:55:57 hyperion-dit31 kernel:
Apr 15 16:55:57 hyperion-dit31 kernel: Call Trace:
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa054c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa054ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1075ecb>] mgc_process_cfg_log+0x134b/0x15c0 [mgc]
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1078133>] mgc_process_log+0x463/0x1390 [mgc]
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffff8150eb5a>] ? schedule_timeout+0x19a/0x2e0
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1072a60>] ? mgc_blocking_ast+0x0/0x7e0 [mgc]
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa0acce00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffff8109705c>] ? remove_wait_queue+0x3c/0x50
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1079a13>] mgc_requeue_thread+0x1a3/0x750 [mgc]
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1079870>] ? mgc_requeue_thread+0x0/0x750 [mgc]
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1079870>] ? mgc_requeue_thread+0x0/0x750 [mgc]
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffffa1079870>] ? mgc_requeue_thread+0x0/0x750 [mgc]
Apr 15 16:55:57 hyperion-dit31 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

The rest of the stack was lost.

Comment by Nathaniel Clark [ 20/Aug/13 ]

Fixed by http://review.whamcloud.com/5049

Comment by Niu Yawei (Inactive) [ 20/Jun/14 ]

This bug is hit on several 2.4 sites, could someone backport it to 2.4 ?

Generated at Sat Feb 10 01:29:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.