Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.6.0
-
Large DNE system on CentOS, upgrading from 2.5 with remote directories to 2.6/master. Occurred on some MDSes when trying to start the MDTs.
-
3
-
14218
Description
When trying to start out large 2.5 DNE test bed with 2.6, we hit the following assertion:
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: LustreError: 14404:0:(fld_index.c:176:fld_index_create()) ASSERTION( mutex_is_locked(&fld->lsf_lock) ) failed:
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: LustreError: 14404:0:(fld_index.c:176:fld_index_create()) LBUG
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: Pid: 14404, comm: llog_process_th
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel:
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: Call Trace:
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa0a8b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa0a8be97>] lbug_with_loc+0x47/0xb0 [libcfs]
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa104adf3>] fld_index_create+0x5a3/0x750 [fld]
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa1370f2e>] ? osd_trans_start+0x21e/0x660 [osd_ldiskfs]
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa104b7f1>] fld_insert_entry+0x291/0x380 [fld]
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa104976a>] fld_update_from_controller+0x27a/0x540 [fld]
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa1485496>] mdt_register_lwp_callback+0x76/0x2d0 [mdt]
(09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa0c4943f>] lustre_lwp_connect+0x83f/0xc90 [obdclass]
Looking at the assertion and other call chains to this function, I see that the mutex in question is usually taken around calls to:
fld_insert_entry
The problematic call chain was introduced by this commit:
commit 519a65ddc04673022124f421e4809f8a87f790d7
Author: wang di <di.wang@intel.com>
Date: Tue Oct 8 02:13:27 2013 -0700
LU-4076 fld: add local fldb to each target
1. Add local FLDB to each MDT, so OSD/OUT can check whether
FID is remote by looking up local FLDB, i.e. no need send RPC
to MDT0.
2. OSD will only do local lookup when checking remote FID.
3. During upgrade, MDTn(n != 0) needs to retrieve its fldb
entries from controller(MDT0) and insert into the local
FLDB.
4. MDT should also use LWP(instead of OSP) to communicate
with sequence controller (MDT0).
Signed-off-by: wang di <di.wang@intel.com>
Change-Id: I788a543aeb7305dfbad3cc41b586f9337f227119
Reviewed-on: http://review.whamcloud.com/7884
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Tested-by: Jenkins
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
I will generate a patch.