Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5153

LustreError: 14404:0:(fld_index.c:176:fld_index_create()) ASSERTION( mutex_is_locked(&fld->lsf_lock) ) failed:

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.6.0
    • Large DNE system on CentOS, upgrading from 2.5 with remote directories to 2.6/master. Occurred on some MDSes when trying to start the MDTs.
    • 3
    • 14218

    Description

      When trying to start out large 2.5 DNE test bed with 2.6, we hit the following assertion:
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: LustreError: 14404:0:(fld_index.c:176:fld_index_create()) ASSERTION( mutex_is_locked(&fld->lsf_lock) ) failed:
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: LustreError: 14404:0:(fld_index.c:176:fld_index_create()) LBUG
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: Pid: 14404, comm: llog_process_th
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel:
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: Call Trace:
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa0a8b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa0a8be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa104adf3>] fld_index_create+0x5a3/0x750 [fld]
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa1370f2e>] ? osd_trans_start+0x21e/0x660 [osd_ldiskfs]
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa104b7f1>] fld_insert_entry+0x291/0x380 [fld]
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa104976a>] fld_update_from_controller+0x27a/0x540 [fld]
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa1485496>] mdt_register_lwp_callback+0x76/0x2d0 [mdt]
      (09:16:38 AM) dmb: Jun 6 09:12:35 galaxy-esf-mds004 kernel: [<ffffffffa0c4943f>] lustre_lwp_connect+0x83f/0xc90 [obdclass]

      Looking at the assertion and other call chains to this function, I see that the mutex in question is usually taken around calls to:
      fld_insert_entry

      The problematic call chain was introduced by this commit:
      commit 519a65ddc04673022124f421e4809f8a87f790d7
      Author: wang di <di.wang@intel.com>
      Date: Tue Oct 8 02:13:27 2013 -0700

      LU-4076 fld: add local fldb to each target

      1. Add local FLDB to each MDT, so OSD/OUT can check whether
      FID is remote by looking up local FLDB, i.e. no need send RPC
      to MDT0.

      2. OSD will only do local lookup when checking remote FID.

      3. During upgrade, MDTn(n != 0) needs to retrieve its fldb
      entries from controller(MDT0) and insert into the local
      FLDB.

      4. MDT should also use LWP(instead of OSP) to communicate
      with sequence controller (MDT0).

      Signed-off-by: wang di <di.wang@intel.com>
      Change-Id: I788a543aeb7305dfbad3cc41b586f9337f227119
      Reviewed-on: http://review.whamcloud.com/7884
      Reviewed-by: John L. Hammond <john.hammond@intel.com>
      Tested-by: Jenkins
      Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
      Tested-by: Maloo <hpdd-maloo@intel.com>
      Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

      I will generate a patch.

      Attachments

        Activity

          People

            di.wang Di Wang
            paf Patrick Farrell
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: