Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4878

fld_server_lookup() ASSERTION( fld->lsf_control_exp ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.3
    • 3
    • 13490

    Description

      The following LBUG appeared at customer site, during the mount process on all OSS in lustre 2.4.3 version.

      LustreError: 12838:0:(fld_handler.c:172:fld_server_lookup()) ASSERTION(fld->lsf_control_exp ) failed:
      LustreError: 12838:0:(fld_handler.c:172:fld_server_lookup()) LBUG
      
      Pid: 12838, comm: mount.lustre
      
      Call Trace:
       libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       lbug_with_loc+0x47/0xb0 [libcfs]
       fld_server_lookup+0x2f7/0x3d0 [fld]
       osd_fld_lookup+0x71/0x1d0 [osd_ldiskfs]
       osd_remote_fid+0x9a/0x280 [osd_ldiskfs]
       osd_index_ea_lookup+0521/0x850 [osd_ldiskfs]
       dt_lookup_dir+0x6f/0x130 [obdclass]
       llog_osd_open+0x485/0xc00 [obdclass]
       llog_open+0xba/0x2c0 [obdclass]
       mgc_process_log [mgc]
       mgc_process_config [mgc]
       lustre_process_log [obdclass]
       server_start_targets [obdclass]
       server_fill_super [obdclass]
       lustre_fill_super[obdclass]
       get_sb_nodev
       lustre_get_sb
       vfs_kern_mount
       do_kern_mount
       do_mount
       sys_mount
       system_call_fastpath
      

      This issue seems the same as LU-3126 for which a patch has been landed in lustre 2.5. Unfortunately no patch has been provided for lustre 2.4 release.

      Attachments

        Issue Links

          Activity

            [LU-4878] fld_server_lookup() ASSERTION( fld->lsf_control_exp ) failed

            The LBUG I hit when testing path #9929 has the same stack trace than LU-3915, which reports an OSS crash when mounting OSTs after an upgrade from 2.4.0 to 2.5.

            In my case, I was upgrading from 2.4.2 to 2.4.3 with a few additional patches including #5049 "LU-2059 llog: MGC to use OSD API for backup logs".

            Patch #7673 "LU-3915 osd-ldiskfs: don't assert on possible upgrade" seems to fix upgrade issue caused by #5049. So I probably need to add #7673 in the list of patches on top of 2.4.3, do you agree ?

            pichong Gregoire Pichon added a comment - The LBUG I hit when testing path #9929 has the same stack trace than LU-3915 , which reports an OSS crash when mounting OSTs after an upgrade from 2.4.0 to 2.5. In my case, I was upgrading from 2.4.2 to 2.4.3 with a few additional patches including #5049 " LU-2059 llog: MGC to use OSD API for backup logs". Patch #7673 " LU-3915 osd-ldiskfs: don't assert on possible upgrade" seems to fix upgrade issue caused by #5049. So I probably need to add #7673 in the list of patches on top of 2.4.3, do you agree ?

            Hello Gregoire,
            Thanks for the update.
            But concerning the new and different stack/traceback you reported, in my opinion it looks more like a different problem which has already been reported in LU-3915.
            So you may now give a try to my b2_4 back-port, of master patch #7673 from LU-3915, I just pushed at http://review.whamcloud.com/9958.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, Thanks for the update. But concerning the new and different stack/traceback you reported, in my opinion it looks more like a different problem which has already been reported in LU-3915 . So you may now give a try to my b2_4 back-port, of master patch #7673 from LU-3915 , I just pushed at http://review.whamcloud.com/9958 .

            I have tested the patch #9929 posted in gerrit. Unfortunately the OSS still crashes when mounting OST.

            Here is the stack

            <3>LustreError: 6577:0:(fld_handler.c:174:fld_server_lookup()) srv-fs_pv-OST0000: lookup 0x7d, but not connects to MDT0yet: rc = -5.
            <3>LustreError: 6577:0:(osd_handler.c:2135:osd_fld_lookup()) fs_pv-OST0000-osd: cannot find FLD range for 0x7d: rc = -5
            <3>LustreError: 6577:0:(osd_handler.c:3344:osd_mdt_seq_exists()) fs_pv-OST0000-osd: Can not lookup fld for 0x7d
            <0>LustreError: 6577:0:(osd_handler.c:2651:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: 
            <0>LustreError: 6577:0:(osd_handler.c:2651:osd_object_ref_del()) LBUG
            <4>Pid: 6577, comm: mount.lustre
            <4>
            <4>Call Trace:
            <4> [<ffffffffa0d57895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            <4> [<ffffffffa0d57e97>] lbug_with_loc+0x47/0xb0 [libcfs]
            <4> [<ffffffffa04197e7>] osd_object_ref_del+0x1e7/0x220 [osd_ldiskfs]
            <4> [<ffffffffa0ec1fee>] llog_osd_destroy+0x48e/0xb20 [obdclass]
            <4> [<ffffffffa0e91d61>] llog_destroy+0x51/0x170 [obdclass]
            <4> [<ffffffffa0e96b34>] llog_erase+0x1c4/0x1e0 [obdclass]
            <4> [<ffffffffa0e97401>] llog_backup+0x231/0x500 [obdclass]
            <4> [<ffffffffa049ad66>] mgc_process_log+0x1636/0x18f0 [mgc]
            <4> [<ffffffffa049c514>] mgc_process_config+0x594/0xed0 [mgc]
            <4> [<ffffffffa0ede64c>] lustre_process_log+0x25c/0xaa0 [obdclass]
            <4> [<ffffffffa0f126d3>] server_start_targets+0x1833/0x19c0 [obdclass]
            <4> [<ffffffffa0f1340c>] server_fill_super+0xbac/0x1660 [obdclass]
            <4> [<ffffffffa0ee3d68>] lustre_fill_super+0x1d8/0x530 [obdclass]
            <4> [<ffffffff8118c7df>] get_sb_nodev+0x5f/0xa0
            <4> [<ffffffffa0edb3b5>] lustre_get_sb+0x25/0x30 [obdclass]
            <4> [<ffffffff8118be3b>] vfs_kern_mount+0x7b/0x1b0
            <4> [<ffffffff8118bfe2>] do_kern_mount+0x52/0x130
            <4> [<ffffffff811acfeb>] do_mount+0x2fb/0x930
            <4> [<ffffffff811ad6b0>] sys_mount+0x90/0xe0
            <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            <4>
            
            pichong Gregoire Pichon added a comment - I have tested the patch #9929 posted in gerrit. Unfortunately the OSS still crashes when mounting OST. Here is the stack <3>LustreError: 6577:0:(fld_handler.c:174:fld_server_lookup()) srv-fs_pv-OST0000: lookup 0x7d, but not connects to MDT0yet: rc = -5. <3>LustreError: 6577:0:(osd_handler.c:2135:osd_fld_lookup()) fs_pv-OST0000-osd: cannot find FLD range for 0x7d: rc = -5 <3>LustreError: 6577:0:(osd_handler.c:3344:osd_mdt_seq_exists()) fs_pv-OST0000-osd: Can not lookup fld for 0x7d <0>LustreError: 6577:0:(osd_handler.c:2651:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: <0>LustreError: 6577:0:(osd_handler.c:2651:osd_object_ref_del()) LBUG <4>Pid: 6577, comm: mount.lustre <4> <4>Call Trace: <4> [<ffffffffa0d57895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa0d57e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa04197e7>] osd_object_ref_del+0x1e7/0x220 [osd_ldiskfs] <4> [<ffffffffa0ec1fee>] llog_osd_destroy+0x48e/0xb20 [obdclass] <4> [<ffffffffa0e91d61>] llog_destroy+0x51/0x170 [obdclass] <4> [<ffffffffa0e96b34>] llog_erase+0x1c4/0x1e0 [obdclass] <4> [<ffffffffa0e97401>] llog_backup+0x231/0x500 [obdclass] <4> [<ffffffffa049ad66>] mgc_process_log+0x1636/0x18f0 [mgc] <4> [<ffffffffa049c514>] mgc_process_config+0x594/0xed0 [mgc] <4> [<ffffffffa0ede64c>] lustre_process_log+0x25c/0xaa0 [obdclass] <4> [<ffffffffa0f126d3>] server_start_targets+0x1833/0x19c0 [obdclass] <4> [<ffffffffa0f1340c>] server_fill_super+0xbac/0x1660 [obdclass] <4> [<ffffffffa0ee3d68>] lustre_fill_super+0x1d8/0x530 [obdclass] <4> [<ffffffff8118c7df>] get_sb_nodev+0x5f/0xa0 <4> [<ffffffffa0edb3b5>] lustre_get_sb+0x25/0x30 [obdclass] <4> [<ffffffff8118be3b>] vfs_kern_mount+0x7b/0x1b0 <4> [<ffffffff8118bfe2>] do_kern_mount+0x52/0x130 <4> [<ffffffff811acfeb>] do_mount+0x2fb/0x930 <4> [<ffffffff811ad6b0>] sys_mount+0x90/0xe0 <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b <4>

            Yes We meet the same LBUG on all OSS, with or without abort-recov
            when mounting all ost at the same time or when we try to mount
            manuely just one ost.
            and it s just with 2.4.3 with some additional patchs and works
            fine on 2.4.2 with some other additionnal patchs

            apercher Antoine Percher added a comment - Yes We meet the same LBUG on all OSS, with or without abort-recov when mounting all ost at the same time or when we try to mount manuely just one ost. and it s just with 2.4.3 with some additional patchs and works fine on 2.4.2 with some other additionnal patchs
            bogl Bob Glossman (Inactive) added a comment - LU-3126 backport to b2_4: http://review.whamcloud.com/9929

            Hello Gregoire,
            Did you mean that this same LBUG occured for all OSSs of the same FS at OSTs mount time ??
            And yes you are right LU-3126 patch has not been back-ported to b2_4, but it is mainly because the issue was unlikely to happen ...

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, Did you mean that this same LBUG occured for all OSSs of the same FS at OSTs mount time ?? And yes you are right LU-3126 patch has not been back-ported to b2_4, but it is mainly because the issue was unlikely to happen ...

            People

              bfaccini Bruno Faccini (Inactive)
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: