Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4878

fld_server_lookup() ASSERTION( fld->lsf_control_exp ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.3
    • 3
    • 13490

    Description

      The following LBUG appeared at customer site, during the mount process on all OSS in lustre 2.4.3 version.

      LustreError: 12838:0:(fld_handler.c:172:fld_server_lookup()) ASSERTION(fld->lsf_control_exp ) failed:
      LustreError: 12838:0:(fld_handler.c:172:fld_server_lookup()) LBUG
      
      Pid: 12838, comm: mount.lustre
      
      Call Trace:
       libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       lbug_with_loc+0x47/0xb0 [libcfs]
       fld_server_lookup+0x2f7/0x3d0 [fld]
       osd_fld_lookup+0x71/0x1d0 [osd_ldiskfs]
       osd_remote_fid+0x9a/0x280 [osd_ldiskfs]
       osd_index_ea_lookup+0521/0x850 [osd_ldiskfs]
       dt_lookup_dir+0x6f/0x130 [obdclass]
       llog_osd_open+0x485/0xc00 [obdclass]
       llog_open+0xba/0x2c0 [obdclass]
       mgc_process_log [mgc]
       mgc_process_config [mgc]
       lustre_process_log [obdclass]
       server_start_targets [obdclass]
       server_fill_super [obdclass]
       lustre_fill_super[obdclass]
       get_sb_nodev
       lustre_get_sb
       vfs_kern_mount
       do_kern_mount
       do_mount
       sys_mount
       system_call_fastpath
      

      This issue seems the same as LU-3126 for which a patch has been landed in lustre 2.5. Unfortunately no patch has been provided for lustre 2.4 release.

      Attachments

        Issue Links

          Activity

            [LU-4878] fld_server_lookup() ASSERTION( fld->lsf_control_exp ) failed

            Gregoire, don't misunderstand me, I did not mean that you added patches without good reasons to do so, but only that doing so you fall back out from our regression/interop testing process.
            Concerning the fact you added #5049 due to LU-2959, that may help for #5049 and #9958 to finally land ...

            bfaccini Bruno Faccini (Inactive) added a comment - Gregoire, don't misunderstand me, I did not mean that you added patches without good reasons to do so, but only that doing so you fall back out from our regression/interop testing process. Concerning the fact you added #5049 due to LU-2959 , that may help for #5049 and #9958 to finally land ...

            Hello Bruno,

            Actually the patch #5049 "LU-2059 llog: MGC to use OSD API for backup logs" has been integrated by Bull on top of release 2.4.x because the customer hit the LBUG ASSERTION(cli->cl_mgc_configs_dir) described in LU-2959. As mentionned in that ticket, the LBUG is fixed by patch #5049.

            These problems occured in lustre 2.4.x release and need to be addressed.

            pichong Gregoire Pichon added a comment - Hello Bruno, Actually the patch #5049 " LU-2059 llog: MGC to use OSD API for backup logs" has been integrated by Bull on top of release 2.4.x because the customer hit the LBUG ASSERTION(cli->cl_mgc_configs_dir) described in LU-2959 . As mentionned in that ticket, the LBUG is fixed by patch #5049. These problems occured in lustre 2.4.x release and need to be addressed.

            Hello Gregoire,
            I am not sure that my patch #9958 will finally be fully accepted+landed to b2_4 ... The main reason of this is that #5049 is itself still not in b2_4 and may be won't, so #9958 is not necessary then as Mike commented in patch with reason !!
            This points to some limit in the process where people use to decide to add more patches on top of releases we tested vs regressions and interoperability...

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, I am not sure that my patch #9958 will finally be fully accepted+landed to b2_4 ... The main reason of this is that #5049 is itself still not in b2_4 and may be won't, so #9958 is not necessary then as Mike commented in patch with reason !! This points to some limit in the process where people use to decide to add more patches on top of releases we tested vs regressions and interoperability...

            Thanks for the backport Bruno. Our comments interleaved !

            I have tested a lustre version 2.4.3 with both additional patches

            • #9929 LU-3126 osd: remove fld lookup during configuration
            • #9958 LU-4878 osd-ldiskfs: don't assert on possible upgrade (backport of LU-3915)

            The OSS is able to start without any problem. Filesystem is operational.

            I am now waiting for these patches to be fully approved and Maloo tested so they can be delivered to the customer.

            pichong Gregoire Pichon added a comment - Thanks for the backport Bruno. Our comments interleaved ! I have tested a lustre version 2.4.3 with both additional patches #9929 LU-3126 osd: remove fld lookup during configuration #9958 LU-4878 osd-ldiskfs: don't assert on possible upgrade (backport of LU-3915 ) The OSS is able to start without any problem. Filesystem is operational. I am now waiting for these patches to be fully approved and Maloo tested so they can be delivered to the customer.

            You may have missed my previous update that already confirmed what you finally found!
            So yes, I agree that you can add #7673 or its back-port on top of your 2.4.3 version that also include #5049 ...

            bfaccini Bruno Faccini (Inactive) added a comment - You may have missed my previous update that already confirmed what you finally found! So yes, I agree that you can add #7673 or its back-port on top of your 2.4.3 version that also include #5049 ...

            The LBUG I hit when testing path #9929 has the same stack trace than LU-3915, which reports an OSS crash when mounting OSTs after an upgrade from 2.4.0 to 2.5.

            In my case, I was upgrading from 2.4.2 to 2.4.3 with a few additional patches including #5049 "LU-2059 llog: MGC to use OSD API for backup logs".

            Patch #7673 "LU-3915 osd-ldiskfs: don't assert on possible upgrade" seems to fix upgrade issue caused by #5049. So I probably need to add #7673 in the list of patches on top of 2.4.3, do you agree ?

            pichong Gregoire Pichon added a comment - The LBUG I hit when testing path #9929 has the same stack trace than LU-3915 , which reports an OSS crash when mounting OSTs after an upgrade from 2.4.0 to 2.5. In my case, I was upgrading from 2.4.2 to 2.4.3 with a few additional patches including #5049 " LU-2059 llog: MGC to use OSD API for backup logs". Patch #7673 " LU-3915 osd-ldiskfs: don't assert on possible upgrade" seems to fix upgrade issue caused by #5049. So I probably need to add #7673 in the list of patches on top of 2.4.3, do you agree ?

            Hello Gregoire,
            Thanks for the update.
            But concerning the new and different stack/traceback you reported, in my opinion it looks more like a different problem which has already been reported in LU-3915.
            So you may now give a try to my b2_4 back-port, of master patch #7673 from LU-3915, I just pushed at http://review.whamcloud.com/9958.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, Thanks for the update. But concerning the new and different stack/traceback you reported, in my opinion it looks more like a different problem which has already been reported in LU-3915 . So you may now give a try to my b2_4 back-port, of master patch #7673 from LU-3915 , I just pushed at http://review.whamcloud.com/9958 .

            I have tested the patch #9929 posted in gerrit. Unfortunately the OSS still crashes when mounting OST.

            Here is the stack

            <3>LustreError: 6577:0:(fld_handler.c:174:fld_server_lookup()) srv-fs_pv-OST0000: lookup 0x7d, but not connects to MDT0yet: rc = -5.
            <3>LustreError: 6577:0:(osd_handler.c:2135:osd_fld_lookup()) fs_pv-OST0000-osd: cannot find FLD range for 0x7d: rc = -5
            <3>LustreError: 6577:0:(osd_handler.c:3344:osd_mdt_seq_exists()) fs_pv-OST0000-osd: Can not lookup fld for 0x7d
            <0>LustreError: 6577:0:(osd_handler.c:2651:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: 
            <0>LustreError: 6577:0:(osd_handler.c:2651:osd_object_ref_del()) LBUG
            <4>Pid: 6577, comm: mount.lustre
            <4>
            <4>Call Trace:
            <4> [<ffffffffa0d57895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            <4> [<ffffffffa0d57e97>] lbug_with_loc+0x47/0xb0 [libcfs]
            <4> [<ffffffffa04197e7>] osd_object_ref_del+0x1e7/0x220 [osd_ldiskfs]
            <4> [<ffffffffa0ec1fee>] llog_osd_destroy+0x48e/0xb20 [obdclass]
            <4> [<ffffffffa0e91d61>] llog_destroy+0x51/0x170 [obdclass]
            <4> [<ffffffffa0e96b34>] llog_erase+0x1c4/0x1e0 [obdclass]
            <4> [<ffffffffa0e97401>] llog_backup+0x231/0x500 [obdclass]
            <4> [<ffffffffa049ad66>] mgc_process_log+0x1636/0x18f0 [mgc]
            <4> [<ffffffffa049c514>] mgc_process_config+0x594/0xed0 [mgc]
            <4> [<ffffffffa0ede64c>] lustre_process_log+0x25c/0xaa0 [obdclass]
            <4> [<ffffffffa0f126d3>] server_start_targets+0x1833/0x19c0 [obdclass]
            <4> [<ffffffffa0f1340c>] server_fill_super+0xbac/0x1660 [obdclass]
            <4> [<ffffffffa0ee3d68>] lustre_fill_super+0x1d8/0x530 [obdclass]
            <4> [<ffffffff8118c7df>] get_sb_nodev+0x5f/0xa0
            <4> [<ffffffffa0edb3b5>] lustre_get_sb+0x25/0x30 [obdclass]
            <4> [<ffffffff8118be3b>] vfs_kern_mount+0x7b/0x1b0
            <4> [<ffffffff8118bfe2>] do_kern_mount+0x52/0x130
            <4> [<ffffffff811acfeb>] do_mount+0x2fb/0x930
            <4> [<ffffffff811ad6b0>] sys_mount+0x90/0xe0
            <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            <4>
            
            pichong Gregoire Pichon added a comment - I have tested the patch #9929 posted in gerrit. Unfortunately the OSS still crashes when mounting OST. Here is the stack <3>LustreError: 6577:0:(fld_handler.c:174:fld_server_lookup()) srv-fs_pv-OST0000: lookup 0x7d, but not connects to MDT0yet: rc = -5. <3>LustreError: 6577:0:(osd_handler.c:2135:osd_fld_lookup()) fs_pv-OST0000-osd: cannot find FLD range for 0x7d: rc = -5 <3>LustreError: 6577:0:(osd_handler.c:3344:osd_mdt_seq_exists()) fs_pv-OST0000-osd: Can not lookup fld for 0x7d <0>LustreError: 6577:0:(osd_handler.c:2651:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: <0>LustreError: 6577:0:(osd_handler.c:2651:osd_object_ref_del()) LBUG <4>Pid: 6577, comm: mount.lustre <4> <4>Call Trace: <4> [<ffffffffa0d57895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa0d57e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa04197e7>] osd_object_ref_del+0x1e7/0x220 [osd_ldiskfs] <4> [<ffffffffa0ec1fee>] llog_osd_destroy+0x48e/0xb20 [obdclass] <4> [<ffffffffa0e91d61>] llog_destroy+0x51/0x170 [obdclass] <4> [<ffffffffa0e96b34>] llog_erase+0x1c4/0x1e0 [obdclass] <4> [<ffffffffa0e97401>] llog_backup+0x231/0x500 [obdclass] <4> [<ffffffffa049ad66>] mgc_process_log+0x1636/0x18f0 [mgc] <4> [<ffffffffa049c514>] mgc_process_config+0x594/0xed0 [mgc] <4> [<ffffffffa0ede64c>] lustre_process_log+0x25c/0xaa0 [obdclass] <4> [<ffffffffa0f126d3>] server_start_targets+0x1833/0x19c0 [obdclass] <4> [<ffffffffa0f1340c>] server_fill_super+0xbac/0x1660 [obdclass] <4> [<ffffffffa0ee3d68>] lustre_fill_super+0x1d8/0x530 [obdclass] <4> [<ffffffff8118c7df>] get_sb_nodev+0x5f/0xa0 <4> [<ffffffffa0edb3b5>] lustre_get_sb+0x25/0x30 [obdclass] <4> [<ffffffff8118be3b>] vfs_kern_mount+0x7b/0x1b0 <4> [<ffffffff8118bfe2>] do_kern_mount+0x52/0x130 <4> [<ffffffff811acfeb>] do_mount+0x2fb/0x930 <4> [<ffffffff811ad6b0>] sys_mount+0x90/0xe0 <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b <4>

            Yes We meet the same LBUG on all OSS, with or without abort-recov
            when mounting all ost at the same time or when we try to mount
            manuely just one ost.
            and it s just with 2.4.3 with some additional patchs and works
            fine on 2.4.2 with some other additionnal patchs

            apercher Antoine Percher added a comment - Yes We meet the same LBUG on all OSS, with or without abort-recov when mounting all ost at the same time or when we try to mount manuely just one ost. and it s just with 2.4.3 with some additional patchs and works fine on 2.4.2 with some other additionnal patchs
            bogl Bob Glossman (Inactive) added a comment - LU-3126 backport to b2_4: http://review.whamcloud.com/9929

            Hello Gregoire,
            Did you mean that this same LBUG occured for all OSSs of the same FS at OSTs mount time ??
            And yes you are right LU-3126 patch has not been back-ported to b2_4, but it is mainly because the issue was unlikely to happen ...

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, Did you mean that this same LBUG occured for all OSSs of the same FS at OSTs mount time ?? And yes you are right LU-3126 patch has not been back-ported to b2_4, but it is mainly because the issue was unlikely to happen ...

            People

              bfaccini Bruno Faccini (Inactive)
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: