Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4878

fld_server_lookup() ASSERTION( fld->lsf_control_exp ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.3
    • 3
    • 13490

    Description

      The following LBUG appeared at customer site, during the mount process on all OSS in lustre 2.4.3 version.

      LustreError: 12838:0:(fld_handler.c:172:fld_server_lookup()) ASSERTION(fld->lsf_control_exp ) failed:
      LustreError: 12838:0:(fld_handler.c:172:fld_server_lookup()) LBUG
      
      Pid: 12838, comm: mount.lustre
      
      Call Trace:
       libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       lbug_with_loc+0x47/0xb0 [libcfs]
       fld_server_lookup+0x2f7/0x3d0 [fld]
       osd_fld_lookup+0x71/0x1d0 [osd_ldiskfs]
       osd_remote_fid+0x9a/0x280 [osd_ldiskfs]
       osd_index_ea_lookup+0521/0x850 [osd_ldiskfs]
       dt_lookup_dir+0x6f/0x130 [obdclass]
       llog_osd_open+0x485/0xc00 [obdclass]
       llog_open+0xba/0x2c0 [obdclass]
       mgc_process_log [mgc]
       mgc_process_config [mgc]
       lustre_process_log [obdclass]
       server_start_targets [obdclass]
       server_fill_super [obdclass]
       lustre_fill_super[obdclass]
       get_sb_nodev
       lustre_get_sb
       vfs_kern_mount
       do_kern_mount
       do_mount
       sys_mount
       system_call_fastpath
      

      This issue seems the same as LU-3126 for which a patch has been landed in lustre 2.5. Unfortunately no patch has been provided for lustre 2.4 release.

      Attachments

        Issue Links

          Activity

            [LU-4878] fld_server_lookup() ASSERTION( fld->lsf_control_exp ) failed
            pjones Peter Jones added a comment -

            Yes this would be under consideration for 2.4.4.

            pjones Peter Jones added a comment - Yes this would be under consideration for 2.4.4.

            Hi Bruno,

            Yes this ticket can be closed since our tests have shown the issue is fixed with patches #9929 and #9958.
            I hope both patches are planned for integration in 2.4 if a new version is released.

            pichong Gregoire Pichon added a comment - Hi Bruno, Yes this ticket can be closed since our tests have shown the issue is fixed with patches #9929 and #9958. I hope both patches are planned for integration in 2.4 if a new version is released.

            Hello Gregoire,
            Since patch #9958 is planned for 2.4 integration, do you agree if we close/resolve this issue as fixed ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, Since patch #9958 is planned for 2.4 integration, do you agree if we close/resolve this issue as fixed ?

            Gregoire, don't misunderstand me, I did not mean that you added patches without good reasons to do so, but only that doing so you fall back out from our regression/interop testing process.
            Concerning the fact you added #5049 due to LU-2959, that may help for #5049 and #9958 to finally land ...

            bfaccini Bruno Faccini (Inactive) added a comment - Gregoire, don't misunderstand me, I did not mean that you added patches without good reasons to do so, but only that doing so you fall back out from our regression/interop testing process. Concerning the fact you added #5049 due to LU-2959 , that may help for #5049 and #9958 to finally land ...

            Hello Bruno,

            Actually the patch #5049 "LU-2059 llog: MGC to use OSD API for backup logs" has been integrated by Bull on top of release 2.4.x because the customer hit the LBUG ASSERTION(cli->cl_mgc_configs_dir) described in LU-2959. As mentionned in that ticket, the LBUG is fixed by patch #5049.

            These problems occured in lustre 2.4.x release and need to be addressed.

            pichong Gregoire Pichon added a comment - Hello Bruno, Actually the patch #5049 " LU-2059 llog: MGC to use OSD API for backup logs" has been integrated by Bull on top of release 2.4.x because the customer hit the LBUG ASSERTION(cli->cl_mgc_configs_dir) described in LU-2959 . As mentionned in that ticket, the LBUG is fixed by patch #5049. These problems occured in lustre 2.4.x release and need to be addressed.

            Hello Gregoire,
            I am not sure that my patch #9958 will finally be fully accepted+landed to b2_4 ... The main reason of this is that #5049 is itself still not in b2_4 and may be won't, so #9958 is not necessary then as Mike commented in patch with reason !!
            This points to some limit in the process where people use to decide to add more patches on top of releases we tested vs regressions and interoperability...

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, I am not sure that my patch #9958 will finally be fully accepted+landed to b2_4 ... The main reason of this is that #5049 is itself still not in b2_4 and may be won't, so #9958 is not necessary then as Mike commented in patch with reason !! This points to some limit in the process where people use to decide to add more patches on top of releases we tested vs regressions and interoperability...

            Thanks for the backport Bruno. Our comments interleaved !

            I have tested a lustre version 2.4.3 with both additional patches

            • #9929 LU-3126 osd: remove fld lookup during configuration
            • #9958 LU-4878 osd-ldiskfs: don't assert on possible upgrade (backport of LU-3915)

            The OSS is able to start without any problem. Filesystem is operational.

            I am now waiting for these patches to be fully approved and Maloo tested so they can be delivered to the customer.

            pichong Gregoire Pichon added a comment - Thanks for the backport Bruno. Our comments interleaved ! I have tested a lustre version 2.4.3 with both additional patches #9929 LU-3126 osd: remove fld lookup during configuration #9958 LU-4878 osd-ldiskfs: don't assert on possible upgrade (backport of LU-3915 ) The OSS is able to start without any problem. Filesystem is operational. I am now waiting for these patches to be fully approved and Maloo tested so they can be delivered to the customer.

            You may have missed my previous update that already confirmed what you finally found!
            So yes, I agree that you can add #7673 or its back-port on top of your 2.4.3 version that also include #5049 ...

            bfaccini Bruno Faccini (Inactive) added a comment - You may have missed my previous update that already confirmed what you finally found! So yes, I agree that you can add #7673 or its back-port on top of your 2.4.3 version that also include #5049 ...

            The LBUG I hit when testing path #9929 has the same stack trace than LU-3915, which reports an OSS crash when mounting OSTs after an upgrade from 2.4.0 to 2.5.

            In my case, I was upgrading from 2.4.2 to 2.4.3 with a few additional patches including #5049 "LU-2059 llog: MGC to use OSD API for backup logs".

            Patch #7673 "LU-3915 osd-ldiskfs: don't assert on possible upgrade" seems to fix upgrade issue caused by #5049. So I probably need to add #7673 in the list of patches on top of 2.4.3, do you agree ?

            pichong Gregoire Pichon added a comment - The LBUG I hit when testing path #9929 has the same stack trace than LU-3915 , which reports an OSS crash when mounting OSTs after an upgrade from 2.4.0 to 2.5. In my case, I was upgrading from 2.4.2 to 2.4.3 with a few additional patches including #5049 " LU-2059 llog: MGC to use OSD API for backup logs". Patch #7673 " LU-3915 osd-ldiskfs: don't assert on possible upgrade" seems to fix upgrade issue caused by #5049. So I probably need to add #7673 in the list of patches on top of 2.4.3, do you agree ?

            Hello Gregoire,
            Thanks for the update.
            But concerning the new and different stack/traceback you reported, in my opinion it looks more like a different problem which has already been reported in LU-3915.
            So you may now give a try to my b2_4 back-port, of master patch #7673 from LU-3915, I just pushed at http://review.whamcloud.com/9958.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Gregoire, Thanks for the update. But concerning the new and different stack/traceback you reported, in my opinion it looks more like a different problem which has already been reported in LU-3915 . So you may now give a try to my b2_4 back-port, of master patch #7673 from LU-3915 , I just pushed at http://review.whamcloud.com/9958 .

            People

              bfaccini Bruno Faccini (Inactive)
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: