Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3915

After upgrade from 2.4.0 to 2.5, can not mount OST, (osd_handler.c:2668:osd_object_ref_del()) LBUG Pid: 9537, comm: mount.lustre

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.0
    • Lustre 2.5.0
    • before upgrade, server and client: 2.4.0
      after upgrade, server and client: lustre-master build #1652
    • 3
    • 10328

    Description

      After upgrade the server and client to 2.5, when mounting OST, hit following error:

      Lustre: DEBUG MARKER: == upgrade-downgrade End == 12:43:46 (1378755826)
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
      LustreError: 13a-8: Failed to get MGS log params and no local copy.
      LustreError: 9537:0:(fld_handler.c:150:fld_server_lookup()) srv-lustre-OST0000: lookup 0x54, but not connects to MDT0yet: rc = -5.
      LustreError: 9537:0:(osd_handler.c:2125:osd_fld_lookup()) lustre-OST0000-osd: cannot find FLD range for 0x54: rc = -5
      LustreError: 9537:0:(osd_handler.c:3344:osd_mdt_seq_exists()) lustre-OST0000-osd: Can not lookup fld for 0x54
      LustreError: 9537:0:(osd_handler.c:2668:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: 
      LustreError: 9537:0:(osd_handler.c:2668:osd_object_ref_del()) LBUG
      Pid: 9537, comm: mount.lustre
      
      Call Trace:
       [<ffffffffa044c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa044ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa0cdd1d7>] osd_object_ref_del+0x1e7/0x220 [osd_ldiskfs]
       [<ffffffffa0586c0e>] llog_osd_destroy+0x48e/0xb20 [obdclass]
       [<ffffffffa0558d51>] llog_destroy+0x51/0x170 [obdclass]
       [<ffffffffa055d8e4>] llog_erase+0x1c4/0x1e0 [obdclass]
       [<ffffffffa055e141>] llog_backup+0x231/0x500 [obdclass]
       [<ffffffff81281b10>] ? sprintf+0x40/0x50
       [<ffffffffa0d69b79>] mgc_process_log+0x1629/0x18e0 [mgc]
       [<ffffffffa0d63350>] ? mgc_blocking_ast+0x0/0x800 [mgc]
       [<ffffffffa070fb90>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
       [<ffffffffa0d6b3c2>] mgc_process_config+0x5f2/0x1120 [mgc]
       [<ffffffffa05a3016>] lustre_process_log+0x256/0xa60 [obdclass]
       [<ffffffffa0572872>] ? class_name2dev+0x42/0xe0 [obdclass]
       [<ffffffff81168013>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
       [<ffffffffa057291e>] ? class_name2obd+0xe/0x30 [obdclass]
       [<ffffffffa05d5b67>] server_start_targets+0x1c57/0x1e10 [obdclass]
       [<ffffffffa05a660b>] ? lustre_start_mgc+0x48b/0x1e10 [obdclass]
       [<ffffffffa059e5b0>] ? class_config_llog_handler+0x0/0x1880 [obdclass]
       [<ffffffffa05dab6c>] server_fill_super+0xbac/0x19e4 [obdclass]
       [<ffffffffa05a8168>] lustre_fill_super+0x1d8/0x530 [obdclass]
       [<ffffffffa05a7f90>] ? lustre_fill_super+0x0/0x530 [obdclass]
       [<ffffffff811845af>] get_sb_nodev+0x5f/0xa0
       [<ffffffffa059ff35>] lustre_get_sb+0x25/0x30 [obdclass]
       [<ffffffff81183beb>] vfs_kern_mount+0x7b/0x1b0
       [<ffffffff81183d92>] do_kern_mount+0x52/0x130
       [<ffffffff811a3f52>] do_mount+0x2d2/0x8d0
       [<ffffffff811a45e0>] sys_mount+0x90/0xe0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      
      
      Message fromKernel panic - not syncing: LBUG
      Pid: 9537, comm: mount.lustre Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
      Call Trace:
       [<ffffffff8150de58>] ? panic+0xa7/0x16f
       [<ffffffffa044ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
       [<ffffffffa0cdd1d7>] ? osd_object_ref_del+0x1e7/0x220 [osd_ldiskfs]
       syslogd@wtm-88  [<ffffffffa0586c0e>] ? llog_osd_destroy+0x48e/0xb20 [obdclass]
      at Sep  9 12:43: [<ffffffffa0558d51>] ? llog_destroy+0x51/0x170 [obdclass]
      57 ...
       kernel [<ffffffffa055d8e4>] ? llog_erase+0x1c4/0x1e0 [obdclass]
      :LustreError: 95 [<ffffffffa055e141>] ? llog_backup+0x231/0x500 [obdclass]
       [<ffffffff81281b10>] ? sprintf+0x40/0x50
       [<ffffffffa0d69b79>] ? mgc_process_log+0x1629/0x18e0 [mgc]
       [<ffffffffa0d63350>] ? mgc_blocking_ast+0x0/0x800 [mgc]
      37:0:(osd_handle [<ffffffffa070fb90>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
       [<ffffffffa0d6b3c2>] ? mgc_process_config+0x5f2/0x1120 [mgc]
      r.c:2668:osd_obj [<ffffffffa05a3016>] ? lustre_process_log+0x256/0xa60 [obdclass]
      ect_ref_del()) A [<ffffffffa0572872>] ? class_name2dev+0x42/0xe0 [obdclass]
       [<ffffffff81168013>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
      SSERTION( inode- [<ffffffffa057291e>] ? class_name2obd+0xe/0x30 [obdclass]
      >i_nlink > 0 ) f [<ffffffffa05d5b67>] ? server_start_targets+0x1c57/0x1e10 [obdclass]
      ailed: 
       [<ffffffffa05a660b>] ? lustre_start_mgc+0x48b/0x1e10 [obdclass]
       [<ffffffffa059e5b0>] ? class_config_llog_handler+0x0/0x1880 [obdclass]
       [<ffffffffa05dab6c>] ? server_fill_super+0xbac/0x19e4 [obdclass]
       [<ffffffffa05a8168>] ? lustre_fill_super+0x1d8/0x530 [obdclass]
       [<ffffffffa05a7f90>] ? lustre_fill_super+0x0/0x530 [obdclass]
       [<ffffffff811845af>] ? get_sb_nodev+0x5f/0xa0
       [<ffffffffa059ff35>] ? lustre_get_sb+0x25/0x30 [obdclass]
       [<ffffffff81183beb>] ? vfs_kern_mount+0x7b/0x1b0
       [<ffffffff81183d92>] ? do_kern_mount+0x52/0x130
       [<ffffffff811a3f52>] ? do_mount+0x2d2/0x8d0
       [<ffffffff811a45e0>] ? sys_mount+0x90/0xe0
       [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      Initializing cgroup subsys cpuset
      Initializing cgroup subsys cpu
      Linux version 2.6.32-358.18.1.el6_lustre.x86_64 (jenkins@builder-4-sde1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Tue Sep 3 04:11:46 PDT 2013
      Command line: ro root=UUID=64a1c9cd-640f-46ab-919e-9f5419f9e521 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=ttyS0,115200 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off  memmap=exactmap memmap=574K@4K memmap=134574K@49726K elfcorehdr=184300K memmap=4K$0K memmap=62K$578K memmap=128K$896K memmap=488K#3031140K memmap=68K$3061492K memmap=36K$3061864K memmap=12K$3061924K memmap=8K$3062120K memmap=12K$3062168K memmap=52K$3062372K memmap=16K$3062680K memmap=60K$3063816K memmap=3076K$3064340K memmap=160K$3067448K memmap=2048K$3106676K memmap=1184K$3109628K memmap=656K#3110812K memmap=4K#3111468K memmap=476K#3111472K memmap=4K#3111948K memmap=8K#3111952K memmap=4K#3111960K memmap=4K#3111964K memmap=108K#3111968K memmap=564K#3112076K memmap=294912K$3112960K memmap=4K$4173824K memmap=4K$4174948K memmap=16K$4174960K memmap=4K$4175872K memmap=6016K$4188288K
      KERNEL supported cpus:
        Intel GenuineIntel
        AMD AuthenticAMD
        Centaur CentaurHauls
      BIOS-provided physical RAM map:
      

      Attachments

        Issue Links

          Activity

            [LU-3915] After upgrade from 2.4.0 to 2.5, can not mount OST, (osd_handler.c:2668:osd_object_ref_del()) LBUG Pid: 9537, comm: mount.lustre
            pjones Peter Jones added a comment -

            Landed for 2.5.0

            pjones Peter Jones added a comment - Landed for 2.5.0
            simmonsja James A Simmons added a comment - - edited

            In the past Andreas I would delete the config logs on the MGS to get around the issue of not being able to mount a file system built before LU-2059. So this lead me to think it was a format change causing the problem. I tested Mikhal patch http://review.whamcloud.com/7673 address this problem perfectly. That patch shows the solution was much simpler than I thought. Thank you Mikhal.

            simmonsja James A Simmons added a comment - - edited In the past Andreas I would delete the config logs on the MGS to get around the issue of not being able to mount a file system built before LU-2059 . So this lead me to think it was a format change causing the problem. I tested Mikhal patch http://review.whamcloud.com/7673 address this problem perfectly. That patch shows the solution was much simpler than I thought. Thank you Mikhal.
            sarah Sarah Liu added a comment -

            upgrade from 2.4.1 to build http://build.whamcloud.com/job/lustre-reviews/18150/ which revert patch 5049, the test passed.

            sarah Sarah Liu added a comment - upgrade from 2.4.1 to build http://build.whamcloud.com/job/lustre-reviews/18150/ which revert patch 5049, the test passed.

            http://review.whamcloud.com/7673
            patch replaces assertion with CERROR() but only for local files. LU-3349 patch should be rebased on top of this one to pass Maloo.

            tappro Mikhail Pershin added a comment - http://review.whamcloud.com/7673 patch replaces assertion with CERROR() but only for local files. LU-3349 patch should be rebased on top of this one to pass Maloo.

            We could replace assertion with simple check for nlink == 0 and suppose it can be possible case due to upgrade. But I wonder shouldn't lfsck repair such files upon start or that happens too early to be fixed by lfsck?

            tappro Mikhail Pershin added a comment - We could replace assertion with simple check for nlink == 0 and suppose it can be possible case due to upgrade. But I wonder shouldn't lfsck repair such files upon start or that happens too early to be fixed by lfsck?

            James, could you please clarify what you mean by "config log is now in OSD format" vs. "old config format"? Is this a change in the FID for the llog object, or how the config logs are named, or the name of the devices inside the config logs or what?

            adilger Andreas Dilger added a comment - James, could you please clarify what you mean by "config log is now in OSD format" vs. "old config format"? Is this a change in the FID for the llog object, or how the config logs are named, or the name of the devices inside the config logs or what?

            The revert will work. I ran into this problem before. I would recommend not reverting the patch for 2.5 since this breaks support of ZFS on the MGS. The problem is the config log format is now the OSD format on the MGS instead of using fsfilt but a 2.4 MGT will still be in the old config format. If we had a way to convert the format this wouldn't be a problem. The other solution is to rebuild the config log.

            simmonsja James A Simmons added a comment - The revert will work. I ran into this problem before. I would recommend not reverting the patch for 2.5 since this breaks support of ZFS on the MGS. The problem is the config log format is now the OSD format on the MGS instead of using fsfilt but a 2.4 MGT will still be in the old config format. If we had a way to convert the format this wouldn't be a problem. The other solution is to rebuild the config log.
            sarah Sarah Liu added a comment -

            Here is the patch revert 5049: http://review.whamcloud.com/#/c/7624/

            sarah Sarah Liu added a comment - Here is the patch revert 5049: http://review.whamcloud.com/#/c/7624/

            Sarah,
            Can you try reverting Change, 5049 and then try the upgrade again and post results?

            jlevi Jodi Levi (Inactive) added a comment - Sarah, Can you try reverting Change, 5049 and then try the upgrade again and post results?

            Mike,
            Could you please have a look at this one?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Mike, Could you please have a look at this one? Thank you!

            People

              tappro Mikhail Pershin
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: