Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3917

lod_initialize_objects()) ASSERTION( cfs_bitmap_check(md->lod_ost_descs.ltd_tgt_bitmap, idx) ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.4.0, Lustre 2.5.0
    • None
    • lustre-2.4.0-15chaos_2.6.32_358.14.1.2chaos.ch5.1.1.ch5.1.1.x86_64
    • 3
    • 10350

    Description

      MDS crashes with summary assertion after upgrade from 2.1 to 2.4. Added -o abort_recov mount option but MDS still crashes each time it is started.

      PID: 4684   TASK: ffff880628b33540  CPU: 12  COMMAND: "tgt_recov"
       #0 [ffff880628b35a48] machine_kexec at ffffffff81035fcb
       #1 [ffff880628b35aa8] crash_kexec at ffffffff810c10b2
       #2 [ffff880628b35b78] panic at ffffffff81510333
       #3 [ffff880628b35bf8] lbug_with_loc at ffffffffa0507f4b [libcfs]
       #4 [ffff880628b35c18] lod_initialize_objects at ffffffffa1141d3b [lod]
       #5 [ffff880628b35ca8] lod_parse_striping at ffffffffa11421e1 [lod]
       #6 [ffff880628b35cd8] lod_load_striping at ffffffffa1143c44 [lod]
       #7 [ffff880628b35d18] lod_declare_object_destroy at ffffffffa114f6db [lod]
       #8 [ffff880628b35d48] __mdd_orphan_cleanup at ffffffffa0e190a9 [mdd]
       #9 [ffff880628b35de8] mdd_recovery_complete at ffffffffa0e2833d [mdd]
      #10 [ffff880628b35e18] mdt_postrecov at ffffffffa1079cb5 [mdt]
      #11 [ffff880628b35e38] mdt_obd_postrecov at ffffffffa107b178 [mdt]
      #12 [ffff880628b35ea8] target_recovery_thread at ffffffffa09a6ca4 [ptlrpc]
      #13 [ffff880628b35f48] kernel_thread at ffffffff8100c10a
      
      ZFS: Loaded module v0.6.2-1.2, ZFS pool version 5000, ZFS filesystem version 5
      Lustre: Lustre: Build Version: 2.4.0-15chaos-15chaos--PRISTINE-2.6.32-358.14.1.2chaos.ch5.1.1.x86_64
      LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=off. Opts: 
      Lustre: lsc-MDT0000: Not available for connect from 192.168.117.178@o2ib10 (not set up)
      Lustre: 4673:0:(mdt_handler.c:4947:mdt_process_config()) For interoperability, skip this mdd.quota_type. It is obsolete.
      LustreError: 11-0: lsc-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
      LustreError: 4469:0:(mdt_handler.c:5930:mdt_iocontrol()) lsc-MDT0000: Aborting recovery for device
      Lustre: lsc-MDT0000: Aborting recovery
      LustreError: 4684:0:(lod_lov.c:706:lod_initialize_objects()) ASSERTION( cfs_bitmap_check(md->lod_ost_descs.ltd_tgt_bitmap, idx) ) failed: 
      LustreError: 4684:0:(lod_lov.c:706:lod_initialize_objects()) LBUG
      Pid: 4684, comm: tgt_recov
      
      Call Trace:
       [<ffffffffa05078f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa0507ef7>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa1141d3b>] lod_initialize_objects+0x98b/0xc30 [lod]
       [<ffffffffa11421e1>] lod_parse_striping+0x201/0x300 [lod]
       [<ffffffffa1143c44>] lod_load_striping+0x2a4/0x4b0 [lod]
       [<ffffffffa114f6db>] lod_declare_object_destroy+0x16b/0x390 [lod]
       [<ffffffffa0e190a9>] __mdd_orphan_cleanup+0x7d9/0xca0 [mdd]
       [<ffffffffa0e2833d>] mdd_recovery_complete+0xed/0x170 [mdd]
       [<ffffffffa1079cb5>] mdt_postrecov+0x35/0xd0 [mdt]
       [<ffffffffa107b178>] mdt_obd_postrecov+0x78/0x90 [mdt]
       [<ffffffffa09964e0>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc]
       [<ffffffffa099189e>] ? ldlm_reprocess_all_ns+0x3e/0x110 [ptlrpc]
       [<ffffffffa09a6ca4>] target_recovery_thread+0xc64/0x1980 [ptlrpc]
       [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
       [<ffffffff8100c10a>] child_rip+0xa/0x20
       [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
       [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
       [<ffffffff8100c100>] ? child_rip+0x0/0x20
      
      Kernel panic - not syncing: LBUG
      Pid: 4684, comm: tgt_recov Tainted: P           ---------------    2.6.32-358.14.1.2chaos.ch5.1.1.x86_64 #1
      Call Trace:
       [<ffffffff8151032c>] ? panic+0xa7/0x16f
       [<ffffffffa0507f4b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
       [<ffffffffa1141d3b>] ? lod_initialize_objects+0x98b/0xc30 [lod]
       [<ffffffffa11421e1>] ? lod_parse_striping+0x201/0x300 [lod]
       [<ffffffffa1143c44>] ? lod_load_striping+0x2a4/0x4b0 [lod]
       [<ffffffffa114f6db>] ? lod_declare_object_destroy+0x16b/0x390 [lod]
       [<ffffffffa0e190a9>] ? __mdd_orphan_cleanup+0x7d9/0xca0 [mdd]
       [<ffffffffa0e2833d>] ? mdd_recovery_complete+0xed/0x170 [mdd]
       [<ffffffffa1079cb5>] ? mdt_postrecov+0x35/0xd0 [mdt]
       [<ffffffffa107b178>] ? mdt_obd_postrecov+0x78/0x90 [mdt]
       [<ffffffffa09964e0>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc]
       [<ffffffffa099189e>] ? ldlm_reprocess_all_ns+0x3e/0x110 [ptlrpc]
       [<ffffffffa09a6ca4>] ? target_recovery_thread+0xc64/0x1980 [ptlrpc]
       [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
       [<ffffffff8100c10a>] ? child_rip+0xa/0x20
       [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
       [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
       [<ffffffff8100c100>] ? child_rip+0x0/0x20
      REWRITING MCP55 CFG REG
      CFG = c1
      

      Attachments

        Issue Links

          Activity

            [LU-3917] lod_initialize_objects()) ASSERTION( cfs_bitmap_check(md->lod_ost_descs.ltd_tgt_bitmap, idx) ) failed

            Closing this as a duplicate of LU-3161, which has a patch already.

            adilger Andreas Dilger added a comment - Closing this as a duplicate of LU-3161 , which has a patch already.
            bobijam Zhenyu Xu added a comment - - edited

            I tried patch http://review.whamcloud.com/7234, it addresses this issue, and I can't reproduce the issue stated in LU-3918.

            bobijam Zhenyu Xu added a comment - - edited I tried patch http://review.whamcloud.com/7234 , it addresses this issue, and I can't reproduce the issue stated in LU-3918 .
            pjones Peter Jones added a comment -

            Bobijam

            Could you please continue with this effort - Oleg will be at LAD next week

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please continue with this effort - Oleg will be at LAD next week Thanks Peter

            This is related to LU-3161, but not a duplicate. The LU-3161 work does not go far enough to deal with the problem in this ticket.

            In this ticket, we need to deal with files in the PENDING directory after a writeconf of the filesystem (a writeconf is currently required when upgrading to 2.4, otherwise 2.4 clients won't work).

            FYI, we are not doing any further 2.4 server upgrades at LLNL until a couple of things are fixed, and this is one of the things that needs to be fixed.

            morrone Christopher Morrone (Inactive) added a comment - This is related to LU-3161 , but not a duplicate. The LU-3161 work does not go far enough to deal with the problem in this ticket. In this ticket, we need to deal with files in the PENDING directory after a writeconf of the filesystem (a writeconf is currently required when upgrading to 2.4, otherwise 2.4 clients won't work). FYI, we are not doing any further 2.4 server upgrades at LLNL until a couple of things are fixed, and this is one of the things that needs to be fixed.

            Duplicate of LU-3161

            jlevi Jodi Levi (Inactive) added a comment - Duplicate of LU-3161
            green Oleg Drokin added a comment -

            Just to be sure I just reproduced this on master with these steps:
            1. mount lustre
            2. cat /dev/zero >/mnt/lustre/file
            3. ^Z
            4. rm /mnt/lustre/file
            5. umount /mnt/mds1
            6. tunefs.lustre --writeconf /path/to/mds/device
            7 mount /path/to/mds/device /mnt -t lustre

            green Oleg Drokin added a comment - Just to be sure I just reproduced this on master with these steps: 1. mount lustre 2. cat /dev/zero >/mnt/lustre/file 3. ^Z 4. rm /mnt/lustre/file 5. umount /mnt/mds1 6. tunefs.lustre --writeconf /path/to/mds/device 7 mount /path/to/mds/device /mnt -t lustre

            Our plan for future updates is to restart the MDS before writeconf to force orphan cleanup.

            nedbass Ned Bass (Inactive) added a comment - Our plan for future updates is to restart the MDS before writeconf to force orphan cleanup.

            Oleg, that's right, we started the MDT/MGS before the OSTs.

            nedbass Ned Bass (Inactive) added a comment - Oleg, that's right, we started the MDT/MGS before the OSTs.
            green Oleg Drokin added a comment -

            BTW, MGS is still supposed to regenerate the config logs, but I imagine you did not start OSTs before starting MDT?

            green Oleg Drokin added a comment - BTW, MGS is still supposed to regenerate the config logs, but I imagine you did not start OSTs before starting MDT?
            green Oleg Drokin added a comment -

            anyway, crashing is not a normal behavior of course, so I'll update the fixing patch.
            Now the other problem is the object leakage is bound to occur here because the orphan cleanup would not happen (no OSTs connected to send destroy requests to).
            I wonder how to better address this.

            green Oleg Drokin added a comment - anyway, crashing is not a normal behavior of course, so I'll update the fixing patch. Now the other problem is the object leakage is bound to occur here because the orphan cleanup would not happen (no OSTs connected to send destroy requests to). I wonder how to better address this.

            People

              bobijam Zhenyu Xu
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: