Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • Lustre 2.2.0
    • Lustre 2.2.0
    • None
    • Lustre Master
    • 3
    • 6468

    Description

      Running on a real machine:

      $ mkfs.lustre --fsname=survey --mdt --index=0 /dev/sda3
      $ mount -t lustre /dev/sda3 /mnt
      $ thrhi=64 file_count=200000 sh mds-survey
      

      then crash:

      Build Version: jenkins-arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel-4610-g614
      Lustre: Added LNI 10.45.1.8@tcp [8/256/0/180]
      Lustre: Accept all, port 988
      LDISKFS-fs (sda3): recovery complete
      LDISKFS-fs (sda3): mounted filesystem with ordered data mode. Opts:
      LDISKFS-fs (sda3): mounted filesystem with ordered data mode. Opts:
      Lustre: MGC10.45.1.8@tcp: Reactivating import
      Lustre: survey-MDT0000: used disk, loading
      Lustre: Echo OBD driver; http://www.lustre.org/
      LustreError: 1821:0:(echo_client.c:1810:echo_md_destroy_internal())
      Can not unlink child tests: rc = -39
      LustreError: 1823:0:(echo_client.c:1810:echo_md_destroy_internal())
      Can not unlink child tests1: rc = -39
      LustreError: 1831:0:(osd_handler.c:2294:osd_object_ref_del())
      ASSERTION((oh)->ot_declare_ref_del > 0) failed
      LustreError: 1831:0:(osd_handler.c:2294:osd_object_ref_del()) LBUG
      Pid: 1831, comm: lctl
      
      Call Trace:
       [<ffffffffa038e855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa038ee95>] lbug_with_loc+0x75/0xe0 [libcfs]
       [<ffffffffa0399d96>] libcfs_assertion_failed+0x66/0x70 [libcfs]
       [<ffffffffa0a1781a>] osd_object_ref_del+0x14a/0x180 [osd_ldiskfs]
       [<ffffffffa096ecbb>] __mdd_ref_del+0x5b/0xa0 [mdd]
       [<ffffffffa09777a2>] mdd_create+0x1ae2/0x2470 [mdd]
       [<ffffffffa051190d>] ? htable_lookup+0xed/0x190 [obdclass]
       [<ffffffffa041b5a9>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs]
       [<ffffffff81275894>] ? vsnprintf+0x484/0x5f0
       [<ffffffffa0a6822b>] echo_md_create_internal+0xab/0x4b0 [obdecho]
       [<ffffffff81275a40>] ? sprintf+0x40/0x50
       [<ffffffffa0a6ff40>] echo_md_handler+0x1380/0x1dd0 [obdecho]
       [<ffffffffa040d87e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
       [<ffffffffa0a75ae6>] echo_client_iocontrol+0x1c86/0x2a30 [obdecho]
       [<ffffffff81127e77>] ? ____pagevec_lru_add+0x167/0x180
       [<ffffffffa040da13>] ? cfs_alloc+0x63/0x90 [libcfs]
       [<ffffffffa04c0f52>] ? obd_ioctl_getdata+0x172/0x1060 [obdclass]
       [<ffffffffa04d6264>] class_handle_ioctl+0x14d4/0x2340 [obdclass]
       [<ffffffff8120d5df>] ? security_inode_permission+0x1f/0x30
       [<ffffffffa04c0313>] obd_class_ioctl+0x53/0x240 [obdclass]
       [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
       [<ffffffff81189342>] vfs_ioctl+0x22/0xa0
       [<ffffffff811894c9>] ? do_vfs_ioctl+0x69/0x580
       [<ffffffff811894e4>] do_vfs_ioctl+0x84/0x580
       [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
       [<ffffffff81189a61>] sys_ioctl+0x81/0xa0
       [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      
      Kernel panic - not syncing: LBUG
      Pid: 7014, comm: lctl Not tainted 2.6.32-220.el6_lustre.x86_64 #1
      Call Trace:
       [<ffffffff814ec701>] ? panic+0x78/0x143
       [<ffffffffa040ceeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
       [<ffffffffa0417d96>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
       [<ffffffffa0a1781a>] ? osd_object_ref_del+0x14a/0x180 [osd_ldiskfs]
       [<ffffffffa096ecbb>] ? __mdd_ref_del+0x5b/0xa0 [mdd]
       [<ffffffffa09777a2>] ? mdd_create+0x1ae2/0x2470 [mdd]
       [<ffffffffa051190d>] ? htable_lookup+0xed/0x190 [obdclass]
       [<ffffffffa041b5a9>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs]
       [<ffffffff81275894>] ? vsnprintf+0x484/0x5f0
       [<ffffffffa0a6822b>] ? echo_md_create_internal+0xab/0x4b0 [obdecho]
       [<ffffffff81275a40>] ? sprintf+0x40/0x50
       [<ffffffffa0a6ff40>] ? echo_md_handler+0x1380/0x1dd0 [obdecho]
       [<ffffffffa040d87e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
       [<ffffffffa0a75ae6>] ? echo_client_iocontrol+0x1c86/0x2a30 [obdecho]
       [<ffffffff81127e77>] ? ____pagevec_lru_add+0x167/0x180
       [<ffffffffa040da13>] ? cfs_alloc+0x63/0x90 [libcfs]
       [<ffffffffa04c0f52>] ? obd_ioctl_getdata+0x172/0x1060 [obdclass]
       [<ffffffffa04d6264>] ? class_handle_ioctl+0x14d4/0x2340 [obdclass]
       [<ffffffff8120d5df>] ? security_inode_permission+0x1f/0x30
       [<ffffffffa04c0313>] ? obd_class_ioctl+0x53/0x240 [obdclass]
       [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
       [<ffffffff81189342>] ? vfs_ioctl+0x22/0xa0
       [<ffffffff811894c9>] ? do_vfs_ioctl+0x69/0x580
       [<ffffffff811894e4>] ? do_vfs_ioctl+0x84/0x580
       [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
       [<ffffffff81189a61>] ? sys_ioctl+0x81/0xa0
       [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
      

      Attachments

        Issue Links

          Activity

            [LU-1080] mds-survey crash

            LU-1082 is tracking the test which will run mds-survey during normal testing ensure it keeps working.

            adilger Andreas Dilger added a comment - LU-1082 is tracking the test which will run mds-survey during normal testing ensure it keeps working.

            FYI: I have been running a more recent Lustre and I have not been able to reproduce this issue.

            # rpm -qa | grep lustre
            lustre-ldiskfs-3.3.0-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64
            kernel-2.6.32-220.el6_lustre.gfd1c51d.x86_64
            lustre-modules-2.1.55-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64
            kernel-firmware-2.6.32-220.el6_lustre.gfd1c51d.x86_64
            lustre-2.1.55-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64
            

            These rpms are from build 480:
            http://build.whamcloud.com/job/lustre-master/480/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/

            rhenwood Richard Henwood (Inactive) added a comment - - edited FYI: I have been running a more recent Lustre and I have not been able to reproduce this issue. # rpm -qa | grep lustre lustre-ldiskfs-3.3.0-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64 kernel-2.6.32-220.el6_lustre.gfd1c51d.x86_64 lustre-modules-2.1.55-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64 kernel-firmware-2.6.32-220.el6_lustre.gfd1c51d.x86_64 lustre-2.1.55-2.6.32_220.el6_lustre.gfd1c51d.x86_64_g0204171.x86_64 These rpms are from build 480: http://build.whamcloud.com/job/lustre-master/480/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/
            pjones Peter Jones added a comment -

            Andreas/Johann

            Could one of you please comment on this 2.2 blocker?

            Thanks

            Peter

            pjones Peter Jones added a comment - Andreas/Johann Could one of you please comment on this 2.2 blocker? Thanks Peter

            I'm beginning to think that if deletion consts 1 credit, then 1..2 more credits to set nlink.
            so 2..3 additional credits for undo path won't hurt us, probably? at least in mdd_create() case.
            another (enormous) case is mdd_rename() - it'll take mode credits for undo, but probably still
            acceptable. and at some point we're going to change the approach to be more object-based than
            just summing ops.

            bzzz Alex Zhuravlev added a comment - I'm beginning to think that if deletion consts 1 credit, then 1..2 more credits to set nlink. so 2..3 additional credits for undo path won't hurt us, probably? at least in mdd_create() case. another (enormous) case is mdd_rename() - it'll take mode credits for undo, but probably still acceptable. and at some point we're going to change the approach to be more object-based than just summing ops.

            btw, any idea why [DTO_INDEX_DELETE] = 16 ? ldiskfs never shrinks dir nor it updates neighbor blocks during entry removal
            nor it changes quota usage. I'd think 1 should be enough?

            I have no idea why it was 16. You are the ext3/4 expert, I believe you are right , 1 block could be enough.

            niu Niu Yawei (Inactive) added a comment - btw, any idea why [DTO_INDEX_DELETE] = 16 ? ldiskfs never shrinks dir nor it updates neighbor blocks during entry removal nor it changes quota usage. I'd think 1 should be enough? I have no idea why it was 16. You are the ext3/4 expert, I believe you are right , 1 block could be enough.
            bzzz Alex Zhuravlev added a comment - - edited

            well, we can declare undo ops - that could be the easiest solution, but that results in more credits.

            btw, any idea why [DTO_INDEX_DELETE] = 16 ? ldiskfs never shrinks dir nor it updates neighbor blocks during entry removal
            nor it changes quota usage. I'd think 1 should be enough?

            bzzz Alex Zhuravlev added a comment - - edited well, we can declare undo ops - that could be the easiest solution, but that results in more credits. btw, any idea why [DTO_INDEX_DELETE] = 16 ? ldiskfs never shrinks dir nor it updates neighbor blocks during entry removal nor it changes quota usage. I'd think 1 should be enough?

            Since we can't declare undo operations, I've removed this LASSERT in LU-993(see ec20be97b9f977d3f4944523baaffb1bf95cf76c LU-993 osd: code cleanup for directory nlink count), but I'm not sure why the echo create failed, is it a normal failure during the test?

            niu Niu Yawei (Inactive) added a comment - Since we can't declare undo operations, I've removed this LASSERT in LU-993 (see ec20be97b9f977d3f4944523baaffb1bf95cf76c LU-993 osd: code cleanup for directory nlink count), but I'm not sure why the echo create failed, is it a normal failure during the test?

            sorry, still thinking how to solve this easily ...

            bzzz Alex Zhuravlev added a comment - sorry, still thinking how to solve this easily ...

            Alex, do you have any ideas on how this might be fixed?

            adilger Andreas Dilger added a comment - Alex, do you have any ideas on how this might be fixed?
            di.wang Di Wang added a comment -

            Assign this to Alex.

            di.wang Di Wang added a comment - Assign this to Alex.

            People

              bzzz Alex Zhuravlev
              rhenwood Richard Henwood (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: