Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7038

obdfilter-survey test_3a: (lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.8.0
    • None
    • client and server: lustre-master build # 3142 RHEL6.6 DNE
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/72f11210-46d3-11e5-90a5-5254006e85c2.

      The sub-test test_3a failed with the following error:

      test failed to respond and timed out
      

      ost console:

      12:55:26:Lustre: DEBUG MARKER: == obdfilter-survey test 3a: Network survey == 05:48:19 (1439988499)
      12:55:28:LustreError: 11-0: lustre-MDT0000-lwp-OST0000: operation obd_ping to node 10.2.4.221@tcp failed: rc = -107
      12:55:30:LustreError: Skipped 7 previous similar messages
      12:55:31:Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.2.4.221@tcp) was lost; in progress operations using this service will wait for recovery to complete
      12:55:31:Lustre: Skipped 7 previous similar messages
      12:55:32:Lustre: 6155:0:(client.c:2014:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1439988511/real 1439988511]  req@ffff880014660980 x1509869039556728/t0(0) o400->MGC10.2.4.221@tcp@10.2.4.221@tcp:26/25 lens 224/224 e 0 to 1 dl 1439988518 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      12:55:32:Lustre: 6155:0:(client.c:2014:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
      12:55:32:LustreError: 166-1: MGC10.2.4.221@tcp: Connection to MGS (at 10.2.4.221@tcp) was lost; in progress operations using this service will fail
      12:55:32:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts
      12:55:34:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1
      12:55:34:Lustre: server umount lustre-OST0000 complete
      12:55:34:Lustre: Skipped 1 previous similar message
      12:55:34:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:34:Lustre: DEBUG MARKER: grep -c /mnt/ost2' ' /proc/mounts
      12:55:34:Lustre: DEBUG MARKER: umount -d -f /mnt/ost2
      12:55:34:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:35:Lustre: DEBUG MARKER: grep -c /mnt/ost3' ' /proc/mounts
      12:55:35:Lustre: DEBUG MARKER: umount -d -f /mnt/ost3
      12:55:35:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:35:Lustre: DEBUG MARKER: grep -c /mnt/ost4' ' /proc/mounts
      12:55:35:Lustre: DEBUG MARKER: umount -d -f /mnt/ost4
      12:55:35:Lustre: server umount lustre-OST0003 complete
      12:55:35:Lustre: Skipped 2 previous similar messages
      12:55:35:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:36:Lustre: DEBUG MARKER: grep -c /mnt/ost5' ' /proc/mounts
      12:55:36:Lustre: DEBUG MARKER: umount -d -f /mnt/ost5
      12:55:36:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:36:Lustre: DEBUG MARKER: grep -c /mnt/ost6' ' /proc/mounts
      12:55:36:Lustre: DEBUG MARKER: umount -d -f /mnt/ost6
      12:55:36:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:37:Lustre: DEBUG MARKER: grep -c /mnt/ost7' ' /proc/mounts
      12:55:37:Lustre: DEBUG MARKER: umount -d -f /mnt/ost7
      12:55:37:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:37:Lustre: DEBUG MARKER: grep -c /mnt/ost8' ' /proc/mounts
      12:55:37:Lustre: DEBUG MARKER: umount -d -f /mnt/ost8
      12:55:37:LustreError: 8532:0:(lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3
      12:55:37:LustreError: 8532:0:(lu_object.c:1224:lu_device_fini()) LBUG
      12:55:37:Pid: 8532, comm: umount
      12:55:38:
      12:55:38:Call Trace:
      12:55:38: [<ffffffffa049b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      12:55:38: [<ffffffffa049be77>] lbug_with_loc+0x47/0xb0 [libcfs]
      12:55:38: [<ffffffffa05f229b>] lu_device_fini+0xbb/0xc0 [obdclass]
      12:55:38: [<ffffffffa05d328d>] ls_device_put+0x7d/0x2e0 [obdclass]
      12:55:39: [<ffffffffa05d3662>] local_oid_storage_fini+0x172/0x410 [obdclass]
      12:55:40: [<ffffffffa0dc476f>] lfsck_instance_cleanup+0x20f/0x7e0 [lfsck]
      12:55:40: [<ffffffffa0dc6f7b>] lfsck_degister+0x4b/0x60 [lfsck]
      12:55:40: [<ffffffffa0e8f597>] ofd_device_fini+0x87/0x250 [ofd]
      12:55:40: [<ffffffffa05e1802>] class_cleanup+0x572/0xd30 [obdclass]
      12:55:40: [<ffffffffa05c1776>] ? class_name2dev+0x56/0xe0 [obdclass]
      12:55:41: [<ffffffffa05e3e56>] class_process_config+0x1e96/0x2800 [obdclass]
      12:55:41: [<ffffffffa04a7c01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      12:55:41: [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230
      12:55:41: [<ffffffffa05e4c7f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass]
      12:55:41: [<ffffffffa05c1776>] ? class_name2dev+0x56/0xe0 [obdclass]
      12:55:41: [<ffffffffa061e102>] server_put_super+0x9e2/0xeb0 [obdclass]
      12:55:41: [<ffffffff811ac776>] ? invalidate_inodes+0xf6/0x190
      12:55:41: [<ffffffff81190b7b>] generic_shutdown_super+0x5b/0xe0
      12:55:41: [<ffffffff81190c66>] kill_anon_super+0x16/0x60
      12:55:41: [<ffffffffa05e7b36>] lustre_kill_super+0x36/0x60 [obdclass]
      12:55:42: [<ffffffff81191407>] deactivate_super+0x57/0x80
      12:55:42: [<ffffffff811b10df>] mntput_no_expire+0xbf/0x110
      12:55:42: [<ffffffff811b1c2b>] sys_umount+0x7b/0x3a0
      12:55:42: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
      12:55:42:
      12:55:42:Kernel panic - not syncing: LBUG
      12:55:42:Pid: 8532, comm: umount Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1
      12:55:42:Call Trace:
      12:55:43: [<ffffffff81529c9c>] ? panic+0xa7/0x16f
      12:55:43: [<ffffffffa049becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      12:55:43: [<ffffffffa05f229b>] ? lu_device_fini+0xbb/0xc0 [obdclass]
      12:55:43: [<ffffffffa05d328d>] ? ls_device_put+0x7d/0x2e0 [obdclass]
      12:55:43: [<ffffffffa05d3662>] ? local_oid_storage_fini+0x172/0x410 [obdclass]
      12:55:43: [<ffffffffa0dc476f>] ? lfsck_instance_cleanup+0x20f/0x7e0 [lfsck]
      12:55:43: [<ffffffffa0dc6f7b>] ? lfsck_degister+0x4b/0x60 [lfsck]
      12:55:43: [<ffffffffa0e8f597>] ? ofd_device_fini+0x87/0x250 [ofd]
      12:55:43: [<ffffffffa05e1802>] ? class_cleanup+0x572/0xd30 [obdclass]
      12:55:43: [<ffffffffa05c1776>] ? class_name2dev+0x56/0xe0 [obdclass]
      12:55:45: [<ffffffffa05e3e56>] ? class_process_config+0x1e96/0x2800 [obdclass]
      12:55:45: [<ffffffffa04a7c01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      12:55:45: [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230
      12:55:46: [<ffffffffa05e4c7f>] ? class_manual_cleanup+0x4bf/0x8e0 [obdclass]
      12:55:46: [<ffffffffa05c1776>] ? class_name2dev+0x56/0xe0 [obdclass]
      12:55:46: [<ffffffffa061e102>] ? server_put_super+0x9e2/0xeb0 [obdclass]
      12:55:46: [<ffffffff811ac776>] ? invalidate_inodes+0xf6/0x190
      12:55:46: [<ffffffff81190b7b>] ? generic_shutdown_super+0x5b/0xe0
      12:55:46: [<ffffffff81190c66>] ? kill_anon_super+0x16/0x60
      12:55:47: [<ffffffffa05e7b36>] ? lustre_kill_super+0x36/0x60 [obdclass]
      12:55:47: [<ffffffff81191407>] ? deactivate_super+0x57/0x80
      12:55:47: [<ffffffff811b10df>] ? mntput_no_expire+0xbf/0x110
      12:55:48: [<ffffffff811b1c2b>] ? sys_umount+0x7b/0x3a0
      12:55:49: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
      12:55:50:Initializing cgroup subsys cpuset
      

      Attachments

        Issue Links

          Activity

            [LU-7038] obdfilter-survey test_3a: (lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3
            bzzz Alex Zhuravlev added a comment - I think http://review.whamcloud.com/#/c/17415/ is OK, http://review.whamcloud.com/18505 should be a proper fix.

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18505
            Subject: LU-7038 obdclass: lu_site_purge() to handle purge-all
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 213ac02d3a6aef2f136457608ff01db9279bf1ab

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18505 Subject: LU-7038 obdclass: lu_site_purge() to handle purge-all Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 213ac02d3a6aef2f136457608ff01db9279bf1ab

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18484
            Subject: LU-7038 debug: print objects if device is still busy
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 62492126228310a2dc0c52d90d18423347829525

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18484 Subject: LU-7038 debug: print objects if device is still busy Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 62492126228310a2dc0c52d90d18423347829525
            sarah Sarah Liu added a comment -

            Hit this on master DNE mode
            https://testing.hpdd.intel.com/test_sets/de009266-bbfd-11e5-8506-5254006e85c2
            client and server: lustre-master # 3305 RHEL6.7 ldiskfs

            sarah Sarah Liu added a comment - Hit this on master DNE mode https://testing.hpdd.intel.com/test_sets/de009266-bbfd-11e5-8506-5254006e85c2 client and server: lustre-master # 3305 RHEL6.7 ldiskfs
            bzzz Alex Zhuravlev added a comment - - edited

            something is wrong with lu_site_purge(), after call to that I still find non-referenced objects in the cache:
            [ 3278.750967] LustreError: 11754:0:(local_storage.c:193:ls_device_put()) header@ffff8800d61b7180[0x0, 0, [0x200000003:0x6:0x0] hash lru exist]{
            [ 3278.752782] LustreError: 11754:0:(local_storage.c:193:ls_device_put()) ....local_storage@ffff8800d61b71d0

            one more call to lu_site_purge() releases all of them. or this is a race..

            bzzz Alex Zhuravlev added a comment - - edited something is wrong with lu_site_purge(), after call to that I still find non-referenced objects in the cache: [ 3278.750967] LustreError: 11754:0:(local_storage.c:193:ls_device_put()) header@ffff8800d61b7180[0x0, 0, [0x200000003:0x6:0x0] hash lru exist]{ [ 3278.752782] LustreError: 11754:0:(local_storage.c:193:ls_device_put()) ....local_storage@ffff8800d61b71d0 one more call to lu_site_purge() releases all of them. or this is a race..

            I'm hitting this quite often locally, mostly using sanity-benchmark

            bzzz Alex Zhuravlev added a comment - I'm hitting this quite often locally, mostly using sanity-benchmark

            master, build# 3264, 2.7.64 tag
            Full test group :EL6.7 Server/EL6.7 Client
            https://testing.hpdd.intel.com/test_sets/6c6a9940-9f0a-11e5-ba94-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - master, build# 3264, 2.7.64 tag Full test group :EL6.7 Server/EL6.7 Client https://testing.hpdd.intel.com/test_sets/6c6a9940-9f0a-11e5-ba94-5254006e85c2

            We've hit this with the full test group on tag 2.7.64 with the lnet-selftest test suite. Logs at
            2015-12-22 17:22:29 - https://testing.hpdd.intel.com/test_sets/e03f0150-a912-11e5-9286-5254006e85c2

            Although there are no logs for the following test session failures, they all hang on umount of ost7 as the one above and are probably due to the same issue:
            2015-12-18 15:26:56 - https://testing.hpdd.intel.com/test_sets/100925aa-a5e4-11e5-a028-5254006e85c2
            2015-12-18 19:59:00 - https://testing.hpdd.intel.com/test_sets/53e5f6ba-a5ec-11e5-9f01-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - We've hit this with the full test group on tag 2.7.64 with the lnet-selftest test suite. Logs at 2015-12-22 17:22:29 - https://testing.hpdd.intel.com/test_sets/e03f0150-a912-11e5-9286-5254006e85c2 Although there are no logs for the following test session failures, they all hang on umount of ost7 as the one above and are probably due to the same issue: 2015-12-18 15:26:56 - https://testing.hpdd.intel.com/test_sets/100925aa-a5e4-11e5-a028-5254006e85c2 2015-12-18 19:59:00 - https://testing.hpdd.intel.com/test_sets/53e5f6ba-a5ec-11e5-9f01-5254006e85c2
            sarah Sarah Liu added a comment -

            Hit this when unmouting OST after upgrade the system from 2.5.5RHEL6.6 ZFS to master/#3264 RHEL7 ZFS. It looks like can be reproduced in this scenario

            [ 3306.094757] Lustre: DEBUG MARKER: == upgrade-downgrade test completed at: Wed Dec 16 15:06:55 PST 2015 == 15:06:55 (1450307215)
            [ 3312.969766] LustreError: 11-0: lustre-MDT0000-lwp-OST0000: operation obd_ping to node 10.2.4.47@tcp failed: rc = -107
            [ 3312.981749] Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.2.4.47@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [ 3324.975299] Lustre: 13357:0:(client.c:1994:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1450307228/real 1450307228]  req@ffff8807ffa3aa00 x1520756527892904/t0(0) o400->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 224/224 e 0 to 1 dl 1450307235 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [ 3325.006351] LustreError: 166-1: MGC10.2.4.47@tcp: Connection to MGS (at 10.2.4.47@tcp) was lost; in progress operations using this service will fail
            [ 3329.514798] LustreError: 14661:0:(lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3
            [ 3329.528240] LustreError: 14661:0:(lu_object.c:1224:lu_device_fini()) LBUG
            [ 3329.535888] Pid: 14661, comm: umount
            [ 3329.539917] 
            [ 3329.539917] Call Trace:
            [ 3329.549151]  [<ffffffffa07457d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
            [ 3329.559376]  [<ffffffffa0745d75>] lbug_with_loc+0x45/0xc0 [libcfs]
            [ 3329.568633]  [<ffffffffa0b65148>] lu_device_fini+0xb8/0xc0 [obdclass]
            
            Message from syslogd@onyx-26[ 3329.577952]  [<ffffffffa0b47cbd>] ls_device_put+0x7d/0x420 [obdclass]
             at Dec 16 15:07:19 ...
             kerne[ 3329.588351]  [<ffffffffa0b48161>] local_oid_storage_fini+0x101/0x340 [obdclass]
            l:LustreError: 14661:0:(lu_objec[ 3329.599608]  [<ffffffffa11ae37e>] lfsck_instance_cleanup+0x20e/0xa50 [lfsck]
            t.c:1224:lu_device_fini()) ASSER[ 3329.610569]  [<ffffffffa11b10f3>] lfsck_degister+0x43/0x50 [lfsck]
            TION( atomic_read(&d->ld_ref) ==[ 3329.620541]  [<ffffffffa127936a>] ofd_device_fini+0xba/0x2a0 [ofd]
             0 ) failed: Ref[ 3329.630555]  [<ffffffffa0b534e4>] class_cleanup+0x734/0xcc0 [obdclass]
            count is 3
            
            [ 3329.639396]  [<ffffffffa0b55d83>] class_process_config+0x1bf3/0x2cf0 [obdclass]
            Message from sy[ 3329.649155]  [<ffffffff811acf53>] ? __kmalloc+0x1f3/0x230
            slogd@onyx-26 at[ 3329.656694]  [<ffffffffa0b500fb>] ? lustre_cfg_new+0x8b/0x400 [obdclass]
             Dec 16 15:07:19[ 3329.665770]  [<ffffffffa0b56f6f>] class_manual_cleanup+0xef/0xba0 [obdclass]
             ...
             kernel:L[ 3329.675170]  [<ffffffffa0b8e40e>] server_put_super+0x84e/0xea0 [obdclass]
            ustreError: 1466[ 3329.684307]  [<ffffffff811c9426>] generic_shutdown_super+0x56/0xe0
            1:0:(lu_object.c[ 3329.692720]  [<ffffffff811c9692>] kill_anon_super+0x12/0x20
            :1224:lu_device_[ 3329.700560]  [<ffffffffa0b5ac42>] lustre_kill_super+0x32/0x50 [obdclass]
            fini()) LBUG
            [ 3329.709596]  [<ffffffff811c9a3d>] deactivate_locked_super+0x3d/0x60
            [ 3329.718066]  [<ffffffff811ca046>] deactivate_super+0x46/0x60
            [ 3329.725517]  [<ffffffff811e6f35>] mntput_no_expire+0xc5/0x120
            [ 3329.733057]  [<ffffffff811e806f>] SyS_umount+0x9f/0x3c0
            [ 3329.740000]  [<ffffffff81615309>] system_call_fastpath+0x16/0x1b
            [ 3329.747792] 
            [ 3329.750842] Kernel panic - not syncing: LBUG
            [ 3329.757928] CPU: 18 PID: 14661 Comm: umount Tainted: PF         IO--------------   3.10.0-229.20.1.el7_lustre.x86_64 #1
            [ 3329.772340] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.99.99.x045.022820121209 02/28/2012
            [ 3329.786192]  ffffffffa0762eaf 000000001b1f107a ffff8807fad17a88 ffffffff816053da
            [ 3329.796912]  ffff8807fad17b08 ffffffff815fec4e ffffffff00000008 ffff8807fad17b18
            [ 3329.807636]  ffff8807fad17ab8 000000001b1f107a ffffffffa0b9d255 0000000000000246
            [ 3329.818348] Call Trace:
            [ 3329.823426]  [<ffffffff816053da>] dump_stack+0x19/0x1b
            [ 3329.831478]  [<ffffffff815fec4e>] panic+0xd8/0x1e7
            [ 3329.839083]  [<ffffffffa0745ddb>] lbug_with_loc+0xab/0xc0 [libcfs]
            [ 3329.848223]  [<ffffffffa0b65148>] lu_device_fini+0xb8/0xc0 [obdclass]
            [ 3329.857598]  [<ffffffffa0b47cbd>] ls_device_put+0x7d/0x420 [obdclass]
            [ 3329.866934]  [<ffffffffa0b48161>] local_oid_storage_fini+0x101/0x340 [obdclass]
            [ 3329.877178]  [<ffffffffa11ae37e>] lfsck_instance_cleanup+0x20e/0xa50 [lfsck]
            [ 3329.887074]  [<ffffffffa11b10f3>] lfsck_degister+0x43/0x50 [lfsck]
            [ 3329.895943]  [<ffffffffa127936a>] ofd_device_fini+0xba/0x2a0 [ofd]
            [ 3329.904788]  [<ffffffffa0b534e4>] class_cleanup+0x734/0xcc0 [obdclass]
            [ 3329.913967]  [<ffffffffa0b55d83>] class_process_config+0x1bf3/0x2cf0 [obdclass]
            [ 3329.923943]  [<ffffffff811acf53>] ? __kmalloc+0x1f3/0x230
            [ 3329.931759]  [<ffffffffa0b500fb>] ? lustre_cfg_new+0x8b/0x400 [obdclass]
            [ 3329.940992]  [<ffffffffa0b56f6f>] class_manual_cleanup+0xef/0xba0 [obdclass]
            [ 3329.950604]  [<ffffffffa0b8e40e>] server_put_super+0x84e/0xea0 [obdclass]
            [ 3329.959882]  [<ffffffff811c9426>] generic_shutdown_super+0x56/0xe0
            [ 3329.968491]  [<ffffffff811c9692>] kill_anon_super+0x12/0x20
            [ 3329.976463]  [<ffffffffa0b5ac42>] lustre_kill_super+0x32/0x50 [obdclass]
            [ 3329.985661]  [<ffffffff811c9a3d>] deactivate_locked_super+0x3d/0x60
            [ 3329.994376]  [<ffffffff811ca046>] deactivate_super+0x46/0x60
            [ 3330.002400]  [<ffffffff811e6f35>] mntput_no_expire+0xc5/0x120
            [ 3330.010524]  [<ffffffff811e806f>] SyS_umount+0x9f/0x3c0
            [ 3330.018051]  [<ffffffff81615309>] system_call_fastpath+0x16/0x1b
            [ 3330.100294] drm_kms_helper: panic occurred, switching back to text console
            
            sarah Sarah Liu added a comment - Hit this when unmouting OST after upgrade the system from 2.5.5RHEL6.6 ZFS to master/#3264 RHEL7 ZFS. It looks like can be reproduced in this scenario [ 3306.094757] Lustre: DEBUG MARKER: == upgrade-downgrade test completed at: Wed Dec 16 15:06:55 PST 2015 == 15:06:55 (1450307215) [ 3312.969766] LustreError: 11-0: lustre-MDT0000-lwp-OST0000: operation obd_ping to node 10.2.4.47@tcp failed: rc = -107 [ 3312.981749] Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.2.4.47@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 3324.975299] Lustre: 13357:0:(client.c:1994:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1450307228/real 1450307228] req@ffff8807ffa3aa00 x1520756527892904/t0(0) o400->MGC10.2.4.47@tcp@10.2.4.47@tcp:26/25 lens 224/224 e 0 to 1 dl 1450307235 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 3325.006351] LustreError: 166-1: MGC10.2.4.47@tcp: Connection to MGS (at 10.2.4.47@tcp) was lost; in progress operations using this service will fail [ 3329.514798] LustreError: 14661:0:(lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3 [ 3329.528240] LustreError: 14661:0:(lu_object.c:1224:lu_device_fini()) LBUG [ 3329.535888] Pid: 14661, comm: umount [ 3329.539917] [ 3329.539917] Call Trace: [ 3329.549151] [<ffffffffa07457d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] [ 3329.559376] [<ffffffffa0745d75>] lbug_with_loc+0x45/0xc0 [libcfs] [ 3329.568633] [<ffffffffa0b65148>] lu_device_fini+0xb8/0xc0 [obdclass] Message from syslogd@onyx-26[ 3329.577952] [<ffffffffa0b47cbd>] ls_device_put+0x7d/0x420 [obdclass] at Dec 16 15:07:19 ... kerne[ 3329.588351] [<ffffffffa0b48161>] local_oid_storage_fini+0x101/0x340 [obdclass] l:LustreError: 14661:0:(lu_objec[ 3329.599608] [<ffffffffa11ae37e>] lfsck_instance_cleanup+0x20e/0xa50 [lfsck] t.c:1224:lu_device_fini()) ASSER[ 3329.610569] [<ffffffffa11b10f3>] lfsck_degister+0x43/0x50 [lfsck] TION( atomic_read(&d->ld_ref) ==[ 3329.620541] [<ffffffffa127936a>] ofd_device_fini+0xba/0x2a0 [ofd] 0 ) failed: Ref[ 3329.630555] [<ffffffffa0b534e4>] class_cleanup+0x734/0xcc0 [obdclass] count is 3 [ 3329.639396] [<ffffffffa0b55d83>] class_process_config+0x1bf3/0x2cf0 [obdclass] Message from sy[ 3329.649155] [<ffffffff811acf53>] ? __kmalloc+0x1f3/0x230 slogd@onyx-26 at[ 3329.656694] [<ffffffffa0b500fb>] ? lustre_cfg_new+0x8b/0x400 [obdclass] Dec 16 15:07:19[ 3329.665770] [<ffffffffa0b56f6f>] class_manual_cleanup+0xef/0xba0 [obdclass] ... kernel:L[ 3329.675170] [<ffffffffa0b8e40e>] server_put_super+0x84e/0xea0 [obdclass] ustreError: 1466[ 3329.684307] [<ffffffff811c9426>] generic_shutdown_super+0x56/0xe0 1:0:(lu_object.c[ 3329.692720] [<ffffffff811c9692>] kill_anon_super+0x12/0x20 :1224:lu_device_[ 3329.700560] [<ffffffffa0b5ac42>] lustre_kill_super+0x32/0x50 [obdclass] fini()) LBUG [ 3329.709596] [<ffffffff811c9a3d>] deactivate_locked_super+0x3d/0x60 [ 3329.718066] [<ffffffff811ca046>] deactivate_super+0x46/0x60 [ 3329.725517] [<ffffffff811e6f35>] mntput_no_expire+0xc5/0x120 [ 3329.733057] [<ffffffff811e806f>] SyS_umount+0x9f/0x3c0 [ 3329.740000] [<ffffffff81615309>] system_call_fastpath+0x16/0x1b [ 3329.747792] [ 3329.750842] Kernel panic - not syncing: LBUG [ 3329.757928] CPU: 18 PID: 14661 Comm: umount Tainted: PF IO-------------- 3.10.0-229.20.1.el7_lustre.x86_64 #1 [ 3329.772340] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.99.99.x045.022820121209 02/28/2012 [ 3329.786192] ffffffffa0762eaf 000000001b1f107a ffff8807fad17a88 ffffffff816053da [ 3329.796912] ffff8807fad17b08 ffffffff815fec4e ffffffff00000008 ffff8807fad17b18 [ 3329.807636] ffff8807fad17ab8 000000001b1f107a ffffffffa0b9d255 0000000000000246 [ 3329.818348] Call Trace: [ 3329.823426] [<ffffffff816053da>] dump_stack+0x19/0x1b [ 3329.831478] [<ffffffff815fec4e>] panic+0xd8/0x1e7 [ 3329.839083] [<ffffffffa0745ddb>] lbug_with_loc+0xab/0xc0 [libcfs] [ 3329.848223] [<ffffffffa0b65148>] lu_device_fini+0xb8/0xc0 [obdclass] [ 3329.857598] [<ffffffffa0b47cbd>] ls_device_put+0x7d/0x420 [obdclass] [ 3329.866934] [<ffffffffa0b48161>] local_oid_storage_fini+0x101/0x340 [obdclass] [ 3329.877178] [<ffffffffa11ae37e>] lfsck_instance_cleanup+0x20e/0xa50 [lfsck] [ 3329.887074] [<ffffffffa11b10f3>] lfsck_degister+0x43/0x50 [lfsck] [ 3329.895943] [<ffffffffa127936a>] ofd_device_fini+0xba/0x2a0 [ofd] [ 3329.904788] [<ffffffffa0b534e4>] class_cleanup+0x734/0xcc0 [obdclass] [ 3329.913967] [<ffffffffa0b55d83>] class_process_config+0x1bf3/0x2cf0 [obdclass] [ 3329.923943] [<ffffffff811acf53>] ? __kmalloc+0x1f3/0x230 [ 3329.931759] [<ffffffffa0b500fb>] ? lustre_cfg_new+0x8b/0x400 [obdclass] [ 3329.940992] [<ffffffffa0b56f6f>] class_manual_cleanup+0xef/0xba0 [obdclass] [ 3329.950604] [<ffffffffa0b8e40e>] server_put_super+0x84e/0xea0 [obdclass] [ 3329.959882] [<ffffffff811c9426>] generic_shutdown_super+0x56/0xe0 [ 3329.968491] [<ffffffff811c9692>] kill_anon_super+0x12/0x20 [ 3329.976463] [<ffffffffa0b5ac42>] lustre_kill_super+0x32/0x50 [obdclass] [ 3329.985661] [<ffffffff811c9a3d>] deactivate_locked_super+0x3d/0x60 [ 3329.994376] [<ffffffff811ca046>] deactivate_super+0x46/0x60 [ 3330.002400] [<ffffffff811e6f35>] mntput_no_expire+0xc5/0x120 [ 3330.010524] [<ffffffff811e806f>] SyS_umount+0x9f/0x3c0 [ 3330.018051] [<ffffffff81615309>] system_call_fastpath+0x16/0x1b [ 3330.100294] drm_kms_helper: panic occurred, switching back to text console

            We hit this while unmounting an OST at the end of ost-pools; LU-7326. Logs are at https://testing.hpdd.intel.com/test_sets/ea392e2a-776b-11e5-a00c-5254006e85c2.

            jamesanunez James Nunez (Inactive) added a comment - We hit this while unmounting an OST at the end of ost-pools; LU-7326 . Logs are at https://testing.hpdd.intel.com/test_sets/ea392e2a-776b-11e5-a00c-5254006e85c2 .
            green Oleg Drokin added a comment -

            I hit this now after a sanity run in cleanup.

            <3>[36131.936383] LustreError: Skipped 1 previous similar message
            <4>[36135.163704] Lustre: server umount lustre-MDT0000 complete
            <0>[36141.992203] LustreError: 26669:0:(lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3
            <0>[36141.993278] LustreError: 26669:0:(lu_object.c:1224:lu_device_fini()) LBUG
            <4>[36141.993812] Pid: 26669, comm: umount
            <4>[36141.994278] 
            <4>[36141.994279] Call Trace:
            <4>[36141.995167]  [<ffffffffa079b885>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            <4>[36141.995966]  [<ffffffffa079be87>] lbug_with_loc+0x47/0xb0 [libcfs]
            <4>[36141.996537]  [<ffffffffa102cfd8>] lu_device_fini+0xb8/0xc0 [obdclass]
            <4>[36141.997073]  [<ffffffffa100f0ad>] ls_device_put+0x8d/0x2d0 [obdclass]
            <4>[36141.997624]  [<ffffffffa100f3c5>] local_oid_storage_fini+0xd5/0x2e0 [obdclass]
            <4>[36141.998363]  [<ffffffffa05cc32f>] lfsck_instance_cleanup+0x22f/0x790 [lfsck]
            <4>[36141.998770]  [<ffffffffa05ce9ab>] lfsck_degister+0x4b/0x60 [lfsck]
            <4>[36141.999128]  [<ffffffffa0c0e0cb>] ofd_device_fini+0xab/0x260 [ofd]
            <4>[36141.999564]  [<ffffffffa101c142>] class_cleanup+0x572/0xd20 [obdclass]
            <4>[36141.999933]  [<ffffffffa0ffe0cc>] ? class_name2dev+0x7c/0xe0 [obdclass]
            <4>[36142.000310]  [<ffffffffa101e666>] class_process_config+0x1d76/0x26d0 [obdclass]
            <4>[36142.001033]  [<ffffffff8117757a>] ? cache_alloc_debugcheck_after+0x14a/0x210
            <4>[36142.001493]  [<ffffffff81179a55>] ? __kmalloc+0x1c5/0x2b0
            <4>[36142.001926]  [<ffffffffa101f218>] ? class_manual_cleanup+0x258/0xe10 [obdclass]
            <4>[36142.002690]  [<ffffffffa101f47f>] class_manual_cleanup+0x4bf/0xe10 [obdclass]
            <4>[36142.003090]  [<ffffffffa0ffe0cc>] ? class_name2dev+0x7c/0xe0 [obdclass]
            <4>[36142.003556]  [<ffffffffa105357c>] server_put_super+0x9bc/0xe80 [obdclass]
            <4>[36142.003987]  [<ffffffff811b141a>] ? invalidate_inodes+0xfa/0x180
            <4>[36142.004383]  [<ffffffff8119564b>] generic_shutdown_super+0x5b/0xe0
            <4>[36142.004796]  [<ffffffff81195736>] kill_anon_super+0x16/0x60
            <4>[36142.005165]  [<ffffffffa1022b76>] lustre_kill_super+0x36/0x60 [obdclass]
            <4>[36142.005792]  [<ffffffff81195ed7>] deactivate_super+0x57/0x80
            <4>[36142.006244]  [<ffffffff811b5e2f>] mntput_no_expire+0xbf/0x110
            <4>[36142.006926]  [<ffffffff811b699b>] sys_umount+0x7b/0x3a0
            <4>[36142.007516]  [<ffffffff8100b112>] system_call_fastpath+0x16/0x1b
            <4>[36142.008049] 
            <0>[36142.013691] Kernel panic - not syncing: LBUG
            

            Crashdump is in /exports/crashdumps/192.168.10.224-2015-10-14-11\:14\:17

            green Oleg Drokin added a comment - I hit this now after a sanity run in cleanup. <3>[36131.936383] LustreError: Skipped 1 previous similar message <4>[36135.163704] Lustre: server umount lustre-MDT0000 complete <0>[36141.992203] LustreError: 26669:0:(lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3 <0>[36141.993278] LustreError: 26669:0:(lu_object.c:1224:lu_device_fini()) LBUG <4>[36141.993812] Pid: 26669, comm: umount <4>[36141.994278] <4>[36141.994279] Call Trace: <4>[36141.995167] [<ffffffffa079b885>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4>[36141.995966] [<ffffffffa079be87>] lbug_with_loc+0x47/0xb0 [libcfs] <4>[36141.996537] [<ffffffffa102cfd8>] lu_device_fini+0xb8/0xc0 [obdclass] <4>[36141.997073] [<ffffffffa100f0ad>] ls_device_put+0x8d/0x2d0 [obdclass] <4>[36141.997624] [<ffffffffa100f3c5>] local_oid_storage_fini+0xd5/0x2e0 [obdclass] <4>[36141.998363] [<ffffffffa05cc32f>] lfsck_instance_cleanup+0x22f/0x790 [lfsck] <4>[36141.998770] [<ffffffffa05ce9ab>] lfsck_degister+0x4b/0x60 [lfsck] <4>[36141.999128] [<ffffffffa0c0e0cb>] ofd_device_fini+0xab/0x260 [ofd] <4>[36141.999564] [<ffffffffa101c142>] class_cleanup+0x572/0xd20 [obdclass] <4>[36141.999933] [<ffffffffa0ffe0cc>] ? class_name2dev+0x7c/0xe0 [obdclass] <4>[36142.000310] [<ffffffffa101e666>] class_process_config+0x1d76/0x26d0 [obdclass] <4>[36142.001033] [<ffffffff8117757a>] ? cache_alloc_debugcheck_after+0x14a/0x210 <4>[36142.001493] [<ffffffff81179a55>] ? __kmalloc+0x1c5/0x2b0 <4>[36142.001926] [<ffffffffa101f218>] ? class_manual_cleanup+0x258/0xe10 [obdclass] <4>[36142.002690] [<ffffffffa101f47f>] class_manual_cleanup+0x4bf/0xe10 [obdclass] <4>[36142.003090] [<ffffffffa0ffe0cc>] ? class_name2dev+0x7c/0xe0 [obdclass] <4>[36142.003556] [<ffffffffa105357c>] server_put_super+0x9bc/0xe80 [obdclass] <4>[36142.003987] [<ffffffff811b141a>] ? invalidate_inodes+0xfa/0x180 <4>[36142.004383] [<ffffffff8119564b>] generic_shutdown_super+0x5b/0xe0 <4>[36142.004796] [<ffffffff81195736>] kill_anon_super+0x16/0x60 <4>[36142.005165] [<ffffffffa1022b76>] lustre_kill_super+0x36/0x60 [obdclass] <4>[36142.005792] [<ffffffff81195ed7>] deactivate_super+0x57/0x80 <4>[36142.006244] [<ffffffff811b5e2f>] mntput_no_expire+0xbf/0x110 <4>[36142.006926] [<ffffffff811b699b>] sys_umount+0x7b/0x3a0 <4>[36142.007516] [<ffffffff8100b112>] system_call_fastpath+0x16/0x1b <4>[36142.008049] <0>[36142.013691] Kernel panic - not syncing: LBUG Crashdump is in /exports/crashdumps/192.168.10.224-2015-10-14-11\:14\:17

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: