Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7038

obdfilter-survey test_3a: (lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.8.0
    • None
    • client and server: lustre-master build # 3142 RHEL6.6 DNE
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/72f11210-46d3-11e5-90a5-5254006e85c2.

      The sub-test test_3a failed with the following error:

      test failed to respond and timed out
      

      ost console:

      12:55:26:Lustre: DEBUG MARKER: == obdfilter-survey test 3a: Network survey == 05:48:19 (1439988499)
      12:55:28:LustreError: 11-0: lustre-MDT0000-lwp-OST0000: operation obd_ping to node 10.2.4.221@tcp failed: rc = -107
      12:55:30:LustreError: Skipped 7 previous similar messages
      12:55:31:Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.2.4.221@tcp) was lost; in progress operations using this service will wait for recovery to complete
      12:55:31:Lustre: Skipped 7 previous similar messages
      12:55:32:Lustre: 6155:0:(client.c:2014:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1439988511/real 1439988511]  req@ffff880014660980 x1509869039556728/t0(0) o400->MGC10.2.4.221@tcp@10.2.4.221@tcp:26/25 lens 224/224 e 0 to 1 dl 1439988518 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      12:55:32:Lustre: 6155:0:(client.c:2014:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
      12:55:32:LustreError: 166-1: MGC10.2.4.221@tcp: Connection to MGS (at 10.2.4.221@tcp) was lost; in progress operations using this service will fail
      12:55:32:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts
      12:55:34:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1
      12:55:34:Lustre: server umount lustre-OST0000 complete
      12:55:34:Lustre: Skipped 1 previous similar message
      12:55:34:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:34:Lustre: DEBUG MARKER: grep -c /mnt/ost2' ' /proc/mounts
      12:55:34:Lustre: DEBUG MARKER: umount -d -f /mnt/ost2
      12:55:34:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:35:Lustre: DEBUG MARKER: grep -c /mnt/ost3' ' /proc/mounts
      12:55:35:Lustre: DEBUG MARKER: umount -d -f /mnt/ost3
      12:55:35:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:35:Lustre: DEBUG MARKER: grep -c /mnt/ost4' ' /proc/mounts
      12:55:35:Lustre: DEBUG MARKER: umount -d -f /mnt/ost4
      12:55:35:Lustre: server umount lustre-OST0003 complete
      12:55:35:Lustre: Skipped 2 previous similar messages
      12:55:35:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:36:Lustre: DEBUG MARKER: grep -c /mnt/ost5' ' /proc/mounts
      12:55:36:Lustre: DEBUG MARKER: umount -d -f /mnt/ost5
      12:55:36:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:36:Lustre: DEBUG MARKER: grep -c /mnt/ost6' ' /proc/mounts
      12:55:36:Lustre: DEBUG MARKER: umount -d -f /mnt/ost6
      12:55:36:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:37:Lustre: DEBUG MARKER: grep -c /mnt/ost7' ' /proc/mounts
      12:55:37:Lustre: DEBUG MARKER: umount -d -f /mnt/ost7
      12:55:37:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      12:55:37:Lustre: DEBUG MARKER: grep -c /mnt/ost8' ' /proc/mounts
      12:55:37:Lustre: DEBUG MARKER: umount -d -f /mnt/ost8
      12:55:37:LustreError: 8532:0:(lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3
      12:55:37:LustreError: 8532:0:(lu_object.c:1224:lu_device_fini()) LBUG
      12:55:37:Pid: 8532, comm: umount
      12:55:38:
      12:55:38:Call Trace:
      12:55:38: [<ffffffffa049b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      12:55:38: [<ffffffffa049be77>] lbug_with_loc+0x47/0xb0 [libcfs]
      12:55:38: [<ffffffffa05f229b>] lu_device_fini+0xbb/0xc0 [obdclass]
      12:55:38: [<ffffffffa05d328d>] ls_device_put+0x7d/0x2e0 [obdclass]
      12:55:39: [<ffffffffa05d3662>] local_oid_storage_fini+0x172/0x410 [obdclass]
      12:55:40: [<ffffffffa0dc476f>] lfsck_instance_cleanup+0x20f/0x7e0 [lfsck]
      12:55:40: [<ffffffffa0dc6f7b>] lfsck_degister+0x4b/0x60 [lfsck]
      12:55:40: [<ffffffffa0e8f597>] ofd_device_fini+0x87/0x250 [ofd]
      12:55:40: [<ffffffffa05e1802>] class_cleanup+0x572/0xd30 [obdclass]
      12:55:40: [<ffffffffa05c1776>] ? class_name2dev+0x56/0xe0 [obdclass]
      12:55:41: [<ffffffffa05e3e56>] class_process_config+0x1e96/0x2800 [obdclass]
      12:55:41: [<ffffffffa04a7c01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      12:55:41: [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230
      12:55:41: [<ffffffffa05e4c7f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass]
      12:55:41: [<ffffffffa05c1776>] ? class_name2dev+0x56/0xe0 [obdclass]
      12:55:41: [<ffffffffa061e102>] server_put_super+0x9e2/0xeb0 [obdclass]
      12:55:41: [<ffffffff811ac776>] ? invalidate_inodes+0xf6/0x190
      12:55:41: [<ffffffff81190b7b>] generic_shutdown_super+0x5b/0xe0
      12:55:41: [<ffffffff81190c66>] kill_anon_super+0x16/0x60
      12:55:41: [<ffffffffa05e7b36>] lustre_kill_super+0x36/0x60 [obdclass]
      12:55:42: [<ffffffff81191407>] deactivate_super+0x57/0x80
      12:55:42: [<ffffffff811b10df>] mntput_no_expire+0xbf/0x110
      12:55:42: [<ffffffff811b1c2b>] sys_umount+0x7b/0x3a0
      12:55:42: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
      12:55:42:
      12:55:42:Kernel panic - not syncing: LBUG
      12:55:42:Pid: 8532, comm: umount Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1
      12:55:42:Call Trace:
      12:55:43: [<ffffffff81529c9c>] ? panic+0xa7/0x16f
      12:55:43: [<ffffffffa049becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      12:55:43: [<ffffffffa05f229b>] ? lu_device_fini+0xbb/0xc0 [obdclass]
      12:55:43: [<ffffffffa05d328d>] ? ls_device_put+0x7d/0x2e0 [obdclass]
      12:55:43: [<ffffffffa05d3662>] ? local_oid_storage_fini+0x172/0x410 [obdclass]
      12:55:43: [<ffffffffa0dc476f>] ? lfsck_instance_cleanup+0x20f/0x7e0 [lfsck]
      12:55:43: [<ffffffffa0dc6f7b>] ? lfsck_degister+0x4b/0x60 [lfsck]
      12:55:43: [<ffffffffa0e8f597>] ? ofd_device_fini+0x87/0x250 [ofd]
      12:55:43: [<ffffffffa05e1802>] ? class_cleanup+0x572/0xd30 [obdclass]
      12:55:43: [<ffffffffa05c1776>] ? class_name2dev+0x56/0xe0 [obdclass]
      12:55:45: [<ffffffffa05e3e56>] ? class_process_config+0x1e96/0x2800 [obdclass]
      12:55:45: [<ffffffffa04a7c01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      12:55:45: [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230
      12:55:46: [<ffffffffa05e4c7f>] ? class_manual_cleanup+0x4bf/0x8e0 [obdclass]
      12:55:46: [<ffffffffa05c1776>] ? class_name2dev+0x56/0xe0 [obdclass]
      12:55:46: [<ffffffffa061e102>] ? server_put_super+0x9e2/0xeb0 [obdclass]
      12:55:46: [<ffffffff811ac776>] ? invalidate_inodes+0xf6/0x190
      12:55:46: [<ffffffff81190b7b>] ? generic_shutdown_super+0x5b/0xe0
      12:55:46: [<ffffffff81190c66>] ? kill_anon_super+0x16/0x60
      12:55:47: [<ffffffffa05e7b36>] ? lustre_kill_super+0x36/0x60 [obdclass]
      12:55:47: [<ffffffff81191407>] ? deactivate_super+0x57/0x80
      12:55:47: [<ffffffff811b10df>] ? mntput_no_expire+0xbf/0x110
      12:55:48: [<ffffffff811b1c2b>] ? sys_umount+0x7b/0x3a0
      12:55:49: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
      12:55:50:Initializing cgroup subsys cpuset
      

      Attachments

        Issue Links

          Activity

            [LU-7038] obdfilter-survey test_3a: (lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18505/
            Subject: LU-7038 obdclass: lu_site_purge() to handle purge-all
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: bcbcd5873589c71a5d1028c14e74f8897fc3ffc0

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18505/ Subject: LU-7038 obdclass: lu_site_purge() to handle purge-all Project: fs/lustre-release Branch: master Current Patch Set: Commit: bcbcd5873589c71a5d1028c14e74f8897fc3ffc0

            Another instance found for interop - EL7 Server/2.7.1 Client, tag 2.7.90.
            https://testing.hpdd.intel.com/test_sessions/495aabae-d306-11e5-be5c-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - Another instance found for interop - EL7 Server/2.7.1 Client, tag 2.7.90. https://testing.hpdd.intel.com/test_sessions/495aabae-d306-11e5-be5c-5254006e85c2
            bzzz Alex Zhuravlev added a comment - I think http://review.whamcloud.com/#/c/17415/ is OK, http://review.whamcloud.com/18505 should be a proper fix.

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18505
            Subject: LU-7038 obdclass: lu_site_purge() to handle purge-all
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 213ac02d3a6aef2f136457608ff01db9279bf1ab

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18505 Subject: LU-7038 obdclass: lu_site_purge() to handle purge-all Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 213ac02d3a6aef2f136457608ff01db9279bf1ab

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18484
            Subject: LU-7038 debug: print objects if device is still busy
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 62492126228310a2dc0c52d90d18423347829525

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18484 Subject: LU-7038 debug: print objects if device is still busy Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 62492126228310a2dc0c52d90d18423347829525
            sarah Sarah Liu added a comment -

            Hit this on master DNE mode
            https://testing.hpdd.intel.com/test_sets/de009266-bbfd-11e5-8506-5254006e85c2
            client and server: lustre-master # 3305 RHEL6.7 ldiskfs

            sarah Sarah Liu added a comment - Hit this on master DNE mode https://testing.hpdd.intel.com/test_sets/de009266-bbfd-11e5-8506-5254006e85c2 client and server: lustre-master # 3305 RHEL6.7 ldiskfs
            bzzz Alex Zhuravlev added a comment - - edited

            something is wrong with lu_site_purge(), after call to that I still find non-referenced objects in the cache:
            [ 3278.750967] LustreError: 11754:0:(local_storage.c:193:ls_device_put()) header@ffff8800d61b7180[0x0, 0, [0x200000003:0x6:0x0] hash lru exist]{
            [ 3278.752782] LustreError: 11754:0:(local_storage.c:193:ls_device_put()) ....local_storage@ffff8800d61b71d0

            one more call to lu_site_purge() releases all of them. or this is a race..

            bzzz Alex Zhuravlev added a comment - - edited something is wrong with lu_site_purge(), after call to that I still find non-referenced objects in the cache: [ 3278.750967] LustreError: 11754:0:(local_storage.c:193:ls_device_put()) header@ffff8800d61b7180[0x0, 0, [0x200000003:0x6:0x0] hash lru exist]{ [ 3278.752782] LustreError: 11754:0:(local_storage.c:193:ls_device_put()) ....local_storage@ffff8800d61b71d0 one more call to lu_site_purge() releases all of them. or this is a race..

            I'm hitting this quite often locally, mostly using sanity-benchmark

            bzzz Alex Zhuravlev added a comment - I'm hitting this quite often locally, mostly using sanity-benchmark

            master, build# 3264, 2.7.64 tag
            Full test group :EL6.7 Server/EL6.7 Client
            https://testing.hpdd.intel.com/test_sets/6c6a9940-9f0a-11e5-ba94-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - master, build# 3264, 2.7.64 tag Full test group :EL6.7 Server/EL6.7 Client https://testing.hpdd.intel.com/test_sets/6c6a9940-9f0a-11e5-ba94-5254006e85c2

            We've hit this with the full test group on tag 2.7.64 with the lnet-selftest test suite. Logs at
            2015-12-22 17:22:29 - https://testing.hpdd.intel.com/test_sets/e03f0150-a912-11e5-9286-5254006e85c2

            Although there are no logs for the following test session failures, they all hang on umount of ost7 as the one above and are probably due to the same issue:
            2015-12-18 15:26:56 - https://testing.hpdd.intel.com/test_sets/100925aa-a5e4-11e5-a028-5254006e85c2
            2015-12-18 19:59:00 - https://testing.hpdd.intel.com/test_sets/53e5f6ba-a5ec-11e5-9f01-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - We've hit this with the full test group on tag 2.7.64 with the lnet-selftest test suite. Logs at 2015-12-22 17:22:29 - https://testing.hpdd.intel.com/test_sets/e03f0150-a912-11e5-9286-5254006e85c2 Although there are no logs for the following test session failures, they all hang on umount of ost7 as the one above and are probably due to the same issue: 2015-12-18 15:26:56 - https://testing.hpdd.intel.com/test_sets/100925aa-a5e4-11e5-a028-5254006e85c2 2015-12-18 19:59:00 - https://testing.hpdd.intel.com/test_sets/53e5f6ba-a5ec-11e5-9f01-5254006e85c2

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: