Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7221

replay-ost-single test_3: ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/6824daa8-65d5-11e5-8d98-5254006e85c2.

      The sub-test test_3 failed with the following error:

      07:42:45:Lustre: DEBUG MARKER: == replay-ost-single test 3: Fail OST during write, with verification == 00:42:38 (1443426158)
      07:42:45:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts
      07:42:45:Lustre: DEBUG MARKER: umount -d /mnt/ost1
      07:42:45:Lustre: Failing over lustre-OST0000
      07:42:45:Lustre: lustre-OST0000: Not available for connect from 10.2.4.116@tcp (stopping)
      07:42:45:LustreError: 28990:0:(genops.c:815:class_export_put()) ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0
      07:42:45:LustreError: 28990:0:(genops.c:815:class_export_put()) LBUG
      07:42:45:Pid: 28990, comm: ll_ost00_014
      07:42:45:
      07:42:45:Call Trace:
      07:42:45: [<ffffffffa04a6875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      07:42:45: [<ffffffffa04a6e77>] lbug_with_loc+0x47/0xb0 [libcfs]
      07:42:45: [<ffffffffa05c98b1>] class_export_put+0x271/0x310 [obdclass]
      07:42:45: [<ffffffffa05c9aa0>] obd_stale_export_put+0x150/0x290 [obdclass]
      07:42:45: [<ffffffffa05c9d01>] class_unlink_export+0x121/0x130 [obdclass]
      07:42:45: [<ffffffffa05e4a90>] class_decref+0x350/0x4d0 [obdclass]
      07:42:45: [<ffffffffa04b2b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      07:42:45: [<ffffffffa0531efc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet]
      07:42:45: [<ffffffffa07ea6ad>] target_handle_connect+0x23d/0x2bb0 [ptlrpc]
      07:42:45: [<ffffffffa04b2523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
      07:42:46: [<ffffffffa08907d2>] tgt_request_handle+0x5a2/0x12e0 [ptlrpc]
      07:42:46: [<ffffffffa0838731>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      07:42:46: [<ffffffffa08378f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      07:42:46: [<ffffffff810a101e>] kthread+0x9e/0xc0
      07:42:46: [<ffffffff8100c28a>] child_rip+0xa/0x20
      07:42:46: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
      07:42:46: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      07:42:46:
      07:42:46:Kernel panic - not syncing: LBUG
      07:42:46:Pid: 28990, comm: ll_ost00_014 Not tainted 2.6.32-573.3.1.el6_lustre.g00880a0.x86_64 #1
      07:42:46:Call Trace:
      07:42:46: [<ffffffff815384e4>] ? panic+0xa7/0x16f
      07:42:47: [<ffffffffa04a6ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      07:42:47: [<ffffffffa05c98b1>] ? class_export_put+0x271/0x310 [obdclass]
      07:42:47: [<ffffffffa05c9aa0>] ? obd_stale_export_put+0x150/0x290 [obdclass]
      07:42:47: [<ffffffffa05c9d01>] ? class_unlink_export+0x121/0x130 [obdclass]
      07:42:47: [<ffffffffa05e4a90>] ? class_decref+0x350/0x4d0 [obdclass]
      07:42:47: [<ffffffffa04b2b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      07:42:47: [<ffffffffa0531efc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet]
      07:42:47: [<ffffffffa07ea6ad>] ? target_handle_connect+0x23d/0x2bb0 [ptlrpc]
      07:42:47: [<ffffffffa04b2523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
      07:42:47: [<ffffffffa08907d2>] ? tgt_request_handle+0x5a2/0x12e0 [ptlrpc]
      07:42:47: [<ffffffffa0838731>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
      07:42:47: [<ffffffffa08378f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      07:42:47: [<ffffffff810a101e>] ? kthread+0x9e/0xc0
      07:42:47: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      07:42:47: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
      07:42:47: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      test failed to respond and timed out
      

      Attachments

        Issue Links

          Activity

            [LU-7221] replay-ost-single test_3: ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0

            Landed for 2.8

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16940/
            Subject: LU-7221 ldlm: do not take a reference on target if stopping
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6ff417169a2ad84478cce1b0321e70a030ffed83

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16940/ Subject: LU-7221 ldlm: do not take a reference on target if stopping Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6ff417169a2ad84478cce1b0321e70a030ffed83
            di.wang Di Wang added a comment - Another one on master https://testing.hpdd.intel.com/test_sets/112b8818-896c-11e5-8ba4-5254006e85c2
            yong.fan nasf (Inactive) added a comment - Another failure instance: https://testing.hpdd.intel.com/test_sets/e4a06a3a-86a4-11e5-bf92-5254006e85c2
            bogl Bob Glossman (Inactive) added a comment - another on master; https://testing.hpdd.intel.com/test_sets/fa2e57a4-7c59-11e5-9ca1-5254006e85c2
            jamesanunez James Nunez (Inactive) added a comment - - edited Another failure on master at https://testing.hpdd.intel.com/test_sets/5c507be8-7b46-11e5-9ee6-5254006e85c2 2015-10-26 04:00:03 - https://testing.hpdd.intel.com/test_sets/87e30d88-7bba-11e5-9851-5254006e85c2 2015-11-11 13:07:17 - https://testing.hpdd.intel.com/test_sets/622a597e-88a9-11e5-b099-5254006e85c2

            Patch http://review.whamcloud.com/16940 is an attempt to revert one of the changes of patch http://review.whamcloud.com/#/c/11750/ for LU-5569, which allows to take a reference on a target even if it is stopping/umounting, upon new connections requests, finally causing a LBUG during useless [obd_self_]export cleanup operations.

            bfaccini Bruno Faccini (Inactive) added a comment - Patch http://review.whamcloud.com/16940 is an attempt to revert one of the changes of patch http://review.whamcloud.com/#/c/11750/ for LU-5569 , which allows to take a reference on a target even if it is stopping/umounting, upon new connections requests, finally causing a LBUG during useless [obd_self_] export cleanup operations.

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/16940
            Subject: LU-7221 ldlm: do not take a reference on target if stopping
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a3008de85070ab7ce9388f1fad6ec37f4989dcdd

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/16940 Subject: LU-7221 ldlm: do not take a reference on target if stopping Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a3008de85070ab7ce9388f1fad6ec37f4989dcdd
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Just got a very similar occurrence during replay-single/test_80d at https://testing.hpdd.intel.com/test_sets/7466dbda-799a-11e5-a447-5254006e85c2.

            A quick look to the crash-dump shows that 3 MDT threads have triggered the same LBUG at the same time, and their common stack part looks like :

            .............
             #3 [ffff88007cd7bab8] lbug_with_loc at ffffffffa04a6ecb [libcfs]
             #4 [ffff88007cd7bad8] class_export_put at ffffffffa05c98b1 [obdclass]
             #5 [ffff88007cd7baf8] obd_stale_export_put at ffffffffa05c9aa0 [obdclass]
             #6 [ffff88007cd7bb28] class_unlink_export at ffffffffa05c9d01 [obdclass]
             #7 [ffff88007cd7bb48] class_decref at ffffffffa05e4778 [obdclass]
             #8 [ffff88007cd7bbb8] target_handle_connect at ffffffffa07e97ed [ptlrpc]
             #9 [ffff88007cd7bd48] tgt_request_handle at ffffffffa088fb22 [ptlrpc]
            #10 [ffff88007cd7bda8] ptlrpc_main at ffffffffa08378c1 [ptlrpc]
            #11 [ffff88007cd7bee8] kthread at ffffffff810a0fce
            #12 [ffff88007cd7bf48] kernel_thread at ffffffff8100c28a
            

            they all 3 are working on the same "lustre-MDT0000" obd_device, its obd_refcount=1, but some of its other counters (obd_num_exports=-3, obd_conn_inprogress=-5) seem to indicate a possible race condition where multiple threads have executed this same path in the code handling connect requests when the MDT target is being unmounted (obd_stopping).

            Having a look to the concerned code, I think this has been introduced by patch http://review.whamcloud.com/#/c/11750/ for LU-5569, where class_incref()/class_decref() calls are now done even when obd_stopping ...

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Just got a very similar occurrence during replay-single/test_80d at https://testing.hpdd.intel.com/test_sets/7466dbda-799a-11e5-a447-5254006e85c2 . A quick look to the crash-dump shows that 3 MDT threads have triggered the same LBUG at the same time, and their common stack part looks like : ............. #3 [ffff88007cd7bab8] lbug_with_loc at ffffffffa04a6ecb [libcfs] #4 [ffff88007cd7bad8] class_export_put at ffffffffa05c98b1 [obdclass] #5 [ffff88007cd7baf8] obd_stale_export_put at ffffffffa05c9aa0 [obdclass] #6 [ffff88007cd7bb28] class_unlink_export at ffffffffa05c9d01 [obdclass] #7 [ffff88007cd7bb48] class_decref at ffffffffa05e4778 [obdclass] #8 [ffff88007cd7bbb8] target_handle_connect at ffffffffa07e97ed [ptlrpc] #9 [ffff88007cd7bd48] tgt_request_handle at ffffffffa088fb22 [ptlrpc] #10 [ffff88007cd7bda8] ptlrpc_main at ffffffffa08378c1 [ptlrpc] #11 [ffff88007cd7bee8] kthread at ffffffff810a0fce #12 [ffff88007cd7bf48] kernel_thread at ffffffff8100c28a they all 3 are working on the same "lustre-MDT0000" obd_device, its obd_refcount=1, but some of its other counters (obd_num_exports=-3, obd_conn_inprogress=-5) seem to indicate a possible race condition where multiple threads have executed this same path in the code handling connect requests when the MDT target is being unmounted (obd_stopping). Having a look to the concerned code, I think this has been introduced by patch http://review.whamcloud.com/#/c/11750/ for LU-5569 , where class_incref()/class_decref() calls are now done even when obd_stopping ...

            Seems there are a variety of problems with unmounting the MDT or OST that need to be investigated.

            adilger Andreas Dilger added a comment - Seems there are a variety of problems with unmounting the MDT or OST that need to be investigated.

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: