[LU-7221] replay-ost-single test_3: ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0 Created: 28/Sep/15 Updated: 15/Mar/19 Resolved: 24/Nov/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for wangdi <di.wang@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/6824daa8-65d5-11e5-8d98-5254006e85c2. The sub-test test_3 failed with the following error: 07:42:45:Lustre: DEBUG MARKER: == replay-ost-single test 3: Fail OST during write, with verification == 00:42:38 (1443426158) 07:42:45:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts 07:42:45:Lustre: DEBUG MARKER: umount -d /mnt/ost1 07:42:45:Lustre: Failing over lustre-OST0000 07:42:45:Lustre: lustre-OST0000: Not available for connect from 10.2.4.116@tcp (stopping) 07:42:45:LustreError: 28990:0:(genops.c:815:class_export_put()) ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0 07:42:45:LustreError: 28990:0:(genops.c:815:class_export_put()) LBUG 07:42:45:Pid: 28990, comm: ll_ost00_014 07:42:45: 07:42:45:Call Trace: 07:42:45: [<ffffffffa04a6875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 07:42:45: [<ffffffffa04a6e77>] lbug_with_loc+0x47/0xb0 [libcfs] 07:42:45: [<ffffffffa05c98b1>] class_export_put+0x271/0x310 [obdclass] 07:42:45: [<ffffffffa05c9aa0>] obd_stale_export_put+0x150/0x290 [obdclass] 07:42:45: [<ffffffffa05c9d01>] class_unlink_export+0x121/0x130 [obdclass] 07:42:45: [<ffffffffa05e4a90>] class_decref+0x350/0x4d0 [obdclass] 07:42:45: [<ffffffffa04b2b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 07:42:45: [<ffffffffa0531efc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet] 07:42:45: [<ffffffffa07ea6ad>] target_handle_connect+0x23d/0x2bb0 [ptlrpc] 07:42:45: [<ffffffffa04b2523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs] 07:42:46: [<ffffffffa08907d2>] tgt_request_handle+0x5a2/0x12e0 [ptlrpc] 07:42:46: [<ffffffffa0838731>] ptlrpc_main+0xe41/0x1910 [ptlrpc] 07:42:46: [<ffffffffa08378f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 07:42:46: [<ffffffff810a101e>] kthread+0x9e/0xc0 07:42:46: [<ffffffff8100c28a>] child_rip+0xa/0x20 07:42:46: [<ffffffff810a0f80>] ? kthread+0x0/0xc0 07:42:46: [<ffffffff8100c280>] ? child_rip+0x0/0x20 07:42:46: 07:42:46:Kernel panic - not syncing: LBUG 07:42:46:Pid: 28990, comm: ll_ost00_014 Not tainted 2.6.32-573.3.1.el6_lustre.g00880a0.x86_64 #1 07:42:46:Call Trace: 07:42:46: [<ffffffff815384e4>] ? panic+0xa7/0x16f 07:42:47: [<ffffffffa04a6ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 07:42:47: [<ffffffffa05c98b1>] ? class_export_put+0x271/0x310 [obdclass] 07:42:47: [<ffffffffa05c9aa0>] ? obd_stale_export_put+0x150/0x290 [obdclass] 07:42:47: [<ffffffffa05c9d01>] ? class_unlink_export+0x121/0x130 [obdclass] 07:42:47: [<ffffffffa05e4a90>] ? class_decref+0x350/0x4d0 [obdclass] 07:42:47: [<ffffffffa04b2b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 07:42:47: [<ffffffffa0531efc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet] 07:42:47: [<ffffffffa07ea6ad>] ? target_handle_connect+0x23d/0x2bb0 [ptlrpc] 07:42:47: [<ffffffffa04b2523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs] 07:42:47: [<ffffffffa08907d2>] ? tgt_request_handle+0x5a2/0x12e0 [ptlrpc] 07:42:47: [<ffffffffa0838731>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc] 07:42:47: [<ffffffffa08378f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 07:42:47: [<ffffffff810a101e>] ? kthread+0x9e/0xc0 07:42:47: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 07:42:47: [<ffffffff810a0f80>] ? kthread+0x0/0xc0 07:42:47: [<ffffffff8100c280>] ? child_rip+0x0/0x20 test failed to respond and timed out |
| Comments |
| Comment by James Nunez (Inactive) [ 06/Oct/15 ] |
|
Same LBUG seen hanging replay-single test_7 with logs at https://testing.hpdd.intel.com/test_sets/9d7825ca-6bf7-11e5-87fb-5254006e85c2 From the MDS console: 22:48:00:Lustre: lustre-MDT0000: Not available for connect from 10.1.4.104@tcp (stopping) 22:48:00:Lustre: Skipped 9 previous similar messages 22:48:00:LustreError: 21346:0:(genops.c:815:class_export_put()) ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0 22:48:00:LustreError: 21346:0:(genops.c:815:class_export_put()) LBUG 22:48:00:Pid: 21346, comm: mdt00_001 22:48:00: 22:48:00:Call Trace: 22:48:00: [<ffffffffa049b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 22:48:00: [<ffffffffa049be77>] lbug_with_loc+0x47/0xb0 [libcfs] 22:48:00: [<ffffffffa05bea21>] class_export_put+0x271/0x310 [obdclass] 22:48:00: [<ffffffffa05bec10>] obd_stale_export_put+0x150/0x290 [obdclass] 22:48:00: [<ffffffffa05bee71>] class_unlink_export+0x121/0x130 [obdclass] 22:48:00: [<ffffffffa05d9c00>] class_decref+0x350/0x4d0 [obdclass] 22:48:00: [<ffffffffa04a7b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 22:48:00: [<ffffffffa0526efc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet] 22:48:00: [<ffffffffa07df6ad>] target_handle_connect+0x23d/0x2bb0 [ptlrpc] 22:48:00: [<ffffffffa04a7523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs] 22:48:00: [<ffffffffa05f1a45>] ? keys_fill+0x25/0x1b0 [obdclass] 22:48:00: [<ffffffffa08857d2>] tgt_request_handle+0x5a2/0x12e0 [ptlrpc] 22:48:00: [<ffffffffa082d731>] ptlrpc_main+0xe41/0x1910 [ptlrpc] 22:48:00: [<ffffffffa082c8f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 22:48:00: [<ffffffff810a101e>] kthread+0x9e/0xc0 22:48:00: [<ffffffff8100c28a>] child_rip+0xa/0x20 22:48:00: [<ffffffff810a0f80>] ? kthread+0x0/0xc0 22:48:00: [<ffffffff8100c280>] ? child_rip+0x0/0x20 22:48:00: 22:48:00: 22:48:00:Kernel panic - not syncing: LBUG 22:48:00:Pid: 21346, comm: mdt00_001 Not tainted 2.6.32-573.3.1.el6_lustre.g00880a0.x86_64 #1 22:48:00:Call Trace: 22:48:00: [<ffffffff815384e4>] ? panic+0xa7/0x16f 22:48:00: [<ffffffffa049becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 22:48:00: [<ffffffffa05bea21>] ? class_export_put+0x271/0x310 [obdclass] 22:48:00: [<ffffffffa05bec10>] ? obd_stale_export_put+0x150/0x290 [obdclass] 22:48:00: [<ffffffffa05bee71>] ? class_unlink_export+0x121/0x130 [obdclass] 22:48:00: [<ffffffffa05d9c00>] ? class_decref+0x350/0x4d0 [obdclass] 22:48:00: [<ffffffffa04a7b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 22:48:00: [<ffffffffa0526efc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet] 22:48:00: [<ffffffffa07df6ad>] ? target_handle_connect+0x23d/0x2bb0 [ptlrpc] 22:48:00: [<ffffffffa04a7523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs] 22:48:00: [<ffffffffa05f1a45>] ? keys_fill+0x25/0x1b0 [obdclass] 22:48:00: [<ffffffffa08857d2>] ? tgt_request_handle+0x5a2/0x12e0 [ptlrpc] 22:48:00: [<ffffffffa082d731>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc] 22:48:00: [<ffffffffa082c8f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 22:48:00: [<ffffffff810a101e>] ? kthread+0x9e/0xc0 22:48:00: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 22:48:00: [<ffffffff810a0f80>] ? kthread+0x0/0xc0 22:48:00: [<ffffffff8100c280>] ? child_rip+0x0/0x20 22:48:00:Initializing cgroup subsys cpuset |
| Comment by Bruno Faccini (Inactive) [ 06/Oct/15 ] |
|
+1 at https://testing.hpdd.intel.com/test_sets/9d7825ca-6bf7-11e5-87fb-5254006e85c2 |
| Comment by James Nunez (Inactive) [ 07/Oct/15 ] |
|
This LBUG seen on replay-single test_53g timeout. Logs at https://testing.hpdd.intel.com/test_sets/13337056-6c72-11e5-9ae6-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 17/Oct/15 ] |
|
seen in replay-single, test 93 on master: console log of OST shows: 05:39:24:LustreError: 9188:0:(genops.c:815:class_export_put()) ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0 05:39:24:LustreError: 9188:0:(genops.c:815:class_export_put()) LBUG 05:39:24:Pid: 9188, comm: ll_ost00_012 05:39:24: 05:39:24:Call Trace: 05:39:24: [<ffffffffa079f875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 05:39:24: [<ffffffffa079fe77>] lbug_with_loc+0x47/0xb0 [libcfs] 05:39:24: [<ffffffffa08c28b1>] class_export_put+0x271/0x310 [obdclass] 05:39:24: [<ffffffffa08c2aa0>] obd_stale_export_put+0x150/0x290 [obdclass] 05:39:24: [<ffffffffa08c2d01>] class_unlink_export+0x121/0x130 [obdclass] 05:39:24: [<ffffffffa08dd778>] class_decref+0x348/0x4c0 [obdclass] 05:39:24: [<ffffffffa07abb61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 05:39:24: [<ffffffffa082aefc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet] 05:39:24: [<ffffffffa0ae27ed>] target_handle_connect+0x23d/0x2ba0 [ptlrpc] 05:39:24: [<ffffffffa07ab523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs] 05:39:24: [<ffffffffa0b88dc2>] tgt_request_handle+0x5a2/0x12e0 [ptlrpc] 05:39:24: [<ffffffffa0b309c1>] ptlrpc_main+0xe41/0x1910 [ptlrpc] 05:39:24: [<ffffffffa0b2fb80>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 05:39:24: [<ffffffff810a0fce>] kthread+0x9e/0xc0 05:39:24: [<ffffffff8100c28a>] child_rip+0xa/0x20 05:39:24: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 05:39:24: [<ffffffff8100c280>] ? child_rip+0x0/0x20 05:39:24: 05:39:24:Kernel panic - not syncing: LBUG |
| Comment by Bob Glossman (Inactive) [ 18/Oct/15 ] |
|
another in replay-single test 62 on master: This appears to be happening all over the place in various tests in replay-single. |
| Comment by Bob Glossman (Inactive) [ 22/Oct/15 ] |
|
another seen in replay-single, test_35 on master: |
| Comment by Andreas Dilger [ 22/Oct/15 ] |
|
Seems there are a variety of problems with unmounting the MDT or OST that need to be investigated. |
| Comment by Bruno Faccini (Inactive) [ 26/Oct/15 ] |
|
Just got a very similar occurrence during replay-single/test_80d at https://testing.hpdd.intel.com/test_sets/7466dbda-799a-11e5-a447-5254006e85c2. A quick look to the crash-dump shows that 3 MDT threads have triggered the same LBUG at the same time, and their common stack part looks like : ............. #3 [ffff88007cd7bab8] lbug_with_loc at ffffffffa04a6ecb [libcfs] #4 [ffff88007cd7bad8] class_export_put at ffffffffa05c98b1 [obdclass] #5 [ffff88007cd7baf8] obd_stale_export_put at ffffffffa05c9aa0 [obdclass] #6 [ffff88007cd7bb28] class_unlink_export at ffffffffa05c9d01 [obdclass] #7 [ffff88007cd7bb48] class_decref at ffffffffa05e4778 [obdclass] #8 [ffff88007cd7bbb8] target_handle_connect at ffffffffa07e97ed [ptlrpc] #9 [ffff88007cd7bd48] tgt_request_handle at ffffffffa088fb22 [ptlrpc] #10 [ffff88007cd7bda8] ptlrpc_main at ffffffffa08378c1 [ptlrpc] #11 [ffff88007cd7bee8] kthread at ffffffff810a0fce #12 [ffff88007cd7bf48] kernel_thread at ffffffff8100c28a they all 3 are working on the same "lustre-MDT0000" obd_device, its obd_refcount=1, but some of its other counters (obd_num_exports=-3, obd_conn_inprogress=-5) seem to indicate a possible race condition where multiple threads have executed this same path in the code handling connect requests when the MDT target is being unmounted (obd_stopping). Having a look to the concerned code, I think this has been introduced by patch http://review.whamcloud.com/#/c/11750/ for |
| Comment by Gerrit Updater [ 26/Oct/15 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/16940 |
| Comment by Bruno Faccini (Inactive) [ 26/Oct/15 ] |
|
Patch http://review.whamcloud.com/16940 is an attempt to revert one of the changes of patch http://review.whamcloud.com/#/c/11750/ for |
| Comment by James Nunez (Inactive) [ 26/Oct/15 ] |
|
Another failure on master at https://testing.hpdd.intel.com/test_sets/5c507be8-7b46-11e5-9ee6-5254006e85c2 2015-10-26 04:00:03 - https://testing.hpdd.intel.com/test_sets/87e30d88-7bba-11e5-9851-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 27/Oct/15 ] |
|
another on master; |
| Comment by nasf (Inactive) [ 09/Nov/15 ] |
|
Another failure instance: |
| Comment by Di Wang [ 12/Nov/15 ] |
|
Another one on master |
| Comment by Gerrit Updater [ 24/Nov/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16940/ |
| Comment by Joseph Gmitter (Inactive) [ 24/Nov/15 ] |
|
Landed for 2.8 |