[LU-7221] replay-ost-single test_3: ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0 Created: 28/Sep/15  Updated: 15/Mar/19  Resolved: 24/Nov/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-7411 conf-sanity test_21b: short descripti... Closed
Related
is related to LU-7038 obdfilter-survey test_3a: (lu_object.... Resolved
is related to LU-7172 replay-single test_70d hung on MDT un... Resolved
is related to LU-7256 sanity-lfsck TIMEOUT on umount /mnt/mds4 Resolved
is related to LU-7326 ost-pools hangs on OST unmount Resolved
is related to LU-5569 recreating a reverse import produce a... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/6824daa8-65d5-11e5-8d98-5254006e85c2.

The sub-test test_3 failed with the following error:

07:42:45:Lustre: DEBUG MARKER: == replay-ost-single test 3: Fail OST during write, with verification == 00:42:38 (1443426158)
07:42:45:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts
07:42:45:Lustre: DEBUG MARKER: umount -d /mnt/ost1
07:42:45:Lustre: Failing over lustre-OST0000
07:42:45:Lustre: lustre-OST0000: Not available for connect from 10.2.4.116@tcp (stopping)
07:42:45:LustreError: 28990:0:(genops.c:815:class_export_put()) ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0
07:42:45:LustreError: 28990:0:(genops.c:815:class_export_put()) LBUG
07:42:45:Pid: 28990, comm: ll_ost00_014
07:42:45:
07:42:45:Call Trace:
07:42:45: [<ffffffffa04a6875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
07:42:45: [<ffffffffa04a6e77>] lbug_with_loc+0x47/0xb0 [libcfs]
07:42:45: [<ffffffffa05c98b1>] class_export_put+0x271/0x310 [obdclass]
07:42:45: [<ffffffffa05c9aa0>] obd_stale_export_put+0x150/0x290 [obdclass]
07:42:45: [<ffffffffa05c9d01>] class_unlink_export+0x121/0x130 [obdclass]
07:42:45: [<ffffffffa05e4a90>] class_decref+0x350/0x4d0 [obdclass]
07:42:45: [<ffffffffa04b2b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
07:42:45: [<ffffffffa0531efc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet]
07:42:45: [<ffffffffa07ea6ad>] target_handle_connect+0x23d/0x2bb0 [ptlrpc]
07:42:45: [<ffffffffa04b2523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
07:42:46: [<ffffffffa08907d2>] tgt_request_handle+0x5a2/0x12e0 [ptlrpc]
07:42:46: [<ffffffffa0838731>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
07:42:46: [<ffffffffa08378f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
07:42:46: [<ffffffff810a101e>] kthread+0x9e/0xc0
07:42:46: [<ffffffff8100c28a>] child_rip+0xa/0x20
07:42:46: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
07:42:46: [<ffffffff8100c280>] ? child_rip+0x0/0x20
07:42:46:
07:42:46:Kernel panic - not syncing: LBUG
07:42:46:Pid: 28990, comm: ll_ost00_014 Not tainted 2.6.32-573.3.1.el6_lustre.g00880a0.x86_64 #1
07:42:46:Call Trace:
07:42:46: [<ffffffff815384e4>] ? panic+0xa7/0x16f
07:42:47: [<ffffffffa04a6ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
07:42:47: [<ffffffffa05c98b1>] ? class_export_put+0x271/0x310 [obdclass]
07:42:47: [<ffffffffa05c9aa0>] ? obd_stale_export_put+0x150/0x290 [obdclass]
07:42:47: [<ffffffffa05c9d01>] ? class_unlink_export+0x121/0x130 [obdclass]
07:42:47: [<ffffffffa05e4a90>] ? class_decref+0x350/0x4d0 [obdclass]
07:42:47: [<ffffffffa04b2b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
07:42:47: [<ffffffffa0531efc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet]
07:42:47: [<ffffffffa07ea6ad>] ? target_handle_connect+0x23d/0x2bb0 [ptlrpc]
07:42:47: [<ffffffffa04b2523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
07:42:47: [<ffffffffa08907d2>] ? tgt_request_handle+0x5a2/0x12e0 [ptlrpc]
07:42:47: [<ffffffffa0838731>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
07:42:47: [<ffffffffa08378f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
07:42:47: [<ffffffff810a101e>] ? kthread+0x9e/0xc0
07:42:47: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
07:42:47: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
07:42:47: [<ffffffff8100c280>] ? child_rip+0x0/0x20
test failed to respond and timed out


 Comments   
Comment by James Nunez (Inactive) [ 06/Oct/15 ]

Same LBUG seen hanging replay-single test_7 with logs at https://testing.hpdd.intel.com/test_sets/9d7825ca-6bf7-11e5-87fb-5254006e85c2

From the MDS console:

22:48:00:Lustre: lustre-MDT0000: Not available for connect from 10.1.4.104@tcp (stopping)
22:48:00:Lustre: Skipped 9 previous similar messages
22:48:00:LustreError: 21346:0:(genops.c:815:class_export_put()) ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0
22:48:00:LustreError: 21346:0:(genops.c:815:class_export_put()) LBUG
22:48:00:Pid: 21346, comm: mdt00_001
22:48:00:
22:48:00:Call Trace:
22:48:00: [<ffffffffa049b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
22:48:00: [<ffffffffa049be77>] lbug_with_loc+0x47/0xb0 [libcfs]
22:48:00: [<ffffffffa05bea21>] class_export_put+0x271/0x310 [obdclass]
22:48:00: [<ffffffffa05bec10>] obd_stale_export_put+0x150/0x290 [obdclass]
22:48:00: [<ffffffffa05bee71>] class_unlink_export+0x121/0x130 [obdclass]
22:48:00: [<ffffffffa05d9c00>] class_decref+0x350/0x4d0 [obdclass]
22:48:00: [<ffffffffa04a7b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
22:48:00: [<ffffffffa0526efc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet]
22:48:00: [<ffffffffa07df6ad>] target_handle_connect+0x23d/0x2bb0 [ptlrpc]
22:48:00: [<ffffffffa04a7523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
22:48:00: [<ffffffffa05f1a45>] ? keys_fill+0x25/0x1b0 [obdclass]
22:48:00: [<ffffffffa08857d2>] tgt_request_handle+0x5a2/0x12e0 [ptlrpc]
22:48:00: [<ffffffffa082d731>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
22:48:00: [<ffffffffa082c8f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
22:48:00: [<ffffffff810a101e>] kthread+0x9e/0xc0
22:48:00: [<ffffffff8100c28a>] child_rip+0xa/0x20
22:48:00: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
22:48:00: [<ffffffff8100c280>] ? child_rip+0x0/0x20
22:48:00:
22:48:00:
22:48:00:Kernel panic - not syncing: LBUG
22:48:00:Pid: 21346, comm: mdt00_001 Not tainted 2.6.32-573.3.1.el6_lustre.g00880a0.x86_64 #1
22:48:00:Call Trace:
22:48:00: [<ffffffff815384e4>] ? panic+0xa7/0x16f
22:48:00: [<ffffffffa049becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
22:48:00: [<ffffffffa05bea21>] ? class_export_put+0x271/0x310 [obdclass]
22:48:00: [<ffffffffa05bec10>] ? obd_stale_export_put+0x150/0x290 [obdclass]
22:48:00: [<ffffffffa05bee71>] ? class_unlink_export+0x121/0x130 [obdclass]
22:48:00: [<ffffffffa05d9c00>] ? class_decref+0x350/0x4d0 [obdclass]
22:48:00: [<ffffffffa04a7b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
22:48:00: [<ffffffffa0526efc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet]
22:48:00: [<ffffffffa07df6ad>] ? target_handle_connect+0x23d/0x2bb0 [ptlrpc]
22:48:00: [<ffffffffa04a7523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
22:48:00: [<ffffffffa05f1a45>] ? keys_fill+0x25/0x1b0 [obdclass]
22:48:00: [<ffffffffa08857d2>] ? tgt_request_handle+0x5a2/0x12e0 [ptlrpc]
22:48:00: [<ffffffffa082d731>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
22:48:00: [<ffffffffa082c8f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
22:48:00: [<ffffffff810a101e>] ? kthread+0x9e/0xc0
22:48:00: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
22:48:00: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
22:48:00: [<ffffffff8100c280>] ? child_rip+0x0/0x20
22:48:00:Initializing cgroup subsys cpuset
Comment by Bruno Faccini (Inactive) [ 06/Oct/15 ]

+1 at https://testing.hpdd.intel.com/test_sets/9d7825ca-6bf7-11e5-87fb-5254006e85c2

Comment by James Nunez (Inactive) [ 07/Oct/15 ]

This LBUG seen on replay-single test_53g timeout. Logs at https://testing.hpdd.intel.com/test_sets/13337056-6c72-11e5-9ae6-5254006e85c2

Comment by Bob Glossman (Inactive) [ 17/Oct/15 ]

seen in replay-single, test 93 on master:
https://testing.hpdd.intel.com/test_sessions/50f53620-74c1-11e5-b8c1-5254006e85c2

console log of OST shows:

05:39:24:LustreError: 9188:0:(genops.c:815:class_export_put()) ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0
05:39:24:LustreError: 9188:0:(genops.c:815:class_export_put()) LBUG
05:39:24:Pid: 9188, comm: ll_ost00_012
05:39:24:
05:39:24:Call Trace:
05:39:24: [<ffffffffa079f875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
05:39:24: [<ffffffffa079fe77>] lbug_with_loc+0x47/0xb0 [libcfs]
05:39:24: [<ffffffffa08c28b1>] class_export_put+0x271/0x310 [obdclass]
05:39:24: [<ffffffffa08c2aa0>] obd_stale_export_put+0x150/0x290 [obdclass]
05:39:24: [<ffffffffa08c2d01>] class_unlink_export+0x121/0x130 [obdclass]
05:39:24: [<ffffffffa08dd778>] class_decref+0x348/0x4c0 [obdclass]
05:39:24: [<ffffffffa07abb61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
05:39:24: [<ffffffffa082aefc>] ? libcfs_nid2str_r+0x11c/0x140 [lnet]
05:39:24: [<ffffffffa0ae27ed>] target_handle_connect+0x23d/0x2ba0 [ptlrpc]
05:39:24: [<ffffffffa07ab523>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
05:39:24: [<ffffffffa0b88dc2>] tgt_request_handle+0x5a2/0x12e0 [ptlrpc]
05:39:24: [<ffffffffa0b309c1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
05:39:24: [<ffffffffa0b2fb80>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
05:39:24: [<ffffffff810a0fce>] kthread+0x9e/0xc0
05:39:24: [<ffffffff8100c28a>] child_rip+0xa/0x20
05:39:24: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
05:39:24: [<ffffffff8100c280>] ? child_rip+0x0/0x20
05:39:24:
05:39:24:Kernel panic - not syncing: LBUG
Comment by Bob Glossman (Inactive) [ 18/Oct/15 ]

another in replay-single test 62 on master:
https://testing.hpdd.intel.com/test_sets/6c2d9470-7537-11e5-95e7-5254006e85c2

This appears to be happening all over the place in various tests in replay-single.

Comment by Bob Glossman (Inactive) [ 22/Oct/15 ]

another seen in replay-single, test_35 on master:
https://testing.hpdd.intel.com/test_sets/d890aae6-78d2-11e5-9aa5-5254006e85c2

Comment by Andreas Dilger [ 22/Oct/15 ]

Seems there are a variety of problems with unmounting the MDT or OST that need to be investigated.

Comment by Bruno Faccini (Inactive) [ 26/Oct/15 ]

Just got a very similar occurrence during replay-single/test_80d at https://testing.hpdd.intel.com/test_sets/7466dbda-799a-11e5-a447-5254006e85c2.

A quick look to the crash-dump shows that 3 MDT threads have triggered the same LBUG at the same time, and their common stack part looks like :

.............
 #3 [ffff88007cd7bab8] lbug_with_loc at ffffffffa04a6ecb [libcfs]
 #4 [ffff88007cd7bad8] class_export_put at ffffffffa05c98b1 [obdclass]
 #5 [ffff88007cd7baf8] obd_stale_export_put at ffffffffa05c9aa0 [obdclass]
 #6 [ffff88007cd7bb28] class_unlink_export at ffffffffa05c9d01 [obdclass]
 #7 [ffff88007cd7bb48] class_decref at ffffffffa05e4778 [obdclass]
 #8 [ffff88007cd7bbb8] target_handle_connect at ffffffffa07e97ed [ptlrpc]
 #9 [ffff88007cd7bd48] tgt_request_handle at ffffffffa088fb22 [ptlrpc]
#10 [ffff88007cd7bda8] ptlrpc_main at ffffffffa08378c1 [ptlrpc]
#11 [ffff88007cd7bee8] kthread at ffffffff810a0fce
#12 [ffff88007cd7bf48] kernel_thread at ffffffff8100c28a

they all 3 are working on the same "lustre-MDT0000" obd_device, its obd_refcount=1, but some of its other counters (obd_num_exports=-3, obd_conn_inprogress=-5) seem to indicate a possible race condition where multiple threads have executed this same path in the code handling connect requests when the MDT target is being unmounted (obd_stopping).

Having a look to the concerned code, I think this has been introduced by patch http://review.whamcloud.com/#/c/11750/ for LU-5569, where class_incref()/class_decref() calls are now done even when obd_stopping ...

Comment by Gerrit Updater [ 26/Oct/15 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/16940
Subject: LU-7221 ldlm: do not take a reference on target if stopping
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a3008de85070ab7ce9388f1fad6ec37f4989dcdd

Comment by Bruno Faccini (Inactive) [ 26/Oct/15 ]

Patch http://review.whamcloud.com/16940 is an attempt to revert one of the changes of patch http://review.whamcloud.com/#/c/11750/ for LU-5569, which allows to take a reference on a target even if it is stopping/umounting, upon new connections requests, finally causing a LBUG during useless [obd_self_]export cleanup operations.

Comment by James Nunez (Inactive) [ 26/Oct/15 ]

Another failure on master at https://testing.hpdd.intel.com/test_sets/5c507be8-7b46-11e5-9ee6-5254006e85c2

2015-10-26 04:00:03 - https://testing.hpdd.intel.com/test_sets/87e30d88-7bba-11e5-9851-5254006e85c2
2015-11-11 13:07:17 - https://testing.hpdd.intel.com/test_sets/622a597e-88a9-11e5-b099-5254006e85c2

Comment by Bob Glossman (Inactive) [ 27/Oct/15 ]

another on master;
https://testing.hpdd.intel.com/test_sets/fa2e57a4-7c59-11e5-9ca1-5254006e85c2

Comment by nasf (Inactive) [ 09/Nov/15 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/e4a06a3a-86a4-11e5-bf92-5254006e85c2

Comment by Di Wang [ 12/Nov/15 ]

Another one on master
https://testing.hpdd.intel.com/test_sets/112b8818-896c-11e5-8ba4-5254006e85c2

Comment by Gerrit Updater [ 24/Nov/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16940/
Subject: LU-7221 ldlm: do not take a reference on target if stopping
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6ff417169a2ad84478cce1b0321e70a030ffed83

Comment by Joseph Gmitter (Inactive) [ 24/Nov/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:07:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.