[LU-11801] replay-vbr test 0b crashes with and LBUG/ASSERTION( ctxt ) Created: 17/Dec/18  Updated: 15/Apr/20  Resolved: 15/Apr/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: ubuntu
Environment:

Ubuntu 18.04


Issue Links:
Related
is related to LU-9337 LBUG replay-single test_0b: test fail... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-vbr test_0b crashes for Ubuntu 18.04 clients with RHEL 7.6 servers. This test started crashing on 27 November 2018.

Looking at the kernel crash from https://testing.whamcloud.com/test_sets/9f692e08-fdc9-11e8-93ea-52540065bddc , we see

[ 5308.450564] Lustre: DEBUG MARKER: == replay-vbr test 0b: getversion for non existent fid shouldn't cause kernel panic ================== 21:08:17 (1544562497)
[ 5308.527820] LustreError: 12286:0:(osp_sync.c:346:osp_sync_declare_add()) ASSERTION( ctxt ) failed: 
[ 5308.528714] LustreError: 12286:0:(osp_sync.c:346:osp_sync_declare_add()) LBUG
[ 5308.529382] Pid: 12286, comm: mdt00_000 3.10.0-957.el7_lustre.x86_64 #1 SMP Sat Dec 8 05:53:16 UTC 2018
[ 5308.530265] Call Trace:
[ 5308.530534]  [<ffffffffc079d7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 5308.531258]  [<ffffffffc079d87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 5308.531885]  [<ffffffffc11e0b89>] osp_sync_declare_add+0x3b9/0x3f0 [osp]
[ 5308.532569]  [<ffffffffc11d0ce3>] osp_declare_destroy+0x1a3/0x1f0 [osp]
[ 5308.533334]  [<ffffffffc111a85e>] lod_sub_declare_destroy+0xce/0x2d0 [lod]
[ 5308.534219]  [<ffffffffc10f7a3d>] lod_obj_stripe_destroy_cb+0x8d/0xa0 [lod]
[ 5308.534955]  [<ffffffffc110423e>] lod_obj_for_each_stripe+0x11e/0x2d0 [lod]
[ 5308.535718]  [<ffffffffc110504f>] lod_declare_destroy+0x45f/0x5e0 [lod]
[ 5308.536452]  [<ffffffffc116b081>] mdd_declare_finish_unlink+0x91/0x210 [mdd]
[ 5308.537193]  [<ffffffffc117a9af>] mdd_unlink+0x4bf/0xad0 [mdd]
[ 5308.537829]  [<ffffffffc1043089>] mdo_unlink+0x46/0x48 [mdt]
[ 5308.538539]  [<ffffffffc1005e69>] mdt_reint_unlink+0xb49/0x14a0 [mdt]
[ 5308.539308]  [<ffffffffc100c5e3>] mdt_reint_rec+0x83/0x210 [mdt]
[ 5308.539937]  [<ffffffffc0fe9133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[ 5308.540621]  [<ffffffffc0ff4497>] mdt_reint+0x67/0x140 [mdt]
[ 5308.541262]  [<ffffffffc0c8535a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 5308.542296]  [<ffffffffc0c2992b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 5308.543087]  [<ffffffffc0c2d25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[ 5308.543835]  [<ffffffff9bcc1c31>] kthread+0xd1/0xe0
[ 5308.544389]  [<ffffffff9c374c37>] ret_from_fork_nospec_end+0x0/0x39
[ 5308.545029]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 5308.545592] Kernel panic - not syncing: LBUG
[ 5308.546008] CPU: 0 PID: 12286 Comm: mdt00_000 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.el7_lustre.x86_64 #1
[ 5308.547086] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 5308.547649] Call Trace:
[ 5308.547916]  [<ffffffff9c361dc1>] dump_stack+0x19/0x1b
[ 5308.548409]  [<ffffffff9c35b4d0>] panic+0xe8/0x21f
[ 5308.548865]  [<ffffffffc079d8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[ 5308.549448]  [<ffffffffc11e0b89>] osp_sync_declare_add+0x3b9/0x3f0 [osp]
[ 5308.550080]  [<ffffffffc11d0ce3>] osp_declare_destroy+0x1a3/0x1f0 [osp]
[ 5308.550705]  [<ffffffffc111a85e>] lod_sub_declare_destroy+0xce/0x2d0 [lod]
[ 5308.551377]  [<ffffffffc10f7a3d>] lod_obj_stripe_destroy_cb+0x8d/0xa0 [lod]
[ 5308.552040]  [<ffffffffc110423e>] lod_obj_for_each_stripe+0x11e/0x2d0 [lod]
[ 5308.552697]  [<ffffffffc110504f>] lod_declare_destroy+0x45f/0x5e0 [lod]
[ 5308.553459]  [<ffffffffc09e4ca4>] ? lu_env_refill+0x24/0x30 [obdclass]
[ 5308.554081]  [<ffffffffc10f79b0>] ? lod_xattr_list+0x150/0x150 [lod]
[ 5308.554674]  [<ffffffffc116b081>] mdd_declare_finish_unlink+0x91/0x210 [mdd]
[ 5308.555363]  [<ffffffffc117a9af>] mdd_unlink+0x4bf/0xad0 [mdd]
[ 5308.555929]  [<ffffffffc1043089>] mdo_unlink+0x46/0x48 [mdt]
[ 5308.556469]  [<ffffffffc1005e69>] mdt_reint_unlink+0xb49/0x14a0 [mdt]
[ 5308.557088]  [<ffffffffc100c5e3>] mdt_reint_rec+0x83/0x210 [mdt]
[ 5308.557663]  [<ffffffffc0fe9133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[ 5308.558293]  [<ffffffffc0ff13f4>] ? mdt_thread_info_init+0xa4/0x1e0 [mdt]
[ 5308.558933]  [<ffffffffc0ff4497>] mdt_reint+0x67/0x140 [mdt]
[ 5308.559512]  [<ffffffffc0c8535a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 5308.560184]  [<ffffffffc07a3f07>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 5308.560834]  [<ffffffffc0c2992b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 5308.561558]  [<ffffffff9bccba9b>] ? __wake_up_common+0x5b/0x90
[ 5308.562151]  [<ffffffffc0c2d25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[ 5308.562750]  [<ffffffff9bcd0880>] ? finish_task_switch+0x50/0x1c0
[ 5308.563382]  [<ffffffffc0c2c760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[ 5308.564084]  [<ffffffff9bcc1c31>] kthread+0xd1/0xe0
[ 5308.564541]  [<ffffffff9bcc1b60>] ? insert_kthread_work+0x40/0x40
[ 5308.565108]  [<ffffffff9c374c37>] ret_from_fork_nospec_begin+0x21/0x21
[ 5308.565710]  [<ffffffff9bcc1b60>] ? insert_kthread_work+0x40/0x40

There are several example of this crash
https://testing.whamcloud.com/test_sets/375d7040-fdc8-11e8-b837-52540065bddc
https://testing.whamcloud.com/test_sets/54b126fe-f955-11e8-b67f-52540065bddc
https://testing.whamcloud.com/test_sets/cec17f86-f6e7-11e8-815b-52540065bddc



 Comments   
Comment by Oleg Drokin [ 18/Dec/18 ]

This assertion seems to be a 100% match to LU-9337, the stack trace also matches.

Comment by Peter Jones [ 18/Dec/18 ]

Alex

Could you please assess this issue?

Peter

Comment by Alex Zhuravlev [ 18/Dec/18 ]

checking the logs. it makes sense to mention that in all reported cases replay-dual precede and failed with very similar symptom:
"Restart of mds1 failed!" and the following in the log:

+ pm -h powerman --off trevis-27vm8
/usr/lib64/lustre/tests/test-framework.sh: line 2470: pm: command not found
waiting ! ping -w 3 -c 1 trevis-27vm8, 4 secs left ...
waiting ! ping -w 3 -c 1 trevis-27vm8, 3 secs left ...
waiting ! ping -w 3 -c 1 trevis-27vm8, 2 secs left ...
waiting ! ping -w 3 -c 1 trevis-27vm8, 1 secs left ...
waiting for trevis-27vm8 to fail attempts=3
+ pm -h powerman --off trevis-27vm8
/usr/lib64/lustre/tests/test-framework.sh: line 2470: pm: command not found
waiting ! ping -w 3 -c 1 trevis-27vm8, 4 secs left ...
waiting ! ping -w 3 -c 1 trevis-27vm8, 3 secs left ...
waiting ! ping -w 3 -c 1 trevis-27vm8, 2 secs left ...
waiting ! ping -w 3 -c 1 trevis-27vm8, 1 secs left ...
waiting for trevis-27vm8 to fail attempts=3
trevis-27vm8 still pingable after power down! attempts=3
reboot facets: mds1
+ pm -h powerman --on trevis-27vm8
/usr/lib64/lustre/tests/test-framework.sh: line 2560: pm: command not found
Failover mds1 to trevis-27vm7
03:17:22 (1543807042) waiting for trevis-27vm7 network 900 secs ...
03:17:22 (1543807042) network interface is UP
CMD: trevis-27vm7 hostname
mount facets: mds1
CMD: trevis-27vm7 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
CMD: trevis-27vm7 dmsetup status /dev/mapper/mds1_flakey 2>&1
CMD: trevis-27vm7 dmsetup table /dev/mapper/mds1_flakey
CMD: trevis-27vm7 dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey
CMD: trevis-27vm7 dmsetup load /dev/mapper/mds1_flakey --table \"0 20971520 linear 252:0 0\"
CMD: trevis-27vm7 dmsetup resume /dev/mapper/mds1_flakey
CMD: trevis-27vm7 test -b /dev/mapper/mds1_flakey
CMD: trevis-27vm7 e2label /dev/mapper/mds1_flakey
Starting mds1: /dev/mapper/mds1_flakey /mnt/lustre-mds1
CMD: trevis-27vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre /dev/mapper/mds1_flakey /mnt/lustre-mds1
trevis-27vm7: mount.lustre: according to /etc/mtab /dev/mapper/mds1_flakey is already mounted on /mnt/lustre-mds1
Start of /dev/mapper/mds1_flakey on mds1 failed 17

 

Comment by Alex Zhuravlev [ 15/Apr/20 ]

a duplicate of LU-12674

Generated at Sat Feb 10 02:47:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.