Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.5
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for sarah <sarah@whamcloud.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/375e5764-84b4-41ec-8cf2-b109216ea063
test_60a failed with the following error:
onyx-99vm1 crashed during sanity test_60a
Test session details:
clients: https://build.whamcloud.com/job/lustre-b2_15/88 - 5.14.0-427.18.1.el9_4.x86_64
servers: https://build.whamcloud.com/job/lustre-b2_15/88 - 4.18.0-513.24.1.el8_lustre.x86_64
<<Please provide additional information about the failure here>>
[ 4362.533695] LustreError: 171557:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 1317 sleeping [ 4363.083782] LustreError: 171577:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 1317 waking [ 4363.085558] LustreError: 171557:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 1317 awake: rc=4450 [ 4364.107775] LustreError: 171577:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 1317 waking [ 4364.109524] Lustre: 171577:0:(llog_test.c:1470:cat_check_old_cb()) seeing record at index 3 - [0x1:0x4be:0x0] in log [0xa:0x11:0x0] [ 4364.688147] LustreError: 171557:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 1317 waking [ 4365.708742] Lustre: 171557:0:(llog_test.c:2075:llog_test_10()) 10h: wrote 64767 records then 0 failed with ENOSPC [ 4365.710822] Lustre: 171557:0:(llog_test.c:2088:llog_test_10()) 10: put newly-created catalog [ 4366.910031] Lustre: DEBUG MARKER: /usr/sbin/lctl dk [ 4368.148192] Lustre: DEBUG MARKER: which llog_reader 2> /dev/null [ 4368.502516] Lustre: DEBUG MARKER: ls -d /usr/sbin/llog_reader [ 4369.709382] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true [ 4370.032161] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1 [ 4372.532113] Lustre: Failing over lustre-MDT0000 [ 4375.225434] Lustre: lustre-MDT0000: Not available for connect from 10.240.29.251@tcp (stopping) [ 4378.342533] Lustre: lustre-MDT0000: Not available for connect from 10.240.29.252@tcp (stopping) [ 4387.945411] Lustre: 11292:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1718047804/real 1718047804] req@000000009650aa81 x1801499268665600/t0(0) o400->lustre-OST0004-osc-MDT0000@10.240.29.253@tcp:28/4 lens 224/224 e 0 to 1 dl 1718047811 ref 1 fl Rpc:RXNQ/0/ffffffff rc 0/-1 job:'kworker/u4:0.0' [ 4387.955128] Lustre: lustre-OST0004-osc-MDT0000: Connection to lustre-OST0004 (at 10.240.29.253@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 4387.958383] Lustre: Skipped 2 previous similar messages [ 4387.959857] Lustre: lustre-MDT0000: Not available for connect from 10.240.29.251@tcp (stopping) [ 4387.961600] Lustre: Skipped 15 previous similar messages [ 4388.020779] Lustre: lustre-OST0004-osc-MDT0000: Connection restored to 10.240.29.253@tcp (at 10.240.29.253@tcp) [ 4388.271300] Lustre: MGS: Client 6a7488a2-5141-4026-8a48-15608255ca8e (at 10.240.29.251@tcp) reconnecting [ 4389.853159] LustreError: 172062:0:(lprocfs_jobstats.c:137:job_stat_exit()) should not have any items [ 4389.855119] LustreError: 172062:0:(lprocfs_jobstats.c:137:job_stat_exit()) Skipped 6 previous similar messages [ 4390.265347] LustreError: 172069:0:(client.c:1256:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@000000007642e2d0 x1801499268667136/t0(0) o101->lustre-MDT0000-lwp-MDT0000@0@lo:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:QU/0/ffffffff rc 0/-1 job:'qsd_reint_0.lus.0' [ 4390.270035] LustreError: 172069:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5 [ 4392.020862] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.240.29.251@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 4392.024378] LustreError: Skipped 8 previous similar messages [ 4393.271120] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.240.29.253@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 4397.062969] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.240.29.251@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 4397.066497] LustreError: Skipped 7 previous similar messages [ 4398.474282] Lustre: 172062:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1718047819/real 1718047819] req@000000007bbd2132 x1801499268667584/t0(0) o251->MGC10.240.28.44@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1718047825 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'umount.0' [ 4398.479856] Lustre: 172062:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 1 previous similar message [ 4427.970012] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [umount:172062] [ 4427.971533] Modules linked in: obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net net_failover virtio_blk failover [last unloaded: llog_test] [ 4427.996085] CPU: 0 PID: 172062 Comm: umount Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1 [ 4427.998467] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 4427.999615] RIP: 0010:memset_erms+0x9/0x20 [ 4428.147666] Code: 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 [ 4428.151228] RSP: 0018:ffffa9f9c133bb30 EFLAGS: 00010246 ORIG_RAX: ffffffffffffff13 [ 4428.152710] RAX: ffff96b543ea525a RBX: ffff96b580f3fc00 RCX: 0000000000100000 [ 4428.154121] RDX: 0000000000100000 RSI: 000000000000005a RDI: ffff96b579100000 [ 4428.155527] RBP: ffff96b580f3fc28 R08: 0000000000000001 R09: ffff96b579100000 [ 4428.156937] R10: ffff96b5787d5000 R11: 0000000000000001 R12: ffff96b580f3fc00 [ 4428.158342] R13: ffff96b5787d5800 R14: dead000000000200 R15: dead000000000100 [ 4428.171578] FS: 00007fc8771da080(0000) GS:ffff96b5ffc00000(0000) knlGS:0000000000000000 [ 4428.173156] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4428.174301] CR2: 00007f6943409050 CR3: 000000001bc80001 CR4: 00000000001706f0 [ 4428.175710] Call Trace: [ 4428.252421] <IRQ> [ 4428.254032] ? watchdog_timer_fn.cold.10+0x46/0x9e [ 4428.278435] ? watchdog+0x30/0x30 [ 4428.279153] ? __hrtimer_run_queues+0x101/0x280 [ 4428.280620] ? hrtimer_interrupt+0x100/0x220 [ 4428.281501] ? smp_apic_timer_interrupt+0x6a/0x130 [ 4428.289409] ? apic_timer_interrupt+0xf/0x20 [ 4428.290302] </IRQ> [ 4428.290773] ? memset_erms+0x9/0x20 [ 4428.291510] ptlrpc_service_purge_all+0x422/0xa80 [ptlrpc] [ 4428.293182] ptlrpc_unregister_service+0x422/0x940 [ptlrpc] [ 4428.294389] ? kmem_cache_alloc_trace+0x142/0x280 [ 4428.303771] ? lprocfs_counter_add+0xd2/0x140 [obdclass] [ 4428.305122] mds_stop_ptlrpc_service+0x69/0x1b0 [mdt] [ 4428.322721] mds_device_fini+0x28/0xd0 [mdt] [ 4428.323661] class_cleanup+0x6f5/0xc90 [obdclass] [ 4428.392742] class_process_config+0x3ad/0x2080 [obdclass] [ 4428.393908] ? class_manual_cleanup+0x191/0x780 [obdclass] [ 4428.395093] ? __kmalloc+0x113/0x250 [ 4428.402142] class_manual_cleanup+0x456/0x780 [obdclass] [ 4428.403265] server_put_super+0xc8b/0x1350 [obdclass] [ 4428.404575] ? evict_inodes+0x160/0x1b0 [ 4428.405889] generic_shutdown_super+0x6c/0x110 [ 4428.433075] kill_anon_super+0x14/0x30 [ 4428.433877] deactivate_locked_super+0x34/0x70 [ 4428.446805] cleanup_mnt+0x3b/0x70 [ 4428.448698] task_work_run+0x8a/0xb0 [ 4428.485783] exit_to_usermode_loop+0xef/0x100 [ 4428.542958] do_syscall_64+0x19c/0x1b0 [ 4428.543778] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 4428.560793] RIP: 0033:0x7fc876144e9b [ 4428.561572] Code: ff d0 48 89 c7 b8 3c 00 00 00 0f 05 48 8b 0d e4 4f 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bd 4f 38 00 f7 d8 64 89 01 48 [ 4428.565120] RSP: 002b:00007fff58347e08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [ 4428.566604] RAX: 0000000000000000 RBX: 0000564708b059c0 RCX: 00007fc876144e9b [ 4428.568005] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000564708b0b390 [ 4428.569417] RBP: 0000000000000000 R08: 0000564708b0bf20 R09: 0000564708b00010 [ 4428.570820] R10: 0000000000000000 R11: 0000000000000246 R12: 0000564708b0b390 [ 4428.572230] R13: 00007fc876fb6184 R14: 0000564708b05ba0 R15: 00000000ffffffff [ 4428.573821] Kernel panic - not syncing: softlockup: hung tasks [ 4428.574998] CPU: 0 PID: 172062 Comm: umount Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1 [ 4428.577378] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 4428.578518] Call Trace: [ 4428.582866] <IRQ> [ 4428.583355] dump_stack+0x41/0x60 [ 4428.604195] panic+0xe7/0x2ac [ 4428.606288] watchdog_timer_fn.cold.10+0x85/0x9e [ 4428.622294] ? watchdog+0x30/0x30 [ 4428.623035] __hrtimer_run_queues+0x101/0x280 [ 4428.635497] hrtimer_interrupt+0x100/0x220 [ 4428.636677] smp_apic_timer_interrupt+0x6a/0x130 [ 4428.650956] apic_timer_interrupt+0xf/0x20 [ 4428.651834] </IRQ> [ 4428.652312] RIP: 0010:memset_erms+0x9/0x20 [ 4428.653162] Code: 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 [ 4428.656714] RSP: 0018:ffffa9f9c133bb30 EFLAGS: 00010246 ORIG_RAX: ffffffffffffff13 [ 4428.658216] RAX: ffff96b543ea525a RBX: ffff96b580f3fc00 RCX: 0000000000100000 [ 4428.659631] RDX: 0000000000100000 RSI: 000000000000005a RDI: ffff96b579100000 [ 4428.661036] RBP: ffff96b580f3fc28 R08: 0000000000000001 R09: ffff96b579100000 [ 4428.677284] R10: ffff96b5787d5000 R11: 0000000000000001 R12: ffff96b580f3fc00 [ 4428.678708] R13: ffff96b5787d5800 R14: dead000000000200 R15: dead000000000100 [ 4428.680117] ptlrpc_service_purge_all+0x422/0xa80 [ptlrpc] [ 4428.681308] ptlrpc_unregister_service+0x422/0x940 [ptlrpc] [ 4428.682506] ? kmem_cache_alloc_trace+0x142/0x280 [ 4428.683465] ? lprocfs_counter_add+0xd2/0x140 [obdclass] [ 4428.684603] mds_stop_ptlrpc_service+0x69/0x1b0 [mdt] [ 4428.685667] mds_device_fini+0x28/0xd0 [mdt] [ 4428.686591] class_cleanup+0x6f5/0xc90 [obdclass] [ 4428.687609] class_process_config+0x3ad/0x2080 [obdclass] [ 4428.688752] ? class_manual_cleanup+0x191/0x780 [obdclass] [ 4428.689910] ? __kmalloc+0x113/0x250 [ 4428.690678] class_manual_cleanup+0x456/0x780 [obdclass] [ 4428.691816] server_put_super+0xc8b/0x1350 [obdclass] [ 4428.692893] ? evict_inodes+0x160/0x1b0 [ 4428.693699] generic_shutdown_super+0x6c/0x110 [ 4428.694611] kill_anon_super+0x14/0x30 [ 4428.695401] deactivate_locked_super+0x34/0x70 [ 4428.696311] cleanup_mnt+0x3b/0x70 [ 4428.697034] task_work_run+0x8a/0xb0 [ 4428.697791] exit_to_usermode_loop+0xef/0x100 [ 4428.698688] do_syscall_64+0x19c/0x1b0 [ 4428.699482] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 4428.700517] RIP: 0033:0x7fc876144e9b [ 4428.701276] Code: ff d0 48 89 c7 b8 3c 00 00 00 0f 05 48 8b 0d e4 4f 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bd 4f 38 00 f7 d8 64 89 01 48 [ 4428.704828] RSP: 002b:00007fff58347e08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [ 4428.706314] RAX: 0000000000000000 RBX: 0000564708b059c0 RCX: 00007fc876144e9b [ 4428.746088] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000564708b0b390 [ 4428.747498] RBP: 0000000000000000 R08: 0000564708b0bf20 R09: 0000564708b00010 [ 4428.748908] R10: 0000000000000000 R11: 0000000000000246 R12: 0000564708b0b390 [ 4428.750319] R13: 00007fc876fb6184 R14: 0000564708b05ba0 R15: 00000000ffffffff
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_60a - onyx-99vm1 crashed during sanity test_60a
Attachments
Issue Links
- is related to
-
LU-17946 sanity test_818: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! umount
- Resolved
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...