[LU-15948] Interop conf-sanity test_32d: MDS hit NMI watchdog: BUG: soft lockup Created: 15/Jun/22 Updated: 15/Jun/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/77959384-2fe3-4f55-bb38-984f7bf61760 test_32d failed with the following error: onyx-124vm8 crashed during conf-sanity test_32d MDS crash [ 4534.035258] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [ 4534.065338] Lustre: MGS: Connection restored to MGC10.240.30.39@tcp_0 (at 0@lo) [ 4534.351344] Lustre: 2461:0:(obd_mount.c:968:lustre_check_exclusion()) Excluding t32fs-OST0000 (on exclusion list) [ 4534.353004] Lustre: 2461:0:(obd_mount.c:968:lustre_check_exclusion()) Skipped 1 previous similar message [ 4580.197426] Lustre: t32fs-MDT0000: Imperative Recovery not enabled, recovery window 60-180 [ 4580.607165] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.t32fs-MDT0000.uuid [ 4580.926690] Lustre: DEBUG MARKER: tunefs.lustre --dryrun /tmp/t32/ost [ 4581.303203] Lustre: DEBUG MARKER: mount -t lustre -onomgs -oloop,mgsnode=10.240.30.39@tcp /tmp/t32/ost /tmp/t32/mnt/ost [ 4581.548860] LDISKFS-fs (loop1): file extents enabled, maximum tree depth=5 [ 4581.550342] LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 4620.308604] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [llog_process_th:2780] [ 4620.312080] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper parport_pc cryptd pcspkr joydev virtio_balloon i2c_piix4 parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi [ 4620.325003] virtio_net net_failover failover ata_piix virtio_blk libata crct10dif_pclmul crct10dif_common floppy crc32c_intel serio_raw virtio_pci virtio_ring virtio [last unloaded: libcfs] [ 4620.327797] CPU: 1 PID: 2780 Comm: llog_process_th Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.49.1.el7_lustre.x86_64 #1 [ 4620.329699] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 4620.330568] task: ffff98515cc4c200 ti: ffff985155fc8000 task.ti: ffff985155fc8000 [ 4620.331691] RIP: 0010:[<ffffffffb918b795>] [<ffffffffb918b795>] _raw_spin_unlock_irqrestore+0x15/0x20 [ 4620.333145] RSP: 0018:ffff985155fcb678 EFLAGS: 00000246 [ 4620.333954] RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff98515cc4c200 [ 4620.335028] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 [ 4620.336102] RBP: ffff985155fcb678 R08: ffff985148661fb0 R09: 0000000000000001 [ 4620.337177] R10: 0000000000000001 R11: 000000000000000f R12: ffffffffb8a2b59e [ 4620.338249] R13: ffff985155fcb678 R14: ffffffffb8ae6321 R15: ffff985155fcb610 [ 4620.339327] FS: 0000000000000000(0000) GS:ffff98517fd00000(0000) knlGS:0000000000000000 [ 4620.340538] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4620.341409] CR2: 00007fa275698504 CR3: 00000000bb604000 CR4: 00000000001606e0 [ 4620.342485] Call Trace: [ 4620.342909] [<ffffffffb8ac6a46>] prepare_to_wait+0x56/0x90 [ 4620.343806] [<ffffffffc0966129>] lnet_discover_peer_locked+0x1e9/0x430 [lnet] [ 4620.344899] [<ffffffffb8ac6f50>] ? wake_up_atomic_t+0x30/0x30 [ 4620.345794] [<ffffffffc0966425>] LNetPrimaryNID+0xb5/0x1f0 [lnet] [ 4620.346781] [<ffffffffc0c906ce>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [ 4620.347856] [<ffffffffc0c84b4c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] [ 4620.348993] [<ffffffffc0c563a2>] import_set_conn+0xb2/0x7a0 [ptlrpc] [ 4620.350002] [<ffffffffc0c57c39>] client_obd_setup+0xd19/0x1430 [ptlrpc] [ 4620.351029] [<ffffffffc1586e03>] lwp_setup.isra.5+0x363/0xc40 [osp] [ 4620.351999] [<ffffffffc085e217>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 4620.353007] [<ffffffffc15878d8>] lwp_device_alloc+0x1f8/0x590 [osp] [ 4620.354009] [<ffffffffc09fa5e9>] obd_setup+0x119/0x280 [obdclass] [ 4620.354962] [<ffffffffc09fa9f8>] class_setup+0x2a8/0x840 [obdclass] [ 4620.355942] [<ffffffffc09fdaa6>] class_process_config+0x1726/0x2830 [obdclass] [ 4620.357056] [<ffffffffc085e217>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 4620.358067] [<ffffffffc0a01fc8>] do_lcfg+0x258/0x500 [obdclass] [ 4620.358998] [<ffffffffc0a067f8>] lustre_start_simple+0x88/0x210 [obdclass] [ 4620.360070] [<ffffffffc0a2fcdc>] client_lwp_config_process+0xb4c/0xe10 [obdclass] [ 4620.361223] [<ffffffffc09c285b>] llog_process_thread+0x94b/0x1af0 [obdclass] [ 4620.362313] [<ffffffffc09c4414>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass] [ 4620.363492] [<ffffffffc09c4370>] ? llog_backup+0x500/0x500 [obdclass] [ 4620.364481] [<ffffffffb8ac5e61>] kthread+0xd1/0xe0 [ 4620.365226] [<ffffffffb8ac5d90>] ? insert_kthread_work+0x40/0x40 [ 4620.366149] [<ffffffffb9195df7>] ret_from_fork_nospec_begin+0x21/0x21 [ 4620.367137] [<ffffffffb8ac5d90>] ? insert_kthread_work+0x40/0x40 [ 4620.368065] Code: 06 bd 98 ff 66 90 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 e8 e6 bc 98 ff 66 90 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 48 [ 4620.372845] Kernel panic - not syncing: softlockup: hung tasks [ 4620.373732] CPU: 1 PID: 2780 Comm: llog_process_th Kdump: loaded Tainted: G OEL ------------ 3.10.0-1160.49.1.el7_lustre.x86_64 #1 [ 4620.375636] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 4620.376516] Call Trace: [ 4620.376903] <IRQ> [<ffffffffb9183539>] dump_stack+0x19/0x1b [ 4620.377820] [<ffffffffb917d241>] panic+0xe8/0x21f [ 4620.378568] [<ffffffffb8b4ee2a>] watchdog_timer_fn+0x20a/0x220 [ 4620.379471] [<ffffffffb8b4ec20>] ? watchdog+0x40/0x40 [ 4620.380267] [<ffffffffb8aca25e>] __hrtimer_run_queues+0x10e/0x270 [ 4620.381215] [<ffffffffb8aca7bf>] hrtimer_interrupt+0xaf/0x1d0 [ 4620.382108] [<ffffffffb8a5cdfb>] local_apic_timer_interrupt+0x3b/0x60 [ 4620.383104] [<ffffffffb919aa23>] smp_apic_timer_interrupt+0x43/0x60 [ 4620.384066] [<ffffffffb9196fba>] apic_timer_interrupt+0x16a/0x170 [ 4620.384997] <EOI> [<ffffffffb918b795>] ? _raw_spin_unlock_irqrestore+0x15/0x20 [ 4620.386144] [<ffffffffb8ac6a46>] prepare_to_wait+0x56/0x90 [ 4620.387001] [<ffffffffc0966129>] lnet_discover_peer_locked+0x1e9/0x430 [lnet] [ 4620.388093] [<ffffffffb8ac6f50>] ? wake_up_atomic_t+0x30/0x30 [ 4620.388982] [<ffffffffc0966425>] LNetPrimaryNID+0xb5/0x1f0 [lnet] [ 4620.389936] [<ffffffffc0c906ce>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [ 4620.391016] [<ffffffffc0c84b4c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] [ 4620.392137] [<ffffffffc0c563a2>] import_set_conn+0xb2/0x7a0 [ptlrpc] [ 4620.393139] [<ffffffffc0c57c39>] client_obd_setup+0xd19/0x1430 [ptlrpc] [ 4620.394157] [<ffffffffc1586e03>] lwp_setup.isra.5+0x363/0xc40 [osp] [ 4620.395125] [<ffffffffc085e217>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 4620.396125] [<ffffffffc15878d8>] lwp_device_alloc+0x1f8/0x590 [osp] [ 4620.397124] [<ffffffffc09fa5e9>] obd_setup+0x119/0x280 [obdclass] [ 4620.398074] [<ffffffffc09fa9f8>] class_setup+0x2a8/0x840 [obdclass] [ 4620.399054] [<ffffffffc09fdaa6>] class_process_config+0x1726/0x2830 [obdclass] [ 4620.400165] [<ffffffffc085e217>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 4620.401185] [<ffffffffc0a01fc8>] do_lcfg+0x258/0x500 [obdclass] [ 4620.402119] [<ffffffffc0a067f8>] lustre_start_simple+0x88/0x210 [obdclass] [ 4620.403189] [<ffffffffc0a2fcdc>] client_lwp_config_process+0xb4c/0xe10 [obdclass] [ 4620.404339] [<ffffffffc09c285b>] llog_process_thread+0x94b/0x1af0 [obdclass] [ 4620.405425] [<ffffffffc09c4414>] llog_process_thread_daemonize+0xa4/0xe0 [obdclass] [ 4620.406598] [<ffffffffc09c4370>] ? llog_backup+0x500/0x500 [obdclass] [ 4620.407583] [<ffffffffb8ac5e61>] kthread+0xd1/0xe0 [ 4620.408332] [<ffffffffb8ac5d90>] ? insert_kthread_work+0x40/0x40 [ 4620.409257] [<ffffffffb9195df7>] ret_from_fork_nospec_begin+0x21/0x21 [ 4620.410247] [<ffffffffb8ac5d90>] ? insert_kthread_work+0x40/0x40 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 15/Jun/22 ] |
|
Chris, Serguei, is this a duplicate of a known bug? |
| Comment by Chris Horn [ 15/Jun/22 ] |
|
We've seen different flavors of that stack trace in a bunch of different tickets. It is usually when discovery cannot complete due to network issues, or a configuration issue. But there were some bugs that would manifest with stack traces like that, too. I don't have a good sense what sort of interop issues might exist between 2.10 and 2.12 LTS though. If you can provide a core dump, vmlinux and kos I could take a look at it. |