[LU-12980] conf-sanity test 53a hangs in lustre_rmmod Created: 18/Nov/19 Updated: 18/Nov/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SLES12SP4 clients |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
conf-sanity test_53a hangs in lustre_rmmod. Looking at the client test_log for the hang at https://testing.whamcloud.com/test_sets/fe1c7768-07a3-11ea-b934-52540065bddc for SLES12 SP4 client, the last thing we see is trevis-52vm12: == rpc test complete, duration -o sec ================================================================ 15:03:03 (1573743783) trevis-52vm11: CMD: trevis-52vm11.trevis.whamcloud.com /sbin/lctl get_param -n version 2>/dev/null || trevis-52vm11: /sbin/lctl lustre_build_version 2>/dev/null || trevis-52vm11: /sbin/lctl --version 2>/dev/null | cut -d' ' -f2 trevis-52vm11: == rpc test complete, duration -o sec ================================================================ 15:03:03 (1573743783) trevis-52vm12: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs trevis-52vm11: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs Everything prior to these last lines look like a typical successful test 53a test run. Looking at the client 1 (vm10) console log, we see [41930.736352] Lustre: DEBUG MARKER: == conf-sanity test 53a: check OSS thread count params =============================================== 07:02:08 (1573743728) [41981.099044] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us [41982.940633] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 [41983.114571] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-52vm10: executing \/usr\/sbin\/lustre_rmmod ldiskfs [41983.389931] Lustre: DEBUG MARKER: trevis-52vm10: executing /usr/sbin/lustre_rmmod ldiskfs [41983.649199] Key type lgssc unregistered [42013.771188] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [rmmod:22596] [42013.772713] Modules linked in: obdecho(OEN) lnet_selftest(OEN) mgc(OEN) fid(OEN) osc(OEN-) ksocklnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet iscsi_ibft iscsi_boot_sysfs crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel 8139too aes_x86_64 crypto_simd glue_helper cryptd pcspkr 8139cp virtio_balloon mii qemu_fw_cfg i2c_piix4 joydev button ata_generic ext4 crc16 jbd2 mbcache ata_piix ahci libahci uhci_hcd virtio_blk ehci_hcd usbcore serio_raw virtio_pci virtio_ring virtio libata floppy sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 [last unloaded: mdc] [42013.782838] Supported: No, Unsupported modules are loaded [42013.907792] CPU: 0 PID: 22596 Comm: rmmod Tainted: G OE 4.12.14-95.29-default #1 SLE12-SP4 [42013.909560] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [42013.910614] task: ffff97ec204c5740 task.stack: ffffb0a8807a0000 [42017.314832] RIP: 0010:__memset+0x24/0x30 [42017.574572] RSP: 0018:ffffb0a8807a3ea8 EFLAGS: 00010206 ORIG_RAX: ffffffffffffff10 [42017.576187] RAX: 5a5a5a5a5a5a5a5a RBX: ffff97ec3b000908 RCX: 000000000000a600 [42017.577470] RDX: 0000000000000000 RSI: 000000000000005a RDI: ffff97ec2302d000 [42017.578736] RBP: ffff97ec3b000488 R08: 0000000000000000 R09: ffff97ec23000000 [42017.580000] R10: 0000000000000000 R11: ffffffffffffffec R12: ffff97ec3af57540 [42017.581266] R13: ffff97ec3b000480 R14: ffff97ec3af57548 R15: 00001fffffffffff [42017.582538] FS: 00007faf6fce5700(0000) GS:ffff97ec3fc00000(0000) knlGS:0000000000000000 [42017.583965] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [42017.585006] CR2: 000000000021e000 CR3: 000000007af06003 CR4: 00000000000606f0 [42017.586287] Call Trace: [42029.158312] ptlrpc_free_rq_pool+0x253/0x3e0 [ptlrpc] [42031.024082] SyS_delete_module+0x16b/0x250 [42034.276610] ? exit_to_usermode_loop+0x86/0x9c [42034.277828] do_syscall_64+0x74/0x160 [42035.491904] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [42037.828160] RIP: 0033:0x7faf6f3de457 [42037.829308] RSP: 002b:00007fff1802b988 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [42037.831252] RAX: ffffffffffffffda RBX: 00000000006f71d0 RCX: 00007faf6f3de457 [42037.833049] RDX: 00007faf6f4459c0 RSI: 0000000000000800 RDI: 00000000006f7238 [42037.834851] RBP: 0000000000000000 R08: 00007faf6f690f40 R09: 00007fff1802a901 [42037.836663] R10: 000000000000000a R11: 0000000000000206 R12: 00007fff1802cdd4 [42037.838495] R13: 0000000000000000 R14: 00000000006f71d0 R15: 00000000006f7010 [42037.840292] Code: 90 90 90 90 90 90 90 66 66 90 66 90 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 [42037.844629] Kernel panic - not syncing: softlockup: hung tasks [42037.846139] CPU: 0 PID: 22596 Comm: rmmod Tainted: G OEL 4.12.14-95.29-default #1 SLE12-SP4 [42037.848498] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [42037.849903] Call Trace: [42038.953574] <IRQ> [42038.954178] dump_stack+0x5a/0x75 [42039.421812] panic+0xdb/0x23e [42042.003586] watchdog_timer_fn+0x208/0x210 [42042.190190] ? watchdog_park_threads+0x70/0x70 [42042.191162] __hrtimer_run_queues+0xce/0x210 [42042.354362] hrtimer_interrupt+0x99/0x190 [42042.355315] smp_apic_timer_interrupt+0x3a/0x60 [42044.012800] apic_timer_interrupt+0x8f/0xa0 [42044.014198] </IRQ> [42044.015037] RIP: 0010:__memset+0x24/0x30 [42044.016375] RSP: 0018:ffffb0a8807a3ea8 EFLAGS: 00010206 ORIG_RAX: ffffffffffffff10 [42044.018779] RAX: 5a5a5a5a5a5a5a5a RBX: ffff97ec3b000908 RCX: 000000000000a600 [42044.021020] RDX: 0000000000000000 RSI: 000000000000005a RDI: ffff97ec2302d000 [42044.023076] RBP: ffff97ec3b000488 R08: 0000000000000000 R09: ffff97ec23000000 [42044.025330] R10: 0000000000000000 R11: ffffffffffffffec R12: ffff97ec3af57540 [42044.027548] R13: ffff97ec3b000480 R14: ffff97ec3af57548 R15: 00001fffffffffff [42044.029830] ptlrpc_free_rq_pool+0x253/0x3e0 [ptlrpc] [42044.031486] SyS_delete_module+0x16b/0x250 [42044.032957] ? exit_to_usermode_loop+0x86/0x9c [42044.034425] do_syscall_64+0x74/0x160 [42044.035679] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [42044.037311] RIP: 0033:0x7faf6f3de457 [42044.038540] RSP: 002b:00007fff1802b988 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [42044.040896] RAX: ffffffffffffffda RBX: 00000000006f71d0 RCX: 00007faf6f3de457 [42044.043043] RDX: 00007faf6f4459c0 RSI: 0000000000000800 RDI: 00000000006f7238 [42044.045258] RBP: 0000000000000000 R08: 00007faf6f690f40 R09: 00007fff1802a901 [42044.047518] R10: 000000000000000a R11: 0000000000000206 R12: 00007fff1802cdd4 [42044.049751] R13: 0000000000000000 R14: 00000000006f71d0 R15: 00000000006f7010 [ 1.686560] Kernel panic - not syncing: Out of memory and no killable processes... [ 1.686560] [ 1.688007] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.14-95.29-default #1 SLE12-SP4 [ 1.688007] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1.688007] Call Trace: [ 1.688007] dump_stack+0x5a/0x75 [ 1.688007] panic+0xdb/0x23e [ 1.688007] out_of_memory+0x3a1/0x490 [ 1.688007] __alloc_pages_slowpath+0x7e5/0xa0d [ 1.688007] __alloc_pages_nodemask+0x1e9/0x210 [ 1.688007] alloc_page_interleave+0x36/0xa0 [ 1.688007] alloc_pages_current+0xa8/0xf0 [ 1.688007] pagecache_get_page+0x4d/0x250 [ 1.688007] ? xattr_resolve_name+0xa8/0xd0 [ 1.688007] grab_cache_page_write_begin+0x1c/0x40 [ 1.688007] simple_write_begin+0x24/0xe0 [ 1.688007] generic_perform_write+0xb2/0x190 [ 1.688007] __generic_file_write_iter+0x184/0x1c0 [ 1.688007] generic_file_write_iter+0xec/0x1d0 [ 1.688007] __vfs_write+0xdc/0x150 [ 1.688007] vfs_write+0xad/0x1a0 [ 1.688007] SyS_write+0x42/0x90 [ 1.688007] xwrite+0x2a/0x5b [ 1.688007] do_copy+0x8b/0xb8 [ 1.688007] write_buffer+0x27/0x37 [ 1.688007] flush_buffer+0x2b/0x84 [ 1.688007] ? write_buffer+0x37/0x37 [ 1.688007] unxz+0x195/0x297 [ 1.688007] ? unlzo+0x4c4/0x4c4 [ 1.688007] unpack_to_rootfs+0x15d/0x299 [ 1.688007] ? initrd_load+0x3f/0x3f [ 1.688007] ? parse_header+0x10e/0x10e [ 1.688007] populate_rootfs+0x61/0x116 [ 1.688007] ? parse_header+0x10e/0x10e [ 1.688007] do_one_initcall+0x50/0x1b0 [ 1.688007] kernel_init_freeable+0x19a/0x222 [ 1.688007] ? set_debug_rodata+0x11/0x11 [ 1.688007] ? rest_init+0x80/0x80 [ 1.688007] kernel_init+0xa/0x110 [ 1.688007] ret_from_fork+0x35/0x40 [ 1.688007] Rebooting in 1 seconds.. Looking at the OSS (vm11) console log, we don’t see any tasks in a ‘D’ state and see the following running [41861.995673] Lustre: DEBUG MARKER: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs [41862.206227] Lustre: DEBUG MARKER: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs [41862.209736] Lustre: DEBUG MARKER: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs [41862.636484] Key type lgssc unregistered [41862.725313] LNet: 32129:0:(timer.c:244:stt_shutdown()) waiting for 1 threads to terminate [41864.825960] LNet: Removed LNI 10.9.3.107@tcp [45533.730112] SysRq : Changing Loglevel [45533.730911] Loglevel set to 8 [45533.732495] SysRq : Show State … [45535.808726] kworker/0:2 R running task 0 25726 2 0x00000080 [45535.809966] Call Trace: [45535.810410] [<ffffffff9aebd19a>] ? process_one_work+0x21a/0x440 [45535.811500] [<ffffffff9b57f229>] schedule+0x29/0x70 [45535.812372] [<ffffffff9aebe2c9>] worker_thread+0x1d9/0x3c0 [45535.813338] [<ffffffff9aebe0f0>] ? manage_workers.isra.26+0x2a0/0x2a0 [45535.814456] [<ffffffff9aec50d1>] kthread+0xd1/0xe0 [45535.815372] [<ffffffff9aec5000>] ? insert_kthread_work+0x40/0x40 [45535.816448] [<ffffffff9b58cd37>] ret_from_fork_nospec_begin+0x21/0x21 [45535.817562] [<ffffffff9aec5000>] ? insert_kthread_work+0x40/0x40 … [45535.909490] bash R running task 0 32398 32396 0x00000088 [45535.910761] Call Trace: [45535.911203] [<ffffffff9aed9278>] sched_show_task+0xa8/0x110 [45535.912208] [<ffffffff9aed9362>] show_state_filter+0x82/0xd0 [45535.913206] [<ffffffff9b26de40>] sysrq_handle_showstate+0x10/0x20 [45535.914261] [<ffffffff9b26e5dd>] __handle_sysrq+0x10d/0x170 [45535.915310] [<ffffffff9b26ea48>] write_sysrq_trigger+0x28/0x40 [45535.916365] [<ffffffff9b0c0b10>] proc_reg_write+0x40/0x80 [45535.917331] [<ffffffff9b048690>] vfs_write+0xc0/0x1f0 [45535.918234] [<ffffffff9b58ce21>] ? system_call_after_swapgs+0xae/0x146 [45535.919366] [<ffffffff9b0494af>] SyS_write+0x7f/0xf0 [45535.920246] [<ffffffff9b58ce21>] ? system_call_after_swapgs+0xae/0x146 [45535.921387] [<ffffffff9b58cede>] system_call_fastpath+0x25/0x2a [45535.922429] [<ffffffff9b58ce21>] ? system_call_after_swapgs+0xae/0x146 Looking at the MDS (vm12) console log, we see [41862.417381] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-52vm12.trevis.whamcloud.com: executing \/usr\/sbin\/lustre_rmmod ldiskfs [41862.422703] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-52vm12.trevis.whamcloud.com: executing \/usr\/sbin\/lustre_rmmod ldiskfs [41862.675612] Lustre: DEBUG MARKER: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs [41862.685734] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-52vm11.trevis.whamcloud.com: executing \/usr\/sbin\/lustre_rmmod ldiskfs [41862.737460] Lustre: DEBUG MARKER: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs [41862.915808] Lustre: DEBUG MARKER: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs [41863.115812] Key type lgssc unregistered [41864.147550] LNet: 32315:0:(api-ni.c:2061:lnet_clear_zombies_nis_locked()) Waiting for zombie LNI 10.9.3.108@tcp [41868.148695] LNet: 32315:0:(api-ni.c:2061:lnet_clear_zombies_nis_locked()) Waiting for zombie LNI 10.9.3.108@tcp [41870.149762] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect [41874.150882] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect [41882.152156] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect [41898.153717] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect [41930.155818] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect [41966.158106] LNet: Removed LNI 10.9.3.108@tcp [45537.590798] SysRq : Changing Loglevel [45537.591604] Loglevel set to 8 Although we’ve seen conf-sanity tes_53a hang previously, this hang looks different from the three other occurrences from this year. |