[LU-12980] conf-sanity test 53a hangs in lustre_rmmod Created: 18/Nov/19  Updated: 18/Nov/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

SLES12SP4 clients


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity test_53a hangs in lustre_rmmod. Looking at the client test_log for the hang at https://testing.whamcloud.com/test_sets/fe1c7768-07a3-11ea-b934-52540065bddc for SLES12 SP4 client, the last thing we see is

trevis-52vm12: == rpc test complete, duration -o sec ================================================================ 15:03:03 (1573743783)
trevis-52vm11: CMD: trevis-52vm11.trevis.whamcloud.com /sbin/lctl get_param -n version 2>/dev/null ||
trevis-52vm11: 				/sbin/lctl lustre_build_version 2>/dev/null ||
trevis-52vm11: 				/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
trevis-52vm11: == rpc test complete, duration -o sec ================================================================ 15:03:03 (1573743783)
trevis-52vm12: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
trevis-52vm11: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs

Everything prior to these last lines look like a typical successful test 53a test run.

Looking at the client 1 (vm10) console log, we see

[41930.736352] Lustre: DEBUG MARKER: == conf-sanity test 53a: check OSS thread count params =============================================== 07:02:08 (1573743728)
[41981.099044] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
[41982.940633] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
[41983.114571] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-52vm10: executing \/usr\/sbin\/lustre_rmmod ldiskfs
[41983.389931] Lustre: DEBUG MARKER: trevis-52vm10: executing /usr/sbin/lustre_rmmod ldiskfs
[41983.649199] Key type lgssc unregistered
[42013.771188] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [rmmod:22596]
[42013.772713] Modules linked in: obdecho(OEN) lnet_selftest(OEN) mgc(OEN) fid(OEN) osc(OEN-) ksocklnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet iscsi_ibft iscsi_boot_sysfs crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel 8139too aes_x86_64 crypto_simd glue_helper cryptd pcspkr 8139cp virtio_balloon mii qemu_fw_cfg i2c_piix4 joydev button ata_generic ext4 crc16 jbd2 mbcache ata_piix ahci libahci uhci_hcd virtio_blk ehci_hcd usbcore serio_raw virtio_pci virtio_ring virtio libata floppy sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 [last unloaded: mdc]
[42013.782838] Supported: No, Unsupported modules are loaded
[42013.907792] CPU: 0 PID: 22596 Comm: rmmod Tainted: G           OE      4.12.14-95.29-default #1 SLE12-SP4
[42013.909560] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[42013.910614] task: ffff97ec204c5740 task.stack: ffffb0a8807a0000
[42017.314832] RIP: 0010:__memset+0x24/0x30
[42017.574572] RSP: 0018:ffffb0a8807a3ea8 EFLAGS: 00010206 ORIG_RAX: ffffffffffffff10
[42017.576187] RAX: 5a5a5a5a5a5a5a5a RBX: ffff97ec3b000908 RCX: 000000000000a600
[42017.577470] RDX: 0000000000000000 RSI: 000000000000005a RDI: ffff97ec2302d000
[42017.578736] RBP: ffff97ec3b000488 R08: 0000000000000000 R09: ffff97ec23000000
[42017.580000] R10: 0000000000000000 R11: ffffffffffffffec R12: ffff97ec3af57540
[42017.581266] R13: ffff97ec3b000480 R14: ffff97ec3af57548 R15: 00001fffffffffff
[42017.582538] FS:  00007faf6fce5700(0000) GS:ffff97ec3fc00000(0000) knlGS:0000000000000000
[42017.583965] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42017.585006] CR2: 000000000021e000 CR3: 000000007af06003 CR4: 00000000000606f0
[42017.586287] Call Trace:
[42029.158312]  ptlrpc_free_rq_pool+0x253/0x3e0 [ptlrpc]
[42031.024082]  SyS_delete_module+0x16b/0x250
[42034.276610]  ? exit_to_usermode_loop+0x86/0x9c
[42034.277828]  do_syscall_64+0x74/0x160
[42035.491904]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[42037.828160] RIP: 0033:0x7faf6f3de457
[42037.829308] RSP: 002b:00007fff1802b988 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[42037.831252] RAX: ffffffffffffffda RBX: 00000000006f71d0 RCX: 00007faf6f3de457
[42037.833049] RDX: 00007faf6f4459c0 RSI: 0000000000000800 RDI: 00000000006f7238
[42037.834851] RBP: 0000000000000000 R08: 00007faf6f690f40 R09: 00007fff1802a901
[42037.836663] R10: 000000000000000a R11: 0000000000000206 R12: 00007fff1802cdd4
[42037.838495] R13: 0000000000000000 R14: 00000000006f71d0 R15: 00000000006f7010
[42037.840292] Code: 90 90 90 90 90 90 90 66 66 90 66 90 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 
[42037.844629] Kernel panic - not syncing: softlockup: hung tasks
[42037.846139] CPU: 0 PID: 22596 Comm: rmmod Tainted: G           OEL     4.12.14-95.29-default #1 SLE12-SP4
[42037.848498] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[42037.849903] Call Trace:
[42038.953574]  <IRQ>
[42038.954178]  dump_stack+0x5a/0x75
[42039.421812]  panic+0xdb/0x23e
[42042.003586]  watchdog_timer_fn+0x208/0x210
[42042.190190]  ? watchdog_park_threads+0x70/0x70
[42042.191162]  __hrtimer_run_queues+0xce/0x210
[42042.354362]  hrtimer_interrupt+0x99/0x190
[42042.355315]  smp_apic_timer_interrupt+0x3a/0x60
[42044.012800]  apic_timer_interrupt+0x8f/0xa0
[42044.014198]  </IRQ>
[42044.015037] RIP: 0010:__memset+0x24/0x30
[42044.016375] RSP: 0018:ffffb0a8807a3ea8 EFLAGS: 00010206 ORIG_RAX: ffffffffffffff10
[42044.018779] RAX: 5a5a5a5a5a5a5a5a RBX: ffff97ec3b000908 RCX: 000000000000a600
[42044.021020] RDX: 0000000000000000 RSI: 000000000000005a RDI: ffff97ec2302d000
[42044.023076] RBP: ffff97ec3b000488 R08: 0000000000000000 R09: ffff97ec23000000
[42044.025330] R10: 0000000000000000 R11: ffffffffffffffec R12: ffff97ec3af57540
[42044.027548] R13: ffff97ec3b000480 R14: ffff97ec3af57548 R15: 00001fffffffffff
[42044.029830]  ptlrpc_free_rq_pool+0x253/0x3e0 [ptlrpc]
[42044.031486]  SyS_delete_module+0x16b/0x250
[42044.032957]  ? exit_to_usermode_loop+0x86/0x9c
[42044.034425]  do_syscall_64+0x74/0x160
[42044.035679]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[42044.037311] RIP: 0033:0x7faf6f3de457
[42044.038540] RSP: 002b:00007fff1802b988 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[42044.040896] RAX: ffffffffffffffda RBX: 00000000006f71d0 RCX: 00007faf6f3de457
[42044.043043] RDX: 00007faf6f4459c0 RSI: 0000000000000800 RDI: 00000000006f7238
[42044.045258] RBP: 0000000000000000 R08: 00007faf6f690f40 R09: 00007fff1802a901
[42044.047518] R10: 000000000000000a R11: 0000000000000206 R12: 00007fff1802cdd4
[42044.049751] R13: 0000000000000000 R14: 00000000006f71d0 R15: 00000000006f7010
[    1.686560] Kernel panic - not syncing: Out of memory and no killable processes...
[    1.686560] 
[    1.688007] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.14-95.29-default #1 SLE12-SP4
[    1.688007] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[    1.688007] Call Trace:
[    1.688007]  dump_stack+0x5a/0x75
[    1.688007]  panic+0xdb/0x23e
[    1.688007]  out_of_memory+0x3a1/0x490
[    1.688007]  __alloc_pages_slowpath+0x7e5/0xa0d
[    1.688007]  __alloc_pages_nodemask+0x1e9/0x210
[    1.688007]  alloc_page_interleave+0x36/0xa0
[    1.688007]  alloc_pages_current+0xa8/0xf0
[    1.688007]  pagecache_get_page+0x4d/0x250
[    1.688007]  ? xattr_resolve_name+0xa8/0xd0
[    1.688007]  grab_cache_page_write_begin+0x1c/0x40
[    1.688007]  simple_write_begin+0x24/0xe0
[    1.688007]  generic_perform_write+0xb2/0x190
[    1.688007]  __generic_file_write_iter+0x184/0x1c0
[    1.688007]  generic_file_write_iter+0xec/0x1d0
[    1.688007]  __vfs_write+0xdc/0x150
[    1.688007]  vfs_write+0xad/0x1a0
[    1.688007]  SyS_write+0x42/0x90
[    1.688007]  xwrite+0x2a/0x5b
[    1.688007]  do_copy+0x8b/0xb8
[    1.688007]  write_buffer+0x27/0x37
[    1.688007]  flush_buffer+0x2b/0x84
[    1.688007]  ? write_buffer+0x37/0x37
[    1.688007]  unxz+0x195/0x297
[    1.688007]  ? unlzo+0x4c4/0x4c4
[    1.688007]  unpack_to_rootfs+0x15d/0x299
[    1.688007]  ? initrd_load+0x3f/0x3f
[    1.688007]  ? parse_header+0x10e/0x10e
[    1.688007]  populate_rootfs+0x61/0x116
[    1.688007]  ? parse_header+0x10e/0x10e
[    1.688007]  do_one_initcall+0x50/0x1b0
[    1.688007]  kernel_init_freeable+0x19a/0x222
[    1.688007]  ? set_debug_rodata+0x11/0x11
[    1.688007]  ? rest_init+0x80/0x80
[    1.688007]  kernel_init+0xa/0x110
[    1.688007]  ret_from_fork+0x35/0x40
[    1.688007] Rebooting in 1 seconds..

Looking at the OSS (vm11) console log, we don’t see any tasks in a ‘D’ state and see the following running

[41861.995673] Lustre: DEBUG MARKER: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
[41862.206227] Lustre: DEBUG MARKER: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
[41862.209736] Lustre: DEBUG MARKER: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
[41862.636484] Key type lgssc unregistered
[41862.725313] LNet: 32129:0:(timer.c:244:stt_shutdown()) waiting for 1 threads to terminate
[41864.825960] LNet: Removed LNI 10.9.3.107@tcp
[45533.730112] SysRq : Changing Loglevel
[45533.730911] Loglevel set to 8
[45533.732495] SysRq : Show State
…
[45535.808726] kworker/0:2     R  running task        0 25726      2 0x00000080
[45535.809966] Call Trace:
[45535.810410]  [<ffffffff9aebd19a>] ? process_one_work+0x21a/0x440
[45535.811500]  [<ffffffff9b57f229>] schedule+0x29/0x70
[45535.812372]  [<ffffffff9aebe2c9>] worker_thread+0x1d9/0x3c0
[45535.813338]  [<ffffffff9aebe0f0>] ? manage_workers.isra.26+0x2a0/0x2a0
[45535.814456]  [<ffffffff9aec50d1>] kthread+0xd1/0xe0
[45535.815372]  [<ffffffff9aec5000>] ? insert_kthread_work+0x40/0x40
[45535.816448]  [<ffffffff9b58cd37>] ret_from_fork_nospec_begin+0x21/0x21
[45535.817562]  [<ffffffff9aec5000>] ? insert_kthread_work+0x40/0x40
…
[45535.909490] bash            R  running task        0 32398  32396 0x00000088
[45535.910761] Call Trace:
[45535.911203]  [<ffffffff9aed9278>] sched_show_task+0xa8/0x110
[45535.912208]  [<ffffffff9aed9362>] show_state_filter+0x82/0xd0
[45535.913206]  [<ffffffff9b26de40>] sysrq_handle_showstate+0x10/0x20
[45535.914261]  [<ffffffff9b26e5dd>] __handle_sysrq+0x10d/0x170
[45535.915310]  [<ffffffff9b26ea48>] write_sysrq_trigger+0x28/0x40
[45535.916365]  [<ffffffff9b0c0b10>] proc_reg_write+0x40/0x80
[45535.917331]  [<ffffffff9b048690>] vfs_write+0xc0/0x1f0
[45535.918234]  [<ffffffff9b58ce21>] ? system_call_after_swapgs+0xae/0x146
[45535.919366]  [<ffffffff9b0494af>] SyS_write+0x7f/0xf0
[45535.920246]  [<ffffffff9b58ce21>] ? system_call_after_swapgs+0xae/0x146
[45535.921387]  [<ffffffff9b58cede>] system_call_fastpath+0x25/0x2a
[45535.922429]  [<ffffffff9b58ce21>] ? system_call_after_swapgs+0xae/0x146

Looking at the MDS (vm12) console log, we see

[41862.417381] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-52vm12.trevis.whamcloud.com: executing \/usr\/sbin\/lustre_rmmod ldiskfs
[41862.422703] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-52vm12.trevis.whamcloud.com: executing \/usr\/sbin\/lustre_rmmod ldiskfs
[41862.675612] Lustre: DEBUG MARKER: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
[41862.685734] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-52vm11.trevis.whamcloud.com: executing \/usr\/sbin\/lustre_rmmod ldiskfs
[41862.737460] Lustre: DEBUG MARKER: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
[41862.915808] Lustre: DEBUG MARKER: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
[41863.115812] Key type lgssc unregistered
[41864.147550] LNet: 32315:0:(api-ni.c:2061:lnet_clear_zombies_nis_locked()) Waiting for zombie LNI 10.9.3.108@tcp
[41868.148695] LNet: 32315:0:(api-ni.c:2061:lnet_clear_zombies_nis_locked()) Waiting for zombie LNI 10.9.3.108@tcp
[41870.149762] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect
[41874.150882] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect
[41882.152156] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect
[41898.153717] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect
[41930.155818] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect
[41966.158106] LNet: Removed LNI 10.9.3.108@tcp
[45537.590798] SysRq : Changing Loglevel
[45537.591604] Loglevel set to 8

Although we’ve seen conf-sanity tes_53a hang previously, this hang looks different from the three other occurrences from this year.


Generated at Sat Feb 10 02:57:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.