Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12980

conf-sanity test 53a hangs in lustre_rmmod

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0
    • None
    • SLES12SP4 clients
    • 3
    • 9223372036854775807

    Description

      conf-sanity test_53a hangs in lustre_rmmod. Looking at the client test_log for the hang at https://testing.whamcloud.com/test_sets/fe1c7768-07a3-11ea-b934-52540065bddc for SLES12 SP4 client, the last thing we see is

      trevis-52vm12: == rpc test complete, duration -o sec ================================================================ 15:03:03 (1573743783)
      trevis-52vm11: CMD: trevis-52vm11.trevis.whamcloud.com /sbin/lctl get_param -n version 2>/dev/null ||
      trevis-52vm11: 				/sbin/lctl lustre_build_version 2>/dev/null ||
      trevis-52vm11: 				/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
      trevis-52vm11: == rpc test complete, duration -o sec ================================================================ 15:03:03 (1573743783)
      trevis-52vm12: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
      trevis-52vm11: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
      

      Everything prior to these last lines look like a typical successful test 53a test run.

      Looking at the client 1 (vm10) console log, we see

      [41930.736352] Lustre: DEBUG MARKER: == conf-sanity test 53a: check OSS thread count params =============================================== 07:02:08 (1573743728)
      [41981.099044] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
      [41982.940633] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
      [41983.114571] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-52vm10: executing \/usr\/sbin\/lustre_rmmod ldiskfs
      [41983.389931] Lustre: DEBUG MARKER: trevis-52vm10: executing /usr/sbin/lustre_rmmod ldiskfs
      [41983.649199] Key type lgssc unregistered
      [42013.771188] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [rmmod:22596]
      [42013.772713] Modules linked in: obdecho(OEN) lnet_selftest(OEN) mgc(OEN) fid(OEN) osc(OEN-) ksocklnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet iscsi_ibft iscsi_boot_sysfs crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel 8139too aes_x86_64 crypto_simd glue_helper cryptd pcspkr 8139cp virtio_balloon mii qemu_fw_cfg i2c_piix4 joydev button ata_generic ext4 crc16 jbd2 mbcache ata_piix ahci libahci uhci_hcd virtio_blk ehci_hcd usbcore serio_raw virtio_pci virtio_ring virtio libata floppy sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 [last unloaded: mdc]
      [42013.782838] Supported: No, Unsupported modules are loaded
      [42013.907792] CPU: 0 PID: 22596 Comm: rmmod Tainted: G           OE      4.12.14-95.29-default #1 SLE12-SP4
      [42013.909560] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [42013.910614] task: ffff97ec204c5740 task.stack: ffffb0a8807a0000
      [42017.314832] RIP: 0010:__memset+0x24/0x30
      [42017.574572] RSP: 0018:ffffb0a8807a3ea8 EFLAGS: 00010206 ORIG_RAX: ffffffffffffff10
      [42017.576187] RAX: 5a5a5a5a5a5a5a5a RBX: ffff97ec3b000908 RCX: 000000000000a600
      [42017.577470] RDX: 0000000000000000 RSI: 000000000000005a RDI: ffff97ec2302d000
      [42017.578736] RBP: ffff97ec3b000488 R08: 0000000000000000 R09: ffff97ec23000000
      [42017.580000] R10: 0000000000000000 R11: ffffffffffffffec R12: ffff97ec3af57540
      [42017.581266] R13: ffff97ec3b000480 R14: ffff97ec3af57548 R15: 00001fffffffffff
      [42017.582538] FS:  00007faf6fce5700(0000) GS:ffff97ec3fc00000(0000) knlGS:0000000000000000
      [42017.583965] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [42017.585006] CR2: 000000000021e000 CR3: 000000007af06003 CR4: 00000000000606f0
      [42017.586287] Call Trace:
      [42029.158312]  ptlrpc_free_rq_pool+0x253/0x3e0 [ptlrpc]
      [42031.024082]  SyS_delete_module+0x16b/0x250
      [42034.276610]  ? exit_to_usermode_loop+0x86/0x9c
      [42034.277828]  do_syscall_64+0x74/0x160
      [42035.491904]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [42037.828160] RIP: 0033:0x7faf6f3de457
      [42037.829308] RSP: 002b:00007fff1802b988 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
      [42037.831252] RAX: ffffffffffffffda RBX: 00000000006f71d0 RCX: 00007faf6f3de457
      [42037.833049] RDX: 00007faf6f4459c0 RSI: 0000000000000800 RDI: 00000000006f7238
      [42037.834851] RBP: 0000000000000000 R08: 00007faf6f690f40 R09: 00007fff1802a901
      [42037.836663] R10: 000000000000000a R11: 0000000000000206 R12: 00007fff1802cdd4
      [42037.838495] R13: 0000000000000000 R14: 00000000006f71d0 R15: 00000000006f7010
      [42037.840292] Code: 90 90 90 90 90 90 90 66 66 90 66 90 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 
      [42037.844629] Kernel panic - not syncing: softlockup: hung tasks
      [42037.846139] CPU: 0 PID: 22596 Comm: rmmod Tainted: G           OEL     4.12.14-95.29-default #1 SLE12-SP4
      [42037.848498] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [42037.849903] Call Trace:
      [42038.953574]  <IRQ>
      [42038.954178]  dump_stack+0x5a/0x75
      [42039.421812]  panic+0xdb/0x23e
      [42042.003586]  watchdog_timer_fn+0x208/0x210
      [42042.190190]  ? watchdog_park_threads+0x70/0x70
      [42042.191162]  __hrtimer_run_queues+0xce/0x210
      [42042.354362]  hrtimer_interrupt+0x99/0x190
      [42042.355315]  smp_apic_timer_interrupt+0x3a/0x60
      [42044.012800]  apic_timer_interrupt+0x8f/0xa0
      [42044.014198]  </IRQ>
      [42044.015037] RIP: 0010:__memset+0x24/0x30
      [42044.016375] RSP: 0018:ffffb0a8807a3ea8 EFLAGS: 00010206 ORIG_RAX: ffffffffffffff10
      [42044.018779] RAX: 5a5a5a5a5a5a5a5a RBX: ffff97ec3b000908 RCX: 000000000000a600
      [42044.021020] RDX: 0000000000000000 RSI: 000000000000005a RDI: ffff97ec2302d000
      [42044.023076] RBP: ffff97ec3b000488 R08: 0000000000000000 R09: ffff97ec23000000
      [42044.025330] R10: 0000000000000000 R11: ffffffffffffffec R12: ffff97ec3af57540
      [42044.027548] R13: ffff97ec3b000480 R14: ffff97ec3af57548 R15: 00001fffffffffff
      [42044.029830]  ptlrpc_free_rq_pool+0x253/0x3e0 [ptlrpc]
      [42044.031486]  SyS_delete_module+0x16b/0x250
      [42044.032957]  ? exit_to_usermode_loop+0x86/0x9c
      [42044.034425]  do_syscall_64+0x74/0x160
      [42044.035679]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [42044.037311] RIP: 0033:0x7faf6f3de457
      [42044.038540] RSP: 002b:00007fff1802b988 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
      [42044.040896] RAX: ffffffffffffffda RBX: 00000000006f71d0 RCX: 00007faf6f3de457
      [42044.043043] RDX: 00007faf6f4459c0 RSI: 0000000000000800 RDI: 00000000006f7238
      [42044.045258] RBP: 0000000000000000 R08: 00007faf6f690f40 R09: 00007fff1802a901
      [42044.047518] R10: 000000000000000a R11: 0000000000000206 R12: 00007fff1802cdd4
      [42044.049751] R13: 0000000000000000 R14: 00000000006f71d0 R15: 00000000006f7010
      [    1.686560] Kernel panic - not syncing: Out of memory and no killable processes...
      [    1.686560] 
      [    1.688007] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.14-95.29-default #1 SLE12-SP4
      [    1.688007] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [    1.688007] Call Trace:
      [    1.688007]  dump_stack+0x5a/0x75
      [    1.688007]  panic+0xdb/0x23e
      [    1.688007]  out_of_memory+0x3a1/0x490
      [    1.688007]  __alloc_pages_slowpath+0x7e5/0xa0d
      [    1.688007]  __alloc_pages_nodemask+0x1e9/0x210
      [    1.688007]  alloc_page_interleave+0x36/0xa0
      [    1.688007]  alloc_pages_current+0xa8/0xf0
      [    1.688007]  pagecache_get_page+0x4d/0x250
      [    1.688007]  ? xattr_resolve_name+0xa8/0xd0
      [    1.688007]  grab_cache_page_write_begin+0x1c/0x40
      [    1.688007]  simple_write_begin+0x24/0xe0
      [    1.688007]  generic_perform_write+0xb2/0x190
      [    1.688007]  __generic_file_write_iter+0x184/0x1c0
      [    1.688007]  generic_file_write_iter+0xec/0x1d0
      [    1.688007]  __vfs_write+0xdc/0x150
      [    1.688007]  vfs_write+0xad/0x1a0
      [    1.688007]  SyS_write+0x42/0x90
      [    1.688007]  xwrite+0x2a/0x5b
      [    1.688007]  do_copy+0x8b/0xb8
      [    1.688007]  write_buffer+0x27/0x37
      [    1.688007]  flush_buffer+0x2b/0x84
      [    1.688007]  ? write_buffer+0x37/0x37
      [    1.688007]  unxz+0x195/0x297
      [    1.688007]  ? unlzo+0x4c4/0x4c4
      [    1.688007]  unpack_to_rootfs+0x15d/0x299
      [    1.688007]  ? initrd_load+0x3f/0x3f
      [    1.688007]  ? parse_header+0x10e/0x10e
      [    1.688007]  populate_rootfs+0x61/0x116
      [    1.688007]  ? parse_header+0x10e/0x10e
      [    1.688007]  do_one_initcall+0x50/0x1b0
      [    1.688007]  kernel_init_freeable+0x19a/0x222
      [    1.688007]  ? set_debug_rodata+0x11/0x11
      [    1.688007]  ? rest_init+0x80/0x80
      [    1.688007]  kernel_init+0xa/0x110
      [    1.688007]  ret_from_fork+0x35/0x40
      [    1.688007] Rebooting in 1 seconds..
      

      Looking at the OSS (vm11) console log, we don’t see any tasks in a ‘D’ state and see the following running

      [41861.995673] Lustre: DEBUG MARKER: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
      [41862.206227] Lustre: DEBUG MARKER: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
      [41862.209736] Lustre: DEBUG MARKER: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
      [41862.636484] Key type lgssc unregistered
      [41862.725313] LNet: 32129:0:(timer.c:244:stt_shutdown()) waiting for 1 threads to terminate
      [41864.825960] LNet: Removed LNI 10.9.3.107@tcp
      [45533.730112] SysRq : Changing Loglevel
      [45533.730911] Loglevel set to 8
      [45533.732495] SysRq : Show State
      …
      [45535.808726] kworker/0:2     R  running task        0 25726      2 0x00000080
      [45535.809966] Call Trace:
      [45535.810410]  [<ffffffff9aebd19a>] ? process_one_work+0x21a/0x440
      [45535.811500]  [<ffffffff9b57f229>] schedule+0x29/0x70
      [45535.812372]  [<ffffffff9aebe2c9>] worker_thread+0x1d9/0x3c0
      [45535.813338]  [<ffffffff9aebe0f0>] ? manage_workers.isra.26+0x2a0/0x2a0
      [45535.814456]  [<ffffffff9aec50d1>] kthread+0xd1/0xe0
      [45535.815372]  [<ffffffff9aec5000>] ? insert_kthread_work+0x40/0x40
      [45535.816448]  [<ffffffff9b58cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [45535.817562]  [<ffffffff9aec5000>] ? insert_kthread_work+0x40/0x40
      …
      [45535.909490] bash            R  running task        0 32398  32396 0x00000088
      [45535.910761] Call Trace:
      [45535.911203]  [<ffffffff9aed9278>] sched_show_task+0xa8/0x110
      [45535.912208]  [<ffffffff9aed9362>] show_state_filter+0x82/0xd0
      [45535.913206]  [<ffffffff9b26de40>] sysrq_handle_showstate+0x10/0x20
      [45535.914261]  [<ffffffff9b26e5dd>] __handle_sysrq+0x10d/0x170
      [45535.915310]  [<ffffffff9b26ea48>] write_sysrq_trigger+0x28/0x40
      [45535.916365]  [<ffffffff9b0c0b10>] proc_reg_write+0x40/0x80
      [45535.917331]  [<ffffffff9b048690>] vfs_write+0xc0/0x1f0
      [45535.918234]  [<ffffffff9b58ce21>] ? system_call_after_swapgs+0xae/0x146
      [45535.919366]  [<ffffffff9b0494af>] SyS_write+0x7f/0xf0
      [45535.920246]  [<ffffffff9b58ce21>] ? system_call_after_swapgs+0xae/0x146
      [45535.921387]  [<ffffffff9b58cede>] system_call_fastpath+0x25/0x2a
      [45535.922429]  [<ffffffff9b58ce21>] ? system_call_after_swapgs+0xae/0x146
      
      

      Looking at the MDS (vm12) console log, we see

      [41862.417381] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-52vm12.trevis.whamcloud.com: executing \/usr\/sbin\/lustre_rmmod ldiskfs
      [41862.422703] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-52vm12.trevis.whamcloud.com: executing \/usr\/sbin\/lustre_rmmod ldiskfs
      [41862.675612] Lustre: DEBUG MARKER: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
      [41862.685734] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-52vm11.trevis.whamcloud.com: executing \/usr\/sbin\/lustre_rmmod ldiskfs
      [41862.737460] Lustre: DEBUG MARKER: trevis-52vm12.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
      [41862.915808] Lustre: DEBUG MARKER: trevis-52vm11.trevis.whamcloud.com: executing /usr/sbin/lustre_rmmod ldiskfs
      [41863.115812] Key type lgssc unregistered
      [41864.147550] LNet: 32315:0:(api-ni.c:2061:lnet_clear_zombies_nis_locked()) Waiting for zombie LNI 10.9.3.108@tcp
      [41868.148695] LNet: 32315:0:(api-ni.c:2061:lnet_clear_zombies_nis_locked()) Waiting for zombie LNI 10.9.3.108@tcp
      [41870.149762] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect
      [41874.150882] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect
      [41882.152156] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect
      [41898.153717] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect
      [41930.155818] LNet: 32315:0:(socklnd.c:2547:ksocknal_shutdown()) waiting for 2 peers to disconnect
      [41966.158106] LNet: Removed LNI 10.9.3.108@tcp
      [45537.590798] SysRq : Changing Loglevel
      [45537.591604] Loglevel set to 8
      

      Although we’ve seen conf-sanity tes_53a hang previously, this hang looks different from the three other occurrences from this year.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: