Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10839

obdfilter-survey test 1a hangs with soft lockup on a client - socknal_cd

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      obdfilter-survey test_1a hangs. For the failure at https://testing.hpdd.intel.com/test_sets/2a6d8a68-2d45-11e8-b74b-52540065bddc, the last thing seen in the client test_log is

      == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 17:00:36 (1521651636)
      Unable to detect ip address for host: ''
      + NETTYPE=tcp thrlo=8 nobjhi=1 thrhi=16 size=1024 case=disk rslt_loc=/tmp targets="10.9.6.3:lustre-OST0000 10.9.6.3:lustre-OST0001 10.9.6.3:lustre-OST0002 10.9.6.3:lustre-OST0003 10.9.6.3:lustre-OST0004 10.9.6.3:lustre-OST0005 10.9.6.3:lustre-OST0006 10.9.6.3:lustre-OST0007" /usr/bin/obdfilter-survey
      Warning: Permanently added '10.9.6.3' (ECDSA) to the list of known hosts.
      Wed Mar 21 17:00:46 UTC 2018 Obdfilter-survey for case=disk from trevis-43vm1.trevis.hpdd.intel.com
      ost  8 sz  8388608K rsz 1024K obj    8 thr   64 write   39.97 [   0.00,   11.00] rewrite   48.09 [   0.00,   17.00] read   91.69 [   0.00,  102.99]
      ost  8 sz  8388608K rsz 1024K obj    8 thr  128 write   33.90 [   0.00,   27.00] rewrite   46.32 [   0.00,   22.00] read
      

       On the client console we see

      [ 4793.948275] LustreError: 166-1: MGC10.9.6.4@tcp: Connection to MGS (at 10.9.6.4@tcp) was lost; in progress operations using this service will fail
      [ 4997.543363] INFO: rcu_sched self-detected stall on CPU { 0
      [ 4997.544362] INFO: rcu_sched detected stalls on CPUs/tasks: \{ 0} (detected by 1, t=213158 jiffies, g=449683, c=449682, q=159)
      [ 4997.544362] Task dump for CPU 0:
      [ 4997.544362] kworker/0:2     R  running task        0    46      2 0x00000008
      [ 4997.544339] }  (t=213158 jiffies g=449683 c=449682 q=159)
      [ 4997.544362] Call Trace:
      [ 4997.544362]  [<ffffffff816b3b0c>] ? __schedule+0x47c/0xa30
      [ 4997.544362]  [<ffffffff810abeca>] ? process_one_work+0x21a/0x440
      [ 4997.544362]  [<ffffffff816b40e9>] schedule+0x29/0x70
      [ 4997.544362]  [<ffffffff810acba9>] worker_thread+0x1d9/0x3c0
      [ 4997.544362]  [<ffffffff810ac9d0>] ? manage_workers.isra.24+0x2a0/0x2a0
      [ 4997.544362]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [ 4997.544362]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [ 4997.544362]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [ 4997.544362]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [ 4997.544339] Task dump for CPU 0:
      [ 4997.544339] kworker/0:2     R  running task        0    46      2 0x00000008
      [ 4997.544339] Call Trace:
      [ 4997.544339]  <IRQ>  [<ffffffff810c72c8>] sched_show_task+0xa8/0x110
      [ 4997.544339]  [<ffffffff810cade9>] dump_cpu_task+0x39/0x70
      [ 4997.544339]  [<ffffffff8113baf0>] rcu_dump_cpu_stacks+0x90/0xd0
      [ 4997.544339]  [<ffffffff8113f192>] rcu_check_callbacks+0x452/0x740
      [ 4997.544339]  [<ffffffff810eef7c>] ? update_wall_time+0x26c/0x6c0
      [ 4997.544339]  [<ffffffff810f6c00>] ? tick_sched_do_timer+0x50/0x50
      [ 4997.544339]  [<ffffffff8109d886>] update_process_times+0x46/0x80
      [ 4997.544339]  [<ffffffff810f6a00>] tick_sched_handle+0x30/0x70
      [ 4997.544339]  [<ffffffff810f6c39>] tick_sched_timer+0x39/0x80
      [ 4997.544339]  [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260
      [ 4997.544339]  [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0
      [ 4997.544339]  [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60
      [ 4997.544339]  [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60
      [ 4997.544339]  [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170
      [ 4997.544339]  <EOI>  [<ffffffff810335b9>] ? sched_clock+0x9/0x10
      [ 4997.544339]  [<ffffffff810c28a4>] ? finish_task_switch+0x54/0x170
      [ 4997.544339]  [<ffffffff816b3b0c>] __schedule+0x47c/0xa30
      [ 4997.544339]  [<ffffffff810abeca>] ? process_one_work+0x21a/0x440
      [ 4997.544339]  [<ffffffff816b40e9>] schedule+0x29/0x70
      [ 4997.544339]  [<ffffffff810acba9>] worker_thread+0x1d9/0x3c0
      [ 4997.544339]  [<ffffffff810ac9d0>] ? manage_workers.isra.24+0x2a0/0x2a0
      [ 4997.544339]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [ 4997.544339]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [ 4997.544339]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [ 4997.544339]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [ 5061.726259] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 25s! [socknal_cd00:4048]
      [ 5061.727248] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_modib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi ppdev crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper virtio_blk syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix drm libata crct10dif_pclmul crct10dif_common crc32c_intel 8139too serio_raw virtio_pci 8139cp virtio_ring virtio mii i2c_core floppy [last unloaded: lnet_selftest]
      
      [ 5076.551703] CPU: 0 PID: 4048 Comm: socknal_cd00 Tainted: G           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
      [ 5076.551703] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      [ 5076.551703] task: ffff8800359b0000 ti: ffff88007af5c000 task.ti: ffff88007af5c000
      [ 5076.551703] RIP: 0010:[<ffffffff816b6545>]  [<ffffffff816b6545>] _raw_spin_unlock_irqrestore+0x15/0x20
      [ 5076.551703] RSP: 0018:ffff88007af5fde0  EFLAGS: 00000246
      [ 5076.551703] RAX: dead000000000200 RBX: ffff88007af5fe08 RCX: dead000000000200
      [ 5076.551703] RDX: ffff88007a8dbe70 RSI: 0000000000000246 RDI: 0000000000000246
      [ 5076.551703] RBP: ffff88007af5fde0 R08: ffff88007af5fe70 R09: 0000000000000000
      [ 5076.551703] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88007af5fd60
      [ 5076.551703] R13: 0000000000000000 R14: 00000000000001bf R15: 00000000000007ca
      [ 5076.551703] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      [ 5076.551703] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5076.551703] CR2: 00000000000000b8 CR3: 000000007afd4000 CR4: 00000000000606f0
      [ 5076.551703] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 5076.551703] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 5076.551703] Call Trace:
      [ 5076.551703]  [<ffffffff810b4d21>] remove_wait_queue+0x31/0x40
      [ 5076.551703]  [<ffffffffc0798cd2>] ksocknal_connd+0x332/0xd60 [ksocklnd]
      [ 5076.551703]  [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
      [ 5076.551703]  [<ffffffffc07989a0>] ? ksocknal_thread_fini+0x30/0x30 [ksocklnd]
      [ 5076.551703]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [ 5076.551703]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [ 5076.551703]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [ 5076.551703]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [ 5076.551703] Code: 07 00 66 66 66 90 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 c6 07 00 66 66 66 90 48 89 f7 57 9d <66> 66 90 66 90 5d c3 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 48
      [ 5076.551703] Kernel panic - not syncing: softlockup: hung tasks
      [ 5076.551703] CPU: 0 PID: 4048 Comm: socknal_cd00 Tainted: G           OEL ------------   3.10.0-693.21.1.el7.x86_64 #1
      [ 5076.551703] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      [ 5076.551703] Call Trace:
      [ 5076.551703]  <IRQ>  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
      [ 5076.551703]  [<ffffffff816a8634>] panic+0xe8/0x21f
      [ 5076.551703]  [<ffffffff8102d7cf>] ? show_regs+0x5f/0x210
      [ 5076.551703]  [<ffffffff811334e1>] watchdog_timer_fn+0x231/0x240
      [ 5076.551703]  [<ffffffff811332b0>] ? watchdog+0x40/0x40
      [ 5076.551703]  [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260
      [ 5076.551703]  [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0
      [ 5076.551703]  [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60
      [ 5076.551703]  [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60
      [ 5076.551703]  [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170
      [ 5076.551703]  <EOI>  [<ffffffff816b6545>] ? _raw_spin_unlock_irqrestore+0x15/0x20
      [ 5076.551703]  [<ffffffff810b4d21>] remove_wait_queue+0x31/0x40
      [ 5076.551703]  [<ffffffffc0798cd2>] ksocknal_connd+0x332/0xd60 [ksocklnd]
      [ 5076.551703]  [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
      [ 5076.551703]  [<ffffffffc07989a0>] ? ksocknal_thread_fini+0x30/0x30 [ksocklnd]
      [ 5076.551703]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [ 5076.551703]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [ 5076.551703]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [ 5076.551703]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      

       

      On the OSS, the following is seen in the console log and may be related

      [ 3746.842987] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 17:00:36 (1521651636) 
      [ 3747.568718] Lustre: Echo OBD driver; http://www.lustre.org/ 
      [ 3750.063791] LustreError: 0-0: lustre-OST0000: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already 
      [ 3750.069266] LustreError: Skipped 15 previous similar messages 
      [ 4697.745115] LNetError: 28867:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.6.1@tcp[1], ip 10.9.6.1:1023, with error, wanted: 136, left: 136, last alive is 18 secs ago 
      [ 4697.750855] LustreError: 28867:0:(events.c:304:request_in_callback()) event type 2, status -5, service ost 
      [ 4697.754178] LustreError: 12877:0:(pack_generic.c:591:__lustre_unpack_msg()) message length 0 too small for magic/version check 
      [ 4697.759336] LustreError: 12877:0:(sec.c:2070:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.9.6.1@tcp x1595563597557120
      
      

       

      I've found only one instance of this hang so far.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: