Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.11.0
-
None
-
3
-
9223372036854775807
Description
obdfilter-survey test_1a hangs. For the failure at https://testing.hpdd.intel.com/test_sets/2a6d8a68-2d45-11e8-b74b-52540065bddc, the last thing seen in the client test_log is
== obdfilter-survey test 1a: Object Storage Targets survey =========================================== 17:00:36 (1521651636) Unable to detect ip address for host: '' + NETTYPE=tcp thrlo=8 nobjhi=1 thrhi=16 size=1024 case=disk rslt_loc=/tmp targets="10.9.6.3:lustre-OST0000 10.9.6.3:lustre-OST0001 10.9.6.3:lustre-OST0002 10.9.6.3:lustre-OST0003 10.9.6.3:lustre-OST0004 10.9.6.3:lustre-OST0005 10.9.6.3:lustre-OST0006 10.9.6.3:lustre-OST0007" /usr/bin/obdfilter-survey Warning: Permanently added '10.9.6.3' (ECDSA) to the list of known hosts. Wed Mar 21 17:00:46 UTC 2018 Obdfilter-survey for case=disk from trevis-43vm1.trevis.hpdd.intel.com ost 8 sz 8388608K rsz 1024K obj 8 thr 64 write 39.97 [ 0.00, 11.00] rewrite 48.09 [ 0.00, 17.00] read 91.69 [ 0.00, 102.99] ost 8 sz 8388608K rsz 1024K obj 8 thr 128 write 33.90 [ 0.00, 27.00] rewrite 46.32 [ 0.00, 22.00] read
On the client console we see
[ 4793.948275] LustreError: 166-1: MGC10.9.6.4@tcp: Connection to MGS (at 10.9.6.4@tcp) was lost; in progress operations using this service will fail
[ 4997.543363] INFO: rcu_sched self-detected stall on CPU { 0
[ 4997.544362] INFO: rcu_sched detected stalls on CPUs/tasks: \{ 0} (detected by 1, t=213158 jiffies, g=449683, c=449682, q=159)
[ 4997.544362] Task dump for CPU 0:
[ 4997.544362] kworker/0:2 R running task 0 46 2 0x00000008
[ 4997.544339] } (t=213158 jiffies g=449683 c=449682 q=159)
[ 4997.544362] Call Trace:
[ 4997.544362] [<ffffffff816b3b0c>] ? __schedule+0x47c/0xa30
[ 4997.544362] [<ffffffff810abeca>] ? process_one_work+0x21a/0x440
[ 4997.544362] [<ffffffff816b40e9>] schedule+0x29/0x70
[ 4997.544362] [<ffffffff810acba9>] worker_thread+0x1d9/0x3c0
[ 4997.544362] [<ffffffff810ac9d0>] ? manage_workers.isra.24+0x2a0/0x2a0
[ 4997.544362] [<ffffffff810b4031>] kthread+0xd1/0xe0
[ 4997.544362] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[ 4997.544362] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[ 4997.544362] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[ 4997.544339] Task dump for CPU 0:
[ 4997.544339] kworker/0:2 R running task 0 46 2 0x00000008
[ 4997.544339] Call Trace:
[ 4997.544339] <IRQ> [<ffffffff810c72c8>] sched_show_task+0xa8/0x110
[ 4997.544339] [<ffffffff810cade9>] dump_cpu_task+0x39/0x70
[ 4997.544339] [<ffffffff8113baf0>] rcu_dump_cpu_stacks+0x90/0xd0
[ 4997.544339] [<ffffffff8113f192>] rcu_check_callbacks+0x452/0x740
[ 4997.544339] [<ffffffff810eef7c>] ? update_wall_time+0x26c/0x6c0
[ 4997.544339] [<ffffffff810f6c00>] ? tick_sched_do_timer+0x50/0x50
[ 4997.544339] [<ffffffff8109d886>] update_process_times+0x46/0x80
[ 4997.544339] [<ffffffff810f6a00>] tick_sched_handle+0x30/0x70
[ 4997.544339] [<ffffffff810f6c39>] tick_sched_timer+0x39/0x80
[ 4997.544339] [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260
[ 4997.544339] [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0
[ 4997.544339] [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60
[ 4997.544339] [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60
[ 4997.544339] [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170
[ 4997.544339] <EOI> [<ffffffff810335b9>] ? sched_clock+0x9/0x10
[ 4997.544339] [<ffffffff810c28a4>] ? finish_task_switch+0x54/0x170
[ 4997.544339] [<ffffffff816b3b0c>] __schedule+0x47c/0xa30
[ 4997.544339] [<ffffffff810abeca>] ? process_one_work+0x21a/0x440
[ 4997.544339] [<ffffffff816b40e9>] schedule+0x29/0x70
[ 4997.544339] [<ffffffff810acba9>] worker_thread+0x1d9/0x3c0
[ 4997.544339] [<ffffffff810ac9d0>] ? manage_workers.isra.24+0x2a0/0x2a0
[ 4997.544339] [<ffffffff810b4031>] kthread+0xd1/0xe0
[ 4997.544339] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[ 4997.544339] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[ 4997.544339] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[ 5061.726259] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 25s! [socknal_cd00:4048]
[ 5061.727248] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_modib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi ppdev crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper virtio_blk syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix drm libata crct10dif_pclmul crct10dif_common crc32c_intel 8139too serio_raw virtio_pci 8139cp virtio_ring virtio mii i2c_core floppy [last unloaded: lnet_selftest]
[ 5076.551703] CPU: 0 PID: 4048 Comm: socknal_cd00 Tainted: G OE ------------ 3.10.0-693.21.1.el7.x86_64 #1
[ 5076.551703] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 5076.551703] task: ffff8800359b0000 ti: ffff88007af5c000 task.ti: ffff88007af5c000
[ 5076.551703] RIP: 0010:[<ffffffff816b6545>] [<ffffffff816b6545>] _raw_spin_unlock_irqrestore+0x15/0x20
[ 5076.551703] RSP: 0018:ffff88007af5fde0 EFLAGS: 00000246
[ 5076.551703] RAX: dead000000000200 RBX: ffff88007af5fe08 RCX: dead000000000200
[ 5076.551703] RDX: ffff88007a8dbe70 RSI: 0000000000000246 RDI: 0000000000000246
[ 5076.551703] RBP: ffff88007af5fde0 R08: ffff88007af5fe70 R09: 0000000000000000
[ 5076.551703] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88007af5fd60
[ 5076.551703] R13: 0000000000000000 R14: 00000000000001bf R15: 00000000000007ca
[ 5076.551703] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 5076.551703] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5076.551703] CR2: 00000000000000b8 CR3: 000000007afd4000 CR4: 00000000000606f0
[ 5076.551703] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5076.551703] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 5076.551703] Call Trace:
[ 5076.551703] [<ffffffff810b4d21>] remove_wait_queue+0x31/0x40
[ 5076.551703] [<ffffffffc0798cd2>] ksocknal_connd+0x332/0xd60 [ksocklnd]
[ 5076.551703] [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
[ 5076.551703] [<ffffffffc07989a0>] ? ksocknal_thread_fini+0x30/0x30 [ksocklnd]
[ 5076.551703] [<ffffffff810b4031>] kthread+0xd1/0xe0
[ 5076.551703] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[ 5076.551703] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[ 5076.551703] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[ 5076.551703] Code: 07 00 66 66 66 90 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 c6 07 00 66 66 66 90 48 89 f7 57 9d <66> 66 90 66 90 5d c3 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 48
[ 5076.551703] Kernel panic - not syncing: softlockup: hung tasks
[ 5076.551703] CPU: 0 PID: 4048 Comm: socknal_cd00 Tainted: G OEL ------------ 3.10.0-693.21.1.el7.x86_64 #1
[ 5076.551703] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 5076.551703] Call Trace:
[ 5076.551703] <IRQ> [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[ 5076.551703] [<ffffffff816a8634>] panic+0xe8/0x21f
[ 5076.551703] [<ffffffff8102d7cf>] ? show_regs+0x5f/0x210
[ 5076.551703] [<ffffffff811334e1>] watchdog_timer_fn+0x231/0x240
[ 5076.551703] [<ffffffff811332b0>] ? watchdog+0x40/0x40
[ 5076.551703] [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260
[ 5076.551703] [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0
[ 5076.551703] [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60
[ 5076.551703] [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60
[ 5076.551703] [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170
[ 5076.551703] <EOI> [<ffffffff816b6545>] ? _raw_spin_unlock_irqrestore+0x15/0x20
[ 5076.551703] [<ffffffff810b4d21>] remove_wait_queue+0x31/0x40
[ 5076.551703] [<ffffffffc0798cd2>] ksocknal_connd+0x332/0xd60 [ksocklnd]
[ 5076.551703] [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
[ 5076.551703] [<ffffffffc07989a0>] ? ksocknal_thread_fini+0x30/0x30 [ksocklnd]
[ 5076.551703] [<ffffffff810b4031>] kthread+0xd1/0xe0
[ 5076.551703] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[ 5076.551703] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[ 5076.551703] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
On the OSS, the following is seen in the console log and may be related
[ 3746.842987] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 17:00:36 (1521651636) [ 3747.568718] Lustre: Echo OBD driver; http://www.lustre.org/ [ 3750.063791] LustreError: 0-0: lustre-OST0000: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already [ 3750.069266] LustreError: Skipped 15 previous similar messages [ 4697.745115] LNetError: 28867:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.6.1@tcp[1], ip 10.9.6.1:1023, with error, wanted: 136, left: 136, last alive is 18 secs ago [ 4697.750855] LustreError: 28867:0:(events.c:304:request_in_callback()) event type 2, status -5, service ost [ 4697.754178] LustreError: 12877:0:(pack_generic.c:591:__lustre_unpack_msg()) message length 0 too small for magic/version check [ 4697.759336] LustreError: 12877:0:(sec.c:2070:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.9.6.1@tcp x1595563597557120
I've found only one instance of this hang so far.
Attachments
Issue Links
- mentioned in
-
Page Loading...