Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9706

conf-sanity test_53a: MDS soft lockup - CPU#0 stuck for 22s in osd_trans_commit_cb()

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c25bf48e-53d1-11e7-a749-5254006e85c2.

      The sub-test test_53a failed with the following error:

      test failed to respond and timed out
      

      env: tag-2.9.59 #3603 ZFS

      MDS console

      05:24:43:[40506.686671] Lustre: MGS: Connection restored to 1a7ddb58-f96d-6d7b-dd19-c36909e007a9 (at 0@lo)
      05:24:43:[40506.826682] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
      05:24:43:[40506.829961] LustreError: 16342:0:(osd_oi.c:503:osd_oid()) lustre-MDT0000-osd: unsupported quota oid: 0x16
      05:24:43:[40507.031480] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
      05:24:43:[40507.336736] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust
      05:24:43:[40507.892722] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-46vm7.trevis.hpdd.intel.com: executing set_default_debug -1 all 4
      05:24:43:[40507.893599] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-46vm7.trevis.hpdd.intel.com: executing set_default_debug -1 all 4
      05:24:43:[40508.120605] Lustre: DEBUG MARKER: trevis-46vm7.trevis.hpdd.intel.com: executing set_default_debug -1 all 4
      05:24:43:[40508.128355] Lustre: DEBUG MARKER: trevis-46vm7.trevis.hpdd.intel.com: executing set_default_debug -1 all 4
      05:24:43:[40508.289854] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
      05:24:43:[40508.591867] Lustre: DEBUG MARKER: zfs get -H -o value 				lustre:svname lustre-mdt1/mdt1 2>/dev/null | 				grep -E ':[a-zA-Z]{3}[0-9]{4}'
      05:24:43:[40508.901939] Lustre: DEBUG MARKER: zfs get -H -o value 				lustre:svname lustre-mdt1/mdt1 2>/dev/null | 				grep -E ':[a-zA-Z]{3}[0-9]{4}'
      05:24:43:[40509.212232] Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname 		                           lustre-mdt1/mdt1 2>/dev/null
      05:24:43:[40509.520874] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
      05:24:43:[40511.168162] Lustre: MGS: Connection restored to 00faafbe-57f8-e637-4480-1f8b3bfa5137 (at 10.9.6.35@tcp)
      05:24:43:[40511.800039] Lustre: 15064:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1497677067/real 1497677067]  req@ffff8800363c0300 x1570428225061024/t0(0) o8->lustre-OST0000-osc-MDT0000@10.9.6.35@tcp:28/4 lens 520/544 e 0 to 1 dl 1497677072 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      05:24:43:[40512.608633] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-46vm8.trevis.hpdd.intel.com: executing set_default_debug -1 all 4
      05:25:17:[40512.800854] Lustre: DEBUG MARKER: trevis-46vm8.trevis.hpdd.intel.com: executing set_default_debug -1 all 4
      05:25:17:[40513.793184] Lustre: MGS: Connection restored to 7310674c-545b-174e-1751-630f6aaa5a8a (at 10.9.6.28@tcp)
      05:25:17:[40513.796051] Lustre: Skipped 2 previous similar messages
      05:25:17:[40516.831820] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_statfs to node 10.9.6.35@tcp failed: rc = -107
      05:25:17:[40516.835465] Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 10.9.6.35@tcp) was lost; in progress operations using this service will wait for recovery to complete
      05:25:17:[40518.877430] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts
      05:25:17:[40519.195771] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
      05:25:17:[40544.062002] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [tx_commit_cb:16414]
      05:25:17:
      05:25:17:[40544.062002] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) osd_zfs(OE) lquota(OE) lustre(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel ppdev lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk cirrus crct10dif_pclmul crct10dif_common drm_kms_helper syscopyarea sysfillrect crc32c_intel 8139too serio_raw sysimgblt virtio_pci fb_sys_fops virtio_ring ttm virtio ata_piix 8139cp mii drm libata i2c_core floppy [last unloaded: libcfs]
      05:25:17:[40544.062002] CPU: 1 PID: 16414 Comm: tx_commit_cb Tainted: P           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
      05:25:17:[40544.062002] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      05:25:17:[40544.062002] task: ffff88005d50bec0 ti: ffff88004361c000 task.ti: ffff88004361c000
      05:25:17:[40544.062002] RIP: 0010:[<ffffffffa08c4fe4>]  [<ffffffffa08c4fe4>] dt_txn_hook_commit+0x34/0x60 [obdclass]
      05:25:17:[40544.062002] RSP: 0018:ffff88004361fd78  EFLAGS: 00000297
      05:25:17:[40544.062002] RAX: 0000000000000000 RBX: ffff8800694b94a0 RCX: ffff880069efc000
      05:25:17:[40544.062002] RDX: ffff880069efc048 RSI: ffffffffa1287600 RDI: ffff88005e026e00
      05:25:17:[40544.062002] RBP: ffff88004361fd88 R08: 20737365636f7250 R09: 0a64657265746e65
      05:25:17:[40544.062002] R10: ffff880014dc2535 R11: 0a64657265746e65 R12: 0000000000000000
      05:25:17:[40544.062002] R13: ffffffff810ce9ac R14: ffff88004361fd20 R15: 00000000a57bc182
      05:25:17:[40544.062002] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
      05:25:17:[40544.062002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      05:25:17:[40544.062002] CR2: 00007f23d5a9e000 CR3: 000000007b0d7000 CR4: 00000000000406e0
      05:25:17:[40544.062002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      05:25:17:[40544.062002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      05:25:17:[40544.062002] Stack:
      05:25:17:[40544.062002]  ffff8800694b94a0 ffff88005cbd0600 ffff88004361fdd0 ffffffffa0f676a8
      05:25:17:[40544.062002]  ffff880069efc000 ffff88007c0c0000 ffff8800694b94a0 ffff8800694b90f0
      05:25:17:[40544.062002]  0000000000000000 ffff8800694b90e0 ffff880079f78800 ffff88004361fe00
      05:25:17:[40544.062002] Call Trace:
      05:25:17:[40544.062002]  [<ffffffffa0f676a8>] osd_trans_commit_cb+0xe8/0x490 [osd_zfs]
      05:25:17:[40544.062002]  [<ffffffffa17d05e4>] dmu_tx_do_callbacks+0x44/0x70 [zfs]
      05:25:17:[40544.062002]  [<ffffffffa1814c74>] txg_do_callbacks+0x14/0x30 [zfs]
      05:25:17:[40544.062002]  [<ffffffffa06d36de>] taskq_thread+0x22e/0x440 [spl]
      05:25:17:[40544.062002]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
      05:25:17:[40544.062002]  [<ffffffffa06d34b0>] ? taskq_thread_spawn+0x60/0x60 [spl]
      05:25:17:[40544.062002]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      05:25:17:[40544.062002]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      05:25:17:[40544.062002]  [<ffffffff81697798>] ret_from_fork+0x58/0x90
      05:25:17:[40544.062002]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      05:25:17:[40544.062002] Code: 41 54 49 89 fc 53 f6 47 48 02 75 42 48 8b 0f 48 8b 41 48 48 8d 51 48 48 39 d0 48 8d 58 d8 74 2e 0f 1f 80 00 00 00 00 48 8b 43 10 <48> 85 c0 74 0d 48 8b 73 18 4c 89 e7 ff d0 49 8b 0c 24 48 8b 43 
      05:25:17:[40544.062002] Kernel panic - not syncing: softlockup: hung tasks
      05:25:17:[40544.062002] CPU: 1 PID: 16414 Comm: tx_commit_cb Tainted: P           OEL ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
      05:25:17:[40544.062002] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      05:25:17:[40544.062002]  ffffffff818da7c9 00000000a57bc182 ffff88007fd03e18 ffffffff8168717f
      05:25:17:[40544.062002]  ffff88007fd03e98 ffffffff816805aa 0000000000000008 ffff88007fd03ea8
      05:25:17:[40544.062002]  ffff88007fd03e48 00000000a57bc182 ffff88007fd03e67 0000000000000000
      05:25:17:[40544.062002] Call Trace:
      05:25:17:[40544.062002]  <IRQ>  [<ffffffff8168717f>] dump_stack+0x19/0x1b
      05:25:17:[40544.062002]  [<ffffffff816805aa>] panic+0xe3/0x1f2
      05:25:17:[40544.062002]  [<ffffffff8112f444>] watchdog_timer_fn+0x234/0x240
      05:25:17:[40544.062002]  [<ffffffff8112f210>] ? watchdog+0x50/0x50
      05:25:17:[40544.062002]  [<ffffffff810b4d72>] __hrtimer_run_queues+0xd2/0x260
      05:25:17:[40544.062002]  [<ffffffff810b5310>] hrtimer_interrupt+0xb0/0x1e0
      05:25:17:[40544.062002]  [<ffffffff816992dc>] ? call_softirq+0x1c/0x30
      05:25:17:[40544.062002]  [<ffffffff81050ff7>] local_apic_timer_interrupt+0x37/0x60
      05:25:17:[40544.062002]  [<ffffffff81699f4f>] smp_apic_timer_interrupt+0x3f/0x60
      05:25:17:[40544.062002]  [<ffffffff8169849d>] apic_timer_interrupt+0x6d/0x80
      05:25:17:[40544.062002]  <EOI>  [<ffffffffa08c4fe4>] ? dt_txn_hook_commit+0x34/0x60 [obdclass]
      05:25:17:[40544.062002]  [<ffffffffa08c4ff2>] ? dt_txn_hook_commit+0x42/0x60 [obdclass]
      05:25:17:[40544.062002]  [<ffffffffa0f676a8>] osd_trans_commit_cb+0xe8/0x490 [osd_zfs]
      05:25:17:[    0.000000] Initializing cgroup subsys cpuset
      05:25:17:[    0.000000] Initializing cgroup subsys cpu
      05:25:17:[    0.000000] Initializing cgroup subsys cpuacct
      05:25:17:[    0.000000] Linux version 3.10.0-514.21.1.el7_lustre.x86_64 (jenkins@trevis-310-el7
      

      Attachments

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: