Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.9.0
    • lola
      build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168
    • 3
    • 9223372036854775807

    Description

      error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
      OSTs formatted with zfs, MDSs formatted with ldiskfs
      DNE is enabled, HSM/robinhood enable and integrated
      4 MDSs with 1 MDT / MDS
      6 OSSs with 4 OSTs / OSS
      Server nodes configured in active-active HA confguration

      Sequence of events:

      • 2016-07-28 08:48:37 - Soak session started
      • 2016-07-28 08:50:34 - First LNet time out:
         
        Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds
        Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib10 (62): c: 0, oc: 0, rc: 8
        Jul 28 08:50:43 lola-5 kernel: Lustre: Skipped 4 previous similar messages
        Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [ll_ost_io00_006:28605]
        Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#2 stuck for 67s! [ll_ost_io00_048:28758]
        

        (see also attached file abrt-kernel-oops.tar.bz2; In total 1545 event records of this form
        had been written till the node crashed)

      • 2016-07-28 08:51:03 - First occurrenance of error below. These error flooded the console after some time. (see console log after entry 'Jul 28 08:45:01 lola-5 TIME: Time stamp for console')
        Jul 28 08:51:03 lola-5 kernel: Pid: 28758, comm: ll_ost_io00_048 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
        Jul 28 08:51:03 lola-5 kernel: RIP: 0010:[<ffffffff8129e8af>]  [<ffffffff8129e8af>] __write_lock_failed+0xf/0x20
        Jul 28 08:51:03 lola-5 kernel: RSP: 0018:ffff8803c8e2b918  EFLAGS: 00000287
        Jul 28 08:51:03 lola-5 kernel: RAX: 0000000000000000 RBX: ffff8803c8e2b920 RCX: 0000000000000000
        Jul 28 08:51:03 lola-5 kernel: RDX: ffff88044e415a00 RSI: ffff880335d78400 RDI: ffff8803fc143dd8
        Jul 28 08:51:03 lola-5 kernel: RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
        Jul 28 08:51:03 lola-5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000008fa9fcc28
        Jul 28 08:51:03 lola-5 kernel: R13: 0000000200000008 R14: ffff8803bac2b0b8 R15: ffffffff810674be
        Jul 28 08:51:03 lola-5 kernel: FS:  0000000000000000(0000) GS:ffff880038640000(0000) knlGS:0000000000000000
        Jul 28 08:51:03 lola-5 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
        Jul 28 08:51:03 lola-5 kernel: CR2: 00007f88b9e46000 CR3: 0000000001a8d000 CR4: 00000000000407e0
        Jul 28 08:51:03 lola-5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        Jul 28 08:51:03 lola-5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Jul 28 08:51:03 lola-5 kernel: Process ll_ost_io00_048 (pid: 28758, threadinfo ffff8803c8e28000, task ffff8803cb648040)
        
      • 2016-07-28-16 16:02 - oom-killer started (see entry in console) and last mtime update
        for collect data file:
        -rw-r--r-- 1 root root  738427536 Jul 28 16:02 lola-5-20160728-021116.raw.gz
        
      • Node neither accessible via ssh nor console. Node rebooted. No crash dump file was written. (Parameter set_param panic_on_lbug=1 was set).

      Attached files:
      message, console, and debug message (written inbetween Jul 28, 08:48 - 16:02), abrt-kernel-oops.tar.bz2 (content for single event)
      collectl memory and slab counters.

      We'll try to trigger a crashdump on the node which will be affected next and increase debug mask. Current debug files don't contain slab information, as far as I could see.

      Attachments

        1. abrt-kernel-oops.tar.bz2
          3 kB
        2. all-lustre-log.tar.bz2
          759 kB
        3. allocation-per-slab.tar.bz2
          1.81 MB
        4. console-lola-5.log.bz2
          698 kB
        5. lola-2-leak_finder.output.bz2
          221 kB
        6. lola-2-lustre-log.1470213950.128013.bz2
          0.3 kB
        7. lola-6.timeouts.txt
          0.2 kB
        8. lola-7.errors.txt
          619 kB
        9. memory-counter-lola-5-20160728-021116.dat.bz2
          80 kB
        10. messages-lola-5.log.bz2
          582 kB
        11. slab-details-counter-lola-5-20160728-021116.dat.bz2
          2.75 MB
        12. slab-sorted-alloaction.dat.bz2
          5 kB

        Issue Links

          Activity

            [LU-8449] OSS crash with oom-killer started

            Cliff, with the patch above?

            bzzz Alex Zhuravlev added a comment - Cliff, with the patch above?

            System did not oom-kill however all OSS are unusable due to massive timeous.

            cliffw Cliff White (Inactive) added a comment - System did not oom-kill however all OSS are unusable due to massive timeous.

            Will continue with run to see if it ooo-kills

            cliffw Cliff White (Inactive) added a comment - Will continue with run to see if it ooo-kills

            We have not seen an oom-killer, however the immediate soft lockups continue.
            Tends to freeze the node over time. Full console from timeouts attached.

            cliffw Cliff White (Inactive) added a comment - We have not seen an oom-killer, however the immediate soft lockups continue. Tends to freeze the node over time. Full console from timeouts attached.

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/21827
            Subject: LU-8449 osd: retake locks as planned
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7a6836c0e49a3065ef876b6fa41e233502adecd6

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/21827 Subject: LU-8449 osd: retake locks as planned Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7a6836c0e49a3065ef876b6fa41e233502adecd6

            Set kernel.panic = 60 on all OSS nodes. Re-ran test. Errors are attached

            cliffw Cliff White (Inactive) added a comment - Set kernel.panic = 60 on all OSS nodes. Re-ran test. Errors are attached
            jhammond John Hammond added a comment -

            This is likely introduced by:

            commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43
            Author: Alex Zhuravlev <alexey.zhuravlev@intel.com>
            Date:   Fri Mar 25 12:21:16 2016 +0300
            
                LU-7899 osd: batch EA updates
                
                during file creation we set number of EAs: LMA, VBR, LinkEA, LOVEA, ACLs.
                calling into SA to refill spill again and again is expensive. thus it
                makes sense to postpone this to osd_trans_stop() where all changed EAs
                has been already collected in a temporary buffer.
                
                Change-Id: I8f02a287b96615c3aa550d63ffd9dd3da51b39ee
                Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
                Reviewed-on: http://review.whamcloud.com/19143
                Tested-by: Jenkins
                Tested-by: Maloo <hpdd-maloo@intel.com>
                Reviewed-by: Lai Siyao <lai.siyao@intel.com>
                Reviewed-by: Bobi Jam <bobijam@hotmail.com>
                Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
            

            Alex, can you comment here?

            jhammond John Hammond added a comment - This is likely introduced by: commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43 Author: Alex Zhuravlev <alexey.zhuravlev@intel.com> Date: Fri Mar 25 12:21:16 2016 +0300 LU-7899 osd: batch EA updates during file creation we set number of EAs: LMA, VBR, LinkEA, LOVEA, ACLs. calling into SA to refill spill again and again is expensive. thus it makes sense to postpone this to osd_trans_stop() where all changed EAs has been already collected in a temporary buffer. Change-Id: I8f02a287b96615c3aa550d63ffd9dd3da51b39ee Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Reviewed-on: http://review.whamcloud.com/19143 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Lai Siyao <lai.siyao@intel.com> Reviewed-by: Bobi Jam <bobijam@hotmail.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com> Alex, can you comment here?

            The soft lockups appear within minutes of starting soak. Once the first one occurs, we get a cascade.

            Aug  4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580]
            Aug  4 19:58:26 lola-7 kernel: Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm raid1 dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core joydev i2c_i801 ioatdma ses enclosure sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            Aug  4 19:58:26 lola-7 kernel: CPU 3
            Aug  4 19:58:26 lola-7 kernel: Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm raid1 dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core joydev i2c_i801 ioatdma ses enclosure sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            Aug  4 19:58:26 lola-7 kernel:
            Aug  4 19:58:26 lola-7 kernel: Pid: 20580, comm: ll_ost_io00_031 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
            Aug  4 19:58:26 lola-7 kernel: RIP: 0010:[<ffffffff8129e8a9>]  [<ffffffff8129e8a9>] __write_lock_failed+0x9/0x20
            Aug  4 19:58:26 lola-7 kernel: RSP: 0018:ffff88030bfb3778  EFLAGS: 00000287
            Aug  4 19:58:26 lola-7 kernel: RAX: ffff8802a01ab780 RBX: ffff88030bfb3780 RCX: 0000000000026c0a
            Aug  4 19:58:26 lola-7 kernel: RDX: ffff8803c52a8000 RSI: ffff8803c52a8000 RDI: ffff8803a66afb88
            Aug  4 19:58:26 lola-7 kernel: RBP: ffffffff8100bc0e R08: ffff8802d28a96c0 R09: 0000000000000008
            Aug  4 19:58:26 lola-7 kernel: R10: 0000000a0007d83d R11: 0000000000000000 R12: ffff88030bfb36f0
            Aug  4 19:58:26 lola-7 kernel: R13: 0000000000000000 R14: ffffffffa0943036 R15: ffff880360c1f660
            Aug  4 19:58:26 lola-7 kernel: FS:  0000000000000000(0000) GS:ffff880038660000(0000) knlGS:0000000000000000
            Aug  4 19:58:26 lola-7 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
            Aug  4 19:58:26 lola-7 kernel: CR2: 0000000001620874 CR3: 0000000001a8d000 CR4: 00000000000407e0
            Aug  4 19:58:26 lola-7 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            Aug  4 19:58:26 lola-7 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            Aug  4 19:58:26 lola-7 kernel: Process ll_ost_io00_031 (pid: 20580, threadinfo ffff88030bfb0000, task ffff88030bc72ab0)
            Aug  4 19:58:26 lola-7 kernel: Stack:
            Aug  4 19:58:26 lola-7 kernel: ffffffff8153d007 ffff88030bfb37b0 ffffffffa0fe2ab0 ffff8802d28a96d0
            Aug  4 19:58:26 lola-7 kernel: <d> ffff8803c52a8000 ffff8803a66afad0 ffff88030bfb38d0 ffff88030bfb37e0
            Aug  4 19:58:26 lola-7 kernel: <d> ffffffffa0fee79b 0000000000000000 ffff8803f6147180 ffff88030bfb37e0
            Aug  4 19:58:26 lola-7 kernel: Call Trace:
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8153d007>] ? _write_lock+0x17/0x20
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fe2ab0>] ? osd_object_sa_dirty_add+0x40/0x80 [osd_zfs]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fee79b>] ? __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0ff09af>] ? __osd_sa_xattr_set+0x25f/0x350 [osd_zfs]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0ff0d8c>] ? osd_xattr_set+0x2ec/0x4d0 [osd_zfs]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa090cdb2>] ? dt_version_set+0x72/0x230 [obdclass]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b5a443>] ? tgt_txn_stop_cb+0x2c3/0x1160 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa090b96e>] ? dt_txn_hook_stop+0x5e/0x90 [obdclass]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fddc3c>] ? osd_trans_stop+0x29c/0x520 [osd_zfs]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa1121a0f>] ? ofd_trans_stop+0x1f/0x60 [ofd]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa1129100>] ? ofd_commitrw_write+0x4d0/0xfa0 [ofd]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa112a18f>] ? ofd_commitrw+0x5bf/0xb10 [ofd]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b612e4>] ? obd_commitrw+0x114/0x380 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b6a080>] ? tgt_brw_write+0xc70/0x1530 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b688bc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b153c1>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff81539896>] ? schedule+0x176/0x3a0
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            Aug  4 19:58:26 lola-7 kernel: Code: 00 00 48 8b 5b 20 48 83 eb 07 48 39 d9 73 06 48 89 01 31 c0 c3 b8 f2 ff ff ff c3 90 90 90 90 90 90 90 f0 81 07 00 00 00 01 f3 90 <81> 3f 00 00 00 01 75 f6 f0 81 2f 00 00 00 01 0f 85 e2 ff ff ff
            Aug  4 19:58:26 lola-7 kernel: Call Trace:
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8153d007>] ? _write_lock+0x17/0x20
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fe2ab0>] ? osd_object_sa_dirty_add+0x40/0x80 [osd_zfs]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fee79b>] ? __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0ff09af>] ? __osd_sa_xattr_set+0x25f/0x350 [osd_zfs]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0ff0d8c>] ? osd_xattr_set+0x2ec/0x4d0 [osd_zfs]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa090cdb2>] ? dt_version_set+0x72/0x230 [obdclass]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b5a443>] ? tgt_txn_stop_cb+0x2c3/0x1160 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa090b96e>] ? dt_txn_hook_stop+0x5e/0x90 [obdclass]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0fddc3c>] ? osd_trans_stop+0x29c/0x520 [osd_zfs]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa1121a0f>] ? ofd_trans_stop+0x1f/0x60 [ofd]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa1129100>] ? ofd_commitrw_write+0x4d0/0xfa0 [ofd]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa112a18f>] ? ofd_commitrw+0x5bf/0xb10 [ofd]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b612e4>] ? obd_commitrw+0x114/0x380 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b6a080>] ? tgt_brw_write+0xc70/0x1530 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b688bc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b153c1>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff81539896>] ? schedule+0x176/0x3a0
            Aug  4 19:58:26 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
            Aug  4 19:58:26 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            Aug  4 19:59:41 lola-7 kernel: INFO: task txg_quiesce:12866 blocked for more than 120 seconds.
            Aug  4 19:59:41 lola-7 kernel:      Tainted: P           --L------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
            Aug  4 19:59:41 lola-7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            Aug  4 19:59:41 lola-7 kernel: txg_quiesce   D 0000000000000009     0 12866      2 0x00000080
            Aug  4 19:59:41 lola-7 kernel: ffff88040f2afd20 0000000000000046 0000000000000000 ffff88040f2afce4
            Aug  4 19:59:41 lola-7 kernel: 0000020a00000000 ffff88043fe83000 0000020be7124652 ffff880038675a00
            Aug  4 19:59:41 lola-7 kernel: 000000000000af7e 00000001001dbf93 ffff88041a5f7ad8 ffff88040f2affd8
            Aug  4 19:59:41 lola-7 kernel: Call Trace:
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0313e5d>] cv_wait_common+0x11d/0x130 [spl]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff812a0d8c>] ? __bitmap_weight+0x8c/0xb0
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0313ec5>] __cv_wait+0x15/0x20 [spl]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa03d19fb>] txg_quiesce_thread+0x32b/0x470 [zfs]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa03d16d0>] ? txg_quiesce_thread+0x0/0x470 [zfs]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa03d16d0>] ? txg_quiesce_thread+0x0/0x470 [zfs]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa030f238>] thread_generic_wrapper+0x68/0x80 [spl]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa030f1d0>] ? thread_generic_wrapper+0x0/0x80 [spl]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            Aug  4 19:59:41 lola-7 kernel: INFO: task ll_ost_io01_001:13081 blocked for more than 120 seconds.
            Aug  4 19:59:41 lola-7 kernel:      Tainted: P           --L------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
            Aug  4 19:59:41 lola-7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            Aug  4 19:59:41 lola-7 kernel: ll_ost_io01_0 D 0000000000000005     0 13081      2 0x00000080
            Aug  4 19:59:41 lola-7 kernel: ffff88040abdf818 0000000000000046 ffff88040abdf7e0 ffff88040abdf7dc
            Aug  4 19:59:41 lola-7 kernel: 0000000000000000 ffff88043fe82800 0000020be1fe58b0 ffff8800386d5a00
            Aug  4 19:59:41 lola-7 kernel: 0000000000000400 00000001001dbf28 ffff88040abc45f8 ffff88040abdffd8
            Aug  4 19:59:41 lola-7 kernel: Call Trace:
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff8153c2e4>] ? down_read+0x24/0x30
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0fe88d7>] osd_attr_get+0x77/0x2e0 [osd_zfs]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa1122861>] ofd_attr_handle_ugid+0x161/0x430 [ofd]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa112842d>] ofd_write_attr_set+0xad/0x8b0 [ofd]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa1128e5e>] ofd_commitrw_write+0x22e/0xfa0 [ofd]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa112e23d>] ? ofd_fmd_find_nolock+0xad/0xd0 [ofd]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa112a18f>] ofd_commitrw+0x5bf/0xb10 [ofd]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0b612e4>] obd_commitrw+0x114/0x380 [ptlrpc]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0b6a080>] tgt_brw_write+0xc70/0x1530 [ptlrpc]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff8105ab9c>] ? check_preempt_curr+0x7c/0x90
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0b688bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0b153c1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
            Aug  4 19:59:41 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
            Aug  4 19:59:41 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            
            cliffw Cliff White (Inactive) added a comment - The soft lockups appear within minutes of starting soak. Once the first one occurs, we get a cascade. Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580] Aug 4 19:58:26 lola-7 kernel: Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm raid1 dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core joydev i2c_i801 ioatdma ses enclosure sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Aug 4 19:58:26 lola-7 kernel: CPU 3 Aug 4 19:58:26 lola-7 kernel: Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm raid1 dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core joydev i2c_i801 ioatdma ses enclosure sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Aug 4 19:58:26 lola-7 kernel: Aug 4 19:58:26 lola-7 kernel: Pid: 20580, comm: ll_ost_io00_031 Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ Aug 4 19:58:26 lola-7 kernel: RIP: 0010:[<ffffffff8129e8a9>] [<ffffffff8129e8a9>] __write_lock_failed+0x9/0x20 Aug 4 19:58:26 lola-7 kernel: RSP: 0018:ffff88030bfb3778 EFLAGS: 00000287 Aug 4 19:58:26 lola-7 kernel: RAX: ffff8802a01ab780 RBX: ffff88030bfb3780 RCX: 0000000000026c0a Aug 4 19:58:26 lola-7 kernel: RDX: ffff8803c52a8000 RSI: ffff8803c52a8000 RDI: ffff8803a66afb88 Aug 4 19:58:26 lola-7 kernel: RBP: ffffffff8100bc0e R08: ffff8802d28a96c0 R09: 0000000000000008 Aug 4 19:58:26 lola-7 kernel: R10: 0000000a0007d83d R11: 0000000000000000 R12: ffff88030bfb36f0 Aug 4 19:58:26 lola-7 kernel: R13: 0000000000000000 R14: ffffffffa0943036 R15: ffff880360c1f660 Aug 4 19:58:26 lola-7 kernel: FS: 0000000000000000(0000) GS:ffff880038660000(0000) knlGS:0000000000000000 Aug 4 19:58:26 lola-7 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Aug 4 19:58:26 lola-7 kernel: CR2: 0000000001620874 CR3: 0000000001a8d000 CR4: 00000000000407e0 Aug 4 19:58:26 lola-7 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 4 19:58:26 lola-7 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Aug 4 19:58:26 lola-7 kernel: Process ll_ost_io00_031 (pid: 20580, threadinfo ffff88030bfb0000, task ffff88030bc72ab0) Aug 4 19:58:26 lola-7 kernel: Stack: Aug 4 19:58:26 lola-7 kernel: ffffffff8153d007 ffff88030bfb37b0 ffffffffa0fe2ab0 ffff8802d28a96d0 Aug 4 19:58:26 lola-7 kernel: <d> ffff8803c52a8000 ffff8803a66afad0 ffff88030bfb38d0 ffff88030bfb37e0 Aug 4 19:58:26 lola-7 kernel: <d> ffffffffa0fee79b 0000000000000000 ffff8803f6147180 ffff88030bfb37e0 Aug 4 19:58:26 lola-7 kernel: Call Trace: Aug 4 19:58:26 lola-7 kernel: [<ffffffff8153d007>] ? _write_lock+0x17/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fe2ab0>] ? osd_object_sa_dirty_add+0x40/0x80 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fee79b>] ? __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff09af>] ? __osd_sa_xattr_set+0x25f/0x350 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff0d8c>] ? osd_xattr_set+0x2ec/0x4d0 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090cdb2>] ? dt_version_set+0x72/0x230 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b5a443>] ? tgt_txn_stop_cb+0x2c3/0x1160 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090b96e>] ? dt_txn_hook_stop+0x5e/0x90 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fddc3c>] ? osd_trans_stop+0x29c/0x520 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1121a0f>] ? ofd_trans_stop+0x1f/0x60 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1129100>] ? ofd_commitrw_write+0x4d0/0xfa0 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa112a18f>] ? ofd_commitrw+0x5bf/0xb10 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b612e4>] ? obd_commitrw+0x114/0x380 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b6a080>] ? tgt_brw_write+0xc70/0x1530 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b688bc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b153c1>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 Aug 4 19:58:26 lola-7 kernel: [<ffffffff81539896>] ? schedule+0x176/0x3a0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 4 19:58:26 lola-7 kernel: Code: 00 00 48 8b 5b 20 48 83 eb 07 48 39 d9 73 06 48 89 01 31 c0 c3 b8 f2 ff ff ff c3 90 90 90 90 90 90 90 f0 81 07 00 00 00 01 f3 90 <81> 3f 00 00 00 01 75 f6 f0 81 2f 00 00 00 01 0f 85 e2 ff ff ff Aug 4 19:58:26 lola-7 kernel: Call Trace: Aug 4 19:58:26 lola-7 kernel: [<ffffffff8153d007>] ? _write_lock+0x17/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fe2ab0>] ? osd_object_sa_dirty_add+0x40/0x80 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fee79b>] ? __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff09af>] ? __osd_sa_xattr_set+0x25f/0x350 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff0d8c>] ? osd_xattr_set+0x2ec/0x4d0 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090cdb2>] ? dt_version_set+0x72/0x230 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b5a443>] ? tgt_txn_stop_cb+0x2c3/0x1160 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090b96e>] ? dt_txn_hook_stop+0x5e/0x90 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fddc3c>] ? osd_trans_stop+0x29c/0x520 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1121a0f>] ? ofd_trans_stop+0x1f/0x60 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1129100>] ? ofd_commitrw_write+0x4d0/0xfa0 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa112a18f>] ? ofd_commitrw+0x5bf/0xb10 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b612e4>] ? obd_commitrw+0x114/0x380 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b6a080>] ? tgt_brw_write+0xc70/0x1530 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b688bc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b153c1>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 Aug 4 19:58:26 lola-7 kernel: [<ffffffff81539896>] ? schedule+0x176/0x3a0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 4 19:59:41 lola-7 kernel: INFO: task txg_quiesce:12866 blocked for more than 120 seconds. Aug 4 19:59:41 lola-7 kernel: Tainted: P --L------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Aug 4 19:59:41 lola-7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 4 19:59:41 lola-7 kernel: txg_quiesce D 0000000000000009 0 12866 2 0x00000080 Aug 4 19:59:41 lola-7 kernel: ffff88040f2afd20 0000000000000046 0000000000000000 ffff88040f2afce4 Aug 4 19:59:41 lola-7 kernel: 0000020a00000000 ffff88043fe83000 0000020be7124652 ffff880038675a00 Aug 4 19:59:41 lola-7 kernel: 000000000000af7e 00000001001dbf93 ffff88041a5f7ad8 ffff88040f2affd8 Aug 4 19:59:41 lola-7 kernel: Call Trace: Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0313e5d>] cv_wait_common+0x11d/0x130 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 Aug 4 19:59:41 lola-7 kernel: [<ffffffff812a0d8c>] ? __bitmap_weight+0x8c/0xb0 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0313ec5>] __cv_wait+0x15/0x20 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa03d19fb>] txg_quiesce_thread+0x32b/0x470 [zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa03d16d0>] ? txg_quiesce_thread+0x0/0x470 [zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa03d16d0>] ? txg_quiesce_thread+0x0/0x470 [zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa030f238>] thread_generic_wrapper+0x68/0x80 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa030f1d0>] ? thread_generic_wrapper+0x0/0x80 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 4 19:59:41 lola-7 kernel: INFO: task ll_ost_io01_001:13081 blocked for more than 120 seconds. Aug 4 19:59:41 lola-7 kernel: Tainted: P --L------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Aug 4 19:59:41 lola-7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 4 19:59:41 lola-7 kernel: ll_ost_io01_0 D 0000000000000005 0 13081 2 0x00000080 Aug 4 19:59:41 lola-7 kernel: ffff88040abdf818 0000000000000046 ffff88040abdf7e0 ffff88040abdf7dc Aug 4 19:59:41 lola-7 kernel: 0000000000000000 ffff88043fe82800 0000020be1fe58b0 ffff8800386d5a00 Aug 4 19:59:41 lola-7 kernel: 0000000000000400 00000001001dbf28 ffff88040abc45f8 ffff88040abdffd8 Aug 4 19:59:41 lola-7 kernel: Call Trace: Aug 4 19:59:41 lola-7 kernel: [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8153c2e4>] ? down_read+0x24/0x30 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0fe88d7>] osd_attr_get+0x77/0x2e0 [osd_zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa1122861>] ofd_attr_handle_ugid+0x161/0x430 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa112842d>] ofd_write_attr_set+0xad/0x8b0 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa1128e5e>] ofd_commitrw_write+0x22e/0xfa0 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa112e23d>] ? ofd_fmd_find_nolock+0xad/0xd0 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa112a18f>] ofd_commitrw+0x5bf/0xb10 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b612e4>] obd_commitrw+0x114/0x380 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b6a080>] tgt_brw_write+0xc70/0x1530 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffff8105ab9c>] ? check_preempt_curr+0x7c/0x90 Aug 4 19:59:41 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b688bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b153c1>] ptlrpc_main+0xd31/0x1800 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            jhammond John Hammond added a comment -

            What is kernel.panic set to here? If it's set to 0 then please try sysctl -w kernel.panic=60 to see if we can get the full crash dump message on the console.

            jhammond John Hammond added a comment - What is kernel.panic set to here? If it's set to 0 then please try sysctl -w kernel.panic=60 to see if we can get the full crash dump message on the console.
            jhammond John Hammond added a comment -

            Do we have stack traces for the soft lockups?

            Build 3046 is 4af430de697249d07bcb7fe3401a6caa4415fe31. There have been 56 commits on master from 4af430de697249d07bcb7fe3401a6caa4415fe31 to 0f37c051158a399f7b00536eeec27f5dbdd54168. How long does this take to reproduce?

            jhammond John Hammond added a comment - Do we have stack traces for the soft lockups? Build 3046 is 4af430de697249d07bcb7fe3401a6caa4415fe31. There have been 56 commits on master from 4af430de697249d07bcb7fe3401a6caa4415fe31 to 0f37c051158a399f7b00536eeec27f5dbdd54168. How long does this take to reproduce?
            cliffw Cliff White (Inactive) added a comment - - edited

            Downgraded OSS nodes to build lustre-master 3406 - no oom-killer. Run stopped when MDS failed to complete recovery (existing bug)
            Upgrading OSS nodes to build 3409 Revision: 9d8469694738288fcd246a1e9f22c47918a32c58
            Soak restart Thu Aug 4 19:38:27 PDT 2016
            First soft lockup: Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580]

            cliffw Cliff White (Inactive) added a comment - - edited Downgraded OSS nodes to build lustre-master 3406 - no oom-killer. Run stopped when MDS failed to complete recovery (existing bug) Upgrading OSS nodes to build 3409 Revision: 9d8469694738288fcd246a1e9f22c47918a32c58 Soak restart Thu Aug 4 19:38:27 PDT 2016 First soft lockup: Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580]

            People

              bzzz Alex Zhuravlev
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: