Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.9.0, Lustre 2.10.0
    • Soak test cluster. Lustre 2.9.0 GA Release
    • 3
    • 9223372036854775807

    Description

      This looks like a few bugs we've seen before, but not enough like any specific one.

      • Started soak
      • Started and completed router delay fault
      • LFSCK triggered.
      • LFSCK dones not complete, runs > 1905s
        MGS/MDS0 has soft lockup on CPU.
        Jan 19 14:06:24 lola-8 kernel: BUG: soft lockup - CPU#9 stuck for 67s! [OI_scrub:19279]
        Jan 19 14:06:24 lola-8 kernel: Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
        Jan 19 14:06:24 lola-8 kernel: CPU 9
        Jan 19 14:06:24 lola-8 kernel: Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
        Jan 19 14:06:24 lola-8 kernel:
        Jan 19 14:06:24 lola-8 kernel: Pid: 19279, comm: OI_scrub Tainted: P           --L------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
        Jan 19 14:06:24 lola-8 kernel: RIP: 0010:[<ffffffffa10d1e2f>]  [<ffffffffa10d1e2f>] osd_scrub_exec+0xf/0x1dc0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: RSP: 0018:ffff88049472bc20  EFLAGS: 00000282
        Jan 19 14:06:24 lola-8 kernel: RAX: 0000000000000004 RBX: ffff88049472bd00 RCX: ffff8803ac48ced0
        Jan 19 14:06:24 lola-8 kernel: RDX: ffff88049472bda0 RSI: ffff880216078000 RDI: ffff88038a74d000
        Jan 19 14:06:24 lola-8 kernel: RBP: ffffffff8100bc0e R08: ffff88049472bdcf R09: 0000000000000004
        Jan 19 14:06:24 lola-8 kernel: R10: ffff8803ac48cec0 R11: ffff8803ac48ced0 R12: ffffffff81fe6920
        Jan 19 14:06:24 lola-8 kernel: R13: ffff8802a5151210 R14: ffff88038a74d000 R15: ffffffffa08549a8
        Jan 19 14:06:24 lola-8 kernel: FS:  0000000000000000(0000) GS:ffff88044e420000(0000) knlGS:0000000000000000
        Jan 19 14:06:24 lola-8 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
        Jan 19 14:06:24 lola-8 kernel: CR2: 0000003903673b90 CR3: 0000000001a8d000 CR4: 00000000000407e0
        Jan 19 14:06:24 lola-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        Jan 19 14:06:24 lola-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Jan 19 14:06:24 lola-8 kernel: Process OI_scrub (pid: 19279, threadinfo ffff880494728000, task ffff880802c30ab0)
        Jan 19 14:06:24 lola-8 kernel: Stack:
        Jan 19 14:06:24 lola-8 kernel: ffff880216078000 ffff88038a74d000 ffffffffffffff02 ffffffffa10d3c9a
        Jan 19 14:06:24 lola-8 kernel: <d> 0000000000000010 0000000000000202 ffff88049472bc60 0000000000000018
        Jan 19 14:06:24 lola-8 kernel: <d> ffff8803ac48cee0 0000000000000001 ffff8802160790e0 ffff880216079000
        Jan 19 14:06:24 lola-8 kernel: Call Trace:
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3c9a>] ? osd_scrub_next+0xba/0x4b0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d6ec9>] ? osd_inode_iteration+0x4c9/0xd80 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d1e20>] ? osd_scrub_exec+0x0/0x1dc0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3be0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8905>] ? osd_scrub_main+0x885/0xec0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8080>] ? osd_scrub_main+0x0/0xec0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
        Jan 19 14:06:24 lola-8 kernel: Code: 00 00 00 e8 54 fe 7b ff e9 14 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 81 ec e0 00 00 00 48 89 5d d8 <4c> 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00
        an 19 14:06:24 lola-8 kernel: Code: 00 00 00 e8 54 fe 7b ff e9 14 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 81 ec e0 00 00 00 48 89 5d d8 <4c> 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00
        Jan 19 14:06:24 lola-8 kernel: Call Trace:
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3c9a>] ? osd_scrub_next+0xba/0x4b0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d6ec9>] ? osd_inode_iteration+0x4c9/0xd80 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d1e20>] ? osd_scrub_exec+0x0/0x1dc0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3be0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8905>] ? osd_scrub_main+0x885/0xec0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8080>] ? osd_scrub_main+0x0/0xec0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
        

        Attempting to get a stack dump, but not going well. System is available is anyone wishes to have a look

      Attachments

        1. lfsck_namespace.txt
          2 kB
        2. lola-7.console.20170208.txt.gz
          6.29 MB
        3. lola7.dmesg
          238 kB
        4. lola-8.lfsck.hang.stacks.txt
          1.13 MB
        5. lu-9040-soak-8.console.log.txt.gz
          218 kB
        6. lustre-log.1486672102.6086.txt.gz
          11.50 MB
        7. vmcore-dmesg.txt
          512 kB

        Issue Links

          Activity

            [LU-9040] Soft lockup on CPU during lfsck
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26751/
            Subject: LU-9040 scrub: detect dead loop scanning
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ad75291c70e67acb6fd607e373be045e9d16d3fc

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26751/ Subject: LU-9040 scrub: detect dead loop scanning Project: fs/lustre-release Branch: master Current Patch Set: Commit: ad75291c70e67acb6fd607e373be045e9d16d3fc
            yong.fan nasf (Inactive) added a comment - - edited

            The patch https://review.whamcloud.com/#/c/26751/ has been updated with more fixing.
            Cliff, would please to try again? Thanks!

            yong.fan nasf (Inactive) added a comment - - edited The patch https://review.whamcloud.com/#/c/26751/ has been updated with more fixing. Cliff, would please to try again? Thanks!

            Created a build with patches https://review.whamcloud.com/#/c/26751/ and https://review.whamcloud.com/#/c/26893/

            • fails immediately on mount of MDT.
              soak-8 is currently spinning very hard, available for examination.
              60.530576] CPU: 13 PID: 12279 Comm: OI_scrub Tainted: P           OEL ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1^M
              [ 6560.543976] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
              [ 6560.557569] task: ffff8803fca11f60 ti: ffff8803fca24000 task.ti: ffff8803fca24000^M
              [ 6560.566978] RIP: 0010:[<ffffffffa13a5be9>]  [<ffffffffa13a5be9>] osd_scrub_next+0x9/0x3c0 [osd_ldiskfs]^M
              [ 6560.578543] RSP: 0018:ffff8803fca27d08  EFLAGS: 00000287^M
              [ 6560.585506] RAX: ffffffffa13a5be0 RBX: 0000000000000001 RCX: ffff8803fca27d98^M
              [ 6560.594492] RDX: ffff8803fca27d78 RSI: ffff88082a370000 RDI: ffff88082da75000^M
              [ 6560.603462] RBP: ffff8803fca27d08 R08: 0000000000000001 R09: 0000000000000004^M
              [ 6560.612418] R10: 000000002a2da701 R11: ffffea0020a8b680 R12: 000000002a2da701^M
              [ 6560.621356] R13: ffffea0020a8b680 R14: ffff8803fca27c70 R15: ffff8803fca27d08^M
              [ 6560.630283] FS:  0000000000000000(0000) GS:ffff88082d940000(0000) knlGS:0000000000000000^M
              [ 6560.640284] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
              [ 6560.647665] CR2: 00007f6a3b9ab000 CR3: 00000000019be000 CR4: 00000000000407e0^M
              [ 6560.656614] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
              [ 6560.665553] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
              [ 6560.674488] Stack:^M
              [ 6560.677690]  ffff8803fca27df0 ffffffffa13ab1c9 ffffffffa13a9b50 ffffffffa13a5be0^M
              [ 6560.686960]  ffff88082a370000 00000000810d354f ffff88082da75000 ffff88082a371468^M
              [ 6560.696224]  0000000020000000 0100000000200163 0000000000000000 0000000000000000^M
              [ 6560.705485] Call Trace:^M
              [ 6560.709202]  [<ffffffffa13ab1c9>] osd_inode_iteration+0x2a9/0xcc0 [osd_ldiskfs]^M
              [ 6560.718370]  [<ffffffffa13a9b50>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]^M
              [ 6560.727525]  [<ffffffffa13a5be0>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]^M
              [ 6560.736388]  [<ffffffffa13ac540>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]^M
              [ 6560.745052]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M
              [ 6560.752248]  [<ffffffffa13abbe0>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]^M
              [ 6560.761582]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
              [ 6560.768004]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
              [ 6560.776269]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
              [ 6560.783271]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
              [ 6560.791526] Code: 48 8d 4c 02 10 48 01 c2 e8 25 fc ff ff 48 83 c4 08 5b 41 5c 41 5d 41 5e 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 <41> 57 49 89 f7 41 56 41 55 49 89 fd 41 54 49 89 d4 53 48 89 cb ^M
              [ 6588.383881] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [OI_scrub:12279]^
              
            cliffw Cliff White (Inactive) added a comment - Created a build with patches https://review.whamcloud.com/#/c/26751/ and https://review.whamcloud.com/#/c/26893/ fails immediately on mount of MDT. soak-8 is currently spinning very hard, available for examination. 60.530576] CPU: 13 PID: 12279 Comm: OI_scrub Tainted: P OEL ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1^M [ 6560.543976] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M [ 6560.557569] task: ffff8803fca11f60 ti: ffff8803fca24000 task.ti: ffff8803fca24000^M [ 6560.566978] RIP: 0010:[<ffffffffa13a5be9>] [<ffffffffa13a5be9>] osd_scrub_next+0x9/0x3c0 [osd_ldiskfs]^M [ 6560.578543] RSP: 0018:ffff8803fca27d08 EFLAGS: 00000287^M [ 6560.585506] RAX: ffffffffa13a5be0 RBX: 0000000000000001 RCX: ffff8803fca27d98^M [ 6560.594492] RDX: ffff8803fca27d78 RSI: ffff88082a370000 RDI: ffff88082da75000^M [ 6560.603462] RBP: ffff8803fca27d08 R08: 0000000000000001 R09: 0000000000000004^M [ 6560.612418] R10: 000000002a2da701 R11: ffffea0020a8b680 R12: 000000002a2da701^M [ 6560.621356] R13: ffffea0020a8b680 R14: ffff8803fca27c70 R15: ffff8803fca27d08^M [ 6560.630283] FS: 0000000000000000(0000) GS:ffff88082d940000(0000) knlGS:0000000000000000^M [ 6560.640284] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M [ 6560.647665] CR2: 00007f6a3b9ab000 CR3: 00000000019be000 CR4: 00000000000407e0^M [ 6560.656614] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M [ 6560.665553] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M [ 6560.674488] Stack:^M [ 6560.677690] ffff8803fca27df0 ffffffffa13ab1c9 ffffffffa13a9b50 ffffffffa13a5be0^M [ 6560.686960] ffff88082a370000 00000000810d354f ffff88082da75000 ffff88082a371468^M [ 6560.696224] 0000000020000000 0100000000200163 0000000000000000 0000000000000000^M [ 6560.705485] Call Trace:^M [ 6560.709202] [<ffffffffa13ab1c9>] osd_inode_iteration+0x2a9/0xcc0 [osd_ldiskfs]^M [ 6560.718370] [<ffffffffa13a9b50>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]^M [ 6560.727525] [<ffffffffa13a5be0>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]^M [ 6560.736388] [<ffffffffa13ac540>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]^M [ 6560.745052] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M [ 6560.752248] [<ffffffffa13abbe0>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]^M [ 6560.761582] [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M [ 6560.768004] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M [ 6560.776269] [<ffffffff81697318>] ret_from_fork+0x58/0x90^M [ 6560.783271] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M [ 6560.791526] Code: 48 8d 4c 02 10 48 01 c2 e8 25 fc ff ff 48 83 c4 08 5b 41 5c 41 5d 41 5e 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 <41> 57 49 89 f7 41 56 41 55 49 89 fd 41 54 49 89 d4 53 48 89 cb ^M [ 6588.383881] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [OI_scrub:12279]^

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/26922
            Subject: LU-9040 scrub: detect dead loop scanning
            Project: fs/lustre-release
            Branch: pfl
            Current Patch Set: 1
            Commit: 533dd2f253c81e8061204da5c785dd9a27063108

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/26922 Subject: LU-9040 scrub: detect dead loop scanning Project: fs/lustre-release Branch: pfl Current Patch Set: 1 Commit: 533dd2f253c81e8061204da5c785dd9a27063108

            Cliff, thanks for update the ticket. It is noticed that the CPU soft lockup comes back. I have made another patch https://review.whamcloud.com/26751 for that. Would you please to verify it in the lola test? Thanks!

            yong.fan nasf (Inactive) added a comment - Cliff, thanks for update the ticket. It is noticed that the CPU soft lockup comes back. I have made another patch https://review.whamcloud.com/26751 for that. Would you please to verify it in the lola test? Thanks!

            Changed to latest lustre-master build. It seems we are back to soft lockup from OI_scrub

            [ 5964.665025] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [OI_scrub:12743]^M
            [ 5964.674913] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel dm_round_robin ipmi_devintf lrw gf128mul glue_helper ablk_helper cryptd sb_edac ipmi_ssif iTCO_wdt ntb ipmi_si iTCO_vendor_support mei_me sg mei ioatdma edac_core ipmi_msghandler lpc_ich wmi pcspkr shpchp i2c_i801 dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ahci crct10dif_pclmul isci crct10dif_common ptp libahci crc32c_intel libsas pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes^M
            [ 5964.813809] CPU: 6 PID: 12743 Comm: OI_scrub Tainted: P           OE  ------------   3.10.0-514.10.2.el7_lustre.x86_64 #1^M
            [ 5964.827453] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
            [ 5964.841399] task: ffff88083fdc0fb0 ti: ffff8807e5e48000 task.ti: ffff8807e5e48000^M
            [ 5964.851143] RIP: 0010:[<ffffffffa131fb46>]  [<ffffffffa131fb46>] ldiskfs_itable_unused_count+0x16/0x30 [ldiskfs]^M
            [ 5964.863926] RSP: 0018:ffff8807e5e4bd10  EFLAGS: 00000246^M
            [ 5964.871236] RAX: 0000000000000000 RBX: ffff8807e5e4bd57 RCX: ffff88042bacc000^M
            [ 5964.880575] RDX: 00000000000072ce RSI: ffff880405ca2a00 RDI: ffff88042bace000^M
            [ 5964.889892] RBP: ffff8807e5e4bd10 R08: ffff8807e5e4bd57 R09: 0000000000000004^M
            [ 5964.899191] R10: ffff88042e199a80 R11: ffffea000fbc98c0 R12: ffff88042e199a80^M
            [ 5964.908502] R13: 0000000000000018 R14: ffff8807e5e4bc80 R15: 0000000000000202^M
            [ 5964.917766] FS:  0000000000000000(0000) GS:ffff88042e180000(0000) knlGS:0000000000000000^M
            [ 5964.928090] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
            [ 5964.935780] CR2: 00007faa11571a90 CR3: 00000000019ba000 CR4: 00000000000407e0^M
            [ 5964.945021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
            [ 5964.954229] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
            [ 5964.963412] Stack:^M
            [ 5964.966841]  ffff8807e5e4bdf0 ffffffffa13b2dfd ffffffffa13b17d0 ffff88017976346c^M
            [ 5964.976321]  ffffffffa13ad700 ffff880820000000 ffff88042bace000 ffff880179763468^M
            [ 5964.985788]  010000000000000c 0000000000000000 0000000000000000 ffff880179762000^M
            [ 5964.995235] Call Trace:^M
            [ 5964.999098]  [<ffffffffa13b2dfd>] osd_inode_iteration+0x21d/0xd90 [osd_ldiskfs]^M
            [ 5965.008393]  [<ffffffffa13b17d0>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]^M
            [ 5965.017668]  [<ffffffffa13ad700>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs]^M
            [ 5965.026648]  [<ffffffffa13b42d0>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]^M
            [ 5965.035419]  [<ffffffff810c5080>] ? wake_up_state+0x20/0x20^M
            [ 5965.042742]  [<ffffffffa13b3970>] ? osd_inode_iteration+0xd90/0xd90 [osd_ldiskfs]^M
            [ 5965.052185]  [<ffffffff810b06ff>] kthread+0xcf/0xe0^M
            [ 5965.058694]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140^M
            [ 5965.067050]  [<ffffffff81696c98>] ret_from_fork+0x58/0x90^M
            [ 5965.074115]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140^M
            [ 5965.082428] Code: 07 0f b7 46 30 c1 e0 10 09 d0 5d c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 8f a8 03 00 00 31 c0 0f b7 56 1c 48 89 e5 <48> 83 39 3f 76 07 0f b7 46 32 c1 e0 10 09 d0 5d c3 66 0f 1f 84 ^M
            [ 5992.664704] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [OI_scrub:12743]^M
            
            

            I am going to attempt to dump stacks and crash dump the node (soak-8)

            cliffw Cliff White (Inactive) added a comment - Changed to latest lustre-master build. It seems we are back to soft lockup from OI_scrub [ 5964.665025] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [OI_scrub:12743]^M [ 5964.674913] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel dm_round_robin ipmi_devintf lrw gf128mul glue_helper ablk_helper cryptd sb_edac ipmi_ssif iTCO_wdt ntb ipmi_si iTCO_vendor_support mei_me sg mei ioatdma edac_core ipmi_msghandler lpc_ich wmi pcspkr shpchp i2c_i801 dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ahci crct10dif_pclmul isci crct10dif_common ptp libahci crc32c_intel libsas pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes^M [ 5964.813809] CPU: 6 PID: 12743 Comm: OI_scrub Tainted: P OE ------------ 3.10.0-514.10.2.el7_lustre.x86_64 #1^M [ 5964.827453] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M [ 5964.841399] task: ffff88083fdc0fb0 ti: ffff8807e5e48000 task.ti: ffff8807e5e48000^M [ 5964.851143] RIP: 0010:[<ffffffffa131fb46>] [<ffffffffa131fb46>] ldiskfs_itable_unused_count+0x16/0x30 [ldiskfs]^M [ 5964.863926] RSP: 0018:ffff8807e5e4bd10 EFLAGS: 00000246^M [ 5964.871236] RAX: 0000000000000000 RBX: ffff8807e5e4bd57 RCX: ffff88042bacc000^M [ 5964.880575] RDX: 00000000000072ce RSI: ffff880405ca2a00 RDI: ffff88042bace000^M [ 5964.889892] RBP: ffff8807e5e4bd10 R08: ffff8807e5e4bd57 R09: 0000000000000004^M [ 5964.899191] R10: ffff88042e199a80 R11: ffffea000fbc98c0 R12: ffff88042e199a80^M [ 5964.908502] R13: 0000000000000018 R14: ffff8807e5e4bc80 R15: 0000000000000202^M [ 5964.917766] FS: 0000000000000000(0000) GS:ffff88042e180000(0000) knlGS:0000000000000000^M [ 5964.928090] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M [ 5964.935780] CR2: 00007faa11571a90 CR3: 00000000019ba000 CR4: 00000000000407e0^M [ 5964.945021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M [ 5964.954229] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M [ 5964.963412] Stack:^M [ 5964.966841] ffff8807e5e4bdf0 ffffffffa13b2dfd ffffffffa13b17d0 ffff88017976346c^M [ 5964.976321] ffffffffa13ad700 ffff880820000000 ffff88042bace000 ffff880179763468^M [ 5964.985788] 010000000000000c 0000000000000000 0000000000000000 ffff880179762000^M [ 5964.995235] Call Trace:^M [ 5964.999098] [<ffffffffa13b2dfd>] osd_inode_iteration+0x21d/0xd90 [osd_ldiskfs]^M [ 5965.008393] [<ffffffffa13b17d0>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]^M [ 5965.017668] [<ffffffffa13ad700>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs]^M [ 5965.026648] [<ffffffffa13b42d0>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]^M [ 5965.035419] [<ffffffff810c5080>] ? wake_up_state+0x20/0x20^M [ 5965.042742] [<ffffffffa13b3970>] ? osd_inode_iteration+0xd90/0xd90 [osd_ldiskfs]^M [ 5965.052185] [<ffffffff810b06ff>] kthread+0xcf/0xe0^M [ 5965.058694] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140^M [ 5965.067050] [<ffffffff81696c98>] ret_from_fork+0x58/0x90^M [ 5965.074115] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140^M [ 5965.082428] Code: 07 0f b7 46 30 c1 e0 10 09 d0 5d c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 8f a8 03 00 00 31 c0 0f b7 56 1c 48 89 e5 <48> 83 39 3f 76 07 0f b7 46 32 c1 e0 10 09 d0 5d c3 66 0f 1f 84 ^M [ 5992.664704] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [OI_scrub:12743]^M I am going to attempt to dump stacks and crash dump the node (soak-8)

            Testing this build on soak, have run lfsck 3 times, in all three cases lfsck is running 600 seconds (our timeout) and then being aborted by soak framework. This is some progress, as previously lfsck would not abort

            cliffw Cliff White (Inactive) added a comment - Testing this build on soak, have run lfsck 3 times, in all three cases lfsck is running 600 seconds (our timeout) and then being aborted by soak framework. This is some progress, as previously lfsck would not abort

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26751
            Subject: LU-9040 scrub: detect dead loop scanning
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3f492eaff2f51368f5d2323b1f17e3192014905b

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26751 Subject: LU-9040 scrub: detect dead loop scanning Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3f492eaff2f51368f5d2323b1f17e3192014905b

            Fan Yong,

            There are some recent sanity-sec test 9 hangs that have a very similar stack trace to the one in this ticket, but have the patch for this ticket. Would you please review these failures are determine if they are indeed the same issue as stated in this ticket or if they are new failures?

            If the below logs show that these failures are unrelated to this ticket, then I can open a new one.

            Here are a few of the recent failures:
            2017-04-09 - https://testing.hpdd.intel.com/test_sets/1fb1e996-1d1c-11e7-8920-5254006e85c2
            2017-04-11 - https://testing.hpdd.intel.com/test_sets/da448a92-1ef3-11e7-9de9-5254006e85c2
            2017-04-12 - https://testing.hpdd.intel.com/test_sets/d251faa0-1fab-11e7-b742-5254006e85c2
            2017-04-17 - https://testing.hpdd.intel.com/test_sets/4ddd019c-2362-11e7-8920-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - Fan Yong, There are some recent sanity-sec test 9 hangs that have a very similar stack trace to the one in this ticket, but have the patch for this ticket. Would you please review these failures are determine if they are indeed the same issue as stated in this ticket or if they are new failures? If the below logs show that these failures are unrelated to this ticket, then I can open a new one. Here are a few of the recent failures: 2017-04-09 - https://testing.hpdd.intel.com/test_sets/1fb1e996-1d1c-11e7-8920-5254006e85c2 2017-04-11 - https://testing.hpdd.intel.com/test_sets/da448a92-1ef3-11e7-9de9-5254006e85c2 2017-04-12 - https://testing.hpdd.intel.com/test_sets/d251faa0-1fab-11e7-b742-5254006e85c2 2017-04-17 - https://testing.hpdd.intel.com/test_sets/4ddd019c-2362-11e7-8920-5254006e85c2

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: