Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.9.0, Lustre 2.10.0
    • Soak test cluster. Lustre 2.9.0 GA Release
    • 3
    • 9223372036854775807

    Description

      This looks like a few bugs we've seen before, but not enough like any specific one.

      • Started soak
      • Started and completed router delay fault
      • LFSCK triggered.
      • LFSCK dones not complete, runs > 1905s
        MGS/MDS0 has soft lockup on CPU.
        Jan 19 14:06:24 lola-8 kernel: BUG: soft lockup - CPU#9 stuck for 67s! [OI_scrub:19279]
        Jan 19 14:06:24 lola-8 kernel: Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
        Jan 19 14:06:24 lola-8 kernel: CPU 9
        Jan 19 14:06:24 lola-8 kernel: Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
        Jan 19 14:06:24 lola-8 kernel:
        Jan 19 14:06:24 lola-8 kernel: Pid: 19279, comm: OI_scrub Tainted: P           --L------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
        Jan 19 14:06:24 lola-8 kernel: RIP: 0010:[<ffffffffa10d1e2f>]  [<ffffffffa10d1e2f>] osd_scrub_exec+0xf/0x1dc0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: RSP: 0018:ffff88049472bc20  EFLAGS: 00000282
        Jan 19 14:06:24 lola-8 kernel: RAX: 0000000000000004 RBX: ffff88049472bd00 RCX: ffff8803ac48ced0
        Jan 19 14:06:24 lola-8 kernel: RDX: ffff88049472bda0 RSI: ffff880216078000 RDI: ffff88038a74d000
        Jan 19 14:06:24 lola-8 kernel: RBP: ffffffff8100bc0e R08: ffff88049472bdcf R09: 0000000000000004
        Jan 19 14:06:24 lola-8 kernel: R10: ffff8803ac48cec0 R11: ffff8803ac48ced0 R12: ffffffff81fe6920
        Jan 19 14:06:24 lola-8 kernel: R13: ffff8802a5151210 R14: ffff88038a74d000 R15: ffffffffa08549a8
        Jan 19 14:06:24 lola-8 kernel: FS:  0000000000000000(0000) GS:ffff88044e420000(0000) knlGS:0000000000000000
        Jan 19 14:06:24 lola-8 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
        Jan 19 14:06:24 lola-8 kernel: CR2: 0000003903673b90 CR3: 0000000001a8d000 CR4: 00000000000407e0
        Jan 19 14:06:24 lola-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        Jan 19 14:06:24 lola-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Jan 19 14:06:24 lola-8 kernel: Process OI_scrub (pid: 19279, threadinfo ffff880494728000, task ffff880802c30ab0)
        Jan 19 14:06:24 lola-8 kernel: Stack:
        Jan 19 14:06:24 lola-8 kernel: ffff880216078000 ffff88038a74d000 ffffffffffffff02 ffffffffa10d3c9a
        Jan 19 14:06:24 lola-8 kernel: <d> 0000000000000010 0000000000000202 ffff88049472bc60 0000000000000018
        Jan 19 14:06:24 lola-8 kernel: <d> ffff8803ac48cee0 0000000000000001 ffff8802160790e0 ffff880216079000
        Jan 19 14:06:24 lola-8 kernel: Call Trace:
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3c9a>] ? osd_scrub_next+0xba/0x4b0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d6ec9>] ? osd_inode_iteration+0x4c9/0xd80 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d1e20>] ? osd_scrub_exec+0x0/0x1dc0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3be0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8905>] ? osd_scrub_main+0x885/0xec0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8080>] ? osd_scrub_main+0x0/0xec0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
        Jan 19 14:06:24 lola-8 kernel: Code: 00 00 00 e8 54 fe 7b ff e9 14 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 81 ec e0 00 00 00 48 89 5d d8 <4c> 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00
        an 19 14:06:24 lola-8 kernel: Code: 00 00 00 e8 54 fe 7b ff e9 14 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 81 ec e0 00 00 00 48 89 5d d8 <4c> 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00
        Jan 19 14:06:24 lola-8 kernel: Call Trace:
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3c9a>] ? osd_scrub_next+0xba/0x4b0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d6ec9>] ? osd_inode_iteration+0x4c9/0xd80 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d1e20>] ? osd_scrub_exec+0x0/0x1dc0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3be0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8905>] ? osd_scrub_main+0x885/0xec0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
        Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8080>] ? osd_scrub_main+0x0/0xec0 [osd_ldiskfs]
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
        Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
        

        Attempting to get a stack dump, but not going well. System is available is anyone wishes to have a look

      Attachments

        1. lfsck_namespace.txt
          2 kB
        2. lola-7.console.20170208.txt.gz
          6.29 MB
        3. lola7.dmesg
          238 kB
        4. lola-8.lfsck.hang.stacks.txt
          1.13 MB
        5. lu-9040-soak-8.console.log.txt.gz
          218 kB
        6. lustre-log.1486672102.6086.txt.gz
          11.50 MB
        7. vmcore-dmesg.txt
          512 kB

        Issue Links

          Activity

            [LU-9040] Soft lockup on CPU during lfsck
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26751/
            Subject: LU-9040 scrub: detect dead loop scanning
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ad75291c70e67acb6fd607e373be045e9d16d3fc

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26751/ Subject: LU-9040 scrub: detect dead loop scanning Project: fs/lustre-release Branch: master Current Patch Set: Commit: ad75291c70e67acb6fd607e373be045e9d16d3fc
            yong.fan nasf (Inactive) added a comment - - edited

            The patch https://review.whamcloud.com/#/c/26751/ has been updated with more fixing.
            Cliff, would please to try again? Thanks!

            yong.fan nasf (Inactive) added a comment - - edited The patch https://review.whamcloud.com/#/c/26751/ has been updated with more fixing. Cliff, would please to try again? Thanks!

            Created a build with patches https://review.whamcloud.com/#/c/26751/ and https://review.whamcloud.com/#/c/26893/

            • fails immediately on mount of MDT.
              soak-8 is currently spinning very hard, available for examination.
              60.530576] CPU: 13 PID: 12279 Comm: OI_scrub Tainted: P           OEL ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1^M
              [ 6560.543976] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
              [ 6560.557569] task: ffff8803fca11f60 ti: ffff8803fca24000 task.ti: ffff8803fca24000^M
              [ 6560.566978] RIP: 0010:[<ffffffffa13a5be9>]  [<ffffffffa13a5be9>] osd_scrub_next+0x9/0x3c0 [osd_ldiskfs]^M
              [ 6560.578543] RSP: 0018:ffff8803fca27d08  EFLAGS: 00000287^M
              [ 6560.585506] RAX: ffffffffa13a5be0 RBX: 0000000000000001 RCX: ffff8803fca27d98^M
              [ 6560.594492] RDX: ffff8803fca27d78 RSI: ffff88082a370000 RDI: ffff88082da75000^M
              [ 6560.603462] RBP: ffff8803fca27d08 R08: 0000000000000001 R09: 0000000000000004^M
              [ 6560.612418] R10: 000000002a2da701 R11: ffffea0020a8b680 R12: 000000002a2da701^M
              [ 6560.621356] R13: ffffea0020a8b680 R14: ffff8803fca27c70 R15: ffff8803fca27d08^M
              [ 6560.630283] FS:  0000000000000000(0000) GS:ffff88082d940000(0000) knlGS:0000000000000000^M
              [ 6560.640284] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
              [ 6560.647665] CR2: 00007f6a3b9ab000 CR3: 00000000019be000 CR4: 00000000000407e0^M
              [ 6560.656614] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
              [ 6560.665553] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
              [ 6560.674488] Stack:^M
              [ 6560.677690]  ffff8803fca27df0 ffffffffa13ab1c9 ffffffffa13a9b50 ffffffffa13a5be0^M
              [ 6560.686960]  ffff88082a370000 00000000810d354f ffff88082da75000 ffff88082a371468^M
              [ 6560.696224]  0000000020000000 0100000000200163 0000000000000000 0000000000000000^M
              [ 6560.705485] Call Trace:^M
              [ 6560.709202]  [<ffffffffa13ab1c9>] osd_inode_iteration+0x2a9/0xcc0 [osd_ldiskfs]^M
              [ 6560.718370]  [<ffffffffa13a9b50>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]^M
              [ 6560.727525]  [<ffffffffa13a5be0>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]^M
              [ 6560.736388]  [<ffffffffa13ac540>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]^M
              [ 6560.745052]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M
              [ 6560.752248]  [<ffffffffa13abbe0>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]^M
              [ 6560.761582]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
              [ 6560.768004]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
              [ 6560.776269]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
              [ 6560.783271]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
              [ 6560.791526] Code: 48 8d 4c 02 10 48 01 c2 e8 25 fc ff ff 48 83 c4 08 5b 41 5c 41 5d 41 5e 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 <41> 57 49 89 f7 41 56 41 55 49 89 fd 41 54 49 89 d4 53 48 89 cb ^M
              [ 6588.383881] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [OI_scrub:12279]^
              
            cliffw Cliff White (Inactive) added a comment - Created a build with patches https://review.whamcloud.com/#/c/26751/ and https://review.whamcloud.com/#/c/26893/ fails immediately on mount of MDT. soak-8 is currently spinning very hard, available for examination. 60.530576] CPU: 13 PID: 12279 Comm: OI_scrub Tainted: P OEL ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1^M [ 6560.543976] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M [ 6560.557569] task: ffff8803fca11f60 ti: ffff8803fca24000 task.ti: ffff8803fca24000^M [ 6560.566978] RIP: 0010:[<ffffffffa13a5be9>] [<ffffffffa13a5be9>] osd_scrub_next+0x9/0x3c0 [osd_ldiskfs]^M [ 6560.578543] RSP: 0018:ffff8803fca27d08 EFLAGS: 00000287^M [ 6560.585506] RAX: ffffffffa13a5be0 RBX: 0000000000000001 RCX: ffff8803fca27d98^M [ 6560.594492] RDX: ffff8803fca27d78 RSI: ffff88082a370000 RDI: ffff88082da75000^M [ 6560.603462] RBP: ffff8803fca27d08 R08: 0000000000000001 R09: 0000000000000004^M [ 6560.612418] R10: 000000002a2da701 R11: ffffea0020a8b680 R12: 000000002a2da701^M [ 6560.621356] R13: ffffea0020a8b680 R14: ffff8803fca27c70 R15: ffff8803fca27d08^M [ 6560.630283] FS: 0000000000000000(0000) GS:ffff88082d940000(0000) knlGS:0000000000000000^M [ 6560.640284] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M [ 6560.647665] CR2: 00007f6a3b9ab000 CR3: 00000000019be000 CR4: 00000000000407e0^M [ 6560.656614] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M [ 6560.665553] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M [ 6560.674488] Stack:^M [ 6560.677690] ffff8803fca27df0 ffffffffa13ab1c9 ffffffffa13a9b50 ffffffffa13a5be0^M [ 6560.686960] ffff88082a370000 00000000810d354f ffff88082da75000 ffff88082a371468^M [ 6560.696224] 0000000020000000 0100000000200163 0000000000000000 0000000000000000^M [ 6560.705485] Call Trace:^M [ 6560.709202] [<ffffffffa13ab1c9>] osd_inode_iteration+0x2a9/0xcc0 [osd_ldiskfs]^M [ 6560.718370] [<ffffffffa13a9b50>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]^M [ 6560.727525] [<ffffffffa13a5be0>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]^M [ 6560.736388] [<ffffffffa13ac540>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]^M [ 6560.745052] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M [ 6560.752248] [<ffffffffa13abbe0>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]^M [ 6560.761582] [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M [ 6560.768004] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M [ 6560.776269] [<ffffffff81697318>] ret_from_fork+0x58/0x90^M [ 6560.783271] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M [ 6560.791526] Code: 48 8d 4c 02 10 48 01 c2 e8 25 fc ff ff 48 83 c4 08 5b 41 5c 41 5d 41 5e 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 <41> 57 49 89 f7 41 56 41 55 49 89 fd 41 54 49 89 d4 53 48 89 cb ^M [ 6588.383881] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [OI_scrub:12279]^

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/26922
            Subject: LU-9040 scrub: detect dead loop scanning
            Project: fs/lustre-release
            Branch: pfl
            Current Patch Set: 1
            Commit: 533dd2f253c81e8061204da5c785dd9a27063108

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/26922 Subject: LU-9040 scrub: detect dead loop scanning Project: fs/lustre-release Branch: pfl Current Patch Set: 1 Commit: 533dd2f253c81e8061204da5c785dd9a27063108

            Cliff, thanks for update the ticket. It is noticed that the CPU soft lockup comes back. I have made another patch https://review.whamcloud.com/26751 for that. Would you please to verify it in the lola test? Thanks!

            yong.fan nasf (Inactive) added a comment - Cliff, thanks for update the ticket. It is noticed that the CPU soft lockup comes back. I have made another patch https://review.whamcloud.com/26751 for that. Would you please to verify it in the lola test? Thanks!

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: