Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10268

rcu_sched self-detected stall in lfsck

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.3
    • None
    • toss 3.2-0rc8
      kernel-3.10.0-693.5.2.1chaos.ch6.x86_64
      lustre-2.8.0_13.chaos-1.ch6.x86_64

      See lustre-release-fe-llnl project in gerritt
    • 3
    • 9223372036854775807

    Description

      lquake-MDT0001 ran out of space while multiple invocations of "lfs migrate --mdt-index XX" were running in parallel. Space was freed up by deleting snapshots, and then an "lctl lfsck_start --all" was invoked on the node hosting the MGS and MDT0000.

      After the layout portion of the lfsck completed and the namespace portion started, we began seeing console messages like this on the node hosting MDT0008:

      INFO: rcu_sched self-detected stall on CPU[ 1678.988863] INFO: rcu_sched detected stalls on CPUs/tasks: { 12} (detected by 2, t=600017 jiffies, g=17401, c=17400, q=850241)
      Task dump for CPU 12:
      lfsck_namespace R  running task        0 36441      2 0x00000088
       0000000000000000 ffff88807ffd8000 0000000000000000 0000000000000002
       ffff88807ffd8008 ffff883f00000141 ffff8840a7003f40 ffff88807ffd7000
       0000000000000010 0000000000000000 fffffffffffffff8 0000000000000001
      Call Trace:
       [<ffffffff8119649f>] ? __alloc_pages_nodemask+0x17f/0x470
       [<ffffffffc030e35d>] ? spl_kmem_alloc_impl+0xcd/0x180 [spl]
       [<ffffffffc030e35d>] ? spl_kmem_alloc_impl+0xcd/0x180 [spl]
       [<ffffffffc0315cb4>] ? xdrmem_dec_bytes+0x64/0xa0 [spl]
       [<ffffffff8119355e>] ? __rmqueue+0xee/0x4a0
       [<ffffffff811ad598>] ? zone_statistics+0x88/0xa0
       [<ffffffff81195e22>] ? get_page_from_freelist+0x502/0xa00
       [<ffffffffc0328a50>] ? nvs_operation+0xf0/0x2e0 [znvpair]
       [<ffffffff816c88d5>] ? mutex_lock+0x25/0x42
       [<ffffffff8119649f>] ? __alloc_pages_nodemask+0x17f/0x470
       [<ffffffff811dd008>] ? alloc_pages_current+0x98/0x110
       [<ffffffffc032afc2>] ? nvlist_lookup_common.part.71+0xa2/0xb0 [znvpair]
       [<ffffffffc032b4b6>] ? nvlist_lookup_byte_array+0x26/0x30 [znvpair]
       [<ffffffffc123d2f3>] ? lfsck_namespace_filter_linkea_entry.isra.64+0x83/0x180 [lfsck]
       [<ffffffffc124f4da>] ? lfsck_namespace_double_scan_one+0x3aa/0x19d0 [lfsck]
       [<ffffffffc08356d6>] ? dbuf_rele+0x36/0x40 [zfs]
       [<ffffffffc11f9c17>] ? osd_index_it_rec+0x1a7/0x240 [osd_zfs]
       [<ffffffffc1250ead>] ? lfsck_namespace_double_scan_one_trace_file+0x3ad/0x830 [lfsck]
       [<ffffffffc1254af5>] ? lfsck_namespace_assistant_handler_p2+0x795/0xa70 [lfsck]
       [<ffffffff811ec173>] ? kfree+0x133/0x170
       [<ffffffffc10283e8>] ? ptlrpc_set_destroy+0x208/0x4f0 [ptlrpc]
       [<ffffffffc1238afe>] ? lfsck_assistant_engine+0x13de/0x21d0 [lfsck]
       [<ffffffff816ca33b>] ? __schedule+0x38b/0x780
       [<ffffffff810c9de0>] ? wake_up_state+0x20/0x20
       [<ffffffffc1237720>] ? lfsck_master_engine+0x1370/0x1370 [lfsck]
       [<ffffffff810b4eef>] ? kthread+0xcf/0xe0
       [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40
       [<ffffffff816d6818>] ? ret_from_fork+0x58/0x90
       [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40
      
      

      Also, lfsck_namespace process reported as stuck by watchdog. Stacks are all like this:

      NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [lfsck_namespace:36441]
      Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) nfsv3 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib iTCO_wdt iTCO_vendor_support ib_core sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass mlx5_core pcspkr devlink joydev i2c_i801 ioatdma lpc_ich zfs(POE) zunicode(POE) zavl(POE) ses icp(POE) enclosure sg shpchp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc zcommon(POE) znvpair(POE) spl(OE) msr_safe(OE) nfsd nfs_acl ip_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic scsi_transport_iscsi dm_multipath mgag200 8021q i2c_algo_bit garp drm_kms_helper stp syscopyarea crct10dif_pclmul llc sysfillrect crct10dif_common mrp crc32_pclmul sysimgblt fb_sys_fops crc32c_intel ttm ghash_clmulni_intel ixgbe(OE) drm ahci mpt3sas aesni_intel mxm_wmi libahci dca lrw gf128mul glue_helper ablk_helper cryptd ptp raid_class libata i2c_core scsi_transport_sas pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
      CPU: 12 PID: 36441 Comm: lfsck_namespace Tainted: P           OEL ------------   3.10.0-693.5.2.1chaos.ch6.x86_64 #1
      Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
      task: ffff883f1eed3f40 ti: ffff883f12b04000 task.ti: ffff883f12b04000
      RIP: 0010:[<ffffffffc123d2f5>]  [<ffffffffc123d2f5>] lfsck_namespace_filter_linkea_entry.isra.64+0x85/0x180 [lfsck]
      RSP: 0018:ffff883f12b07ad0  EFLAGS: 00000246
      RAX: 0000000000000000 RBX: ffffffffc032b4b6 RCX: ffff887f19214971
      RDX: 0000000000000000 RSI: ffff883ef42f1010 RDI: ffff883f12b07ba8
      RBP: ffff883f12b07b18 R08: 0000000000000000 R09: 0000000000000025
      R10: ffff883ef42f1010 R11: 0000000000000000 R12: ffff883f12b07ab4
      R13: ffff883ef42f1040 R14: ffff887f1c31a7e0 R15: ffffffffc1282fa3
      FS:  0000000000000000(0000) GS:ffff887f7df00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffff7ad74f0 CR3: 0000000001a16000 CR4: 00000000001407e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
       ffff883f12b07ba8 ffff883ef42f1040 0000000000000001 ffff883f12b07b18
       ffff887f18d35ce8 ffff887f1c31a7e0 ffff883ef42f1000 ffff887f2c933c00
       ffff883ef42f1010 ffff883f12b07c18 ffffffffc124f4da ffffffffc08356d6
      Call Trace:
       [<ffffffffc124f4da>] lfsck_namespace_double_scan_one+0x3aa/0x19d0 [lfsck]
       [<ffffffffc08356d6>] ? dbuf_rele+0x36/0x40 [zfs]
       [<ffffffffc11f9c17>] ? osd_index_it_rec+0x1a7/0x240 [osd_zfs]
       [<ffffffffc1250ead>] lfsck_namespace_double_scan_one_trace_file+0x3ad/0x830 [lfsck]
       [<ffffffffc1254af5>] lfsck_namespace_assistant_handler_p2+0x795/0xa70 [lfsck]
       [<ffffffff811ec173>] ? kfree+0x133/0x170
       [<ffffffffc10283e8>] ? ptlrpc_set_destroy+0x208/0x4f0 [ptlrpc]
       [<ffffffffc1238afe>] lfsck_assistant_engine+0x13de/0x21d0 [lfsck]
       [<ffffffff816ca33b>] ? __schedule+0x38b/0x780
       [<ffffffff810c9de0>] ? wake_up_state+0x20/0x20
       [<ffffffffc1237720>] ? lfsck_master_engine+0x1370/0x1370 [lfsck]
       [<ffffffff810b4eef>] kthread+0xcf/0xe0
       [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40
       [<ffffffff816d6818>] ret_from_fork+0x58/0x90
       [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40
      Code: c7 47 10 00 00 00 00 45 31 e4 45 31 c0 4d 63 ce 66 0f 1f 44 00 00 4d 85 e4 74 41 41 0f b6 1c 24 41 0f b6 44 24 01 c1 e3 08 09 c3 <41> 39 de 41 89 5d 18 74 47 49 8b 4d 08 48 85 c9 0f 84 ad 00 00
      
      

      Attachments

        Activity

          [LU-10268] rcu_sched self-detected stall in lfsck

          I see that you did need to make changes. I'll try with your backport.

          ofaaland Olaf Faaland added a comment - I see that you did need to make changes. I'll try with your backport.
          ofaaland Olaf Faaland added a comment -

          I used the original commit from LU-8084 applied to master, https://review.whamcloud.com/#/c/19877/. I didn't see that you had started a backport. Did you find changes were required?

          ofaaland Olaf Faaland added a comment - I used the original commit from LU-8084 applied to master, https://review.whamcloud.com/#/c/19877/ . I didn't see that you had started a backport. Did you find changes were required?

          About the LU-8084 patch, you back ported such patch by yourself or directly used the following one ?
          https://review.whamcloud.com/#/c/30370/

          yong.fan nasf (Inactive) added a comment - About the LU-8084 patch, you back ported such patch by yourself or directly used the following one ? https://review.whamcloud.com/#/c/30370/
          ofaaland Olaf Faaland added a comment - - edited

          I built Lustre with patch from LU-8084, upgraded the rpms on the servers, and power cycled them.

          All the servers were started without any mount options. When MDT0008 started, lfsck and lfsck_namespace threads started. The lfsck_namespace thread again was reported as stuck by the NMI watchdog, with same stack as originally reported and RIP lfsck_namespace_filter_linkea_entry.isra.64+0x8e as before.

          So, the patch appeared not to make a difference. What next?

          ofaaland Olaf Faaland added a comment - - edited I built Lustre with patch from LU-8084 , upgraded the rpms on the servers, and power cycled them. All the servers were started without any mount options. When MDT0008 started, lfsck and lfsck_namespace threads started. The lfsck_namespace thread again was reported as stuck by the NMI watchdog, with same stack as originally reported and RIP lfsck_namespace_filter_linkea_entry.isra.64+0x8e as before. So, the patch appeared not to make a difference. What next?
          ofaaland Olaf Faaland added a comment -

          I'll build with the LU-8084 patch and see if that allows the lfsck to finish, and report back.

          ofaaland Olaf Faaland added a comment - I'll build with the LU-8084 patch and see if that allows the lfsck to finish, and report back.

          Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30421
          Subject: LU-10268 lfsck: postpone lfsck start until initialized
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set: 1
          Commit: bdc80f63b0e278040de8e2ba82b3f82e640733bf

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30421 Subject: LU-10268 lfsck: postpone lfsck start until initialized Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: bdc80f63b0e278040de8e2ba82b3f82e640733bf
          ofaaland Olaf Faaland added a comment -

          I made a mistake and need to do the experiment again. Ignore the below comment for now.

          I still don't understand something about the lfsck_namespace thread.

          Yes, I used the "-A" option to "lfsck_start" about 3 weeks ago. That was the last time I ran "lctl lfsck_start".

          Today I:
          1. Powered off all the servers, then powered them all on.
          2. Mounted lustre on all of them with "mount -t lustre <pool>/<dataset> <mntpoint>", that is no mount options
          3. On MDT0008 the lfsck_namespace thread was started and got stuck in the loop scanning a linkEA
          4. I powered off the node running MDT0008 and powered it on again
          5. I mounted MDT0008, this time using the "-o skip_fsck" option
          6. The lfsck_namespace thread was started and got stuck in the loop

          Why didn't the skip_lfsck option prevent lfsck_namespace from continuing, today (step #5 and #6)?

          ofaaland Olaf Faaland added a comment - I made a mistake and need to do the experiment again. Ignore the below comment for now. I still don't understand something about the lfsck_namespace thread. Yes, I used the "-A" option to "lfsck_start" about 3 weeks ago. That was the last time I ran "lctl lfsck_start". Today I: 1. Powered off all the servers, then powered them all on. 2. Mounted lustre on all of them with "mount -t lustre <pool>/<dataset> <mntpoint>", that is no mount options 3. On MDT0008 the lfsck_namespace thread was started and got stuck in the loop scanning a linkEA 4. I powered off the node running MDT0008 and powered it on again 5. I mounted MDT0008, this time using the "-o skip_fsck" option 6. The lfsck_namespace thread was started and got stuck in the loop Why didn't the skip_lfsck option prevent lfsck_namespace from continuing, today (step #5 and #6)?
          ofaaland Olaf Faaland added a comment -

          No, we do not have LU-8084 https://review.whamcloud.com/#/c/19877/ in our patch stack.

          I still don't understand something about the lfsck_namespace thread.

          Yes, I used the "-A" option to "lfsck_start" about 3 weeks ago. That was the last time I ran "lctl lfsck_start".

          Today I:
          1. Powered off all the servers, then powered them all on.
          2. Mounted lustre on all of them with "mount -t lustre <pool>/<dataset> <mntpoint>", that is no mount options
          3. On MDT0008 the lfsck_namespace thread was started and got stuck in the loop scanning a linkEA
          4. I powered off the node running MDT0008 and powered it on again
          5. I mounted MDT0008, this time using the "-o skip_fsck" option
          6. The lfsck_namespace thread was started and got stuck in the loop

          Why didn't the skip_lfsck option prevent lfsck_namespace from continuing, today (step #5 and #6)?

          thanks!

          ofaaland Olaf Faaland added a comment - No, we do not have LU-8084 https://review.whamcloud.com/#/c/19877/ in our patch stack. I still don't understand something about the lfsck_namespace thread. Yes, I used the "-A" option to "lfsck_start" about 3 weeks ago. That was the last time I ran "lctl lfsck_start". Today I: 1. Powered off all the servers, then powered them all on. 2. Mounted lustre on all of them with "mount -t lustre <pool>/<dataset> <mntpoint>", that is no mount options 3. On MDT0008 the lfsck_namespace thread was started and got stuck in the loop scanning a linkEA 4. I powered off the node running MDT0008 and powered it on again 5. I mounted MDT0008, this time using the "-o skip_fsck" option 6. The lfsck_namespace thread was started and got stuck in the loop Why didn't the skip_lfsck option prevent lfsck_namespace from continuing, today (step #5 and #6)? thanks!

          I said your case may be the #2, not the #3. That is why your former "-o skip_lfsck" did not work.

          As for the original RCU symptoms, only with the given stack information, it is not easy to locate the root reason. It seems that the namespace LFSCK dropped into some loop scanning of some file's linkEA. One most possible case is that such linkEA was corrupted. Do you have LU-8084 patch (https://review.whamcloud.com/#/c/19877/) applied on your system?

          yong.fan nasf (Inactive) added a comment - I said your case may be the #2, not the #3. That is why your former "-o skip_lfsck" did not work. As for the original RCU symptoms, only with the given stack information, it is not easy to locate the root reason. It seems that the namespace LFSCK dropped into some loop scanning of some file's linkEA. One most possible case is that such linkEA was corrupted. Do you have LU-8084 patch ( https://review.whamcloud.com/#/c/19877/ ) applied on your system?
          ofaaland Olaf Faaland added a comment -

          Thanks for the updates!

          I don't believe my observations fit with your description of namespace scan trigger method #3, but I will re test in case I made an error. I'll update the ticket about that tomorrow.

          Have you been able to learn anything about what could have led to the RCU symptoms?

          ofaaland Olaf Faaland added a comment - Thanks for the updates! I don't believe my observations fit with your description of namespace scan trigger method #3, but I will re test in case I made an error. I'll update the ticket about that tomorrow. Have you been able to learn anything about what could have led to the RCU symptoms?

          I now see MDT0001 working really hard in __mdd_orphan_cleanup(). This doesn't appear to be hurting anything, but is it a side effect of the lfsck?

          For ldiskfs backend, during the 2nd phase scanning, the namespace LFSCK will scan the backend "/lost+found" directory and try to recover the ldiskfs orphans back to normal namespace. But the logic __mdd_orphan_cleanup() will handle the Lustre orphans under "/PENDING" directory. I do not think the two parts will affect each other. Especially your system is ZFS backend based, so I do not think LFSCK will affect the Lustre orphans handling.

          If there are too many Lustre orphans under "/PENDING" directory, then __mdd_orphan_cleanup() may be very slow, you can check that via mount the backend as "zfs" directly.

          yong.fan nasf (Inactive) added a comment - I now see MDT0001 working really hard in __mdd_orphan_cleanup(). This doesn't appear to be hurting anything, but is it a side effect of the lfsck? For ldiskfs backend, during the 2nd phase scanning, the namespace LFSCK will scan the backend "/lost+found" directory and try to recover the ldiskfs orphans back to normal namespace. But the logic __mdd_orphan_cleanup() will handle the Lustre orphans under "/PENDING" directory. I do not think the two parts will affect each other. Especially your system is ZFS backend based, so I do not think LFSCK will affect the Lustre orphans handling. If there are too many Lustre orphans under "/PENDING" directory, then __mdd_orphan_cleanup() may be very slow, you can check that via mount the backend as "zfs" directly.

          People

            yong.fan nasf (Inactive)
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: