Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.2
    • None
    • 3
    • 9223372036854775807

    Description

      We get quite a few soft lockups on our Lustre gateways (Lustre clients that export Lustre filesystems over NFS). Example:

      Nov 13 00:26:06 foxtrot2 kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [nfsd:11973]
      Nov 13 00:26:06 foxtrot2 kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [rsync:36079]
      Nov 13 00:26:06 foxtrot2 kernel: Modules linked in: vfat fat dm_service_time mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptb
      ase nfsv3 nfs fscache osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE)
      dell_rbu libcfs(OE) bonding sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt iTCO_vendor_support kv
      m joydev dcdbas irqbypass sg shpchp ipmi_si ipmi_devintf ipmi_msghandler lpc_ich mei_me mei acpi_power_meter acpi_pad nfsd auth_rpcgss
      nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp stp llc mrp mgag200 i2c_algo_bit drm_kms
      _helper scsi_transport_iscsi bnx2x syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32_pclmul cr
      c32c_intel ahci drm ghash_clmulni_intel
      Nov 13 00:26:06 foxtrot2 kernel: libahci aesni_intel dm_multipath libata lrw gf128mul glue_helper ablk_helper cryptd megaraid_sas i2c_
      core ptp pps_core mdio libcrc32c wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: usb_storage]
      Nov 13 00:26:06 foxtrot2 kernel: CPU: 1 PID: 36079 Comm: rsync Tainted: G W OE ------------ 3.10.0-693.5.2.el7_lustre.x86_6
      4 #1
      Nov 13 00:26:06 foxtrot2 kernel: Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.5.4 01/22/2016
      Nov 13 00:26:06 foxtrot2 kernel: task: ffff883ff8a04f10 ti: ffff8815a1200000 task.ti: ffff8815a1200000
      Nov 13 00:26:06 foxtrot2 kernel: RIP: 0010:[<ffffffff810fa332>] [<ffffffff810fa332>] native_queued_spin_lock_slowpath+0x112/0x1e0
      Nov 13 00:26:06 foxtrot2 kernel: RSP: 0018:ffff8815a1203700 EFLAGS: 00000246
      Nov 13 00:26:06 foxtrot2 kernel: RAX: 0000000000000000 RBX: ffff883fff017880 RCX: 0000000000090000
      Nov 13 00:26:06 foxtrot2 kernel: RDX: ffff883fff4d7880 RSI: 0000000001390101 RDI: ffff881ff99da818
      Nov 13 00:26:06 foxtrot2 kernel: RBP: ffff8815a1203700 R08: ffff883fff017880 R09: 0000000000000000
      Nov 13 00:26:06 foxtrot2 kernel: R10: 0004c5dab524ba0b R11: 0000000000000000 R12: 0004c5dab524ba0b
      Nov 13 00:26:06 foxtrot2 kernel: R13: 0000000000000000 R14: 0004c5dab39dc857 R15: ffff8815a12036e8
      Nov 13 00:26:06 foxtrot2 kernel: FS: 00007f0ff1094740(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000
      Nov 13 00:26:06 foxtrot2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Nov 13 00:26:06 foxtrot2 kernel: CR2: 00007fd6cb1e9000 CR3: 000000163eff9000 CR4: 00000000001407e0
      Nov 13 00:26:06 foxtrot2 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      Nov 13 00:26:06 foxtrot2 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Nov 13 00:26:06 foxtrot2 kernel: Stack:
      Nov 13 00:26:06 foxtrot2 kernel: ffff8815a1203710 ffffffff8169e6bf ffff8815a1203720 ffffffff816abbf0
      Nov 13 00:26:06 foxtrot2 kernel: ffff8815a12037a0 ffffffffc0c2d421 ffff8815a12037e0 ffffffffc0c2ba60
      Nov 13 00:26:06 foxtrot2 kernel: 0000000000000000 00000161000ab602 0004c5dab524ba0b ffff88130fb65c00
      Nov 13 00:26:06 foxtrot2 kernel: Call Trace:
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff8169e6bf>] queued_spin_lock_slowpath+0xb/0xf
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff816abbf0>] _raw_spin_lock+0x20/0x30
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c2d421>] ldlm_prepare_lru_list+0x361/0x4e0 [ptlrpc]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c2ba60>] ? ldlm_cancel_aged_no_wait_policy+0x70/0x70 [ptlrpc]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c30c5a>] ldlm_cancel_lru_local+0x1a/0x30 [ptlrpc]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c30e8e>] ldlm_prep_elc_req+0x21e/0x490 [ptlrpc]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c31128>] ldlm_prep_enqueue_req+0x28/0x30 [ptlrpc]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc07c67a3>] mdc_intent_getattr_pack.isra.15+0x93/0x280 [mdc]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc07c8f3b>] mdc_enqueue_base+0x9fb/0x18f0 [mdc]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff810c45a3>] ? try_to_wake_up+0x183/0x340
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff810ba598>] ? __wake_up_common+0x58/0x90
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc07ca6cb>] mdc_intent_lock+0x26b/0x520 [mdc]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c66243>] ? reply_in_callback+0x143/0x5e0 [ptlrpc]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0972e30>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0c2c7a0>] ? ldlm_expired_completion_wait+0x240/0x240 [ptlrpc]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0910e4f>] lmv_intent_lock+0x5cf/0x1b50 [lmv]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff810b8a01>] ? in_group_p+0x31/0x40
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc09738c5>] ? ll_i2suppgid+0x15/0x40 [lustre]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0973914>] ? ll_i2gids+0x24/0xb0 [lustre]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff81114b02>] ? from_kgid+0x12/0x20
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0972e30>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0974feb>] ll_lookup_it+0x29b/0xee0 [lustre]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff810c8f28>] ? __enqueue_entity+0x78/0x80
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffffc0976fbb>] ll_lookup_nd+0xbb/0x190 [lustre]
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff8120b3dd>] lookup_real+0x1d/0x50
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff8120bcb2>] __lookup_hash+0x42/0x60
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff816a13e2>] lookup_slow+0x42/0xa7
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff8120f25b>] path_lookupat+0x77b/0x7b0
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff811df623>] ? kmem_cache_alloc+0x193/0x1e0
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff81211c9f>] ? getname_flags+0x4f/0x1a0
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff8120f2bb>] filename_lookup+0x2b/0xc0
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff81212e37>] user_path_at_empty+0x67/0xc0
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff81212ea1>] user_path_at+0x11/0x20
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff812063e3>] vfs_fstatat+0x63/0xc0
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff812069b1>] SYSC_newlstat+0x31/0x60
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff81206c3e>] SyS_newlstat+0xe/0x10
      Nov 13 00:26:06 foxtrot2 kernel: [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b

       

      Attachments

        Issue Links

          Activity

            [LU-11693] Soft lockups on Lustre clients

            Just some feedback: got some soft lockups on one of our clients, though it only happened once however. The other clients have been fine.

            cmcl Campbell Mcleay (Inactive) added a comment - Just some feedback: got some soft lockups on one of our clients, though it only happened once however. The other clients have been fine.
            pjones Peter Jones added a comment -

            Campbell

            Even the servers only need to be patched if you are using the project quotas feature. The patches that gave performance improvements in past versions have now been upstreamed and many customers prefer the simplified admin over project quotas...

            Peter

            pjones Peter Jones added a comment - Campbell Even the servers only need to be patched if you are using the project quotas feature. The patches that gave performance improvements in past versions have now been upstreamed and many customers prefer the simplified admin over project quotas... Peter
            yujian Jian Yu added a comment -

            Hi Campbell,
            Lustre client is patchless, which means while building Lustre codes for client, we do not need to patch Linux vendor or vanilla kernel. All of the regression testings were performed on patchless Lustre clients, so we suggest to use vendor kernel.

            yujian Jian Yu added a comment - Hi Campbell, Lustre client is patchless, which means while building Lustre codes for client, we do not need to patch Linux vendor or vanilla kernel. All of the regression testings were performed on patchless Lustre clients, so we suggest to use vendor kernel.

            I've built the rpms fine but I have another question: the client has the lustre kernel package installed (I am told it was installed as the lustre kernel has better performance than a vanilla kernel), which provides the fs and net kernel modules. The kmod-lustre-client package provides the kernel modules, though it installs them in /lib/modules/`uname -r`/extra/lustre-client rather than /lib/modules/`uname -r`/extra/lustre. Will this cause any kind of issue if both are installed, or is it better to install e.g., a vanilla kernel and rebuild the packages against this?

            Thanks,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - I've built the rpms fine but I have another question: the client has the lustre kernel package installed (I am told it was installed as the lustre kernel has better performance than a vanilla kernel), which provides the fs and net kernel modules. The kmod-lustre-client package provides the kernel modules, though it installs them in /lib/modules/`uname -r`/extra/lustre-client rather than /lib/modules/`uname -r`/extra/lustre. Will this cause any kind of issue if both are installed, or is it better to install e.g., a vanilla kernel and rebuild the packages against this? Thanks, Campbell
            pjones Peter Jones added a comment -

            Glad to hear that you've got this sorted out. Let us know whether the fix works as expected.

            pjones Peter Jones added a comment - Glad to hear that you've got this sorted out. Let us know whether the fix works as expected.

            Hi Andreas,

            I cloned the lustre repo and then checked the b2_10 branch. I then ran an autogen, copied the spec file to my rpmbuild tree and tarred the source up and copied it to rpmbuild/SOURCES. I was expecting the b2_10 to already be patched but a comparison showed it hadn't been. I created a patch file from a recursive diff and then modify the spec file to apply that patch. I then built a source rpm and tried an rpm rebuild. I was getting build errors, e.g., 

             /u/cmcl/rpmbuild/BUILD/lustre-2.10.2/lustre/include/lustre_lib.h:357:9: error: implicit declaration of function 'is_bl_done' [-Werror=implicit-function-declaration]
             struct l_wait_info *__info = (info); \
             ^
            /u/cmcl/rpmbuild/BUILD/lustre-2.10.2/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c:2330:3: note: in expansion of macro 'l_wait_event'
             l_wait_event(lock->l_waitq, is_bl_done(lock), &lwi)
            

            I'm doing something wrong and/or in an overly complicated way. I thought the b2_10 branch would have already been patched.
            I'd seen the whamcloud wiki page you'd mentioned but thought that was for server rather than client. The wiki.lustre.org I hadn't seen.
            Anyway, I found a build on https://review.whamcloud.com/33798 linked by Jian which has the patches in it, so I'll build from that. Sorry for wasting your time with this but hopefully I'll be on the right track from here on in.

            Cheers,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Andreas, I cloned the lustre repo and then checked the b2_10 branch. I then ran an autogen, copied the spec file to my rpmbuild tree and tarred the source up and copied it to rpmbuild/SOURCES. I was expecting the b2_10 to already be patched but a comparison showed it hadn't been. I created a patch file from a recursive diff and then modify the spec file to apply that patch. I then built a source rpm and tried an rpm rebuild. I was getting build errors, e.g.,  /u/cmcl/rpmbuild/BUILD/lustre-2.10.2/lustre/include/lustre_lib.h:357:9: error: implicit declaration of function 'is_bl_done' [-Werror=implicit-function-declaration] struct l_wait_info *__info = (info); \ ^ /u/cmcl/rpmbuild/BUILD/lustre-2.10.2/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c:2330:3: note: in expansion of macro 'l_wait_event' l_wait_event(lock->l_waitq, is_bl_done(lock), &lwi) I'm doing something wrong and/or in an overly complicated way. I thought the b2_10 branch would have already been patched. I'd seen the whamcloud wiki page you'd mentioned but thought that was for server rather than client. The wiki.lustre.org I hadn't seen. Anyway, I found a build on https://review.whamcloud.com/33798 linked by Jian which has the patches in it, so I'll build from that. Sorry for wasting your time with this but hopefully I'll be on the right track from here on in. Cheers, Campbell

            Campbell, what process are you using to build, and what files are "unpatched"? I'd recommend to follow e.g. https://wiki.whamcloud.com/pages/viewpage.action?pageId=52104622 or http://wiki.lustre.org/Compiling_Lustre if you've never done this before. At its simplest, doing "sh autogen.sh; ./configure; make rpms" is all that is needed, once you have the kernel source RPMs but it can become more complex if you are using OFED, ZFS, etc.

            As Jian wrote, it is a lot easier to use a pre-built package if that has the features you need.

            adilger Andreas Dilger added a comment - Campbell, what process are you using to build, and what files are "unpatched"? I'd recommend to follow e.g. https://wiki.whamcloud.com/pages/viewpage.action?pageId=52104622 or http://wiki.lustre.org/Compiling_Lustre if you've never done this before. At its simplest, doing " sh autogen.sh; ./configure; make rpms " is all that is needed, once you have the kernel source RPMs but it can become more complex if you are using OFED, ZFS, etc. As Jian wrote, it is a lot easier to use a pre-built package if that has the features you need.
            yujian Jian Yu added a comment -

            Hi Campbell,
            Build https://build.whamcloud.com/job/lustre-reviews/60480/ in https://review.whamcloud.com/33798 is ready. It contains both the patches for LU-11693/LU-9230 and LU-11692/LU-11647 applied on the tip of Lustre b2_10 branch (tag 2.10.6-RC3).

            yujian Jian Yu added a comment - Hi Campbell, Build https://build.whamcloud.com/job/lustre-reviews/60480/ in https://review.whamcloud.com/33798 is ready. It contains both the patches for LU-11693 / LU-9230 and LU-11692 / LU-11647 applied on the tip of Lustre b2_10 branch (tag 2.10.6-RC3).

            Hi Andreas,

            I'm doing something wrong here, I cloned git://git.whamcloud.com/fs/lustre-release.git and checked out the b2_10 branch, but the files are unpatched and I'm not quite sure how to add that patch via git. I can't find it to cherry-pick it. Or can I just add the patches manually via diff and patch? I was doing it this way before but the build fails (whereas an unpatched tree compiles fine). Sorry for my ignorance here.

            regards,

            Campbell

             

            cmcl Campbell Mcleay (Inactive) added a comment - Hi Andreas, I'm doing something wrong here, I cloned git://git.whamcloud.com/fs/lustre-release.git and checked out the b2_10 branch, but the files are unpatched and I'm not quite sure how to add that patch via git. I can't find it to cherry-pick it. Or can I just add the patches manually via diff and patch? I was doing it this way before but the build fails (whereas an unpatched tree compiles fine). Sorry for my ignorance here. regards, Campbell  

            Campbell, I'm not sure what build problem you are seeing (we build this branch daily), but I've cherry-picked the LU-11692 patch on top of 33130. It looks like the builders are a bit backed up, but there should be a link to a build reported in https://review.whamcloud.com/33798 in a couple of hours. Feel free to attach your build logs here, in case it is a trivial problem to fix.

            adilger Andreas Dilger added a comment - Campbell, I'm not sure what build problem you are seeing (we build this branch daily), but I've cherry-picked the LU-11692 patch on top of 33130. It looks like the builders are a bit backed up, but there should be a link to a build reported in https://review.whamcloud.com/33798 in a couple of hours. Feel free to attach your build logs here, in case it is a trivial problem to fix.

            People

              yujian Jian Yu
              cmcl Campbell Mcleay (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: