[LU-11693] Soft lockups on Lustre clients Created: 22/Nov/18 Updated: 28/Feb/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Campbell Mcleay (Inactive) | Assignee: | Jian Yu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
We get quite a few soft lockups on our Lustre gateways (Lustre clients that export Lustre filesystems over NFS). Example: Nov 13 00:26:06 foxtrot2 kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [nfsd:11973]
|
| Comments |
| Comment by Campbell Mcleay (Inactive) [ 22/Nov/18 ] |
|
Versions are: lustre-2.10.2-1.el7 kernel-3.10.0-693.5.2.el7_lustre |
| Comment by Andreas Dilger [ 22/Nov/18 ] |
|
This looks like a duplicate of a previously-reported issue. Please try: # lctl set_param ldlm.namespaces.*.lru_size=10000 On these clients to see if this avoids the issue? |
| Comment by Campbell Mcleay (Inactive) [ 22/Nov/18 ] |
|
Hi Andreas, I think we set this already, when I run: lctl get_param 'ldlm.namespaces.*.lru_size' I get: ldlm.namespaces.MGC10.21.22.10@tcp.lru_size=10000 where 10.21.22.10 is our MDS Thanks, Campbell |
| Comment by Campbell Mcleay (Inactive) [ 22/Nov/18 ] |
|
I should mention it prints out a list of values for all the OSTs and the MDT which are larger than this value, e.g., ldlm.namespaces.foxtrot-MDT0000-mdc-ffff883ff9b89000.lru_size=246531 etc in case that is important |
| Comment by Andreas Dilger [ 22/Nov/18 ] |
|
The patch https://review.whamcloud.com/33130 " This patch has been landed to the master branch (for 2.12) for over 6 months and has seen a lot of testing already. It is in the process of landing to the b2_10 branch for the next 2.10.x release, so there is not yet a release package available with this patch included. It is a client-only patch, so could be installed on the affected nodes without taking down the whole system. |
| Comment by Campbell Mcleay (Inactive) [ 05/Dec/18 ] |
|
Thanks Andreas. I was looking for a compatibility matrix to see whether 2.12 on the client is compatible with 2.10 on the server. Is there something available online that shows compatibility of releases? regards, Campbell |
| Comment by Campbell Mcleay (Inactive) [ 05/Dec/18 ] |
|
Actually, I see that 2.12 is not listed as supported by Whamcloud. I'll patch it then. |
| Comment by Peter Jones [ 05/Dec/18 ] |
|
Campbell 2.12 is very close to release - we tagged the first RC yesterday. So, upon GA, another option to patching will be to use a 2.12 client as this interoperates with 2.10.x servers ok. Peter |
| Comment by Campbell Mcleay (Inactive) [ 05/Dec/18 ] |
|
Thanks Peter. I applied the patches for both the crashes and the lockups to 2.10.2-1 source and it fails to build. Can you tell me what I need to do here? Attached is the build log build.log |
| Comment by Peter Jones [ 05/Dec/18 ] |
|
Jian Could you please assist Campbell in porting the fix for thanks Peter |
| Comment by Campbell Mcleay (Inactive) [ 05/Dec/18 ] |
|
I'm also happy to apply the patches to a later supported release if that is easier... |
| Comment by Peter Jones [ 05/Dec/18 ] |
|
Jian The port already exists to b2_10 - https://review.whamcloud.com/#/c/33130/ - but does it need refreshing to apply to the tip of b2_10 or else to the 2.10.5 release? Peter |
| Comment by Jian Yu [ 05/Dec/18 ] |
|
Hi Peter, |
| Comment by Jian Yu [ 05/Dec/18 ] |
|
Hi Campbell, |
| Comment by Campbell Mcleay (Inactive) [ 06/Dec/18 ] |
|
Thanks Jian. I still have to add a patch for a kernel panic issue ( -Campbell |
| Comment by Campbell Mcleay (Inactive) [ 06/Dec/18 ] |
|
2.10.5 fails to build. Should I send the build log? |
| Comment by Andreas Dilger [ 06/Dec/18 ] |
|
Campbell, I'm not sure what build problem you are seeing (we build this branch daily), but I've cherry-picked the |
| Comment by Campbell Mcleay (Inactive) [ 06/Dec/18 ] |
|
Hi Andreas, I'm doing something wrong here, I cloned git://git.whamcloud.com/fs/lustre-release.git and checked out the b2_10 branch, but the files are unpatched and I'm not quite sure how to add that patch via git. I can't find it to cherry-pick it. Or can I just add the patches manually via diff and patch? I was doing it this way before but the build fails (whereas an unpatched tree compiles fine). Sorry for my ignorance here. regards, Campbell
|
| Comment by Jian Yu [ 06/Dec/18 ] |
|
Hi Campbell, |
| Comment by Andreas Dilger [ 07/Dec/18 ] |
|
Campbell, what process are you using to build, and what files are "unpatched"? I'd recommend to follow e.g. https://wiki.whamcloud.com/pages/viewpage.action?pageId=52104622 or http://wiki.lustre.org/Compiling_Lustre if you've never done this before. At its simplest, doing "sh autogen.sh; ./configure; make rpms" is all that is needed, once you have the kernel source RPMs but it can become more complex if you are using OFED, ZFS, etc. As Jian wrote, it is a lot easier to use a pre-built package if that has the features you need. |
| Comment by Campbell Mcleay (Inactive) [ 07/Dec/18 ] |
|
Hi Andreas, I cloned the lustre repo and then checked the b2_10 branch. I then ran an autogen, copied the spec file to my rpmbuild tree and tarred the source up and copied it to rpmbuild/SOURCES. I was expecting the b2_10 to already be patched but a comparison showed it hadn't been. I created a patch file from a recursive diff and then modify the spec file to apply that patch. I then built a source rpm and tried an rpm rebuild. I was getting build errors, e.g., /u/cmcl/rpmbuild/BUILD/lustre-2.10.2/lustre/include/lustre_lib.h:357:9: error: implicit declaration of function 'is_bl_done' [-Werror=implicit-function-declaration] struct l_wait_info *__info = (info); \ ^ /u/cmcl/rpmbuild/BUILD/lustre-2.10.2/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c:2330:3: note: in expansion of macro 'l_wait_event' l_wait_event(lock->l_waitq, is_bl_done(lock), &lwi) I'm doing something wrong and/or in an overly complicated way. I thought the b2_10 branch would have already been patched. Cheers, Campbell |
| Comment by Peter Jones [ 07/Dec/18 ] |
|
Glad to hear that you've got this sorted out. Let us know whether the fix works as expected. |
| Comment by Campbell Mcleay (Inactive) [ 07/Dec/18 ] |
|
I've built the rpms fine but I have another question: the client has the lustre kernel package installed (I am told it was installed as the lustre kernel has better performance than a vanilla kernel), which provides the fs and net kernel modules. The kmod-lustre-client package provides the kernel modules, though it installs them in /lib/modules/`uname -r`/extra/lustre-client rather than /lib/modules/`uname -r`/extra/lustre. Will this cause any kind of issue if both are installed, or is it better to install e.g., a vanilla kernel and rebuild the packages against this? Thanks, Campbell |
| Comment by Jian Yu [ 07/Dec/18 ] |
|
Hi Campbell, |
| Comment by Peter Jones [ 07/Dec/18 ] |
|
Campbell Even the servers only need to be patched if you are using the project quotas feature. The patches that gave performance improvements in past versions have now been upstreamed and many customers prefer the simplified admin over project quotas... Peter |
| Comment by Campbell Mcleay (Inactive) [ 28/Feb/19 ] |
|
Just some feedback: got some soft lockups on one of our clients, though it only happened once however. The other clients have been fine. |