[LU-8634] 2.8.0 MDS (layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server) Created: 22/Sep/16  Updated: 07/Nov/16  Resolved: 07/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Josh Samuelson (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: mdt, ptlrpc, quota

Issue Links:
Related
is related to LU-8246 Leaks on ldlm granted locks counter o... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Synopsis:

Lustre version 2.8.0 on servers and clients, sudden degraded namespace lookups on clients. Subsystem RPC ERRORS related to LDLM_INTENT_QUOTA are emitted to the Lustre debug log on all Lustre servers. Degraded performance is resolved by unmounting MDT and rebooting the MDS.

I haven't found any JIRA matches on what we're seeing in particular, so opening this ticket.

Servers are running Lustre 2.8 from latest el6 Community Releases:
https://downloads.hpdd.intel.com/public/lustre/latest-release/el6.7/server/RPMS/x86_64/

More details follow, any suggestions on how to collect more meaningful debugging information is appreciated as we've experienced this degraded performance twice on this production cluster.


Detailed:

The Lustre system is comprised of 1 MDS with 1 MDT, 6 OSSs with 6 OSTs per OSS, 36x OSTs total.

The first time we experienced this issue (2016-09-07) we suspected something was wrong with the MDT since all the errors pointed to that component (mds1 warnings on nid2str of "lo" and OSSs warning on nid2str of "10.137.32.37@o2ib," the mds1's nid).

The degraded namespace performance on the clients appeared to be totally hung for some users/processes. Simple tests showed that a 'stat /lustre' took 6 sconds before returning valid data. Stating directories in deeper paths of /lustre added 6 seconds per path component, ie 'stat /lustre/work' was 12 seconds, etc. Path components with more entries did take a bit longer to stat.

The corrective actions we took at that time was to unmount the MDT, unload all the Lustre kernel modules, unconfigure the LNET network and unload the remaining lnd kernel modules. We reversed this process without rebooting the MDS. Remounting the MDT succeeded, however we experienced copious LustreError messages on the console and most of our clients were evicted, snippets of errors I still have on hand from then:

[1993495.191766] LustreError: 24447:0:(upcall_cache.c:237:upcall_cache_get_entry()) acquire for key 4254: error -110
[1993695.996095] Lustre: lstrFS-MDT0000: Recovery over after 5:00, of 111 clients 16 recovered and 95 were evicted.

'lctl dl' didn't show anything and /proc/fs/lustre was empty on the MDS post mount of the MDT. Calling 'stat /lustre' on a client was still very heavily delayed. Most namespace related calls seemed to be completely hung.

Having arrived at this point we determined our only option was to reboot the MDS. Upon our attempt to again 'unmount /lustre', messages were emitted to the console about various Lustre services threads being hung. The umount process had the following back trace:

cat /proc/25973/stack

[<ffffffffa12ab1e0>] __ldlm_namespace_free+0x1c0/0x560 [ptlrpc]
[<ffffffffa12ab5ef>] ldlm_namespace_free_prior+0x6f/0x220 [ptlrpc]
[<ffffffffa19055d2>] mdt_device_fini+0x6a2/0x12e0 [mdt]
[<ffffffffa10c3332>] class_cleanup+0x572/0xd20 [obdclass]
[<ffffffffa10c5646>] class_process_config+0x1b66/0x24c0 [obdclass]
[<ffffffffa10c645f>] class_manual_cleanup+0x4bf/0xc90 [obdclass]
[<ffffffffa10f7a1c>] server_put_super+0x8bc/0xcd0 [obdclass]
[<ffffffff811944bb>] generic_shutdown_super+0x5b/0xe0
[<ffffffff811945a6>] kill_anon_super+0x16/0x60
[<ffffffffa10c9646>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff81194d47>] deactivate_super+0x57/0x80
[<ffffffff811b4d3f>] mntput_no_expire+0xbf/0x110
[<ffffffff811b588b>] sys_umount+0x7b/0x3a0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

We waited ~15 minutes but the umount attempt seemed thoroughly wedged, so we reset the MDS.

Upon reboot, we 'e2fsck -fp /dev/path/to/mdt' prior to remounting. When we did, the recovery process seemed normal.


Fast-forward to today:
We ran without issue again until this morning around 5am, with the same symptoms on clients and errors on the servers:

mds1 lctl dk entry

00000100:00020000:7.0:1474538545.022548:0:3082:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
req@ffff8805c33040c0 x1544837681171836/t0(0) o101->lstrFS-MDT0000-lwp-MDT0000@0@lo:23/10 lens 456/192 e 0 to 0 dl 1474538552 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss1 (lctl dk log rolled over, so /var/log/messages entry)

Sep 22 05:02:06 oss1 kernel: [3264568.614229] LustreError: 11270:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
Sep 22 05:02:06 oss1 kernel: [3264568.614231] req@ffff8800705e3c80 x1542797516212496/t0(0) o101->lstrFS-MDT0000-lwp-OST0000@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538533 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss2 (lctl dk log rolled over, so /var/log/messages entry)

Sep 22 05:01:16 oss2 kernel: [3264426.880193] LustreError: 17980:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
Sep 22 05:01:16 oss2 kernel: [3264426.880195] req@ffff88030a58ccc0 x1542838780945432/t0(0) o101->lstrFS-MDT0000-lwp-OST0007@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538483 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss3 (lctl dk log rolled over, so /var/log/messages entry)

Sep 22 05:01:40 oss3 kernel: [3264396.120233] LustreError: 17958:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
Sep 22 05:01:40 oss3 kernel: [3264396.120235] req@ffff880362c2f980 x1542838278617176/t0(0) o101->lstrFS-MDT0000-lwp-OST0011@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538507 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss4 (lctl dk log rolled over, so /var/log/messages entry)

Sep 22 05:01:46 oss4 kernel: [3264365.930765] LustreError: 17958:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
Sep 22 05:01:46 oss4 kernel: [3264365.930767] req@ffff8800af9fc6c0 x1542838806272068/t0(0) o101->lstrFS-MDT0000-lwp-OST0012@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538513 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss5 lctl dk

00000100:00020000:7.0:1474538507.425216:0:18722:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
req@ffff8801d42fb980 x1542838266300652/t0(0) o101->lstrFS-MDT0000-lwp-OST001b@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538514 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss6 lctl dk

00000100:00020000:10.0:1474538538.449115:0:18717:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
req@ffff8805ae9ad380 x1542838948349576/t0(0) o101->lstrFS-MDT0000-lwp-OST001e@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538545 ref 1 fl Interpret:RN/0/0 rc -115/-115

The corrective actions we took today involved the same unmount/unloading of kernel module steps as indicated above from (2016-09-07). However, we rebooted the MDS rather than reloading/remounting the MDT to avoid any similar issues. Upon MDS boot, we still 'e2fsck -fp /dev/path/to/mdt' to ensure consistency of the cleanly unmounted MDT.

The following was output:

[QUOTA WARNING] Usage inconsistent for ID 0:actual (7314026496, 402396) != expected (7313580032, 402359)
lstrFS-MDT0000: Update quota info for quota type 0.
[QUOTA WARNING] Usage inconsistent for ID 0:actual (7312138240, 402074) != expected (7311691776, 402037)
lstrFS-MDT0000: Update quota info for quota type 1.
lstrFS-MDT0000: 47314184/2147483648 files (0.1% non-contiguous), 273274458/536870912 blocks

Remounting the MDT, all clients reconnected without being evicted.



 Comments   
Comment by Niu Yawei (Inactive) [ 26/Sep/16 ]

Were there lots of client activities when this error message showed up? I'm wondering the MDS could be overloaded by locking requests, could you check the following three proc files on MDS when you see this message ("__req_capsule_get() ... 'quota_body') again?

/proc/fs/lustre/ldlm/lock_granted_count
/proc/fs/lustre/ldlm/lock_limit_mb
/proc/fs/lustre/ldlm/lock_reclaim_threshold_mb

To avoid being fullfilled by ldlm locks, MDS will reject any new locking request with -EINPROGRESS(-115) when cached locks reaching a threshold, and I checked quota code, looks it would mistakenly interpret this error and try to unpack reply buffer.

Comment by Josh Samuelson (Inactive) [ 27/Sep/16 ]

Hi Niu,

Thank you for taking a look into this.

I looked at the source for a bit and see lock_reclaim_threshold_mb and lock_limit_mb are 20/30% of totalram_pages, so our values are:

grep -H . /proc/fs/lustre/ldlm/

{lock_reclaim_threshold_mb,lock_limit_mb}

/proc/fs/lustre/ldlm/lock_reclaim_threshold_mb:9638
/proc/fs/lustre/ldlm/lock_limit_mb:14457

Using the size reported in /proc/slabinfo for cache type ldlm_locks for data structure 'struct ldlm_locks' (512 in our case), the low/high lock count for us would be (using lustre/ldlm/ldlm_reclaim.c:ldlm_ratio2locknr() to find the counts):

low: 9638*2^20/(100*512) == ~197386
high: 14457*2^20/(100*512) == ~296079

If I found the counts correctly, what are those threshold counts checked against, the number of slab ldlm_locks active objects?

I'm curious how the following proc values relate:

In particular, should active_objs for ldlm_locks match closely to the lock values lustre outputs in /proc/fs/lustre/ldlm paths?

mds1# grep -E '(active_objs)
  1. name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
    : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_s
    labs> <sharedavail>
    ldlm_locks 6669 8183 512 7 1 : tunables 54 27 8 : slabdata 1169 1169 255
    ldlm_resources 1820 3580 192 20 1 : tunables 120 60 8 : slabdata 179 179 616
mds1# grep -H . /proc/fs/lustre/ldlm/lock_granted_count

/proc/fs/lustre/ldlm/lock_granted_count:3549033

Should ldlm_locks active_objs of 6669 be close to the granted count of 3549033?

mds1# grep -H . /proc/fs/lustre/ldlm/namespaces/*/lock_count

/proc/fs/lustre/ldlm/namespaces/lstrFS-MDT0000-lwp-MDT0000/lock_count:232
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0000-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0001-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0002-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0003-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0004-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0005-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0006-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0007-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0008-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0009-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST000a-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST000b-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST000c-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST000d-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST000e-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST000f-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0010-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0011-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0012-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0013-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0014-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0015-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0016-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0017-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0018-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0019-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST001a-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST001b-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST001c-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST001d-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST001e-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST001f-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0020-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0021-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0022-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/lstrFS-OST0023-osc-MDT0000/lock_count:0
/proc/fs/lustre/ldlm/namespaces/mdt-lstrFS-MDT0000_UUID/lock_count:5156
/proc/fs/lustre/ldlm/namespaces/MGC10.137.32.37@o2ib/lock_count:6
/proc/fs/lustre/ldlm/namespaces/MGS/lock_count:600

Should these values add up (or be close) to the ldlm_locks active_objs count also?

mds1# cat /proc/fs/lustre/ldlm/namespaces/mdt-lstrFS-MDT0000_UUID/pool/state

LDLM pool state (ldlm-pool-mdt-lstrFS-MDT0000_UUID-1):
SLV: 1
CLV: 0
LVF: 1
GSP: 1%
GP: 3011936
GR: 33
CR: 50
GS: -17
G: 3549033
L: 2409549

Above, G (granted) > L (limit)?

With your hint to look at the LDLM counters and my sharing the odd granted vs limit lock counts with a coworker, he came across the following https://jira.hpdd.intel.com/browse/LU-8246 which I believe matches what we're seeing.

Comment by Niu Yawei (Inactive) [ 27/Sep/16 ]

Right, it's related to LU-8246, that makes 'lock_granted_count' an unreasonable high value, so server starts to reclaim locks and reject new locking requests. Another defect being revealed is that quota code doesn't handle such situation (quota intent lock is rejected with -EINPROGRESS) well, I'll try to cook a patch soon.

Comment by Niu Yawei (Inactive) [ 28/Oct/16 ]

Patch was uploaded at: http://review.whamcloud.com/#/c/22751/

Comment by Gerrit Updater [ 07/Nov/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22751/
Subject: LU-8634 quota: fix return code of intent quota lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 11387730bac0ebb7940657ac6c463f4afd9b0fe8

Comment by Peter Jones [ 07/Nov/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:19:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.