[LU-8634] 2.8.0 MDS (layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server) Created: 22/Sep/16 Updated: 07/Nov/16 Resolved: 07/Nov/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Josh Samuelson (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mdt, ptlrpc, quota | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
Synopsis:Lustre version 2.8.0 on servers and clients, sudden degraded namespace lookups on clients. Subsystem RPC ERRORS related to LDLM_INTENT_QUOTA are emitted to the Lustre debug log on all Lustre servers. Degraded performance is resolved by unmounting MDT and rebooting the MDS. I haven't found any JIRA matches on what we're seeing in particular, so opening this ticket. Servers are running Lustre 2.8 from latest el6 Community Releases: More details follow, any suggestions on how to collect more meaningful debugging information is appreciated as we've experienced this degraded performance twice on this production cluster. Detailed:The Lustre system is comprised of 1 MDS with 1 MDT, 6 OSSs with 6 OSTs per OSS, 36x OSTs total. The first time we experienced this issue (2016-09-07) we suspected something was wrong with the MDT since all the errors pointed to that component (mds1 warnings on nid2str of "lo" and OSSs warning on nid2str of "10.137.32.37@o2ib," the mds1's nid). The degraded namespace performance on the clients appeared to be totally hung for some users/processes. Simple tests showed that a 'stat /lustre' took 6 sconds before returning valid data. Stating directories in deeper paths of /lustre added 6 seconds per path component, ie 'stat /lustre/work' was 12 seconds, etc. Path components with more entries did take a bit longer to stat. The corrective actions we took at that time was to unmount the MDT, unload all the Lustre kernel modules, unconfigure the LNET network and unload the remaining lnd kernel modules. We reversed this process without rebooting the MDS. Remounting the MDT succeeded, however we experienced copious LustreError messages on the console and most of our clients were evicted, snippets of errors I still have on hand from then: [1993495.191766] LustreError: 24447:0:(upcall_cache.c:237:upcall_cache_get_entry()) acquire for key 4254: error -110 'lctl dl' didn't show anything and /proc/fs/lustre was empty on the MDS post mount of the MDT. Calling 'stat /lustre' on a client was still very heavily delayed. Most namespace related calls seemed to be completely hung. Having arrived at this point we determined our only option was to reboot the MDS. Upon our attempt to again 'unmount /lustre', messages were emitted to the console about various Lustre services threads being hung. The umount process had the following back trace: cat /proc/25973/stack [<ffffffffa12ab1e0>] __ldlm_namespace_free+0x1c0/0x560 [ptlrpc] We waited ~15 minutes but the umount attempt seemed thoroughly wedged, so we reset the MDS. Upon reboot, we 'e2fsck -fp /dev/path/to/mdt' prior to remounting. When we did, the recovery process seemed normal. Fast-forward to today: mds1 lctl dk entry 00000100:00020000:7.0:1474538545.022548:0:3082:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server) oss1 (lctl dk log rolled over, so /var/log/messages entry) Sep 22 05:02:06 oss1 kernel: [3264568.614229] LustreError: 11270:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server) oss2 (lctl dk log rolled over, so /var/log/messages entry) Sep 22 05:01:16 oss2 kernel: [3264426.880193] LustreError: 17980:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server) oss3 (lctl dk log rolled over, so /var/log/messages entry) Sep 22 05:01:40 oss3 kernel: [3264396.120233] LustreError: 17958:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server) oss4 (lctl dk log rolled over, so /var/log/messages entry) Sep 22 05:01:46 oss4 kernel: [3264365.930765] LustreError: 17958:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server) oss5 lctl dk 00000100:00020000:7.0:1474538507.425216:0:18722:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server) oss6 lctl dk 00000100:00020000:10.0:1474538538.449115:0:18717:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server) The corrective actions we took today involved the same unmount/unloading of kernel module steps as indicated above from (2016-09-07). However, we rebooted the MDS rather than reloading/remounting the MDT to avoid any similar issues. Upon MDS boot, we still 'e2fsck -fp /dev/path/to/mdt' to ensure consistency of the cleanly unmounted MDT. The following was output: [QUOTA WARNING] Usage inconsistent for ID 0:actual (7314026496, 402396) != expected (7313580032, 402359) Remounting the MDT, all clients reconnected without being evicted. |
| Comments |
| Comment by Niu Yawei (Inactive) [ 26/Sep/16 ] |
|
Were there lots of client activities when this error message showed up? I'm wondering the MDS could be overloaded by locking requests, could you check the following three proc files on MDS when you see this message ("__req_capsule_get() ... 'quota_body') again? /proc/fs/lustre/ldlm/lock_granted_count /proc/fs/lustre/ldlm/lock_limit_mb /proc/fs/lustre/ldlm/lock_reclaim_threshold_mb To avoid being fullfilled by ldlm locks, MDS will reject any new locking request with -EINPROGRESS(-115) when cached locks reaching a threshold, and I checked quota code, looks it would mistakenly interpret this error and try to unpack reply buffer. |
| Comment by Josh Samuelson (Inactive) [ 27/Sep/16 ] |
|
Hi Niu, Thank you for taking a look into this. I looked at the source for a bit and see lock_reclaim_threshold_mb and lock_limit_mb are 20/30% of totalram_pages, so our values are: grep -H . /proc/fs/lustre/ldlm/ {lock_reclaim_threshold_mb,lock_limit_mb}/proc/fs/lustre/ldlm/lock_reclaim_threshold_mb:9638 Using the size reported in /proc/slabinfo for cache type ldlm_locks for data structure 'struct ldlm_locks' (512 in our case), the low/high lock count for us would be (using lustre/ldlm/ldlm_reclaim.c:ldlm_ratio2locknr() to find the counts): low: 9638*2^20/(100*512) == ~197386 If I found the counts correctly, what are those threshold counts checked against, the number of slab ldlm_locks active objects? I'm curious how the following proc values relate: In particular, should active_objs for ldlm_locks match closely to the lock values lustre outputs in /proc/fs/lustre/ldlm paths? mds1# grep -E '(active_objs)
mds1# grep -H . /proc/fs/lustre/ldlm/lock_granted_count /proc/fs/lustre/ldlm/lock_granted_count:3549033 Should ldlm_locks active_objs of 6669 be close to the granted count of 3549033? mds1# grep -H . /proc/fs/lustre/ldlm/namespaces/*/lock_count /proc/fs/lustre/ldlm/namespaces/lstrFS-MDT0000-lwp-MDT0000/lock_count:232 Should these values add up (or be close) to the ldlm_locks active_objs count also? mds1# cat /proc/fs/lustre/ldlm/namespaces/mdt-lstrFS-MDT0000_UUID/pool/state LDLM pool state (ldlm-pool-mdt-lstrFS-MDT0000_UUID-1): Above, G (granted) > L (limit)? With your hint to look at the LDLM counters and my sharing the odd granted vs limit lock counts with a coworker, he came across the following https://jira.hpdd.intel.com/browse/LU-8246 which I believe matches what we're seeing. |
| Comment by Niu Yawei (Inactive) [ 27/Sep/16 ] |
|
Right, it's related to |
| Comment by Niu Yawei (Inactive) [ 28/Oct/16 ] |
|
Patch was uploaded at: http://review.whamcloud.com/#/c/22751/ |
| Comment by Gerrit Updater [ 07/Nov/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22751/ |
| Comment by Peter Jones [ 07/Nov/16 ] |
|
Landed for 2.9 |