Loading...

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.8.0
Labels:
- mdt
- ptlrpc
- quota

Severity:
3
Rank (Obsolete):
9223372036854775807

Synopsis:

Lustre version 2.8.0 on servers and clients, sudden degraded namespace lookups on clients. Subsystem RPC ERRORS related to LDLM_INTENT_QUOTA are emitted to the Lustre debug log on all Lustre servers. Degraded performance is resolved by unmounting MDT and rebooting the MDS.

I haven't found any JIRA matches on what we're seeing in particular, so opening this ticket.

Servers are running Lustre 2.8 from latest el6 Community Releases:
https://downloads.hpdd.intel.com/public/lustre/latest-release/el6.7/server/RPMS/x86_64/

More details follow, any suggestions on how to collect more meaningful debugging information is appreciated as we've experienced this degraded performance twice on this production cluster.

Detailed:

The Lustre system is comprised of 1 MDS with 1 MDT, 6 OSSs with 6 OSTs per OSS, 36x OSTs total.

The first time we experienced this issue (2016-09-07) we suspected something was wrong with the MDT since all the errors pointed to that component (mds1 warnings on nid2str of "lo" and OSSs warning on nid2str of "10.137.32.37@o2ib," the mds1's nid).

The degraded namespace performance on the clients appeared to be totally hung for some users/processes. Simple tests showed that a 'stat /lustre' took 6 sconds before returning valid data. Stating directories in deeper paths of /lustre added 6 seconds per path component, ie 'stat /lustre/work' was 12 seconds, etc. Path components with more entries did take a bit longer to stat.

The corrective actions we took at that time was to unmount the MDT, unload all the Lustre kernel modules, unconfigure the LNET network and unload the remaining lnd kernel modules. We reversed this process without rebooting the MDS. Remounting the MDT succeeded, however we experienced copious LustreError messages on the console and most of our clients were evicted, snippets of errors I still have on hand from then:

[1993495.191766] LustreError: 24447:0:(upcall_cache.c:237:upcall_cache_get_entry()) acquire for key 4254: error -110
[1993695.996095] Lustre: lstrFS-MDT0000: Recovery over after 5:00, of 111 clients 16 recovered and 95 were evicted.

'lctl dl' didn't show anything and /proc/fs/lustre was empty on the MDS post mount of the MDT. Calling 'stat /lustre' on a client was still very heavily delayed. Most namespace related calls seemed to be completely hung.

Having arrived at this point we determined our only option was to reboot the MDS. Upon our attempt to again 'unmount /lustre', messages were emitted to the console about various Lustre services threads being hung. The umount process had the following back trace:

cat /proc/25973/stack

[<ffffffffa12ab1e0>] __ldlm_namespace_free+0x1c0/0x560 [ptlrpc]
[<ffffffffa12ab5ef>] ldlm_namespace_free_prior+0x6f/0x220 [ptlrpc]
[<ffffffffa19055d2>] mdt_device_fini+0x6a2/0x12e0 [mdt]
[<ffffffffa10c3332>] class_cleanup+0x572/0xd20 [obdclass]
[<ffffffffa10c5646>] class_process_config+0x1b66/0x24c0 [obdclass]
[<ffffffffa10c645f>] class_manual_cleanup+0x4bf/0xc90 [obdclass]
[<ffffffffa10f7a1c>] server_put_super+0x8bc/0xcd0 [obdclass]
[<ffffffff811944bb>] generic_shutdown_super+0x5b/0xe0
[<ffffffff811945a6>] kill_anon_super+0x16/0x60
[<ffffffffa10c9646>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff81194d47>] deactivate_super+0x57/0x80
[<ffffffff811b4d3f>] mntput_no_expire+0xbf/0x110
[<ffffffff811b588b>] sys_umount+0x7b/0x3a0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

We waited ~15 minutes but the umount attempt seemed thoroughly wedged, so we reset the MDS.

Upon reboot, we 'e2fsck -fp /dev/path/to/mdt' prior to remounting. When we did, the recovery process seemed normal.

Fast-forward to today:
We ran without issue again until this morning around 5am, with the same symptoms on clients and errors on the servers:

mds1 lctl dk entry

00000100:00020000:7.0:1474538545.022548:0:3082:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
req@ffff8805c33040c0 x1544837681171836/t0(0) o101->lstrFS-MDT0000-lwp-MDT0000@0@lo:23/10 lens 456/192 e 0 to 0 dl 1474538552 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss1 (lctl dk log rolled over, so /var/log/messages entry)

Sep 22 05:02:06 oss1 kernel: [3264568.614229] LustreError: 11270:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
Sep 22 05:02:06 oss1 kernel: [3264568.614231] req@ffff8800705e3c80 x1542797516212496/t0(0) o101->lstrFS-MDT0000-lwp-OST0000@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538533 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss2 (lctl dk log rolled over, so /var/log/messages entry)

Sep 22 05:01:16 oss2 kernel: [3264426.880193] LustreError: 17980:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
Sep 22 05:01:16 oss2 kernel: [3264426.880195] req@ffff88030a58ccc0 x1542838780945432/t0(0) o101->lstrFS-MDT0000-lwp-OST0007@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538483 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss3 (lctl dk log rolled over, so /var/log/messages entry)

Sep 22 05:01:40 oss3 kernel: [3264396.120233] LustreError: 17958:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
Sep 22 05:01:40 oss3 kernel: [3264396.120235] req@ffff880362c2f980 x1542838278617176/t0(0) o101->lstrFS-MDT0000-lwp-OST0011@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538507 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss4 (lctl dk log rolled over, so /var/log/messages entry)

Sep 22 05:01:46 oss4 kernel: [3264365.930765] LustreError: 17958:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
Sep 22 05:01:46 oss4 kernel: [3264365.930767] req@ffff8800af9fc6c0 x1542838806272068/t0(0) o101->lstrFS-MDT0000-lwp-OST0012@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538513 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss5 lctl dk

00000100:00020000:7.0:1474538507.425216:0:18722:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
req@ffff8801d42fb980 x1542838266300652/t0(0) o101->lstrFS-MDT0000-lwp-OST001b@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538514 ref 1 fl Interpret:RN/0/0 rc -115/-115

oss6 lctl dk

00000100:00020000:10.0:1474538538.449115:0:18717:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
req@ffff8805ae9ad380 x1542838948349576/t0(0) o101->lstrFS-MDT0000-lwp-OST001e@10.137.32.37@o2ib:23/10 lens 456/192 e 0 to 0 dl 1474538545 ref 1 fl Interpret:RN/0/0 rc -115/-115

The corrective actions we took today involved the same unmount/unloading of kernel module steps as indicated above from (2016-09-07). However, we rebooted the MDS rather than reloading/remounting the MDT to avoid any similar issues. Upon MDS boot, we still 'e2fsck -fp /dev/path/to/mdt' to ensure consistency of the cleanly unmounted MDT.

The following was output:

[QUOTA WARNING] Usage inconsistent for ID 0:actual (7314026496, 402396) != expected (7313580032, 402359)
lstrFS-MDT0000: Update quota info for quota type 0.
[QUOTA WARNING] Usage inconsistent for ID 0:actual (7312138240, 402074) != expected (7311691776, 402037)
lstrFS-MDT0000: Update quota info for quota type 1.
lstrFS-MDT0000: 47314184/2147483648 files (0.1% non-contiguous), 273274458/536870912 blocks

Remounting the MDT, all clients reconnected without being evicted.

is related to

LU-8246 Leaks on ldlm granted locks counter on MDS leading to canceling loop

Resolved

Details

Description

Synopsis:

Detailed:

Attachments

Issue Links

Activity

People

Dates