Details
Description
Lustre versions involved:
Clients (Sherlock): 2.12.0 + patches from LU-11964 [mdc: prevent glimpse lock count grow]
Servers (Fir): 2.12.0 + patches from LU-12037 [mdt: call mdt_dom_discard_data() after rename unlock]+[mdt: add option for cross-MDT rename]
Last night, while fir-md1-s1 was relatively quiet, we had a lot of call traces showing up on the second MDS fir-md1-s2 (serving MDT0001 and MDT0003) , the first trace was:
Mar 19 21:19:02 fir-md1-s2 kernel: LustreError: 90840:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.9.108.46@o2ib4 ns: mdt-fir-MDT0001_UUID lock: ffff8ecffb20f500/0xefacb2c18c9ab3c7 lrc: 3/0,0 mode: PW/PW res: [0x24000dd55:0x1cca7:0x0].0x0 bits 0x40/0x0 rrc: 79 type: IBT flags: 0x60200400000020 nid: 10.9.108.46@o2ib4 remote: 0x8374126d39604757 expref: 38560 pid: 91529 timeout: 900085 lvb_type: 0 Mar 19 21:19:43 fir-md1-s2 kernel: Lustre: 91243:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply req@ffff8eb9af2cb000 x1627853341443984/t0(0) o101->df44ff7c-4e8a-070f-774f-84780b4dab3d@10.9.108.48@o2ib4:18/0 lens 600/3264 e 0 to 0 dl 1553055588 ref 2 fl Interpret:/0/0 rc 0/0 Mar 19 21:19:43 fir-md1-s2 kernel: Lustre: 91243:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 31 previous similar messages Mar 19 21:19:48 fir-md1-s2 kernel: LustreError: 90840:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.9.108.44@o2ib4 ns: mdt-fir-MDT0001_UUID lock: ffff8eba9122a400/0xefacb2c18c9ab3dc lrc: 3/0,0 mode: PW/PW res: [0x24000dd55:0x1cca7:0x0].0x0 bits 0x40/0x0 rrc: 77 type: IBT flags: 0x60200400000020 nid: 10.9.108.44@o2ib4 remote: 0xe63432aea7141892 expref: 920 pid: 91353 timeout: 900131 lvb_type: 0 Mar 19 21:19:53 fir-md1-s2 kernel: LNet: Service thread pid 91651 was inactive for 200.05s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Mar 19 21:19:53 fir-md1-s2 kernel: Pid: 91651, comm: mdt03_027 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Mar 19 21:19:53 fir-md1-s2 kernel: Call Trace: Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc0f930bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc0f93dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc14e64bb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc14e6b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc14e6f0c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc14fe1ac>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc1501b68>] mdt_reint_setattr+0x6c8/0x12d0 [mdt] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc1503c53>] mdt_reint_rec+0x83/0x210 [mdt] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc14e2143>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc14ed4a7>] mdt_reint+0x67/0x140 [mdt] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc103035a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc0fd492b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffc0fd825c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffb36c1c31>] kthread+0xd1/0xe0 Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffb3d74c24>] ret_from_fork_nospec_begin+0xe/0x21 Mar 19 21:19:53 fir-md1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Mar 19 21:19:53 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1553055593.91651
a thing I noticed, a high lock count on MDT0001:
$ clush -w@mds 'lctl get_param ldlm.namespaces.mdt-fir-MDT000*_UUID.lock_count' fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=972046 fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=480758 fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=3661720 fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=186735
I'm attaching the kernel logs for fir-md1-s2 of last night as fir-md1-s2-kern.log
This is not like in LU-12037, as I don't think the filesystem wasn't globally stuck. It was more focused on some clients, like:
- sh-101-09 10.9.101.9@o2ib4 – logs attached as sh-101-09-kern.log
- sh-101-19 10.9.101.19@o2ib4 – logs attached as sh-101-19-kern.log
on sh-101-09, the lustre eviction impacted running jobs as shown in this:
Mar 20 02:22:15 sh-101-09 kernel: LustreError: 167-0: fir-MDT0001-mdc-ffff9dc7b03c0000: This client was evicted by fir-MDT0001; in progress operations using this service will fail. Mar 20 02:22:15 sh-101-09 kernel: LustreError: 78084:0:(llite_lib.c:1551:ll_md_setattr()) md_setattr fails: rc = -5 Mar 20 02:22:15 sh-101-09 kernel: LustreError: 78079:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x240005ab2:0x1e49e:0x0] error: rc = -5 Mar 20 02:22:15 sh-101-09 kernel: Lustre: 93503:0:(llite_lib.c:2733:ll_dirty_page_discard_warn()) fir: dirty page discard: 10.0.10.51@o2ib7:10.0.10.52@o2ib7:/fir/fid: [0x24000ed00:0x10:0x0]// may get corrupted (rc -108) Mar 20 02:22:15 sh-101-09 kernel: LustreError: 107259:0:(vvp_io.c:1495:vvp_io_init()) fir: refresh file layout [0x24000ed00:0x10:0x0] error -108. Mar 20 02:22:20 sh-101-09 kernel: julia[117508]: segfault at 18 ip 00007fc5da76f3d2 sp 00007fff3acf86e0 error 4 in libhdf5.so.100.0.1[7fc5da68f000+37c000]
There is a also a call trace involving mdt_dom_discard_data:
Mar 20 09:16:05 fir-md1-s2 kernel: Pid: 14629, comm: mdt02_105 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Mar 20 09:16:05 fir-md1-s2 kernel: Call Trace: Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffc0f930bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc] Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffc0f93dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffc1523ef1>] mdt_dom_discard_data+0x101/0x130 [mdt] Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffc14fecb1>] mdt_reint_unlink+0x331/0x14b0 [mdt] Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffc1503c53>] mdt_reint_rec+0x83/0x210 [mdt] Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffc14e2143>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffc14ed4a7>] mdt_reint+0x67/0x140 [mdt] Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffc103035a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffc0fd492b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffc0fd825c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffb36c1c31>] kthread+0xd1/0xe0 Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffb3d74c24>] ret_from_fork_nospec_begin+0xe/0x21 Mar 20 09:16:05 fir-md1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
However, since 09:16 it seems to have stopped. We have rebooted sh-101-09 and a few others this morning so that may have helped perhaps?
Attachments
Issue Links
- duplicates
-
LU-11359 racer test 1 times out with client hung in dir_create.sh, ls, … and MDS in ldlm_completion_ast()
- Resolved