Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12091

DNE/DOM: client evictions -108

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      Lustre versions involved:
      Clients (Sherlock): 2.12.0 + patches from LU-11964 [mdc: prevent glimpse lock count grow]
      Servers (Fir): 2.12.0 + patches from LU-12037 [mdt: call mdt_dom_discard_data() after rename unlock]+[mdt: add option for cross-MDT rename]

      Last night, while fir-md1-s1 was relatively quiet, we had a lot of call traces showing up on the second MDS fir-md1-s2 (serving MDT0001 and MDT0003) , the first trace was:

      Mar 19 21:19:02 fir-md1-s2 kernel: LustreError: 90840:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.9.108.46@o2ib4  ns: mdt-fir-MDT0001_UUID lock: ffff8ecffb20f500/0xefacb2c18c9ab3c7 lrc: 3/0,0 mode: PW/PW res: [0x24000dd55:0x1cca7:0x0].0x0 bits 0x40/0x0 rrc: 79 type: IBT flags: 0x60200400000020 nid: 10.9.108.46@o2ib4 remote: 0x8374126d39604757 expref: 38560 pid: 91529 timeout: 900085 lvb_type: 0
      Mar 19 21:19:43 fir-md1-s2 kernel: Lustre: 91243:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply
                                           req@ffff8eb9af2cb000 x1627853341443984/t0(0) o101->df44ff7c-4e8a-070f-774f-84780b4dab3d@10.9.108.48@o2ib4:18/0 lens 600/3264 e 0 to 0 dl 1553055588 ref 2 fl Interpret:/0/0 rc 0/0
      Mar 19 21:19:43 fir-md1-s2 kernel: Lustre: 91243:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 31 previous similar messages
      Mar 19 21:19:48 fir-md1-s2 kernel: LustreError: 90840:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.9.108.44@o2ib4  ns: mdt-fir-MDT0001_UUID lock: ffff8eba9122a400/0xefacb2c18c9ab3dc lrc: 3/0,0 mode: PW/PW res: [0x24000dd55:0x1cca7:0x0].0x0 bits 0x40/0x0 rrc: 77 type: IBT flags: 0x60200400000020 nid: 10.9.108.44@o2ib4 remote: 0xe63432aea7141892 expref: 920 pid: 91353 timeout: 900131 lvb_type: 0
      Mar 19 21:19:53 fir-md1-s2 kernel: LNet: Service thread pid 91651 was inactive for 200.05s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Mar 19 21:19:53 fir-md1-s2 kernel: Pid: 91651, comm: mdt03_027 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      Mar 19 21:19:53 fir-md1-s2 kernel: Call Trace:
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc0f930bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc0f93dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc14e64bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc14e6b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc14e6f0c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc14fe1ac>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc1501b68>] mdt_reint_setattr+0x6c8/0x12d0 [mdt]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc1503c53>] mdt_reint_rec+0x83/0x210 [mdt]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc14e2143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc14ed4a7>] mdt_reint+0x67/0x140 [mdt]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc103035a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc0fd492b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffc0fd825c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffb36c1c31>] kthread+0xd1/0xe0
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffb3d74c24>] ret_from_fork_nospec_begin+0xe/0x21
      Mar 19 21:19:53 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      Mar 19 21:19:53 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1553055593.91651
      

       

      a thing I noticed, a high lock count on MDT0001:

      $ clush -w@mds 'lctl get_param ldlm.namespaces.mdt-fir-MDT000*_UUID.lock_count'
       fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=972046
       fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=480758
       fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=3661720
       fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=186735
      

      I'm attaching the kernel logs for fir-md1-s2 of last night as fir-md1-s2-kern.log

      This is not like in LU-12037, as I don't think the filesystem wasn't globally stuck. It was more focused on some clients, like:

      • sh-101-09 10.9.101.9@o2ib4 – logs attached as sh-101-09-kern.log
      • sh-101-19 10.9.101.19@o2ib4 – logs attached as sh-101-19-kern.log

      on sh-101-09, the lustre eviction impacted running jobs as shown in this:

      Mar 20 02:22:15 sh-101-09 kernel: LustreError: 167-0: fir-MDT0001-mdc-ffff9dc7b03c0000: This client was evicted by fir-MDT0001; in progress operations using this service will fail.
      Mar 20 02:22:15 sh-101-09 kernel: LustreError: 78084:0:(llite_lib.c:1551:ll_md_setattr()) md_setattr fails: rc = -5
      Mar 20 02:22:15 sh-101-09 kernel: LustreError: 78079:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x240005ab2:0x1e49e:0x0] error: rc = -5
      Mar 20 02:22:15 sh-101-09 kernel: Lustre: 93503:0:(llite_lib.c:2733:ll_dirty_page_discard_warn()) fir: dirty page discard: 10.0.10.51@o2ib7:10.0.10.52@o2ib7:/fir/fid: [0x24000ed00:0x10:0x0]// may get corrupted (rc -108)
      Mar 20 02:22:15 sh-101-09 kernel: LustreError: 107259:0:(vvp_io.c:1495:vvp_io_init()) fir: refresh file layout [0x24000ed00:0x10:0x0] error -108.
      Mar 20 02:22:20 sh-101-09 kernel: julia[117508]: segfault at 18 ip 00007fc5da76f3d2 sp 00007fff3acf86e0 error 4 in libhdf5.so.100.0.1[7fc5da68f000+37c000]
      

      There is a also a call trace involving mdt_dom_discard_data:

      Mar 20 09:16:05 fir-md1-s2 kernel: Pid: 14629, comm: mdt02_105 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      Mar 20 09:16:05 fir-md1-s2 kernel: Call Trace:
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffc0f930bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffc0f93dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffc1523ef1>] mdt_dom_discard_data+0x101/0x130 [mdt]
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffc14fecb1>] mdt_reint_unlink+0x331/0x14b0 [mdt]
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffc1503c53>] mdt_reint_rec+0x83/0x210 [mdt]
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffc14e2143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffc14ed4a7>] mdt_reint+0x67/0x140 [mdt]
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffc103035a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffc0fd492b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffc0fd825c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffb36c1c31>] kthread+0xd1/0xe0
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffb3d74c24>] ret_from_fork_nospec_begin+0xe/0x21
      Mar 20 09:16:05 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      

      However, since 09:16 it seems to have stopped. We have rebooted sh-101-09 and a few others this morning so that may have helped perhaps?

      Attachments

        1. fir-md1-s2-kern.log
          375 kB
        2. sh-101-09-kern.log
          38 kB
        3. sh-101-19-kern.log
          785 kB

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: