Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11301

hung threads on MDT and MDT won't umount

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • Lustre 2.10.4
    • None
    • x86_64, zfs, 3 MDTs, all on 1 MDS, or across 2 MDS's, 2.10.4 + many patches.
    • 2
    • 9223372036854775807

    Description

      Hi,

      we've had 2 more MDT hangs that have similar symptoms to LU-11082. file operations on clients hang, usually gradually from more and more clients. in every case MDS failover restores functionality, one of the 3 MDTs won't umount, and the MDS has to be stonith'd. only our large DNE filesystem is affected (dagg) and not home,images,apps, so likely DNE issue.

      the first hang was yesterday. the 3 MDTs were distributed across the 2 MDS's. dagg-mdt2 would not umount in the MDS failovers that were needed to restore operation. warble1,2 logs attached.
      in the hang today all MDTs were on warble1 and dagg-mdt0 wouldn't umount. logs attached.

      previous to these 2 hangs, we haven't had similar MDS issues for 3 weeks. we can't see an obvious change in user behaviour that triggers this.

      the hangs yesterday started with

      Aug 29 14:19:37 warble2 kernel: LNet: Service thread pid 51461 was inactive for 200.46s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Aug 29 14:19:37 warble2 kernel: Pid: 51461, comm: mdt_rdpg01_013 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
      Aug 29 14:19:37 warble2 kernel: Call Trace:
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc0fb7037>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc0f9890b>] top_trans_stop+0x42b/0x930 [ptlrpc]
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc18655f9>] lod_trans_stop+0x259/0x340 [lod]
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc1ac623a>] mdd_trans_stop+0x2a/0x46 [mdd]
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc1abbbcb>] mdd_attr_set+0x5eb/0xce0 [mdd]
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc1a1b206>] mdt_mfd_close+0x1a6/0x610 [mdt]
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc1a20981>] mdt_close_internal+0x121/0x220 [mdt]
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc1a20ca0>] mdt_close+0x220/0x780 [mdt]
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc0f8538a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc0f2de4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Aug 29 14:19:37 warble2 kernel: [<ffffffffc0f31592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Aug 29 14:19:37 warble2 kernel: [<ffffffff9f8bb621>] kthread+0xd1/0xe0
      Aug 29 14:19:37 warble2 kernel: [<ffffffff9ff205dd>] ret_from_fork_nospec_begin+0x7/0x21
      Aug 29 14:19:37 warble2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Aug 29 14:19:37 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1535516377.51461
      

      and the hangs today started with

      Aug 30 12:25:32 warble1 kernel: Lustre: dagg-MDT0000-osp-MDT0002: Connection to dagg-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      Aug 30 12:25:32 warble1 kernel: LustreError: 212573:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1535595632, 300s ago), entering recovery for dagg-MDT0000_UUID@192.168.44.21@o2ib44 ns: dagg-MDT0000-osp-MDT0002 lock: ffff93869b37ea00/0xddb792bb3962e7e2 lrc: 4/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 13 type: IBT flags: 0x1000001000000 nid: local remote: 0xddb792bb3962e7e9 expref: -99 pid: 212573 timeout: 0 lvb_type: 0
      Aug 30 12:25:32 warble1 kernel: Lustre: dagg-MDT0000: Received new LWP connection from 0@lo, removing former export from same NID
      Aug 30 12:25:32 warble1 kernel: Lustre: Skipped 1 previous similar message
      Aug 30 12:25:32 warble1 kernel: Lustre: dagg-MDT0000: Connection restored to 192.168.44.21@o2ib44 (at 0@lo)
      Aug 30 12:25:32 warble1 kernel: Lustre: Skipped 5 previous similar messages
      Aug 30 12:25:32 warble1 kernel: LustreError: 212573:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 4 previous similar messages
      Aug 30 12:25:34 warble1 kernel: LustreError: 132075:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1535595633, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff931f1afa8a00/0xddb792bb39655687 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 20 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 132075 timeout: 0 lvb_type: 0
      Aug 30 12:25:34 warble1 kernel: LustreError: dumping log to /tmp/lustre-log.1535595934.132075
      

      I've attached the first lustre-logs and syslog for both days.
      there's nothing extra in console logs that I can see.
      MDS's are warble, OSS's are arkle/umlaut, clients are john,farnarkle,bryan.

      on Lustre servers we have a bunch of patches over standard 2.10.4 applied. it's pretty much 2.10.5. to be precise, it's Lustre 2.10.4 plus these
      lu10683-lu11093-checksum-overquota-gerrit32788-1fb85e7e.patch
      lu10988-lfsck2-gerrit32522-21d33c11.patch
      lu11074-mdc-xattr-gerrit32739-dea1cde9.patch
      lu11107-xattr-gerrit32753-c96a8f08.patch
      lu11111-lfsck-gerrit32796-693fe452.patch
      lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
      lu11062-stacktrace-gerrit32972-7232c445.patch

      on clients we have 2.10.4 plus
      lu11074-gerrit32739_a1ae6014.diff

      cheers,
      robin

      Attachments

        Activity

          [LU-11301] hung threads on MDT and MDT won't umount
          laisiyao Lai Siyao added a comment -

          I saw -116 error in one of them, thanks.

          laisiyao Lai Siyao added a comment - I saw -116 error in one of them, thanks.
          laisiyao Lai Siyao added a comment -

          yes, I saw the 5 log files, and I'm looking into them.

          laisiyao Lai Siyao added a comment - yes, I saw the 5 log files, and I'm looking into them.
          scadmin SC Admin added a comment -

          I've attached syslog that matches the -1 logs that are uploading.

          the 3 most likely -1 logs (listed above) are on the ftp site now in the LU-11301 dir.
          I've also uploaded one extra before and one extra after (but the earliest is still uploading, ETA 5 mins).
          so there should be 5 total

          let me know if you'd like more from the below list

          lu-11301.debug_daemon.warble2.1536078317.gz
          lu-11301.debug_daemon.warble2.1536078327.gz
          lu-11301.debug_daemon.warble2.1536078337.gz
          lu-11301.debug_daemon.warble2.1536078347.gz
          lu-11301.debug_daemon.warble2.1536078357.gz
          lu-11301.debug_daemon.warble2.1536078367.gz
          lu-11301.debug_daemon.warble2.1536078377.gz
          lu-11301.debug_daemon.warble2.1536078387.gz
          lu-11301.debug_daemon.warble2.1536078397.gz
          lu-11301.debug_daemon.warble2.1536078407.gz
          lu-11301.debug_daemon.warble2.1536078417.gz
          lu-11301.debug_daemon.warble2.1536078427.gz
          lu-11301.debug_daemon.warble2.1536078437.gz
          lu-11301.debug_daemon.warble2.1536078447.gz
          lu-11301.debug_daemon.warble2.1536078457.gz
          lu-11301.debug_daemon.warble2.1536078467.gz
          lu-11301.debug_daemon.warble2.1536078477.gz
          lu-11301.debug_daemon.warble2.1536078487.gz
          lu-11301.debug_daemon.warble2.1536078497.gz
          lu-11301.debug_daemon.warble2.1536078507.gz
          lu-11301.debug_daemon.warble2.1536078517.gz
          
          scadmin SC Admin added a comment - I've attached syslog that matches the -1 logs that are uploading. the 3 most likely -1 logs (listed above) are on the ftp site now in the LU-11301 dir. I've also uploaded one extra before and one extra after (but the earliest is still uploading, ETA 5 mins). so there should be 5 total let me know if you'd like more from the below list lu-11301.debug_daemon.warble2.1536078317.gz lu-11301.debug_daemon.warble2.1536078327.gz lu-11301.debug_daemon.warble2.1536078337.gz lu-11301.debug_daemon.warble2.1536078347.gz lu-11301.debug_daemon.warble2.1536078357.gz lu-11301.debug_daemon.warble2.1536078367.gz lu-11301.debug_daemon.warble2.1536078377.gz lu-11301.debug_daemon.warble2.1536078387.gz lu-11301.debug_daemon.warble2.1536078397.gz lu-11301.debug_daemon.warble2.1536078407.gz lu-11301.debug_daemon.warble2.1536078417.gz lu-11301.debug_daemon.warble2.1536078427.gz lu-11301.debug_daemon.warble2.1536078437.gz lu-11301.debug_daemon.warble2.1536078447.gz lu-11301.debug_daemon.warble2.1536078457.gz lu-11301.debug_daemon.warble2.1536078467.gz lu-11301.debug_daemon.warble2.1536078477.gz lu-11301.debug_daemon.warble2.1536078487.gz lu-11301.debug_daemon.warble2.1536078497.gz lu-11301.debug_daemon.warble2.1536078507.gz lu-11301.debug_daemon.warble2.1536078517.gz
          laisiyao Lai Siyao added a comment - - edited

          ahh, I just noticed that in LU-9157, and I'll check that.

          laisiyao Lai Siyao added a comment - - edited ahh, I just noticed that in LU-9157 , and I'll check that.
          pjones Peter Jones added a comment -

          Ah good - I was just rummaging for the instructions

          pjones Peter Jones added a comment - Ah good - I was just rummaging for the instructions
          scadmin SC Admin added a comment -

          ah, I figured out the ftp from an old intel ticket we had. cool. uploading now.

          scadmin SC Admin added a comment - ah, I figured out the ftp from an old intel ticket we had. cool. uploading now.
          scadmin SC Admin added a comment -

          also Simon found that LU-9157 has the same -116 error. but I guess you know that already.

          scadmin SC Admin added a comment - also Simon found that LU-9157 has the same -116 error. but I guess you know that already.
          scadmin SC Admin added a comment -

          it won't let me upload. do I have to do it to a specific dir?

          scadmin SC Admin added a comment - it won't let me upload. do I have to do it to a specific dir?
          laisiyao Lai Siyao added a comment -

          you can upload to ftp.whamcloud.com.

          laisiyao Lai Siyao added a comment - you can upload to ftp.whamcloud.com.
          scadmin SC Admin added a comment -

          I spent all day trying to reproduce it in a VM or on a 1-node lustre setup on a compute node, but (of course) couldn't.

          so I ran it on the real machine as you suggested. all 3 MDTs are on one MDS (warble2).

          the largest possible -1 debug logs still overflow very quickly, so I kept cycling them every 10s.
          I think I caught your -116 cleanly ->

          ...
          Sep  5 02:26:17 warble2 kernel: Lustre: debug daemon will attempt to start writing to /mnt/root/root/lu-11301.debug_daemon.warble2.1536078377 (20480000kB max)
          Sep  5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024)
          Sep  5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024)
          Sep  5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024)
          Sep  5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024)
          Sep  5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024)
          Sep  5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024)
          Sep  5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024)
          Sep  5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024)
          Sep  5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024)
          Sep  5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024)
          Sep  5 02:26:27 warble2 kernel: Lustre: shutting down debug daemon thread...
          Sep  5 02:26:27 warble2 kernel: Lustre: debug daemon will attempt to start writing to /mnt/root/root/lu-11301.debug_daemon.warble2.1536078387 (20480000kB max)
          Sep  5 02:26:36 warble2 kernel: LustreError: 20423:0:(llog_cat.c:795:llog_cat_cancel_records()) dagg-MDT0001-osp-MDT0000: fail to cancel 0 of 1 llog-records: rc = -116
          Sep  5 02:26:37 warble2 kernel: Lustre: shutting down debug daemon thread...
          Sep  5 02:26:37 warble2 kernel: Lustre: debug daemon will attempt to start writing to /mnt/root/root/lu-11301.debug_daemon.warble2.1536078397 (20480000kB max)
          ...
          

          so I'll upload

          lu-11301.debug_daemon.warble2.1536078377
          lu-11301.debug_daemon.warble2.1536078387
          lu-11301.debug_daemon.warble2.1536078397

          but I'll need a ftp site to upload them to. they are huge. please let me know the write-only location.

          BTW, almost all the clients were evicted on one of the power cycles to get back from this, so it's complete cluster reboot time again. very sad. I hope these are enough logs.

          cheers,
          robin

          scadmin SC Admin added a comment - I spent all day trying to reproduce it in a VM or on a 1-node lustre setup on a compute node, but (of course) couldn't. so I ran it on the real machine as you suggested. all 3 MDTs are on one MDS (warble2). the largest possible -1 debug logs still overflow very quickly, so I kept cycling them every 10s. I think I caught your -116 cleanly -> ... Sep 5 02:26:17 warble2 kernel: Lustre: debug daemon will attempt to start writing to /mnt/root/root/lu-11301.debug_daemon.warble2.1536078377 (20480000kB max) Sep 5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024) Sep 5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024) Sep 5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024) Sep 5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024) Sep 5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024) Sep 5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024) Sep 5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024) Sep 5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024) Sep 5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024) Sep 5 02:26:21 warble2 kernel: debug daemon buffer overflowed; discarding 10% of pages (103 of 1024) Sep 5 02:26:27 warble2 kernel: Lustre: shutting down debug daemon thread... Sep 5 02:26:27 warble2 kernel: Lustre: debug daemon will attempt to start writing to /mnt/root/root/lu-11301.debug_daemon.warble2.1536078387 (20480000kB max) Sep 5 02:26:36 warble2 kernel: LustreError: 20423:0:(llog_cat.c:795:llog_cat_cancel_records()) dagg-MDT0001-osp-MDT0000: fail to cancel 0 of 1 llog-records: rc = -116 Sep 5 02:26:37 warble2 kernel: Lustre: shutting down debug daemon thread... Sep 5 02:26:37 warble2 kernel: Lustre: debug daemon will attempt to start writing to /mnt/root/root/lu-11301.debug_daemon.warble2.1536078397 (20480000kB max) ... so I'll upload lu-11301.debug_daemon.warble2.1536078377 lu-11301.debug_daemon.warble2.1536078387 lu-11301.debug_daemon.warble2.1536078397 but I'll need a ftp site to upload them to. they are huge. please let me know the write-only location. BTW, almost all the clients were evicted on one of the power cycles to get back from this, so it's complete cluster reboot time again. very sad. I hope these are enough logs. cheers, robin
          scadmin SC Admin added a comment -

          Hi,

          these events are quite traumatic. we have at least 8 new corrupted and unremovable dirs from yesterday and we can't lfsck because of LU-11111.

          I would very much prefer to not to damage the production machine any further.
          also last time we turned on -1 debugging then IIRC a bunch of clients got evicted.

          I'll see if I can reproduce it in a VM instead...

          cheers,
          robin

          scadmin SC Admin added a comment - Hi, these events are quite traumatic. we have at least 8 new corrupted and unremovable dirs from yesterday and we can't lfsck because of LU-11111 . I would very much prefer to not to damage the production machine any further. also last time we turned on -1 debugging then IIRC a bunch of clients got evicted. I'll see if I can reproduce it in a VM instead... cheers, robin

          People

            laisiyao Lai Siyao
            scadmin SC Admin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: