Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4733

All mdt thread stuck in cfs_waitq_wait

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.1.6
    • None
    • 3
    • 13008

    Description

      Hi,

      We are seeing all mdt threads on the MDS stuck in "cfs_waitq_wait". At the same time, we have a lot of rpc request (15k/s).
      Looking closely at the console and the 'bt' from crash, we can see that those threads are coming from qos_statfs_update() where they block in l_wait_event and never wake up.

      What is strange is that cfs_time_beforeq_64(max_age, obd->obd_osfs_age) should be true.

      This issue was hit 4 times during February.

      Please find attached the dmesg and 'foreach bt' outputs.

      Sebastien.

      Attachments

        1. bt-all.txt
          604 kB
        2. dmesg(1).txt
          173 kB
        3. log_2014-03-02-03.tgz
          1.36 MB
        4. log_OSS_2014-03-02_04.log2.gz
          105 kB
        5. lctl-dk.tgz
          322 kB
        6. backtrace-20140407
          1.65 MB

        Activity

          [LU-4733] All mdt thread stuck in cfs_waitq_wait
          pjones Peter Jones added a comment -

          ok thanks Sebastien

          pjones Peter Jones added a comment - ok thanks Sebastien

          Hi,

          The customer has upgraded to Lustre 2.4, and since then is not able to reproduce.
          So this ticket can be closed.

          Thanks,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, The customer has upgraded to Lustre 2.4, and since then is not able to reproduce. So this ticket can be closed. Thanks, Sebastien.
          bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
          bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8
          bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar  3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
          bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar  3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8
          bcluster231.log-20140309.gz:2014-03-03 23:22:50 Lustre: scratch3-OST01df: haven't heard from client scratch3-MDT0000-mdtlov_UUID (at JO.BOO.AL.IL@o2ib2) in 902 seconds. I think it's dead, and I am evicting it. exp ffff880dc9df1000, cur 1393885370 expire 1393884770 last 1393884468
          

          The OSS log showed that something wrong with network when the problem happend on MDS, could you check your network as Amir suggested in LU-4195.

          niu Niu Yawei (Inactive) added a comment - bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8 bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar 3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar 3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8 bcluster231.log-20140309.gz:2014-03-03 23:22:50 Lustre: scratch3-OST01df: haven 't heard from client scratch3-MDT0000-mdtlov_UUID (at JO.BOO.AL.IL@o2ib2) in 902 seconds. I think it' s dead, and I am evicting it. exp ffff880dc9df1000, cur 1393885370 expire 1393884770 last 1393884468 The OSS log showed that something wrong with network when the problem happend on MDS, could you check your network as Amir suggested in LU-4195 .

          Hi,
          I have attached the stack trace of all processes (2014/04/07) provided by on site support team.
          They also mentionned that JO.BOO.PA.xxx addresses in lctl-dk_rpc file are client's (compute nodes) adresses, NOT OSS.

          patrick.valentin Patrick Valentin (Inactive) added a comment - Hi, I have attached the stack trace of all processes (2014/04/07) provided by on site support team. They also mentionned that JO.BOO.PA.xxx addresses in lctl-dk_rpc file are client's (compute nodes) adresses, NOT OSS.

          Several things were made :
          list of clients disconnected was made to attempt to access the filesystem, with a df/ls/touch. All attempts have failed.
          umount of device on MDS failed, the process umount stuck on cfs_waitq_wait

          Could you provide the stack trace for all processes on MDS when umount the device?

          00000100:00000400:2.0F:1396898402.999140:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
            req@ffff880f02f9b800 x1463563674626235/t0(0) o400->4b5b918b-511a-1159-a7ee-db25715da035@JO.BOO.PA.BLW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1
          00000100:00000400:2.0:1396898402.999154:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
            req@ffff880f02f9bc00 x1463563600312609/t0(0) o400->23a5a711-176e-4fea-3690-e6e8f81b7ba3@JO.BOO.IW.WW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1
          00000100:00000400:2.0:1396898402.999165:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
            req@ffff88091d277000 x1463563647245622/t0(0) o400->e081ff72-cc00-6425-c96f-b093a974f1a4@JO.BOO.WB.AW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1
          00000100:00000400:2.0:1396898402.999176:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
            req@ffff88091d277400 x1463563646360851/t0(0) o400->822cdf71-bac5-e141-3e77-a03c858db938@JO.BOO.PA.BBA@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1
          
          

          JO.BOO.PA.BLW@o2ib2 is one of the OST? Looks the connections between MDT and some OSTs are not stable. I tend to think this is same issue of LU-4335 & LU-4195.

          niu Niu Yawei (Inactive) added a comment - Several things were made : list of clients disconnected was made to attempt to access the filesystem, with a df/ls/touch. All attempts have failed. umount of device on MDS failed, the process umount stuck on cfs_waitq_wait Could you provide the stack trace for all processes on MDS when umount the device? 00000100:00000400:2.0F:1396898402.999140:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff880f02f9b800 x1463563674626235/t0(0) o400->4b5b918b-511a-1159-a7ee-db25715da035@JO.BOO.PA.BLW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 00000100:00000400:2.0:1396898402.999154:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff880f02f9bc00 x1463563600312609/t0(0) o400->23a5a711-176e-4fea-3690-e6e8f81b7ba3@JO.BOO.IW.WW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 00000100:00000400:2.0:1396898402.999165:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff88091d277000 x1463563647245622/t0(0) o400->e081ff72-cc00-6425-c96f-b093a974f1a4@JO.BOO.WB.AW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 00000100:00000400:2.0:1396898402.999176:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff88091d277400 x1463563646360851/t0(0) o400->822cdf71-bac5-e141-3e77-a03c858db938@JO.BOO.PA.BBA@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 JO.BOO.PA.BLW@o2ib2 is one of the OST? Looks the connections between MDT and some OSTs are not stable. I tend to think this is same issue of LU-4335 & LU-4195 .

          Hi,
          two new occurences of this issue were detected at TGCC on april 7 and 9.

          Several things were made :

          • list of clients disconnected was made to attempt to access the filesystem, with a df/ls/touch. All attempts have failed.
          • umount of device on MDS failed, the process umount stuck on cfs_waitq_wait

          The attached file "lctl-dk.tgz" contains the following traces:

          • mantis518_dk.log (collected with an echo -1 > /proc/sys/lnet/debug)
          • mantis518_rpc.log (collected with an echo +other +rpctrace > /proc/sys/lnet/debug)

          Patrick

          patrick.valentin Patrick Valentin (Inactive) added a comment - Hi, two new occurences of this issue were detected at TGCC on april 7 and 9. Several things were made : list of clients disconnected was made to attempt to access the filesystem, with a df/ls/touch. All attempts have failed. umount of device on MDS failed, the process umount stuck on cfs_waitq_wait The attached file "lctl-dk.tgz" contains the following traces: mantis518_dk.log (collected with an echo -1 > /proc/sys/lnet/debug) mantis518_rpc.log (collected with an echo +other +rpctrace > /proc/sys/lnet/debug) Patrick

          Thank you, Sebastien. The code looks correct to me.

          It's possible that connection between MDT to some OST is broken, so the statfs can't be updated, and since there were many ongoing create requests, lq_statfs_in_progress could be changed to 1 immediately after a statfs done, so the waiters will go back to wait on qos_statfs_ready().

          When the problem raise again, could you try to collect lustre log on MDT with D_QOS & D_RPCTRACE enabled? Thank you.

          niu Niu Yawei (Inactive) added a comment - Thank you, Sebastien. The code looks correct to me. It's possible that connection between MDT to some OST is broken, so the statfs can't be updated, and since there were many ongoing create requests, lq_statfs_in_progress could be changed to 1 immediately after a statfs done, so the waiters will go back to wait on qos_statfs_ready(). When the problem raise again, could you try to collect lustre log on MDT with D_QOS & D_RPCTRACE enabled? Thank you.

          Hi,

          On Whamcloud's FTP (ftp.whamcloud.com), I have created a directory "LU-4733" containing a file lustre_216_bull_sources.tgz.

          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, On Whamcloud's FTP (ftp.whamcloud.com), I have created a directory " LU-4733 " containing a file lustre_216_bull_sources.tgz. Sebastien.

          Could you upload the code to the whamcloud ftp server (just like update logs)? Or show me where I can checkout the code? Thanks.

          niu Niu Yawei (Inactive) added a comment - Could you upload the code to the whamcloud ftp server (just like update logs)? Or show me where I can checkout the code? Thanks.

          > BTW: Is there anyway to check the code running on servers? (to make sure the all the patches were applied properly)

          Sure, but what do you suggest? I mean, how can we proceed to 'check the code'?

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - > BTW: Is there anyway to check the code running on servers? (to make sure the all the patches were applied properly) Sure, but what do you suggest? I mean, how can we proceed to 'check the code'?

          People

            niu Niu Yawei (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: