[LU-4733] All mdt thread stuck in cfs_waitq_wait Created: 07/Mar/14  Updated: 02/Oct/14  Resolved: 02/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Sebastien Buisson (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File backtrace-20140407     Text File bt-all.txt     Text File dmesg(1).txt     File lctl-dk.tgz     File log_2014-03-02-03.tgz     File log_OSS_2014-03-02_04.log2.gz    
Severity: 3
Rank (Obsolete): 13008

 Description   

Hi,

We are seeing all mdt threads on the MDS stuck in "cfs_waitq_wait". At the same time, we have a lot of rpc request (15k/s).
Looking closely at the console and the 'bt' from crash, we can see that those threads are coming from qos_statfs_update() where they block in l_wait_event and never wake up.

What is strange is that cfs_time_beforeq_64(max_age, obd->obd_osfs_age) should be true.

This issue was hit 4 times during February.

Please find attached the dmesg and 'foreach bt' outputs.

Sebastien.



 Comments   
Comment by Bruno Faccini (Inactive) [ 07/Mar/14 ]

Looks like a dup of LU-4335 ...

Comment by Sebastien Buisson (Inactive) [ 07/Mar/14 ]

Hi Bruno,

I was not so sure as problem described in LU-4335 seems to be related to network problems, whereas there is no trace of such here.

Comment by Peter Jones [ 07/Mar/14 ]

Niu

Could you please confirm whether this is a duplicate of LU-4335 or not?

Thanks

Peter

Comment by Liang Zhen (Inactive) [ 07/Mar/14 ]

This issue should have been fixed on master, please see lov_update_set() in master, set->set_completes are atomic which is safe.

void lov_update_set(struct lov_request_set *set,
                    struct lov_request *req, int rc)
{
        req->rq_complete = 1;
        req->rq_rc = rc;

        cfs_atomic_inc(&set->set_completes);
        if (rc == 0)
                cfs_atomic_inc(&set->set_success);

        wake_up(&set->set_waitq);
}

and please see 2.1.x, set->set_completes is just integer which is not safe (yes, increments may be lost under race), which means lov_set_finished() may never return true, and qos_statfs_done() will never be called if this happened

void lov_update_set(struct lov_request_set *set,
                    struct lov_request *req, int rc)
{
        req->rq_complete = 1;
        req->rq_rc = rc;

        set->set_completes++;
        if (rc == 0)
                set->set_success++;

        cfs_waitq_signal(&set->set_waitq);
}
Comment by Niu Yawei (Inactive) [ 09/Mar/14 ]

Liang, 2.1 has only one ptlrpcd thread to process the statfs requests, so I think the ++ operation should be safe. Probably statfs can't finish on certain OST?

Sebastien, were there any anbnormal messages on OSTs? and I see following messages on MDT:

[655444.910448] Uhhuh. NMI received for unknown reason 3d on CPU 0.
[655444.916528] Do you have a strange power saving mode enabled?
[655444.922337] Kernel panic - not syncing: NMI: Not continuing
[655444.928060] Pid: 0, comm: swapper Not tainted 2.6.32-220.23.1.bl6.Bull.28.10.x86_64 #1
[655444.936173] Call Trace:
[655444.938768]  <NMI>  [<ffffffff814851a0>] ? panic+0x78/0x143
[655444.944518]  [<ffffffff81488ff0>] ? do_nmi+0x240/0x2b0
[655444.949801]  [<ffffffff81488830>] ? nmi+0x20/0x30
[655444.954655]  [<ffffffff812a9831>] ? intel_idle+0xb1/0x170
[655444.960200]  <<EOE>>  [<ffffffff813aa267>] ? cpuidle_idle_call+0xa7/0x140
[655444.967165]  [<ffffffff81001e06>] ? cpu_idle+0xb6/0x110
[655444.972536]  [<ffffffff8146ff2a>] ? rest_init+0x7a/0x80
[655444.977909]  [<ffffffff81adff8c>] ? start_kernel+0x481/0x48d
[655444.983721]  [<ffffffff81adf31e>] ? x86_64_start_reservations+0x125/0x129
[655444.990651]  [<ffffffff81adf41c>] ? x86_64_start_kernel+0xfa/0x109

This looks like a hardware error, could it cause system hang under load?

Comment by Bruno Faccini (Inactive) [ 09/Mar/14 ]

Niu, my 2 cents this comes from a forced crash-dump attempt or a HA fencing via NMI ...

Comment by Alexandre Louvet [ 10/Mar/14 ]

The lustre version running at customer site is based on lustre 2.1.6 + the following list of patches

  • LU-3240 check for all attributes validity in revalidate
  • LU-3071 call cfs_cap_raise() before journal start
  • LU-3929 deny LWP connection for system upgrade
  • LU-3941 only modify extent of sub-lock and preserve flags
  • LU-4298 do not create file with no striping info
  • LU-3848 fix memory corruption
  • LU-3311 mdt: handle -ENOENT && !MDS_OPEN_CREAT in reint open
  • LU-4363 open: backport MDS_OPEN_BY_FID flag
  • LU-2665 refine LU-2665 patch for POSIX compliance
  • LU-1306 Protect l_flags with locking to prevent race on signal reception
  • LU-2943 LBUG in mdt_reconstruct_open() (patch set #3)
  • ORNL-22 general ptlrpcd threads pool support
  • LU-1144 implement a NUMA aware ptlrpcd binding policy
  • LU-1110 MDS Oops in osd_xattr_get() during file open by FID
  • LU-2613 to much unreclaimable slab space
  • LU-2624 ptlrpc fix thread stop
  • LU-2683 client deadlock in cl_lock_mutex_get
  • LU-2665 keep resend flocks

so:

  • multi ptlrpd is build in
  • while I haven't identify which patch went with the lov_update_set safe routine, we do have it.
  • the backtrace pointed by Niu is the result of an hardware NMI that was initiated to take the dump (so this is a managed hardware error, not a consequence nor an unexpected error).
Comment by Niu Yawei (Inactive) [ 11/Mar/14 ]

Alexandre, any error messages on OSTs?

Comment by Patrick Valentin (Inactive) [ 11/Mar/14 ]

Niu,
To complete the previous note from Alexandre, multi ptlrpcd and lov_update_set() safe routine are coming from ORNL-22 patch, which is a Bull backport in lustre 2.1.6 of branch b2_2 ORNL-22 patch (commit 71350744).

I have attached the logs provided by support team, corresponding to the two last crashes which occured on 2 and 3 of march.

here under is a copy of the analyze they have done:

crash> struct lov_qos.lq_statfs_in_progress 0xffff881eff05a6b0
  lq_statfs_in_progress = 1

crash> struct lov_statfs_data 0xffff881e3b9a2b80
struct lov_statfs_data {
  lsd_oi = {
    oi_policy = {
    .....
    oi_flags = 5, <--- OBD_STATFS_PTLRPCD | OBD_STATFS_NODELAY

crash> p *ptlrpcds
$1 = {
  pd_size = 6352,
  pd_index = 0,
  pd_nthreads = 32,
  pd_thread_rcv = {
    pc_flags = 258,
    pc_lock = {
      raw_lock = {
        slock = 0
      }
    }
.....
crash> struct ptlrpc_request_set 0xffff881fa6c702c0
struct ptlrpc_request_set {
  set_refcount = {
    counter = 1
  },
  set_new_count = {
    counter = 0
  },
  set_remaining = {
    counter = 0
  }
......

It seems rpc have been send but not response is received.

In cb_statfs_update conditions are not satisfied

        lov_update_set(lovreq->rq_rqset, lovreq, rc);
        if (rc)
                GOTO(out, rc);
out:
        if (lovreq->rq_rqset->set_oi->oi_flags & OBD_STATFS_PTLRPCD &&
            lov_set_finished(lovreq->rq_rqset, 0)) {
               lov_statfs_interpret(NULL, lovreq->rq_rqset,
                                    lovreq->rq_rqset->set_success !=
                                                  lovreq->rq_rqset->set_count);
               qos_statfs_done(lov);
Comment by Sebastien Buisson (Inactive) [ 20/Mar/14 ]

Hi,

Please find attached the error logs on the OSS.
We cannot see errors by the time the problem occurred (first occurrence is on March 2nd at 09:45am and second occurrence on March 3rd at 10:04pm).

Sebastien.

Comment by Niu Yawei (Inactive) [ 24/Mar/14 ]

I see some OSTs reported slow read in the log:

bcluster210.log-20140309.gz:2014-03-02 13:09:40 Lustre: 24804:0:(filter_io.c:475:filter_preprw_read()) scratch3-OST002c: slow start_page_read 30s
bcluster210.log-20140309.gz:2014-03-02 13:09:40 Lustre: 24804:0:(filter_io.c:475:filter_preprw_read()) Skipped 30 previous similar messages

Maybe the OST is too busy to reply the statfs in time?

Will the MDT get out of stuck state after the statfs RPC timedout and the burst of create requests from client level off?

Comment by Sebastien Buisson (Inactive) [ 24/Mar/14 ]

I do not have answers to your questions, but given their time of occurrence the 'slow' messages seem to be completely independent of the problem when all mdt threads are stuck in "cfs_waitq_wait", right?

Comment by Niu Yawei (Inactive) [ 25/Mar/14 ]

I do not have answers to your questions, but given their time of occurrence the 'slow' messages seem to be completely independent of the problem when all mdt threads are stuck in "cfs_waitq_wait", right?

I don't know for sure.

I just want to figure out if it's a deadlock(or missing wakeup) problem (if the threads recovered when system load getting lower, then it's unlikely a deadlock/missing wakeup problem).

BTW: Is there anyway to check the code running on servers? (to make sure the all the patches were applied properly)

Comment by Sebastien Buisson (Inactive) [ 25/Mar/14 ]

> BTW: Is there anyway to check the code running on servers? (to make sure the all the patches were applied properly)

Sure, but what do you suggest? I mean, how can we proceed to 'check the code'?

Comment by Niu Yawei (Inactive) [ 25/Mar/14 ]

Could you upload the code to the whamcloud ftp server (just like update logs)? Or show me where I can checkout the code? Thanks.

Comment by Sebastien Buisson (Inactive) [ 27/Mar/14 ]

Hi,

On Whamcloud's FTP (ftp.whamcloud.com), I have created a directory "LU-4733" containing a file lustre_216_bull_sources.tgz.

Sebastien.

Comment by Niu Yawei (Inactive) [ 02/Apr/14 ]

Thank you, Sebastien. The code looks correct to me.

It's possible that connection between MDT to some OST is broken, so the statfs can't be updated, and since there were many ongoing create requests, lq_statfs_in_progress could be changed to 1 immediately after a statfs done, so the waiters will go back to wait on qos_statfs_ready().

When the problem raise again, could you try to collect lustre log on MDT with D_QOS & D_RPCTRACE enabled? Thank you.

Comment by Patrick Valentin (Inactive) [ 09/Apr/14 ]

Hi,
two new occurences of this issue were detected at TGCC on april 7 and 9.

Several things were made :

  • list of clients disconnected was made to attempt to access the filesystem, with a df/ls/touch. All attempts have failed.
  • umount of device on MDS failed, the process umount stuck on cfs_waitq_wait

The attached file "lctl-dk.tgz" contains the following traces:

  • mantis518_dk.log (collected with an echo -1 > /proc/sys/lnet/debug)
  • mantis518_rpc.log (collected with an echo +other +rpctrace > /proc/sys/lnet/debug)

Patrick

Comment by Niu Yawei (Inactive) [ 21/Apr/14 ]

Several things were made :
list of clients disconnected was made to attempt to access the filesystem, with a df/ls/touch. All attempts have failed.
umount of device on MDS failed, the process umount stuck on cfs_waitq_wait

Could you provide the stack trace for all processes on MDS when umount the device?

00000100:00000400:2.0F:1396898402.999140:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
  req@ffff880f02f9b800 x1463563674626235/t0(0) o400->4b5b918b-511a-1159-a7ee-db25715da035@JO.BOO.PA.BLW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1
00000100:00000400:2.0:1396898402.999154:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
  req@ffff880f02f9bc00 x1463563600312609/t0(0) o400->23a5a711-176e-4fea-3690-e6e8f81b7ba3@JO.BOO.IW.WW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1
00000100:00000400:2.0:1396898402.999165:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
  req@ffff88091d277000 x1463563647245622/t0(0) o400->e081ff72-cc00-6425-c96f-b093a974f1a4@JO.BOO.WB.AW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1
00000100:00000400:2.0:1396898402.999176:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
  req@ffff88091d277400 x1463563646360851/t0(0) o400->822cdf71-bac5-e141-3e77-a03c858db938@JO.BOO.PA.BBA@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1

JO.BOO.PA.BLW@o2ib2 is one of the OST? Looks the connections between MDT and some OSTs are not stable. I tend to think this is same issue of LU-4335 & LU-4195.

Comment by Patrick Valentin (Inactive) [ 25/Apr/14 ]

Hi,
I have attached the stack trace of all processes (2014/04/07) provided by on site support team.
They also mentionned that JO.BOO.PA.xxx addresses in lctl-dk_rpc file are client's (compute nodes) adresses, NOT OSS.

Comment by Niu Yawei (Inactive) [ 17/Jun/14 ]
bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8
bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar  3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar  3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8
bcluster231.log-20140309.gz:2014-03-03 23:22:50 Lustre: scratch3-OST01df: haven't heard from client scratch3-MDT0000-mdtlov_UUID (at JO.BOO.AL.IL@o2ib2) in 902 seconds. I think it's dead, and I am evicting it. exp ffff880dc9df1000, cur 1393885370 expire 1393884770 last 1393884468

The OSS log showed that something wrong with network when the problem happend on MDS, could you check your network as Amir suggested in LU-4195.

Comment by Sebastien Buisson (Inactive) [ 02/Oct/14 ]

Hi,

The customer has upgraded to Lustre 2.4, and since then is not able to reproduce.
So this ticket can be closed.

Thanks,
Sebastien.

Comment by Peter Jones [ 02/Oct/14 ]

ok thanks Sebastien

Generated at Sat Feb 10 01:45:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.