[LU-6269] Unable to mount /nobackupp8 Created: 20/Feb/15  Updated: 17/Sep/16  Resolved: 22/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.7.0, Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Herbert Yeung Assignee: Emoly Liu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

nbp8-mds running lustre 2.4.3, CentoS 6.5


Attachments: File nbp8-mds1.syslog     File service200.LBUG.gz     File service200.log.gz     File service200.ps.gz    
Issue Links:
Related
is related to LU-5297 osp_sync_thread can't handle invalid ... Resolved
is related to LU-5188 nbp6-OST002f-osc-MDT0000: invalid set... Resolved
Severity: 1
Rank (Obsolete): 17570

 Description   

There appears to be some IB related congestion problems that are destabilizing out lustre servers. The various OSS and MDS systems will stop communicating over IB or the meta-data traffic on the file system slows down to where is can take minutes for an ls -l to return information. The problems come in waves, everything can be okay for a few hours, then we get a slow file system, after an hour or two it can recover and be fine for a few hours, then the same or different lustre file system has problems. This has been happening for a few days. Last night, /nobackupp8 went down. When the system was brought up and we mount the MDT, we get a lot of traffic to the MDS which locks the system up. The file system has been down since last night. We'd like to talk to a lustre engineer on this problem.

service214: Feb 20 09:50:29 kern:err:nbp8-oss13 LNetError: 9772:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 36 seconds
service214: Feb 20 09:50:29 kern:err:nbp8-oss13 LNetError: 9772:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (186): c: 0, oc: 0, rc: 8
service221: Feb 20 10:24:11 kern:err:nbp8-oss20 LNetError: 9776:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 31 seconds
service221: Feb 20 10:24:11 kern:err:nbp8-oss20 LNetError: 9776:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (181): c: 0, oc: 0, rc: 8
service216: Feb 20 09:49:41 kern:err:nbp8-oss15 LNetError: 9816:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 43 seconds
service216: Feb 20 09:49:41 kern:err:nbp8-oss15 LNetError: 9816:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (193): c: 0, oc: 0, rc: 8
service203: Feb 20 09:49:43 kern:err:nbp8-oss2 LNetError: 9880:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 21 seconds
service203: Feb 20 09:49:43 kern:err:nbp8-oss2 LNetError: 9880:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.151.27.60@o2ib (171): c: 0, oc: 0, rc: 8
service226: Feb 20 10:33:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457198/real 1424457198] req@ffff881f62464800 x1493641412487452/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457303 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
service226: Feb 20 10:43:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457798/real 1424457798] req@ffff881f42990000 x1493641412488128/t0(0) o38->nbp8-MDT0000-lwp-OST0080@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424457903 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
service226: Feb 20 10:53:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458398/real 1424458398] req@ffff881f4758f400 x1493641412488828/t0(0) o38->nbp8-MDT0000-lwp-OST009a@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424458503 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
service226: Feb 20 11:03:18 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458998/real 1424458998] req@ffff881f64bad000 x1493641412489500/t0(0) o38->nbp8-MDT0000-lwp-OST009a@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424459103 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
service226: Feb 20 09:51:43 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424454598/real 0] req@ffff881f46825400 x1493641412486068/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424454703 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
service226: Feb 20 10:02:08 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424455223/real 0] req@ffff881f405d9400 x1493641412486328/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455328 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
service226: Feb 20 10:12:33 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424455848/real 0] req@ffff881f48c69c00 x1493641412486588/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455953 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
service226: Feb 20 10:22:58 kern:warning:nbp8-oss25 Lustre: 7920:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424456473/real 0] req@ffff881f4568e000 x1493641412486848/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424456578 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
service202: Feb 20 10:41:39 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424457699/real 1424457699] req@ffff880ff2c3d400 x1493641414590864/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
service202: Feb 20 10:52:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458324/real 1424458324] req@ffff881028c58400 x1493641414591540/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424458429 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
service202: Feb 20 11:02:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424458924/real 1424458924] req@ffff880e3c2cb800 x1493641414592216/t0(0) o38->nbp8-MDT0000-lwp-OST011e@10.151.27.60@o2ib:12/10 lens 400/544 e 0 to 1 dl 1424459029 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
service202: Feb 20 10:31:19 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1424456974/real 0] req@ffff880e669c6000 x1493641414590140/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424457079 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
service202: Feb 20 09:49:39 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424454474/real 1424454474] req@ffff880ff480b000 x1493641414589100/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424454579 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
service202: Feb 20 10:00:04 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424455099/real 1424455099] req@ffff880ff4584c00 x1493641414589360/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455204 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
service202: Feb 20 10:10:29 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424455724/real 1424455724] req@ffff880ff310c000 x1493641414589620/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424455829 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
service202: Feb 20 10:20:54 kern:warning:nbp8-oss1 Lustre: 8544:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424456349/real 1424456349] req@ffff880db2b0d000 x1493641414589880/t0(0) o250->MGC10.151.27.60@o2ib@10.151.27.60@o2ib:26/25 lens 400/544 e 0 to 1 dl 1424456454 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1



 Comments   
Comment by Peter Jones [ 20/Feb/15 ]

Herbert

To be clear - are all your filesystems affected or just this one?

Peter

Comment by Peter Jones [ 20/Feb/15 ]

Herbert

An LNET engineer is examining the code in light of the messages you are seeing. Do you have any particular questions at this time? Do you want to pose these via the ticket or setup a concall?

Peter

Comment by Amir Shehata (Inactive) [ 20/Feb/15 ]

In the syslog, I'm seeing:

Feb 20 07:50:21 kern:err:nbp8-mds1 LNetError: 2497:0:(o2iblnd_cb.c:2267:kiblnd_passive_connect()) Can't accept 10.151.40.87@o2ib on NA (ib1:0:10.151.27.60): bad dst nid 10.151.27.60@o2ib

This occurs when a node receives an RDMA_CM_EVENT_CONNECT_REQUEST, but the destination nid in the message doesn't correspond to any local NIs.

from the errors you pasted in the description all of them seem to revolve around nid: 10.151.27.60@o2ib.

Are there any issues around that node? Is it configured properly?

Seems like 10.151.40.87@o2ib is attempting to establish connection with 10.151.27.60@o2ib (which looks like the MGC in your setup), but that node, doesn't have a network interface configured with that NID.

Comment by Mahmoud Hanafi [ 21/Feb/15 ]

More info... It looks like the MDS running in osp_sync_process_queues and it blocks all other process. It looks like is responding to clients but all console access and host access is hung. I am uploading process trace and a list of running process.

Comment by Oleg Drokin [ 21/Feb/15 ]

Looking at this newly uploaded log, it seems the problem is LU-5188.
I know you have this patch in your tree, but it appears to be incomplete. Here's your patch from your tree:

commit f55937464476f48c6e24841f724fcedc463962c5
Author: Oleg Drokin <oleg.drokin@intel.com>
Date:   Thu Jun 12 19:13:25 2014 -0400

    LU-5188 osp: Correctly check for invalid setattr record
    
    Patch for LU-4345 (commit 80f90fcde73e2faff8b7b0ffc7c19bc52982e027 )
    has a correct comment about lsr_valid member being either 0 or
    having UID and GID fields set, but the check has a typo causing
    it to check for lsr_valid to be both 0 and have the bits set which
    is impossible.
    
    Change-Id: I32e6f261b4707391d6e96f8434aae7bf8f55bf8e
    Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>

diff --git a/lustre/osp/osp_sync.c b/lustre/osp/osp_sync.c
index 595c7c0..328cf50 100644
--- a/lustre/osp/osp_sync.c
+++ b/lustre/osp/osp_sync.c
@@ -482,7 +482,7 @@ static int osp_sync_new_setattr_job(struct osp_device *d,
 	LASSERT(h->lrh_type == MDS_SETATTR64_REC);
 
 	/* lsr_valid can only be 0 or LA_UID/GID set */
-	if (!rec->lsr_valid && !(rec->lsr_valid & ~(LA_UID | LA_GID))) {
+	if (rec->lsr_valid && !(rec->lsr_valid & ~(LA_UID | LA_GID))) {
 		CERROR("%s: invalid setattr record, lsr_valid:"LPU64"\n",
 		       d->opd_obd->obd_name, rec->lsr_valid);
 		RETURN(-EINVAL);

And the proper patch from our tree is:

commit 79dd530f1352e6b57fcb870a1e0f2c2a05a0648d
Author: Oleg Drokin <oleg.drokin@intel.com>
Date:   Thu Jun 12 19:13:25 2014 -0400

    LU-5188 osp: Correctly check for invalid setattr record
    
    Patch for LU-4345 (commit 80f90fcde73e2faff8b7b0ffc7c19bc52982e027 )
    has a correct comment about lsr_valid member being either 0 or
    having UID and GID fields set, but the check has a typo causing
    it to check for lsr_valid to be both 0 and have the bits set which
    is impossible.
    
    The osp_sync_new_setattr_job() should return 0 for invalid record,
    so that sync thread can continue processing on other records.
    
    Change-Id: I32e6f261b4707391d6e96f8434aae7bf8f55bf8e
    Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
    Reviewed-on: http://review.whamcloud.com/10706
    Tested-by: Jenkins
    Reviewed-by: John L. Hammond <john.hammond@intel.com>
    Reviewed-by: Mike Pershin <mike.pershin@intel.com>
    Tested-by: Maloo <hpdd-maloo@intel.com>
    Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>

diff --git a/lustre/osp/osp_sync.c b/lustre/osp/osp_sync.c
index 34a0e2c..1dd8e85 100644
--- a/lustre/osp/osp_sync.c
+++ b/lustre/osp/osp_sync.c
@@ -484,10 +484,12 @@ static int osp_sync_new_setattr_job(struct osp_device *d,
 	LASSERT(h->lrh_type == MDS_SETATTR64_REC);
 
 	/* lsr_valid can only be 0 or LA_UID/GID set */
-	if (!rec->lsr_valid && !(rec->lsr_valid & ~(LA_UID | LA_GID))) {
+	if ((rec->lsr_valid & ~(LA_UID | LA_GID)) != 0) {
 		CERROR("%s: invalid setattr record, lsr_valid:"LPU64"\n",
 		       d->opd_obd->obd_name, rec->lsr_valid);
-		RETURN(-EINVAL);
+		/* return 0 so that sync thread can continue processing
+		 * other records. */
+		RETURN(0);
 	}
 
 	req = osp_sync_new_job(d, llh, h, OST_SETATTR, &RQF_OST_SETATTR);

So if you update the RETURN part of it - the problem should go away. Did you pick wrong version of the patch perhaps?

You might want to add LU-4743 to your tree that has a somewhat similar nature.

Comment by Oleg Drokin [ 21/Feb/15 ]

Also looking through LU-5188 you also need http://review.whamcloud.com/#/c/10828/4 as a follow on to that fix we have in our tree too.

Comment by Jay Lan (Inactive) [ 21/Feb/15 ]

Our LU-5188 patch was an early version. We tested with that patch and it seemed to have solved our
problem.

Comment by Jay Lan (Inactive) [ 21/Feb/15 ]

Could you take a look at our code at https://github.com/jlan/lustre-nas
'nas-2.4.3' branch, Oleg? Review http://review.whamcloud.com/#/c/10828/4 can not be applied cleanly.

Our code in osp_sync_process_record(), the default case of "switch (rec->lrh_type)" is as below:

default:
CERROR("unknown record type: %x\n", rec->lrh_type);
rc = -EINVAL;
break;

It sets rc and does a break. You do not set rc and did not have break.

Comment by Jay Lan (Inactive) [ 21/Feb/15 ]

Ah, if I picked up LU-4743 before LU-5188, then there would be no conflict.

Comment by Peter Jones [ 21/Feb/15 ]

Oleg likely will not be online for a few hours but he was talking about lu-4743 when he initially started looking at the info from Mahmoud. LU-4743 was landed for 2.5.2 and so has been widely exposed.

Comment by Jay Lan (Inactive) [ 21/Feb/15 ]

Thanks, Peter. I have resolved the conflict and built a set of new server rpms.

Comment by Peter Jones [ 21/Feb/15 ]

Great! Please continue to keep us informed

Comment by Mahmoud Hanafi [ 21/Feb/15 ]

Got a LBUG with new patches in play

nbp8-mds1 login: Lustre: 4864:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424540000/real 1424540001]  req@ffff883fee2e5000 x1493738361717656/t0(0) o8->nbp8-OST005d-osc-MDT0000@10.151.27.77@o2ib:28/4 lens 400/544 e 0 to 1 dl 1424540105 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 4864:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424540000/real 1424540001]  req@ffff883fee0da400 x1493738361717612/t0(0) o8->nbp8-OST0076-osc-MDT0000@10.151.27.76@o2ib:28/4 lens 400/544 e 0 to 1 dl 1424540105 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 4864:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424540001/real 1424540001]  req@ffff883fec2ee800 x1493738361718180/t0(0) o8->nbp8-OST0103-osc-MDT0000@10.151.27.87@o2ib:28/4 lens 400/544 e 0 to 1 dl 1424540106 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 4864:0:(client.c:1878:ptlrpc_expire_one_request()) Skipped 179 previous similar messages
LustreError: 5529:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp8-OST001b-osc-MDT0000: invalid setattr record, lsr_valid:68
LustreError: 5511:0:(osp_sync.c:888:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 1 changes, 1 in progress, 1 in flight
LustreError: 5511:0:(osp_sync.c:888:osp_sync_thread()) LBUG
Pid: 5511, comm: os p -s31yn o-0ut
 of 32 cpus in kdb, waiting for the rest, timeout in 10 second(s)

Call Trace:
 [<ffffffffa0489895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0489e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0f6c07a>] osp_sync_thread+0x76a/0x7e0 [osp]
 [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 5511, comm: osp-syn-0 Not tainted 2.6.32-358.23.2.el6.20140115.x86_64.lustre243 #1
Call Trace:
 [<ffffffff8153ef9e>] ? panic+0xa7/0x190
 [<ffffffffa0489eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0f6c07a>] ? osp_sync_thread+0x76a/0x7e0 [osp]
 [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Mahmoud Hanafi [ 21/Feb/15 ]

Attaching full process trace for LBUG.

see service200.LBUG.gz

Comment by Oleg Drokin [ 21/Feb/15 ]

hm
so the sync thread exit prematurely. That's certainly unexpected.

I'll try to trigger this on my test system with some extra debug to better understand how does it exit like that all of a sudden.

Comment by Oleg Drokin [ 21/Feb/15 ]

Ok, so I see this problem also exists in master.

I am still digging around to see if this is the proper solution or if we want to change something else (hence no patch in gerrit),
but meanwhile this in-line patch should help you out, it certainyl does for me.

Please give it a try (against your tree, only needed on MDS):

diff --git a/lustre/osp/osp_sync.c b/lustre/osp/osp_sync.c
index 76eec62..755071c 100644
--- a/lustre/osp/osp_sync.c
+++ b/lustre/osp/osp_sync.c
@@ -485,6 +485,9 @@ static int osp_sync_new_setattr_job(struct osp_device *d,
        if ((rec->lsr_valid & ~(LA_UID | LA_GID)) != 0) {
                CERROR("%s: invalid setattr record, lsr_valid:"LPU64"\n",
                       d->opd_obd->obd_name, rec->lsr_valid);
+               spin_lock(&d->opd_syn_lock);
+               d->opd_syn_changes--;
+               spin_unlock(&d->opd_syn_lock);
                /* return 0 so that sync thread can continue processing
                 * other records. */
                RETURN(0);
Comment by Mahmoud Hanafi [ 22/Feb/15 ]

no luck....

------------------------------------------------------------------------

nbp8-mds1 login: LustreError: 5418:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp8-OST0000-osc-MDT0000: invalid setattr record, lsr_valid:68
LustreError: 5416:0:(osp_sync.c:891:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 18446744073709551615 changes, 0 in progress, 0 in flight
LustreError: 5416:0:(osp_sync.c:891:osp_sync_thread()) LBUG
  31 out of 32 cpus in kdb, waiting for the rest, timeout in 10 second(s)
Pid: 5416, comm: osp-syn-78

Call Trace:
 [<ffffffffa0489895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0489e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0f6c08a>] osp_sync_thread+0x76a/0x7e0 [osp]
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LustreError: 5430:0:(osp_sync.c:891:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 18446744073709551615 changes, 0 in progress, 0 in flight
LustreError: 5430:0:(osp_sync.c:891:osp_sync_thread()) LBUG
Pid: 5430, comm: osp-syn-156

Call Trace:
 [<ffffffffa0489895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0489e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0f6c08a>] osp_sync_thread+0x76a/0x7e0 [osp]
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LustreError: 5426:0:(osp_sync.c:891:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 18446744073709551615 changes, 0 in progress, 0 in flight
LustreError: 5426:0:(osp_sync.c:891:osp_sync_thread()) LBUG
Pid: 5426, comm: osp-syn-234

Call Trace:
 [<ffffffffa0489895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0489e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0f6c08a>] osp_sync_thread+0x76a/0x7e0 [osp]
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 5416, comm: osp-syn-78 Tainted: G        W  ---------------    2.6.32-358.23.2.el6.20140115.x86_64.lustre243 #1
Call Trace:
 [<ffffffff8153ef9e>] ? panic+0xa7/0x190
 [<ffffffffa0489eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0f6c08a>] ? osp_sync_thread+0x76a/0x7e0 [osp]
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
.All cpus are now in kdb

Entering kdb (current=0xffff88200dcea040, pid 5416) on processor 10 Oops: (null)
due to oops @ 0x0
kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
[10]kdb> 
Comment by Oleg Drokin [ 22/Feb/15 ]

Hm, so there's changes counter unedrflow in this case then that I don't see in my testing.

Thanks, I'll keep digging.

Comment by Mahmoud Hanafi [ 22/Feb/15 ]

We reverted to running on 2.4.3-7nasS and the MDS came up. Will run on this until we get a more complete patch.

Comment by Oleg Drokin [ 22/Feb/15 ]

hm, so it came up and there was no chatter in the logs about invalid records anymore? Perhaps my path was just succesful enough to clear them out so your reversion was ok?
Did the crash happened soon after/immediately after MDS restart?

The invalid record you show was change of GID and CTIME which while technically should not have been possible, examining the code I see that in fact we did not ensure the exclusivity of the value until much later patch taht you don't have (LU-5296).

Something as simple as chgrp on an (old, as in existign for some time) file should have caused this. I am not advocating for you to try this on a production system, but if you have a test system where you can give this a try - that would be interesting.

Comment by Jay Lan (Inactive) [ 22/Feb/15 ]

Running 2.4.3-7nasS version risks a different kind of panic.

Comment by Mahmoud Hanafi [ 22/Feb/15 ]

we are moving backup to 13nasS

Comment by Mahmoud Hanafi [ 22/Feb/15 ]

The crash would happen as soon as the OST connected.

Comment by Oleg Drokin [ 22/Feb/15 ]

Ok, so with this extra bit of info I see what was missing in my patch.

Should have been this:

diff --git a/lustre/osp/osp_sync.c b/lustre/osp/osp_sync.c
index 76eec62..6d74249 100644
--- a/lustre/osp/osp_sync.c
+++ b/lustre/osp/osp_sync.c
@@ -485,6 +485,10 @@ static int osp_sync_new_setattr_job(struct osp_device *d,
        if ((rec->lsr_valid & ~(LA_UID | LA_GID)) != 0) {
                CERROR("%s: invalid setattr record, lsr_valid:"LPU64"\n",
                       d->opd_obd->obd_name, rec->lsr_valid);
+               spin_lock(&d->opd_syn_lock);
+               if (d->opd_syn_prev_done)
+                       d->opd_syn_changes--;
+               spin_unlock(&d->opd_syn_lock);
                /* return 0 so that sync thread can continue processing
                 * other records. */
                RETURN(0);

But with the bad record gone - you do not really need it immediately until you obtain another one.
I did some more probing around and it seems that any chgrp (and potentially chown) on your filesystem will generate a bad record, though.
You probably can test this theory on a test filesystem of some sort.

Comment by Peter Jones [ 22/Feb/15 ]

Mahmoud/Jay

What is your current status? Is nobackupp8 back in service?

Peter

Comment by Stephen Champion [ 22/Feb/15 ]

Hi Peter - The last status before the site team went quiet is that the file system is online with jobs running.

Comment by Jay Lan (Inactive) [ 23/Feb/15 ]

Mahmoud put 2.4.3-13nasS on nobackupp8, the version that were running when we ran into problems. However, it seems the situation that hit the site is not there and the production system has been stable.

Nevertheless, the bug is still there. We just hope it remains sleeping until you have a fix.

Comment by Mahmoud Hanafi [ 23/Feb/15 ]

With lustre 2.4.3 how do we run a full lfsck? I can run on the MDT but I like to check the objects on the OSTs.

Comment by Oleg Drokin [ 24/Feb/15 ]

do you mean online lfsck or the offline version. I am not sure we support offline lfsck in 2.4, and online one was pretty basic at the time. Do you suspect any corruptions?

Additionally I wonder if you did a chgrp/chon experiment on a test fs by any chance to see if that is what generates the bad log records?

Comment by Mahmoud Hanafi [ 24/Feb/15 ]

We are continuing to see the osp_sync error but we are not locking up the mds. We are running with you last patch provided here. It looks like these errors started showing up ~Feb 19 18:00:00. I have attached console log for reference.

I haven't test the chrp/chon yet but will try it out on the test filesystem.

I know we have had bad group/user ownership for OST object due a bug (I forget the LU) so we been struggling to get our blockquota correct. (I think https://jira.hpdd.intel.com/browse/LU-1267 fixed that issue) We must have some type of corruption and bad records that we need to clear-up.

Also, after recovery of the filesystem we have some users reporting file creation dates in the future.

Comment by Oleg Drokin [ 25/Feb/15 ]

Hm, so looking into your logs I see towards the end:

LustreError: 7315:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp8-OST002d-osc-MDT0000: invalid setattr record, lsr_valid:64
LustreError: 7315:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 281069252 previous similar messages
Lustre: MGS: haven't heard from client 1d6b484d-708f-0eaa-a72c-7d0d6e13068f (at 
10.153.10.48@o2ib233) in 227 seconds. I think it's dead, and I am evicting it. exp ffff883f72e3fc00, cur 1424413862 expire 1424413712 last 1424413635

There are no log timestamps, but this certainly looks like symptoms before my patch.
Converting 1424413862 from unix seconds into normal time in EST timezone yields "Fri Feb 20 01:31:02 2015"? Is this the correct log uploaded? I wonder what current messages are that you are seeing?

Also please note that if you have any chown/chmod activity in your real filesystem - this is the message you will most likely see (but with no crash and not as many repetitions of course - my patch in comments here is what prevents repetitions/crashes). This is because you are also missing another follow on patch - from LU-5296.

When LU-4345 was ported to b2_5 it actually also included a number of other fixes as you can see in http://review.whamcloud.com/11435 and you miss some of them.

Time in the future might be a different problem - how much into the future?

Rearding the fsck, I just confirmed that old offline lfsck did not really pay any attention to object ownership so running that would not help bring your system into order.

I imagine best way to ensure correct ownership of objects in 2.4.3 is to just chown files to correct user/group (after you have LU-5296 fix!) all over again.

Comment by Jay Lan (Inactive) [ 26/Feb/15 ]

>Also please note that if you have any chown/chmod activity in your real filesystem - this is
> the message you will most likely see (but with no crash and not as many repetitions of
> course - my patch in comments here is what prevents repetitions/crashes). This is because
> you are also missing another follow on patch - from LU-5296.

The LU-5296 is a fix to a regression introduced in LU-4690 (commit #9511). Since we do not have LU-4690, should the regression affect us? The commit #9511 is a MAJOR patch that I am scared to pick up unless you provide us a 2.4.x (or at least a 2.5.x) version.

Comment by Jay Lan (Inactive) [ 26/Feb/15 ]

> When LU-4345 was ported to b2_5 it actually also included a number of other fixes as you
> can see in http://review.whamcloud.com/11435 and you miss some of them.

The version of LU-4345 patch we have here is patch set 1. Your final b2_4 patch for LU-4345 is patch set 7.

I will upgrade to patch set 7. BTW, it would be #10462/7, rather than #11435, which is for b2_5.

Comment by Oleg Drokin [ 27/Feb/15 ]

Well, the thing of the matter is that currently you are running with a unique set of patches that nobody else runs, so exposure is very little, as opposed to b2_5 that runs with patch 11435 that incorporates a number of fixes.
I specifically pointed at LU-5296 because it seems to be fixing a problem you seem to be fixing, namely this part (don't apply it alone, this is just a demonstration):

@@ -237,7 +237,9 @@ static int osp_sync_add_rec(const struct lu_env *env, struct
                LASSERT(attr);
                osi->osi_setattr.lsr_uid = attr->la_uid;
                osi->osi_setattr.lsr_gid = attr->la_gid;
-               osi->osi_setattr.lsr_valid = attr->la_valid;
+               osi->osi_setattr.lsr_valid =
+                       ((attr->la_valid & LA_UID) ? OBD_MD_FLUID : 0) |
+                       ((attr->la_valid & LA_GID) ? OBD_MD_FLGID : 0);
                break;
        default:
                LBUG();
@@ -483,8 +485,9 @@ static int osp_sync_new_setattr_job(struct osp_device *d,
        ENTRY;
        LASSERT(h->lrh_type == MDS_SETATTR64_REC);
 
-       /* lsr_valid can only be 0 or LA_UID/GID set */
-       if ((rec->lsr_valid & ~(LA_UID | LA_GID)) != 0) {
+       /* lsr_valid can only be 0 or have OBD_MD_{FLUID,FLGID} set,
+        * so no bits other than these should be set. */
+       if ((rec->lsr_valid & ~(OBD_MD_FLUID | OBD_MD_FLGID)) != 0) {
                CERROR("%s: invalid setattr record, lsr_valid:"LPU64"\n",
                       d->opd_obd->obd_name, rec->lsr_valid);
                /* return 0 so that sync thread can continue processing

First hunk starts to really enforce filtering of what's placed into lsr_valid to ensure we only place there the two valid bits and nothing else. Before the patch all sorts of unrelated stuff might be set (CTIME is the most common - also seen in your logs) that trips the check in the second hunk. - this is a regression of sorts introduced by LU-4345 and that's why you need it, without any regards of possible interactions with LU-4690 that you are not carryign anyway.

change 10462 looks like a good one based on the list of patches in there (and some quick checks). The main risk remains that this is code that (next to) nobody have run, and even though it looks like almost identical to b2_5, there might be some other code paths not hit in b2_5 for example (there were after all changes in b2_5 compared to b2_4).

Comment by Jay Lan (Inactive) [ 27/Feb/15 ]

OK, the two code hunks you quoted above are in both patch LU-5296 (#10989) and in
LU-4345 ( #10462/7 and #11435). By saying "this is a regression of sorts introduced by LU-4345", I bet you meant to say earlier patch set, especially patch set #1 of #10462). So, either #10462/7 or #11435 of LU-4345 should fix the regression already. Am I correct?
If I am correct, then I do not really need LU-5296(#10989)? Or do I still need #10989 for other code fixes besides the two hunks?

Between #10462/7 and #11435, which one do you recommend us to pick? Remember we are running code based on b2_4.

It is kind of scary to know that we probably the only one running b2_4..., code paths the (next to) nobody have run...

Comment by Oleg Drokin [ 27/Feb/15 ]

you are correct that the regression was only part of patchset 1 and if you use 10462 patchset 7 then you should be fine.

11435 is not really applicable to your tree then as it's b2_5 based, I only brought it in as the example of what things you really needed, but I did not know there was a right-sized b2_4 backport too.

Comment by Jay Lan (Inactive) [ 27/Feb/15 ]

With patchset 7 of LU-4345 in place, do we still need the patch you provided at 21/Feb/15 7:47 PM?

Comment by Oleg Drokin [ 28/Feb/15 ]

Well. Once you have LU-4345 patchset 7, the only reason for my patch would be to combat the condition of disk corruption of an llog record - i.e. if somehow an llog record becomes corrupted in such a way that lsr_valid becomes invalid and triggers the check - you'll have a crash exactly like you had after applying the updated patch from LU-5188. Chance of that is much-much smaller than before so I imagine it's ok not to carry it until we hash out a final version.

Comment by Jay Lan (Inactive) [ 03/Mar/15 ]

Oleg, I just updated nas-2.5.3 to up to today of b2_5. It already has LU-4345, but b2_5 does not have LU-5188. Do I need to cherry-pick LU-5188 from master to nas-2.5.3?

Comment by Oleg Drokin [ 04/Mar/15 ]

b2_5 version of LU-4345 http://review.whamcloud.com/11435 already includes LU05188 patch as well, so you should be fine.

Comment by Peter Jones [ 11/May/15 ]

Emoly

Could you please turn Oleg's suggestion here - https://jira.hpdd.intel.com/browse/LU-6269?focusedCommentId=107620&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-107620 into a proper patch? Oleg has also suggested that you use a failloc to trigger this condition.

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 14/May/15 ]

OSP sync thread can't handle invalid record properly, see LU-5297.

Comment by Gerrit Updater [ 14/May/15 ]

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/14806
Subject: LU-6269 osp: decrease opd_syn_changes for an invalid rec
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1ecf3aff71b3ab3b6c6432d033535e31243878dc

Comment by Peter Jones [ 22/May/15 ]

The remaining work here will be tracked under LU-5297

Comment by Jay Lan (Inactive) [ 02/Jun/15 ]

Peter, do we need LU-5297 in 2.4.3 and 2.5.3? And, I guess I need Review #14806?

Comment by Peter Jones [ 04/Jun/15 ]

Jay

Review #14806 has been deprecated in favor of the approach being taken under LU-5297. This is a long-term improvement and need not be something you rush to pick up as a fix. You only ran into this issue in the first place due to confusion about patch versions that should no longer be a factor.

Peter

Comment by Jay Lan (Inactive) [ 04/Jun/15 ]

Thank you for clarification, Peter!

Generated at Sat Feb 10 01:58:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.