[LU-6269] Unable to mount /nobackupp8 Created: 20/Feb/15 Updated: 17/Sep/16 Resolved: 22/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.7.0, Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Herbert Yeung | Assignee: | Emoly Liu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
nbp8-mds running lustre 2.4.3, CentoS 6.5 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 1 | ||||||||||||
| Rank (Obsolete): | 17570 | ||||||||||||
| Description |
|
There appears to be some IB related congestion problems that are destabilizing out lustre servers. The various OSS and MDS systems will stop communicating over IB or the meta-data traffic on the file system slows down to where is can take minutes for an ls -l to return information. The problems come in waves, everything can be okay for a few hours, then we get a slow file system, after an hour or two it can recover and be fine for a few hours, then the same or different lustre file system has problems. This has been happening for a few days. Last night, /nobackupp8 went down. When the system was brought up and we mount the MDT, we get a lot of traffic to the MDS which locks the system up. The file system has been down since last night. We'd like to talk to a lustre engineer on this problem. service214: Feb 20 09:50:29 kern:err:nbp8-oss13 LNetError: 9772:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 36 seconds |
| Comments |
| Comment by Peter Jones [ 20/Feb/15 ] |
|
Herbert To be clear - are all your filesystems affected or just this one? Peter |
| Comment by Peter Jones [ 20/Feb/15 ] |
|
Herbert An LNET engineer is examining the code in light of the messages you are seeing. Do you have any particular questions at this time? Do you want to pose these via the ticket or setup a concall? Peter |
| Comment by Amir Shehata (Inactive) [ 20/Feb/15 ] |
|
In the syslog, I'm seeing: Feb 20 07:50:21 kern:err:nbp8-mds1 LNetError: 2497:0:(o2iblnd_cb.c:2267:kiblnd_passive_connect()) Can't accept 10.151.40.87@o2ib on NA (ib1:0:10.151.27.60): bad dst nid 10.151.27.60@o2ib This occurs when a node receives an RDMA_CM_EVENT_CONNECT_REQUEST, but the destination nid in the message doesn't correspond to any local NIs. from the errors you pasted in the description all of them seem to revolve around nid: 10.151.27.60@o2ib. Are there any issues around that node? Is it configured properly? Seems like 10.151.40.87@o2ib is attempting to establish connection with 10.151.27.60@o2ib (which looks like the MGC in your setup), but that node, doesn't have a network interface configured with that NID. |
| Comment by Mahmoud Hanafi [ 21/Feb/15 ] |
|
More info... It looks like the MDS running in osp_sync_process_queues and it blocks all other process. It looks like is responding to clients but all console access and host access is hung. I am uploading process trace and a list of running process. |
| Comment by Oleg Drokin [ 21/Feb/15 ] |
|
Looking at this newly uploaded log, it seems the problem is commit f55937464476f48c6e24841f724fcedc463962c5
Author: Oleg Drokin <oleg.drokin@intel.com>
Date: Thu Jun 12 19:13:25 2014 -0400
LU-5188 osp: Correctly check for invalid setattr record
Patch for LU-4345 (commit 80f90fcde73e2faff8b7b0ffc7c19bc52982e027 )
has a correct comment about lsr_valid member being either 0 or
having UID and GID fields set, but the check has a typo causing
it to check for lsr_valid to be both 0 and have the bits set which
is impossible.
Change-Id: I32e6f261b4707391d6e96f8434aae7bf8f55bf8e
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
diff --git a/lustre/osp/osp_sync.c b/lustre/osp/osp_sync.c
index 595c7c0..328cf50 100644
--- a/lustre/osp/osp_sync.c
+++ b/lustre/osp/osp_sync.c
@@ -482,7 +482,7 @@ static int osp_sync_new_setattr_job(struct osp_device *d,
LASSERT(h->lrh_type == MDS_SETATTR64_REC);
/* lsr_valid can only be 0 or LA_UID/GID set */
- if (!rec->lsr_valid && !(rec->lsr_valid & ~(LA_UID | LA_GID))) {
+ if (rec->lsr_valid && !(rec->lsr_valid & ~(LA_UID | LA_GID))) {
CERROR("%s: invalid setattr record, lsr_valid:"LPU64"\n",
d->opd_obd->obd_name, rec->lsr_valid);
RETURN(-EINVAL);
And the proper patch from our tree is: commit 79dd530f1352e6b57fcb870a1e0f2c2a05a0648d
Author: Oleg Drokin <oleg.drokin@intel.com>
Date: Thu Jun 12 19:13:25 2014 -0400
LU-5188 osp: Correctly check for invalid setattr record
Patch for LU-4345 (commit 80f90fcde73e2faff8b7b0ffc7c19bc52982e027 )
has a correct comment about lsr_valid member being either 0 or
having UID and GID fields set, but the check has a typo causing
it to check for lsr_valid to be both 0 and have the bits set which
is impossible.
The osp_sync_new_setattr_job() should return 0 for invalid record,
so that sync thread can continue processing on other records.
Change-Id: I32e6f261b4707391d6e96f8434aae7bf8f55bf8e
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
Reviewed-on: http://review.whamcloud.com/10706
Tested-by: Jenkins
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Reviewed-by: Mike Pershin <mike.pershin@intel.com>
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
diff --git a/lustre/osp/osp_sync.c b/lustre/osp/osp_sync.c
index 34a0e2c..1dd8e85 100644
--- a/lustre/osp/osp_sync.c
+++ b/lustre/osp/osp_sync.c
@@ -484,10 +484,12 @@ static int osp_sync_new_setattr_job(struct osp_device *d,
LASSERT(h->lrh_type == MDS_SETATTR64_REC);
/* lsr_valid can only be 0 or LA_UID/GID set */
- if (!rec->lsr_valid && !(rec->lsr_valid & ~(LA_UID | LA_GID))) {
+ if ((rec->lsr_valid & ~(LA_UID | LA_GID)) != 0) {
CERROR("%s: invalid setattr record, lsr_valid:"LPU64"\n",
d->opd_obd->obd_name, rec->lsr_valid);
- RETURN(-EINVAL);
+ /* return 0 so that sync thread can continue processing
+ * other records. */
+ RETURN(0);
}
req = osp_sync_new_job(d, llh, h, OST_SETATTR, &RQF_OST_SETATTR);
So if you update the RETURN part of it - the problem should go away. Did you pick wrong version of the patch perhaps? You might want to add |
| Comment by Oleg Drokin [ 21/Feb/15 ] |
|
Also looking through |
| Comment by Jay Lan (Inactive) [ 21/Feb/15 ] |
|
Our |
| Comment by Jay Lan (Inactive) [ 21/Feb/15 ] |
|
Could you take a look at our code at https://github.com/jlan/lustre-nas Our code in osp_sync_process_record(), the default case of "switch (rec->lrh_type)" is as below: default: It sets rc and does a break. You do not set rc and did not have break. |
| Comment by Jay Lan (Inactive) [ 21/Feb/15 ] |
|
Ah, if I picked up |
| Comment by Peter Jones [ 21/Feb/15 ] |
|
Oleg likely will not be online for a few hours but he was talking about lu-4743 when he initially started looking at the info from Mahmoud. |
| Comment by Jay Lan (Inactive) [ 21/Feb/15 ] |
|
Thanks, Peter. I have resolved the conflict and built a set of new server rpms. |
| Comment by Peter Jones [ 21/Feb/15 ] |
|
Great! Please continue to keep us informed |
| Comment by Mahmoud Hanafi [ 21/Feb/15 ] |
|
Got a LBUG with new patches in play nbp8-mds1 login: Lustre: 4864:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424540000/real 1424540001] req@ffff883fee2e5000 x1493738361717656/t0(0) o8->nbp8-OST005d-osc-MDT0000@10.151.27.77@o2ib:28/4 lens 400/544 e 0 to 1 dl 1424540105 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 4864:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424540000/real 1424540001] req@ffff883fee0da400 x1493738361717612/t0(0) o8->nbp8-OST0076-osc-MDT0000@10.151.27.76@o2ib:28/4 lens 400/544 e 0 to 1 dl 1424540105 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 4864:0:(client.c:1878:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1424540001/real 1424540001] req@ffff883fec2ee800 x1493738361718180/t0(0) o8->nbp8-OST0103-osc-MDT0000@10.151.27.87@o2ib:28/4 lens 400/544 e 0 to 1 dl 1424540106 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 4864:0:(client.c:1878:ptlrpc_expire_one_request()) Skipped 179 previous similar messages LustreError: 5529:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp8-OST001b-osc-MDT0000: invalid setattr record, lsr_valid:68 LustreError: 5511:0:(osp_sync.c:888:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 1 changes, 1 in progress, 1 in flight LustreError: 5511:0:(osp_sync.c:888:osp_sync_thread()) LBUG Pid: 5511, comm: os p -s31yn o-0ut of 32 cpus in kdb, waiting for the rest, timeout in 10 second(s) Call Trace: [<ffffffffa0489895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0489e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0f6c07a>] osp_sync_thread+0x76a/0x7e0 [osp] [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 5511, comm: osp-syn-0 Not tainted 2.6.32-358.23.2.el6.20140115.x86_64.lustre243 #1 Call Trace: [<ffffffff8153ef9e>] ? panic+0xa7/0x190 [<ffffffffa0489eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0f6c07a>] ? osp_sync_thread+0x76a/0x7e0 [osp] [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffffa0f6b910>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comment by Mahmoud Hanafi [ 21/Feb/15 ] |
|
Attaching full process trace for LBUG. see service200.LBUG.gz |
| Comment by Oleg Drokin [ 21/Feb/15 ] |
|
hm I'll try to trigger this on my test system with some extra debug to better understand how does it exit like that all of a sudden. |
| Comment by Oleg Drokin [ 21/Feb/15 ] |
|
Ok, so I see this problem also exists in master. I am still digging around to see if this is the proper solution or if we want to change something else (hence no patch in gerrit), Please give it a try (against your tree, only needed on MDS): diff --git a/lustre/osp/osp_sync.c b/lustre/osp/osp_sync.c
index 76eec62..755071c 100644
--- a/lustre/osp/osp_sync.c
+++ b/lustre/osp/osp_sync.c
@@ -485,6 +485,9 @@ static int osp_sync_new_setattr_job(struct osp_device *d,
if ((rec->lsr_valid & ~(LA_UID | LA_GID)) != 0) {
CERROR("%s: invalid setattr record, lsr_valid:"LPU64"\n",
d->opd_obd->obd_name, rec->lsr_valid);
+ spin_lock(&d->opd_syn_lock);
+ d->opd_syn_changes--;
+ spin_unlock(&d->opd_syn_lock);
/* return 0 so that sync thread can continue processing
* other records. */
RETURN(0);
|
| Comment by Mahmoud Hanafi [ 22/Feb/15 ] |
|
no luck.... ------------------------------------------------------------------------ nbp8-mds1 login: LustreError: 5418:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp8-OST0000-osc-MDT0000: invalid setattr record, lsr_valid:68 LustreError: 5416:0:(osp_sync.c:891:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 18446744073709551615 changes, 0 in progress, 0 in flight LustreError: 5416:0:(osp_sync.c:891:osp_sync_thread()) LBUG 31 out of 32 cpus in kdb, waiting for the rest, timeout in 10 second(s) Pid: 5416, comm: osp-syn-78 Call Trace: [<ffffffffa0489895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0489e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0f6c08a>] osp_sync_thread+0x76a/0x7e0 [osp] [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 LustreError: 5430:0:(osp_sync.c:891:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 18446744073709551615 changes, 0 in progress, 0 in flight LustreError: 5430:0:(osp_sync.c:891:osp_sync_thread()) LBUG Pid: 5430, comm: osp-syn-156 Call Trace: [<ffffffffa0489895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0489e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0f6c08a>] osp_sync_thread+0x76a/0x7e0 [osp] [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 LustreError: 5426:0:(osp_sync.c:891:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 18446744073709551615 changes, 0 in progress, 0 in flight LustreError: 5426:0:(osp_sync.c:891:osp_sync_thread()) LBUG Pid: 5426, comm: osp-syn-234 Call Trace: [<ffffffffa0489895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0489e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0f6c08a>] osp_sync_thread+0x76a/0x7e0 [osp] [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 5416, comm: osp-syn-78 Tainted: G W --------------- 2.6.32-358.23.2.el6.20140115.x86_64.lustre243 #1 Call Trace: [<ffffffff8153ef9e>] ? panic+0xa7/0x190 [<ffffffffa0489eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0f6c08a>] ? osp_sync_thread+0x76a/0x7e0 [osp] [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffffa0f6b920>] ? osp_sync_thread+0x0/0x7e0 [osp] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 .All cpus are now in kdb Entering kdb (current=0xffff88200dcea040, pid 5416) on processor 10 Oops: (null) due to oops @ 0x0 kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task [10]kdb> |
| Comment by Oleg Drokin [ 22/Feb/15 ] |
|
Hm, so there's changes counter unedrflow in this case then that I don't see in my testing. Thanks, I'll keep digging. |
| Comment by Mahmoud Hanafi [ 22/Feb/15 ] |
|
We reverted to running on 2.4.3-7nasS and the MDS came up. Will run on this until we get a more complete patch. |
| Comment by Oleg Drokin [ 22/Feb/15 ] |
|
hm, so it came up and there was no chatter in the logs about invalid records anymore? Perhaps my path was just succesful enough to clear them out so your reversion was ok? The invalid record you show was change of GID and CTIME which while technically should not have been possible, examining the code I see that in fact we did not ensure the exclusivity of the value until much later patch taht you don't have ( Something as simple as chgrp on an (old, as in existign for some time) file should have caused this. I am not advocating for you to try this on a production system, but if you have a test system where you can give this a try - that would be interesting. |
| Comment by Jay Lan (Inactive) [ 22/Feb/15 ] |
|
Running 2.4.3-7nasS version risks a different kind of panic. |
| Comment by Mahmoud Hanafi [ 22/Feb/15 ] |
|
we are moving backup to 13nasS |
| Comment by Mahmoud Hanafi [ 22/Feb/15 ] |
|
The crash would happen as soon as the OST connected. |
| Comment by Oleg Drokin [ 22/Feb/15 ] |
|
Ok, so with this extra bit of info I see what was missing in my patch. Should have been this: diff --git a/lustre/osp/osp_sync.c b/lustre/osp/osp_sync.c
index 76eec62..6d74249 100644
--- a/lustre/osp/osp_sync.c
+++ b/lustre/osp/osp_sync.c
@@ -485,6 +485,10 @@ static int osp_sync_new_setattr_job(struct osp_device *d,
if ((rec->lsr_valid & ~(LA_UID | LA_GID)) != 0) {
CERROR("%s: invalid setattr record, lsr_valid:"LPU64"\n",
d->opd_obd->obd_name, rec->lsr_valid);
+ spin_lock(&d->opd_syn_lock);
+ if (d->opd_syn_prev_done)
+ d->opd_syn_changes--;
+ spin_unlock(&d->opd_syn_lock);
/* return 0 so that sync thread can continue processing
* other records. */
RETURN(0);
But with the bad record gone - you do not really need it immediately until you obtain another one. |
| Comment by Peter Jones [ 22/Feb/15 ] |
|
Mahmoud/Jay What is your current status? Is nobackupp8 back in service? Peter |
| Comment by Stephen Champion [ 22/Feb/15 ] |
|
Hi Peter - The last status before the site team went quiet is that the file system is online with jobs running. |
| Comment by Jay Lan (Inactive) [ 23/Feb/15 ] |
|
Mahmoud put 2.4.3-13nasS on nobackupp8, the version that were running when we ran into problems. However, it seems the situation that hit the site is not there and the production system has been stable. Nevertheless, the bug is still there. We just hope it remains sleeping until you have a fix. |
| Comment by Mahmoud Hanafi [ 23/Feb/15 ] |
|
With lustre 2.4.3 how do we run a full lfsck? I can run on the MDT but I like to check the objects on the OSTs. |
| Comment by Oleg Drokin [ 24/Feb/15 ] |
|
do you mean online lfsck or the offline version. I am not sure we support offline lfsck in 2.4, and online one was pretty basic at the time. Do you suspect any corruptions? Additionally I wonder if you did a chgrp/chon experiment on a test fs by any chance to see if that is what generates the bad log records? |
| Comment by Mahmoud Hanafi [ 24/Feb/15 ] |
|
We are continuing to see the osp_sync error but we are not locking up the mds. We are running with you last patch provided here. It looks like these errors started showing up ~Feb 19 18:00:00. I have attached console log for reference. I haven't test the chrp/chon yet but will try it out on the test filesystem. I know we have had bad group/user ownership for OST object due a bug (I forget the LU) so we been struggling to get our blockquota correct. (I think https://jira.hpdd.intel.com/browse/LU-1267 fixed that issue) We must have some type of corruption and bad records that we need to clear-up. Also, after recovery of the filesystem we have some users reporting file creation dates in the future. |
| Comment by Oleg Drokin [ 25/Feb/15 ] |
|
Hm, so looking into your logs I see towards the end: LustreError: 7315:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp8-OST002d-osc-MDT0000: invalid setattr record, lsr_valid:64 LustreError: 7315:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 281069252 previous similar messages Lustre: MGS: haven't heard from client 1d6b484d-708f-0eaa-a72c-7d0d6e13068f (at 10.153.10.48@o2ib233) in 227 seconds. I think it's dead, and I am evicting it. exp ffff883f72e3fc00, cur 1424413862 expire 1424413712 last 1424413635 There are no log timestamps, but this certainly looks like symptoms before my patch. Also please note that if you have any chown/chmod activity in your real filesystem - this is the message you will most likely see (but with no crash and not as many repetitions of course - my patch in comments here is what prevents repetitions/crashes). This is because you are also missing another follow on patch - from When Time in the future might be a different problem - how much into the future? Rearding the fsck, I just confirmed that old offline lfsck did not really pay any attention to object ownership so running that would not help bring your system into order. I imagine best way to ensure correct ownership of objects in 2.4.3 is to just chown files to correct user/group (after you have |
| Comment by Jay Lan (Inactive) [ 26/Feb/15 ] |
|
>Also please note that if you have any chown/chmod activity in your real filesystem - this is The |
| Comment by Jay Lan (Inactive) [ 26/Feb/15 ] |
|
> When The version of I will upgrade to patch set 7. BTW, it would be #10462/7, rather than #11435, which is for b2_5. |
| Comment by Oleg Drokin [ 27/Feb/15 ] |
|
Well, the thing of the matter is that currently you are running with a unique set of patches that nobody else runs, so exposure is very little, as opposed to b2_5 that runs with patch 11435 that incorporates a number of fixes. @@ -237,7 +237,9 @@ static int osp_sync_add_rec(const struct lu_env *env, struct
LASSERT(attr);
osi->osi_setattr.lsr_uid = attr->la_uid;
osi->osi_setattr.lsr_gid = attr->la_gid;
- osi->osi_setattr.lsr_valid = attr->la_valid;
+ osi->osi_setattr.lsr_valid =
+ ((attr->la_valid & LA_UID) ? OBD_MD_FLUID : 0) |
+ ((attr->la_valid & LA_GID) ? OBD_MD_FLGID : 0);
break;
default:
LBUG();
@@ -483,8 +485,9 @@ static int osp_sync_new_setattr_job(struct osp_device *d,
ENTRY;
LASSERT(h->lrh_type == MDS_SETATTR64_REC);
- /* lsr_valid can only be 0 or LA_UID/GID set */
- if ((rec->lsr_valid & ~(LA_UID | LA_GID)) != 0) {
+ /* lsr_valid can only be 0 or have OBD_MD_{FLUID,FLGID} set,
+ * so no bits other than these should be set. */
+ if ((rec->lsr_valid & ~(OBD_MD_FLUID | OBD_MD_FLGID)) != 0) {
CERROR("%s: invalid setattr record, lsr_valid:"LPU64"\n",
d->opd_obd->obd_name, rec->lsr_valid);
/* return 0 so that sync thread can continue processing
First hunk starts to really enforce filtering of what's placed into lsr_valid to ensure we only place there the two valid bits and nothing else. Before the patch all sorts of unrelated stuff might be set (CTIME is the most common - also seen in your logs) that trips the check in the second hunk. - this is a regression of sorts introduced by change 10462 looks like a good one based on the list of patches in there (and some quick checks). The main risk remains that this is code that (next to) nobody have run, and even though it looks like almost identical to b2_5, there might be some other code paths not hit in b2_5 for example (there were after all changes in b2_5 compared to b2_4). |
| Comment by Jay Lan (Inactive) [ 27/Feb/15 ] |
|
OK, the two code hunks you quoted above are in both patch Between #10462/7 and #11435, which one do you recommend us to pick? Remember we are running code based on b2_4. It is kind of scary to know that we probably the only one running b2_4..., code paths the (next to) nobody have run... |
| Comment by Oleg Drokin [ 27/Feb/15 ] |
|
you are correct that the regression was only part of patchset 1 and if you use 10462 patchset 7 then you should be fine. 11435 is not really applicable to your tree then as it's b2_5 based, I only brought it in as the example of what things you really needed, but I did not know there was a right-sized b2_4 backport too. |
| Comment by Jay Lan (Inactive) [ 27/Feb/15 ] |
|
With patchset 7 of |
| Comment by Oleg Drokin [ 28/Feb/15 ] |
|
Well. Once you have |
| Comment by Jay Lan (Inactive) [ 03/Mar/15 ] |
|
Oleg, I just updated nas-2.5.3 to up to today of b2_5. It already has |
| Comment by Oleg Drokin [ 04/Mar/15 ] |
|
b2_5 version of |
| Comment by Peter Jones [ 11/May/15 ] |
|
Emoly Could you please turn Oleg's suggestion here - https://jira.hpdd.intel.com/browse/LU-6269?focusedCommentId=107620&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-107620 into a proper patch? Oleg has also suggested that you use a failloc to trigger this condition. Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 14/May/15 ] |
|
OSP sync thread can't handle invalid record properly, see |
| Comment by Gerrit Updater [ 14/May/15 ] |
|
Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/14806 |
| Comment by Peter Jones [ 22/May/15 ] |
|
The remaining work here will be tracked under |
| Comment by Jay Lan (Inactive) [ 02/Jun/15 ] |
|
Peter, do we need |
| Comment by Peter Jones [ 04/Jun/15 ] |
|
Jay Review #14806 has been deprecated in favor of the approach being taken under Peter |
| Comment by Jay Lan (Inactive) [ 04/Jun/15 ] |
|
Thank you for clarification, Peter! |