[LU-9825] Multiple errors on OST/MDS Created: 03/Aug/17  Updated: 02/Oct/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Soak cluster - testing LU-7899 patch


Issue Links:
Related
is related to LU-10055 mdt_fill_lvbo() message spew on MDS c... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running soak on LU-7899 patch, seeing multiple repeated errors not tied to any system halts.
First error, on OST, appeared after an OST failover, during LFSCK run. (LFSCK exceeded timeout and was aborted)
There is a hiccup in network and simul job dies:

sys-recov.today:/scratch/logs/syslog/soak-6.log:Aug  3 22:14:38 soak-6 kernel: Lustre: soaked-OST0016: Recovery over after 0:05, of 35 clients 35 recovered and 0 were evicted.
sys-recov.today:/scratch/logs/syslog/soak-6.log:Aug  3 22:14:49 soak-6 kernel: Lustre: soaked-OST0004: Recovery over after 0:06, of 35 clients 35 recovered and 0 were evicted.
/scratch/logs/syslog/soak-5.log:Aug  3 22:17:08 soak-5 kernel: LustreError: 24771:0:(osd_index.c:224:__osd_xattr_load_by_oid()) Skipped 13 previous similar messages
/scratch/logs/syslog/soak-5.log:Aug  3 22:17:08 soak-5 kernel: LustreError: 24771:0:(osd_index.c:224:__osd_xattr_load_by_oid()) soaked-OST0003: can't get bonus, rc = -17
/scratch/logs/syslog/soak-2.log:Aug  3 22:18:07 soak-2 kernel: LustreError: 24114:0:(osd_index.c:224:__osd_xattr_load_by_oid()) Skipped 534 previous similar messages
/scratch/logs/syslog/soak-2.log:Aug  3 22:18:07 soak-2 kernel: LustreError: 24114:0:(osd_index.c:224:__osd_xattr_load_by_oid()) soaked-OST0012: can't get bonus, rc = -17

Second set of errors comes up on MDS

scratch/logs/syslog/soak-8.log:Aug  3 22:13:46 soak-8 kernel: LustreError: 4491:0:(client.c:3006:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff8807d76d5700 x1574746919153872/t167504869448(167504869448) o6->soaked-OST0016-osc-MDT0000@192.168.1.107@o2ib:28/4 lens 664/400 e 0 to 0 dl 1501798434 ref 2 fl Interpret:R/4/0 rc -2/-2
/scratch/logs/syslog/soak-8.log:Aug  3 22:13:50 soak-8 kernel: LustreError: 4491:0:(client.c:3006:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff8807f37a1500 x1574746919157120/t167504861127(167504861127) o6->soaked-OST0004-osc-MDT0000@192.168.1.107@o2ib:28/4 lens 664/400 e 0 to 0 dl 1501798438 ref 2 fl Interpret:R/4/0 rc -2/-2
/scratch/logs/syslog/soak-8.log:Aug  3 22:13:56 soak-8 kernel: LustreError: 4939:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 1 previous similar message
/scratch/logs/syslog/soak-8.log:Aug  3 22:13:56 soak-8 kernel: LustreError: 4939:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 968 actual 344.
/scratch/logs/syslog/soak-8.log:Aug  3 22:14:11 soak-8 kernel: LustreError: 11-0: soaked-OST0016-osc-MDT0000: operation ost_destroy to node 192.168.1.107@o2ib failed: rc = -107
/scratch/logs/syslog/soak-8.log:Aug  3 22:14:33 soak-8 kernel: LustreError: 11-0: soaked-OST0004-osc-MDT0000: operation ost_create to node 192.168.1.107@o2ib failed: rc = -107
/scratch/logs/syslog/soak-8.log:Aug  3 22:14:33 soak-8 kernel: LustreError: 4596:0:(osp_precreate.c:619:osp_precreate_send()) soaked-OST0004-osc-MDT0000: can't precreate: rc = -107
/scratch/logs/syslog/soak-8.log:Aug  3 22:14:40 soak-8 kernel: LustreError: 11-0: soaked-OST000a-osc-MDT0000: operation ost_create to node 192.168.1.107@o2ib failed: rc = -107
/scratch/logs/syslog/soak-8.log:Aug  3 22:14:40 soak-8 kernel: LustreError: 4609:0:(osp_precreate.c:619:osp_precreate_send()) soaked-OST000a-osc-MDT0000: can't precreate: rc = -107
/scratch/logs/syslog/soak-8.log:Aug  3 22:14:54 soak-8 kernel: LustreError: 11-0: soaked-OST0010-osc-MDT0000: operation ost_statfs to node 192.168.1.107@o2ib failed: rc = -107
/scratch/logs/syslog/soak-8.log:Aug  3 22:22:14 soak-8 kernel: LustreError: 4692:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 23 previous similar messages
/scratch/logs/syslog/soak-8.log:Aug  3 22:22:14 soak-8 kernel: LustreError: 4692:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 416 actual 344.
/scratch/logs/syslog/soak-8.log:Aug  3 22:22:15 soak-8 kernel: LustreError: 4853:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 25 previous similar messages
/scratch/logs/syslog/soak-8.log:Aug  3 22:22:15 soak-8 kernel: LustreError: 4853:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 968 actual 416.
/scratch/logs/syslog/soak-8.log:Aug  3 22:22:28 soak-8 kernel: LustreError: 4769:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 968 actual 344.
/scratch/logs/syslog/soak-8.log:Aug  3 22:25:20 soak-8 kernel: LustreError: 5065:0:(osp_object.c:582:osp_attr_get()) soaked-OST000b-osc-MDT0000:osp_attr_get update error [0x1000b0000:0x111cd18:0x0]: rc = -4

Servers are not dying, but this is new with the patch, so possibly of interest.



 Comments   
Comment by Alex Zhuravlev [ 04/Aug/17 ]

I can reproduce locally messages like the following:
/scratch/logs/syslog/soak-5.log:Aug 3 22:17:08 soak-5 kernel: LustreError: 24771:0:(osd_index.c:224:__osd_xattr_load_by_oid()) Skipped 13 previous similar messages
this is OI scrubber meeting objects being destroyed.
in my local testing this don't lead to any test issues.

Comment by Andreas Dilger [ 07/Aug/17 ]

The unexpected errors on the MDS are confusing for users, especially errors that appear during normal usage and do not indicate any kind of problem.

Lai, the "mdt_lvbo_fill() expected N got M" messages are from PFL and should be quieted (2.11 and 2.10.1) since they are expected for composite files.

Nasf, _osd_xattr_load_by_oid() are from LFSCK and should be investigated. I don't know if they indicate a problem or not.

Comment by nasf (Inactive) [ 08/Aug/17 ]

There are two callers in current osd-zfs for _osd_xattr_load_by_oid(): one is osd_zfs_otable_it_next(), the other is osd_get_fid_by_oid(). The former one is LFSCK related. I am not sure which one triggered the failure in test. But since LFSCK was triggered during the test, it is possible that the LFSCK called the _osd_xattr_load_by_oid() and caused the failure message. I checked the ZFS code, one suspected point is that:

int
dnode_hold_impl(objset_t *os, uint64_t object, int flag, int slots,
    void *tag, dnode_t **dnp)
{
...
        mutex_enter(&dn->dn_mtx);
        type = dn->dn_type;
        if (dn->dn_free_txg ||
            ((flag & DNODE_MUST_BE_FREE) && !refcount_is_zero(&dn->dn_holds))) {
                mutex_exit(&dn->dn_mtx);
                zrl_remove(&dnh->dnh_zrlock);
                dbuf_rele(db, FTAG);
                return (type == DMU_OT_NONE ? ENOENT : EEXIST);
        }
...
}

It seems that the object (that we are loading xattr) is removed. From current osd-zfs implementation, we do NOT hold reference on the object, instead, we hope to load the xattr via the oid directly. So that is normal race case.

On the other hand, our zfs otable based iteration osd_zfs_otable_it_next() will ignore __osd_xattr_load_by_oid() failure, means that if it is the LFSCK triggered the __osd_xattr_load_by_oid() failure message, it is just some warning information, will NOT affect LFSCK running.

static int osd_zfs_otable_it_next(const struct lu_env *env, struct dt_it *di)
{
...
                rc = __osd_xattr_load_by_oid(dev, it->mit_pos, &nvbuf);
                if (unlikely(rc != 0))
                        continue;
...
}
Comment by Alex Zhuravlev [ 08/Aug/17 ]

there is no point to hold a reference to object in this case, especially given this would be the same call to dmu_bonus_hold()..
I agree we should avoid error messages in -ENOENT case at least - it's an object being removed.

Generated at Sat Feb 10 02:29:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.