[LU-9825] Multiple errors on OST/MDS Created: 03/Aug/17 Updated: 02/Oct/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Soak cluster - testing |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Running soak on sys-recov.today:/scratch/logs/syslog/soak-6.log:Aug 3 22:14:38 soak-6 kernel: Lustre: soaked-OST0016: Recovery over after 0:05, of 35 clients 35 recovered and 0 were evicted. sys-recov.today:/scratch/logs/syslog/soak-6.log:Aug 3 22:14:49 soak-6 kernel: Lustre: soaked-OST0004: Recovery over after 0:06, of 35 clients 35 recovered and 0 were evicted. /scratch/logs/syslog/soak-5.log:Aug 3 22:17:08 soak-5 kernel: LustreError: 24771:0:(osd_index.c:224:__osd_xattr_load_by_oid()) Skipped 13 previous similar messages /scratch/logs/syslog/soak-5.log:Aug 3 22:17:08 soak-5 kernel: LustreError: 24771:0:(osd_index.c:224:__osd_xattr_load_by_oid()) soaked-OST0003: can't get bonus, rc = -17 /scratch/logs/syslog/soak-2.log:Aug 3 22:18:07 soak-2 kernel: LustreError: 24114:0:(osd_index.c:224:__osd_xattr_load_by_oid()) Skipped 534 previous similar messages /scratch/logs/syslog/soak-2.log:Aug 3 22:18:07 soak-2 kernel: LustreError: 24114:0:(osd_index.c:224:__osd_xattr_load_by_oid()) soaked-OST0012: can't get bonus, rc = -17 Second set of errors comes up on MDS scratch/logs/syslog/soak-8.log:Aug 3 22:13:46 soak-8 kernel: LustreError: 4491:0:(client.c:3006:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff8807d76d5700 x1574746919153872/t167504869448(167504869448) o6->soaked-OST0016-osc-MDT0000@192.168.1.107@o2ib:28/4 lens 664/400 e 0 to 0 dl 1501798434 ref 2 fl Interpret:R/4/0 rc -2/-2 /scratch/logs/syslog/soak-8.log:Aug 3 22:13:50 soak-8 kernel: LustreError: 4491:0:(client.c:3006:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff8807f37a1500 x1574746919157120/t167504861127(167504861127) o6->soaked-OST0004-osc-MDT0000@192.168.1.107@o2ib:28/4 lens 664/400 e 0 to 0 dl 1501798438 ref 2 fl Interpret:R/4/0 rc -2/-2 /scratch/logs/syslog/soak-8.log:Aug 3 22:13:56 soak-8 kernel: LustreError: 4939:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 1 previous similar message /scratch/logs/syslog/soak-8.log:Aug 3 22:13:56 soak-8 kernel: LustreError: 4939:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 968 actual 344. /scratch/logs/syslog/soak-8.log:Aug 3 22:14:11 soak-8 kernel: LustreError: 11-0: soaked-OST0016-osc-MDT0000: operation ost_destroy to node 192.168.1.107@o2ib failed: rc = -107 /scratch/logs/syslog/soak-8.log:Aug 3 22:14:33 soak-8 kernel: LustreError: 11-0: soaked-OST0004-osc-MDT0000: operation ost_create to node 192.168.1.107@o2ib failed: rc = -107 /scratch/logs/syslog/soak-8.log:Aug 3 22:14:33 soak-8 kernel: LustreError: 4596:0:(osp_precreate.c:619:osp_precreate_send()) soaked-OST0004-osc-MDT0000: can't precreate: rc = -107 /scratch/logs/syslog/soak-8.log:Aug 3 22:14:40 soak-8 kernel: LustreError: 11-0: soaked-OST000a-osc-MDT0000: operation ost_create to node 192.168.1.107@o2ib failed: rc = -107 /scratch/logs/syslog/soak-8.log:Aug 3 22:14:40 soak-8 kernel: LustreError: 4609:0:(osp_precreate.c:619:osp_precreate_send()) soaked-OST000a-osc-MDT0000: can't precreate: rc = -107 /scratch/logs/syslog/soak-8.log:Aug 3 22:14:54 soak-8 kernel: LustreError: 11-0: soaked-OST0010-osc-MDT0000: operation ost_statfs to node 192.168.1.107@o2ib failed: rc = -107 /scratch/logs/syslog/soak-8.log:Aug 3 22:22:14 soak-8 kernel: LustreError: 4692:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 23 previous similar messages /scratch/logs/syslog/soak-8.log:Aug 3 22:22:14 soak-8 kernel: LustreError: 4692:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 416 actual 344. /scratch/logs/syslog/soak-8.log:Aug 3 22:22:15 soak-8 kernel: LustreError: 4853:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 25 previous similar messages /scratch/logs/syslog/soak-8.log:Aug 3 22:22:15 soak-8 kernel: LustreError: 4853:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 968 actual 416. /scratch/logs/syslog/soak-8.log:Aug 3 22:22:28 soak-8 kernel: LustreError: 4769:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 968 actual 344. /scratch/logs/syslog/soak-8.log:Aug 3 22:25:20 soak-8 kernel: LustreError: 5065:0:(osp_object.c:582:osp_attr_get()) soaked-OST000b-osc-MDT0000:osp_attr_get update error [0x1000b0000:0x111cd18:0x0]: rc = -4 Servers are not dying, but this is new with the patch, so possibly of interest. |
| Comments |
| Comment by Alex Zhuravlev [ 04/Aug/17 ] |
|
I can reproduce locally messages like the following: |
| Comment by Andreas Dilger [ 07/Aug/17 ] |
|
The unexpected errors on the MDS are confusing for users, especially errors that appear during normal usage and do not indicate any kind of problem. Lai, the "mdt_lvbo_fill() expected N got M" messages are from PFL and should be quieted (2.11 and 2.10.1) since they are expected for composite files. Nasf, _osd_xattr_load_by_oid() are from LFSCK and should be investigated. I don't know if they indicate a problem or not. |
| Comment by nasf (Inactive) [ 08/Aug/17 ] |
|
There are two callers in current osd-zfs for _osd_xattr_load_by_oid(): one is osd_zfs_otable_it_next(), the other is osd_get_fid_by_oid(). The former one is LFSCK related. I am not sure which one triggered the failure in test. But since LFSCK was triggered during the test, it is possible that the LFSCK called the _osd_xattr_load_by_oid() and caused the failure message. I checked the ZFS code, one suspected point is that: int
dnode_hold_impl(objset_t *os, uint64_t object, int flag, int slots,
void *tag, dnode_t **dnp)
{
...
mutex_enter(&dn->dn_mtx);
type = dn->dn_type;
if (dn->dn_free_txg ||
((flag & DNODE_MUST_BE_FREE) && !refcount_is_zero(&dn->dn_holds))) {
mutex_exit(&dn->dn_mtx);
zrl_remove(&dnh->dnh_zrlock);
dbuf_rele(db, FTAG);
return (type == DMU_OT_NONE ? ENOENT : EEXIST);
}
...
}
It seems that the object (that we are loading xattr) is removed. From current osd-zfs implementation, we do NOT hold reference on the object, instead, we hope to load the xattr via the oid directly. So that is normal race case. On the other hand, our zfs otable based iteration osd_zfs_otable_it_next() will ignore __osd_xattr_load_by_oid() failure, means that if it is the LFSCK triggered the __osd_xattr_load_by_oid() failure message, it is just some warning information, will NOT affect LFSCK running. static int osd_zfs_otable_it_next(const struct lu_env *env, struct dt_it *di)
{
...
rc = __osd_xattr_load_by_oid(dev, it->mit_pos, &nvbuf);
if (unlikely(rc != 0))
continue;
...
}
|
| Comment by Alex Zhuravlev [ 08/Aug/17 ] |
|
there is no point to hold a reference to object in this case, especially given this would be the same call to dmu_bonus_hold().. |