Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.4.0, Lustre 2.1.5
-
3
-
9382
Description
During acceptance testing for KIT, they tried OSS failover while running several applications. And applications got IO errors (can't create file and similar messages). This should not happen and IO should just stall till failover happens.
The clients were running 2.4 and servers were 2.1.5. We tried with 2.1.5 clients and did not see this issue. I have attached the client and server logs.
Attachments
- ll10987.out.gz
- 0.2 kB
- LU-XXXX.tgz
- 463 kB
- mds1.llog.gz
- 224 kB
- mds2.llog.gz
- 0.2 kB
Activity
Hi Hongchao,
Do you think something like this patch is necessary on master? The osc appears to be a lot different in that version.
Thanks,
Kit
Perfect! I confirmed with the customer that this patch fixes their issue, so once this lands, I guess we can close the ticket.
Okay, the patch has been updated to suppress this error message if the corresponding request failed.
Thanks
Would it be possible to suppress the warning message in this case? I think it is going to confuse sysadmins.
this error message should not cause issues, for it only detected there was no "ost_body" in the replied message (server side) of the failed ost_create request.
and 208 is just the size of "ost_body"
if (field->rmf_flags & RMF_F_STRUCT_ARRAY) { /* * We've already asserted that field->rmf_size > 0 in * req_layout_init(). */ len = lustre_msg_buflen(msg, offset); if ((len % field->rmf_size) != 0) { CERROR("%s: array field size mismatch " "%d modulo %d != 0 (%d)\n", field->rmf_name, len, field->rmf_size, loc); return NULL; } } else if (pill->rc_area[loc][offset] != -1) { len = pill->rc_area[loc][offset]; } else { len = max(field->rmf_size, 0); } value = getter(msg, offset, len); if (value == NULL) { DEBUG_REQ(D_ERROR, pill->rc_req, "Wrong buffer for field `%s' (%d of %d) " "in format `%s': %d vs. %d (%s)\n", field->rmf_name, offset, lustre_msg_bufcount(msg), fmt->rf_name, lustre_msg_buflen(msg, offset), len, rcl_names[loc]);
Yes, the two patches should be combined, and the patch has been updated.
Thanks
With both patchset 4 and 5, it seems to solve the issue:
LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_create operation failed with -107 Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete LustreError: 9266:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `ost_body' (1 of 1) in format `OST_CREATE': 0 vs. 208 (server) req@ffff880808005800 x1450889175112924/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/192 e 0 to 0 dl 1383677590 ref 1 fl Interpret:RN/0/0 rc -107/-107 LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_connect operation failed with -19 LustreError: 9266:0:(osc_create.c:177:osc_interpret_create()) @@@ Unknown rc -107 from async create: failing oscc req@ffff880808005800 x1450889175112924/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/192 e 0 to 0 dl 1383677590 ref 1 fl Interpret:RN/0/0 rc -107/-107 Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib) Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_statfs operation failed with -107 Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib) Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_statfs operation failed with -107 LustreError: Skipped 1 previous similar message Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete LustreError: 9266:0:(osc_create.c:177:osc_interpret_create()) @@@ Unknown rc -19 from async create: failing oscc req@ffff881001a70000 x1450889176260853/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/400 e 0 to 0 dl 1383678377 ref 1 fl Interpret:RN/0/0 rc -19/-19 Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib) Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The obd_ping operation failed with -107 LustreError: Skipped 2 previous similar messages Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib) Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
This error message looks somewhat concerning:
LustreError: 9266:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `ost_body' (1 of 1) in format `OST_CREATE': 0 vs. 208 (server)
Do you know if this should cause issues?
Do you think that patchset 4 would fix this issue? It actually seems as if we are hitting 2 separate issues depending on where in the code path the OSS fails.
Hi Hongchao,
Still getting errors:
Nov 4 21:33:56 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_create operation failed with -107 Nov 4 21:33:56 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete Nov 4 21:33:56 pfscn2 kernel: : LustreError: 9244:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `ost_body' (1 of 1) in format `OST_CREATE': 0 vs. 208 (server) Nov 4 21:33:56 pfscn2 kernel: : req@ffff8808092fa000 x1450802648479850/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/192 e 0 to 0 dl 1383597243 ref 1 fl Interpret:RN/0/0 rc -107/-107 Nov 4 21:33:56 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_connect operation failed with -19 Nov 4 21:33:56 pfscn2 kernel: : LustreError: 9244:0:(osc_create.c:177:osc_interpret_create()) @@@ Unknown rc -107 from async create: failing oscc req@ffff8808092fa000 x1450802648479850/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/192 e 0 to 0 dl 1383597243 ref 1 fl Interpret:RN/0/0 rc -107/-107 Nov 4 21:33:56 pfscn2 kernel: : LustreError: 12037:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x1377d sub-object on OST idx 0/1: rc = -5 Nov 4 21:34:48 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
Are you able to reproduce the error? Basically we are doing it by having 8 clients writing small files, then failing over an OSS.
Hi Kit,
the master has waited the object precreation in "osp_declare_object_create"/"osp_precreate_reserve"/"osp_precreate_ready_condition" for recovering case