[LU-14372] LustreError: 38823:0:(vvp_io.c:1562:vvp_io_init()) nbp11: refresh file layout [0x2400498d9:0x1642:0x0] error -5 Created: 27/Jan/21  Updated: 07/Oct/22  Resolved: 07/Oct/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: John Hammond
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Client running 2.12.5
server running 2.12.4/5


Issue Links:
Duplicate
duplicates LU-14644 IOR SSF PFL ill-formed I/O job aborte... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

We are seeing this error with hdf5 files that results in a job crash due to I/O error.

 Wed Jan 27 10:51:01 2021 C r901i5n2 [1611773461.647472] LustreError: 38823:0:(vvp_io.c:1562:vvp_io_init()) nbp11: refresh file layout [0x2400498d9:0x1642:0x0] error -5.
Wed Jan 27 10:51:01 2021 M r901i5n2 kernel: [1611773461.663472] LustreError: 38823:0:(vvp_io.c:1562:vvp_io_init()) nbp11: refresh file layout [0x2400498d9:0x1642:0x0] error -5.

Here is the only thing relevant that I found in debug file. I have enable additional debugging hoping to gather additional info.

0000100 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (import.c 1078 ptlrpc_connect_interpret()) nbp11-OST0024-osc-ffff8f3f503c4000  connect to target with instance 4
00000100 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (import.c 939 ptlrpc_connect_set_flags()) nbp11-OST0024-osc-ffff8f3f503c4000  Resetting ns_connect_flags to server flags  0xa0425af2e3440478
00000080 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize  72/1704
00000100 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (import.c 1162 ptlrpc_connect_interpret()) connected to replayable target  nbp11-OST0024_UUID
00000100 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f57200e7000 nbp11-OST0024_UUID  changing import state from CONNECTING to FULL
00000080 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize  72/1704
00000100 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (recover.c 223 ptlrpc_wake_delayed()) @@@ waking (set ffff8f4de56df380)   req@ffff8f533af8edc0 x1683453707600256/t0(0) o101->nbp11-OST0024-osc-ffff8f3f503c4000@10.151.26.168@o2ib 28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl
 Rpc W/0/ffffffff rc 0/-1
00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 1078 ptlrpc_connect_interpret()) nbp11-OST0015-osc-ffff8f3f503c4000  connect to target with instance 5
00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 939 ptlrpc_connect_set_flags()) nbp11-OST0015-osc-ffff8f3f503c4000  Resetting ns_connect_flags to server flags  0xa0425af2e3440478
00000080 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize  72/1704
00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 1162 ptlrpc_connect_interpret()) connected to replayable target  nbp11-OST0015_UUID
00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f4faa569000 nbp11-OST0015_UUID  changing import state from CONNECTING to FULL
00000080 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize  72/1704
00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (recover.c 223 ptlrpc_wake_delayed()) @@@ waking (set ffff8f3bbd4ef380)   req@ffff8f3f3491edc0 x1683453707789248/t0(0) o101->nbp11-OST0015-osc-ffff8f3f503c4000@10.151.26.167@o2ib 28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl
 Rpc W/0/ffffffff rc 0/-1
00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 1078 ptlrpc_connect_interpret()) nbp11-OST001a-osc-ffff8f3f503c4000  connect to target with instance 8
00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 939 ptlrpc_connect_set_flags()) nbp11-OST001a-osc-ffff8f3f503c4000  Resetting ns_connect_flags to server flags  0xa0425af2e3440478
00000080 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize  72/1704
00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 1162 ptlrpc_connect_interpret()) connected to replayable target  nbp11-OST001a_UUID
00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f5764398800 nbp11-OST001a_UUID  changing import state from CONNECTING to FULL
00000080 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize  72/1704
00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (recover.c 223 ptlrpc_wake_delayed()) @@@ waking (set ffff8f3bbd4efb80)   req@ffff8f36219ca000 x1683453707789888/t0(0) o101->nbp11-OST001a-osc-ffff8f3f503c4000@10.151.26.171@o2ib 28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl
 Rpc W/0/ffffffff rc 0/-1
00000100 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (import.c 1078 ptlrpc_connect_interpret()) nbp11-OST000a-osc-ffff8f3f503c4000  connect to target with instance 4
00000100 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (import.c 939 ptlrpc_connect_set_flags()) nbp11-OST000a-osc-ffff8f3f503c4000  Resetting ns_connect_flags to server flags  0xa0425af2e3440478
00000080 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize  72/1704
00000100 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (import.c 1162 ptlrpc_connect_interpret()) connected to replayable target  nbp11-OST000a_UUID
00000100 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f53b9dde800 nbp11-OST000a_UUID  changing import state from CONNECTING to FULL
00000080 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize  72/1704
00000100 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (recover.c 223 ptlrpc_wake_delayed()) @@@ waking (set ffff8f4de56df700)   req@ffff8f57a8e38dc0 x1683453707790848/t0(0) o101->nbp11-OST000a-osc-ffff8f3f503c4000@10.151.26.169@o2ib 28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl
 Rpc W/0/ffffffff rc 0/-1
00000100 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (import.c 1078 ptlrpc_connect_interpret()) nbp16-OST0005-osc-ffff8f3d83ba4000  connect to target with instance 4
00000100 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (import.c 939 ptlrpc_connect_set_flags()) nbp16-OST0005-osc-ffff8f3d83ba4000  Resetting ns_connect_flags to server flags  0xa0425af2e3440478
00000080 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize  144/600
00000100 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (import.c 1162 ptlrpc_connect_interpret()) connected to replayable target  nbp16-OST0005_UUID
00000100 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f57bb391800 nbp16-OST0005_UUID  changing import state from CONNECTING to FULL
00000080 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize  144/600
00000100 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (recover.c 223 ptlrpc_wake_delayed()) @@@ waking (set ffff8f3f99b8cc00)   req@ffff8f3c5228b240 x1683453707981184/t0(0) o101->nbp16-OST0005-osc-ffff8f3d83ba4000@10.151.26.195@o2ib 28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl
 Rpc W/0/ffffffff rc 0/-1
00000100 00080000 3.0 Wed Jan 27 11:05:33 PST 2021 0 4789 0 (import.c 1753 ptlrpc_disconnect_idle_interpret()) @@@ inflight=1, refcount=5  rc = 0   req@ffff8f3fb2bb16c0 x1683453708172736/t0(0) o9->nbp11-OST0030-osc-ffff8f3f503c4000@10.151.26.168@o2ib 28/4 lens 224/192 e 0 to 0 d
l 1611774613 ref 1 fl Interpret RN/0/0 rc 0/0
00000100 00080000 3.0 Wed Jan 27 11:05:33 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f5764398000 nbp11-OST0030_UUID  changing import state from CONNECTING to IDLE
00000100 00080000 18.0 Wed Jan 27 11:06:00 PST 2021 0 4789 0 (import.c 1753 ptlrpc_disconnect_idle_interpret()) @@@ inflight=1, refcount=5  rc = 0   req@ffff8f29b0455240 x1683453708177472/t0(0) o9->nbp11-OST0031-osc-ffff8f3f503c4000@10.151.26.170@o2ib 28/4 lens 224/192 e 0 to 0 
dl 1611774640 ref 1 fl Interpret RN/0/0 rc 0/0
00000100 00080000 18.0 Wed Jan 27 11:06:00 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f571ce55000 nbp11-OST0031_UUID  changing import state from CONNECTING to IDLE
00000002 00080000 45.0 Wed Jan 27 11:07:35 PST 2021 0 39214 0 (mdc_request.c 903 mdc_close()) @@@ matched open  req@ffff8f348f370040 x1683453708196032/t41664090920(41664090920) o101->nbp11-MDT0002-mdc-ffff8f3f503c4000@10.151.26.171@o2ib 12/10 lens 976/1008 e 0 to 0 dl 1611775078
 ref 1 fl Complete RP/4/ffffffff rc 0/-1
00000002 00080000 45.0 Wed Jan 27 11:07:35 PST 2021 0 39214 0 (mdc_request.c 903 mdc_close()) @@@ matched open  req@ffff8f348f3716c0 x1683453708196416/t41664090924(41664090924) o101->nbp11-MDT0002-mdc-ffff8f3f503c4000@10.151.26.171@o2ib 12/10 lens 1080/1080 e 0 to 0 dl 161177507
8 ref 2 fl Complete RP/4/0 rc 0/0

I did see this error from MDT to OST around the same time. So this could be a network issue.

Jan 27 10:59:06 nbp11-srv1 kernel: [21636687.904930] Lustre: 24143:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1611773322/real 1611773322]  req@ffff9e2aca8f6300 x1669611070093696/t0(0) o6->nbp11-OST0035-osc-MDT0000@10.151.26.172@o2ib:28/4 lens 544/432 e 1 to 1 dl 1611773946 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Jan 27 10:59:06 nbp11-srv1 kernel: [21636687.991757] Lustre: nbp11-OST0035-osc-MDT0000: Connection to nbp11-OST0035 (at 10.151.26.172@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Jan 27 10:59:06 nbp11-srv6 kernel: [21636637.466674] Lustre: nbp11-OST0035: Client nbp11-MDT0000-mdtlov_UUID (at 10.151.26.167@o2ib) reconnecting



 Comments   
Comment by Peter Jones [ 28/Jan/21 ]

Serguei

Does this seem like a network issue to you?

Peter

Comment by Serguei Smirnov [ 28/Jan/21 ]

I'm not sure yet if this is caused by a network issue or OST fails to respond for some other reason.

Would it be possible to collect net debug logs?

 

Comment by John Hammond [ 29/Mar/22 ]

This is not a network issue. This is comes from vvp_prune() failing to

$LUSTRE/tests/llmount.sh
lfs setstripe \
  -E 1M -c 4 \
  -E 2M -c 4 \
  -E 3M -c 4 \
  -E 4M -c 4 \
  -E 5M -c 4 \
  -E 6M -c 4 \
  -E 7M -c 4 \
  -E 8M -c 4 \
  -E 9M -c 4 \
  -E 10M -c 4 \
  -E 11M -c 4 \
  -E 12M -c 4 \
  -E 13M -c 4 \
  -E 14M -c 4 \
  -E 15M -c 4 \
  -E eof -c 4 \
  /mnt/lustre
while dd if=/dev/zero of=f0 bs=1M count=16 && rm f0; do true; done

Eventually dd will fail with -EIO

dd: error writing 'f0': Input/output error
11+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.274785 s, 38.2 MB/s

[20055.549945] LustreError: 2005822:0:(vvp_io.c:1820:vvp_io_init()) lustre: refresh file layout [0x200000bd1:0x1a67a:0x0] error -5.

Tracing shows that the -EIO comes from:

static int vvp_prune(const struct lu_env *env, struct cl_object *obj)
{
        struct inode *inode = vvp_object_inode(obj);
        int rc;
        ENTRY;

        rc = cl_sync_file_range(inode, 0, OBD_OBJECT_EOF, CL_FSYNC_LOCAL, 1);
        if (rc < 0) {
                CDEBUG(D_VFSTRACE, DFID ": writeback failed: %d\n",
                       PFID(lu_object_fid(&obj->co_lu)), rc);
                RETURN(rc);
        }

        truncate_inode_pages(inode->i_mapping, 0);
        if (inode->i_mapping->nrpages) {
                CDEBUG(D_VFSTRACE, DFID ": still has %lu pages remaining\n",
                       PFID(lu_object_fid(&obj->co_lu)),
                       inode->i_mapping->nrpages);
                RETURN(-EIO);
        }

        RETURN(0);
}

This appears to be fixed in master by 7d5d004506 (LU-14644 vvp: wait for nrpages to be updated).

Comment by John Hammond [ 06/Apr/22 ]

"John L. Hammond <jhammond@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46948
Subject: LU-14644 vvp: wait for nrpages to be updated
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 1c73a734375a05fa75a9a55fa89494f743a10114

Comment by Peter Jones [ 07/Oct/22 ]

Fix landed to b2_12

Generated at Sat Feb 10 03:09:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.