[LU-14372] LustreError: 38823:0:(vvp_io.c:1562:vvp_io_init()) nbp11: refresh file layout [0x2400498d9:0x1642:0x0] error -5 Created: 27/Jan/21 Updated: 07/Oct/22 Resolved: 07/Oct/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | John Hammond |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Client running 2.12.5 |
||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We are seeing this error with hdf5 files that results in a job crash due to I/O error. Wed Jan 27 10:51:01 2021 C r901i5n2 [1611773461.647472] LustreError: 38823:0:(vvp_io.c:1562:vvp_io_init()) nbp11: refresh file layout [0x2400498d9:0x1642:0x0] error -5. Wed Jan 27 10:51:01 2021 M r901i5n2 kernel: [1611773461.663472] LustreError: 38823:0:(vvp_io.c:1562:vvp_io_init()) nbp11: refresh file layout [0x2400498d9:0x1642:0x0] error -5. Here is the only thing relevant that I found in debug file. I have enable additional debugging hoping to gather additional info. 0000100 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (import.c 1078 ptlrpc_connect_interpret()) nbp11-OST0024-osc-ffff8f3f503c4000 connect to target with instance 4 00000100 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (import.c 939 ptlrpc_connect_set_flags()) nbp11-OST0024-osc-ffff8f3f503c4000 Resetting ns_connect_flags to server flags 0xa0425af2e3440478 00000080 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize 72/1704 00000100 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (import.c 1162 ptlrpc_connect_interpret()) connected to replayable target nbp11-OST0024_UUID 00000100 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f57200e7000 nbp11-OST0024_UUID changing import state from CONNECTING to FULL 00000080 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize 72/1704 00000100 00080000 1.0 Wed Jan 27 10:32:40 PST 2021 0 4789 0 (recover.c 223 ptlrpc_wake_delayed()) @@@ waking (set ffff8f4de56df380) req@ffff8f533af8edc0 x1683453707600256/t0(0) o101->nbp11-OST0024-osc-ffff8f3f503c4000@10.151.26.168@o2ib 28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl Rpc W/0/ffffffff rc 0/-1 00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 1078 ptlrpc_connect_interpret()) nbp11-OST0015-osc-ffff8f3f503c4000 connect to target with instance 5 00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 939 ptlrpc_connect_set_flags()) nbp11-OST0015-osc-ffff8f3f503c4000 Resetting ns_connect_flags to server flags 0xa0425af2e3440478 00000080 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize 72/1704 00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 1162 ptlrpc_connect_interpret()) connected to replayable target nbp11-OST0015_UUID 00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f4faa569000 nbp11-OST0015_UUID changing import state from CONNECTING to FULL 00000080 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize 72/1704 00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (recover.c 223 ptlrpc_wake_delayed()) @@@ waking (set ffff8f3bbd4ef380) req@ffff8f3f3491edc0 x1683453707789248/t0(0) o101->nbp11-OST0015-osc-ffff8f3f503c4000@10.151.26.167@o2ib 28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl Rpc W/0/ffffffff rc 0/-1 00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 1078 ptlrpc_connect_interpret()) nbp11-OST001a-osc-ffff8f3f503c4000 connect to target with instance 8 00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 939 ptlrpc_connect_set_flags()) nbp11-OST001a-osc-ffff8f3f503c4000 Resetting ns_connect_flags to server flags 0xa0425af2e3440478 00000080 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize 72/1704 00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 1162 ptlrpc_connect_interpret()) connected to replayable target nbp11-OST001a_UUID 00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f5764398800 nbp11-OST001a_UUID changing import state from CONNECTING to FULL 00000080 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize 72/1704 00000100 00080000 1.0 Wed Jan 27 10:41:58 PST 2021 0 4789 0 (recover.c 223 ptlrpc_wake_delayed()) @@@ waking (set ffff8f3bbd4efb80) req@ffff8f36219ca000 x1683453707789888/t0(0) o101->nbp11-OST001a-osc-ffff8f3f503c4000@10.151.26.171@o2ib 28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl Rpc W/0/ffffffff rc 0/-1 00000100 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (import.c 1078 ptlrpc_connect_interpret()) nbp11-OST000a-osc-ffff8f3f503c4000 connect to target with instance 4 00000100 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (import.c 939 ptlrpc_connect_set_flags()) nbp11-OST000a-osc-ffff8f3f503c4000 Resetting ns_connect_flags to server flags 0xa0425af2e3440478 00000080 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize 72/1704 00000100 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (import.c 1162 ptlrpc_connect_interpret()) connected to replayable target nbp11-OST000a_UUID 00000100 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f53b9dde800 nbp11-OST000a_UUID changing import state from CONNECTING to FULL 00000080 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize 72/1704 00000100 00080000 1.0 Wed Jan 27 10:41:59 PST 2021 0 4789 0 (recover.c 223 ptlrpc_wake_delayed()) @@@ waking (set ffff8f4de56df700) req@ffff8f57a8e38dc0 x1683453707790848/t0(0) o101->nbp11-OST000a-osc-ffff8f3f503c4000@10.151.26.169@o2ib 28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl Rpc W/0/ffffffff rc 0/-1 00000100 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (import.c 1078 ptlrpc_connect_interpret()) nbp16-OST0005-osc-ffff8f3d83ba4000 connect to target with instance 4 00000100 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (import.c 939 ptlrpc_connect_set_flags()) nbp16-OST0005-osc-ffff8f3d83ba4000 Resetting ns_connect_flags to server flags 0xa0425af2e3440478 00000080 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize 144/600 00000100 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (import.c 1162 ptlrpc_connect_interpret()) connected to replayable target nbp16-OST0005_UUID 00000100 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f57bb391800 nbp16-OST0005_UUID changing import state from CONNECTING to FULL 00000080 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (lcommon_misc.c 70 cl_init_ea_size()) updating def/max_easize 144/600 00000100 00080000 1.0 Wed Jan 27 10:51:01 PST 2021 0 4789 0 (recover.c 223 ptlrpc_wake_delayed()) @@@ waking (set ffff8f3f99b8cc00) req@ffff8f3c5228b240 x1683453707981184/t0(0) o101->nbp16-OST0005-osc-ffff8f3d83ba4000@10.151.26.195@o2ib 28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl Rpc W/0/ffffffff rc 0/-1 00000100 00080000 3.0 Wed Jan 27 11:05:33 PST 2021 0 4789 0 (import.c 1753 ptlrpc_disconnect_idle_interpret()) @@@ inflight=1, refcount=5 rc = 0 req@ffff8f3fb2bb16c0 x1683453708172736/t0(0) o9->nbp11-OST0030-osc-ffff8f3f503c4000@10.151.26.168@o2ib 28/4 lens 224/192 e 0 to 0 d l 1611774613 ref 1 fl Interpret RN/0/0 rc 0/0 00000100 00080000 3.0 Wed Jan 27 11:05:33 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f5764398000 nbp11-OST0030_UUID changing import state from CONNECTING to IDLE 00000100 00080000 18.0 Wed Jan 27 11:06:00 PST 2021 0 4789 0 (import.c 1753 ptlrpc_disconnect_idle_interpret()) @@@ inflight=1, refcount=5 rc = 0 req@ffff8f29b0455240 x1683453708177472/t0(0) o9->nbp11-OST0031-osc-ffff8f3f503c4000@10.151.26.170@o2ib 28/4 lens 224/192 e 0 to 0 dl 1611774640 ref 1 fl Interpret RN/0/0 rc 0/0 00000100 00080000 18.0 Wed Jan 27 11:06:00 PST 2021 0 4789 0 (import.c 86 import_set_state_nolock()) ffff8f571ce55000 nbp11-OST0031_UUID changing import state from CONNECTING to IDLE 00000002 00080000 45.0 Wed Jan 27 11:07:35 PST 2021 0 39214 0 (mdc_request.c 903 mdc_close()) @@@ matched open req@ffff8f348f370040 x1683453708196032/t41664090920(41664090920) o101->nbp11-MDT0002-mdc-ffff8f3f503c4000@10.151.26.171@o2ib 12/10 lens 976/1008 e 0 to 0 dl 1611775078 ref 1 fl Complete RP/4/ffffffff rc 0/-1 00000002 00080000 45.0 Wed Jan 27 11:07:35 PST 2021 0 39214 0 (mdc_request.c 903 mdc_close()) @@@ matched open req@ffff8f348f3716c0 x1683453708196416/t41664090924(41664090924) o101->nbp11-MDT0002-mdc-ffff8f3f503c4000@10.151.26.171@o2ib 12/10 lens 1080/1080 e 0 to 0 dl 161177507 8 ref 2 fl Complete RP/4/0 rc 0/0 I did see this error from MDT to OST around the same time. So this could be a network issue. Jan 27 10:59:06 nbp11-srv1 kernel: [21636687.904930] Lustre: 24143:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1611773322/real 1611773322] req@ffff9e2aca8f6300 x1669611070093696/t0(0) o6->nbp11-OST0035-osc-MDT0000@10.151.26.172@o2ib:28/4 lens 544/432 e 1 to 1 dl 1611773946 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 Jan 27 10:59:06 nbp11-srv1 kernel: [21636687.991757] Lustre: nbp11-OST0035-osc-MDT0000: Connection to nbp11-OST0035 (at 10.151.26.172@o2ib) was lost; in progress operations using this service will wait for recovery to complete Jan 27 10:59:06 nbp11-srv6 kernel: [21636637.466674] Lustre: nbp11-OST0035: Client nbp11-MDT0000-mdtlov_UUID (at 10.151.26.167@o2ib) reconnecting |
| Comments |
| Comment by Peter Jones [ 28/Jan/21 ] |
|
Serguei Does this seem like a network issue to you? Peter |
| Comment by Serguei Smirnov [ 28/Jan/21 ] |
|
I'm not sure yet if this is caused by a network issue or OST fails to respond for some other reason. Would it be possible to collect net debug logs?
|
| Comment by John Hammond [ 29/Mar/22 ] |
|
This is not a network issue. This is comes from vvp_prune() failing to $LUSTRE/tests/llmount.sh lfs setstripe \ -E 1M -c 4 \ -E 2M -c 4 \ -E 3M -c 4 \ -E 4M -c 4 \ -E 5M -c 4 \ -E 6M -c 4 \ -E 7M -c 4 \ -E 8M -c 4 \ -E 9M -c 4 \ -E 10M -c 4 \ -E 11M -c 4 \ -E 12M -c 4 \ -E 13M -c 4 \ -E 14M -c 4 \ -E 15M -c 4 \ -E eof -c 4 \ /mnt/lustre while dd if=/dev/zero of=f0 bs=1M count=16 && rm f0; do true; done Eventually dd will fail with -EIO dd: error writing 'f0': Input/output error 11+0 records in 10+0 records out 10485760 bytes (10 MB, 10 MiB) copied, 0.274785 s, 38.2 MB/s [20055.549945] LustreError: 2005822:0:(vvp_io.c:1820:vvp_io_init()) lustre: refresh file layout [0x200000bd1:0x1a67a:0x0] error -5. Tracing shows that the -EIO comes from: static int vvp_prune(const struct lu_env *env, struct cl_object *obj) { struct inode *inode = vvp_object_inode(obj); int rc; ENTRY; rc = cl_sync_file_range(inode, 0, OBD_OBJECT_EOF, CL_FSYNC_LOCAL, 1); if (rc < 0) { CDEBUG(D_VFSTRACE, DFID ": writeback failed: %d\n", PFID(lu_object_fid(&obj->co_lu)), rc); RETURN(rc); } truncate_inode_pages(inode->i_mapping, 0); if (inode->i_mapping->nrpages) { CDEBUG(D_VFSTRACE, DFID ": still has %lu pages remaining\n", PFID(lu_object_fid(&obj->co_lu)), inode->i_mapping->nrpages); RETURN(-EIO); } RETURN(0); } This appears to be fixed in master by 7d5d004506 ( |
| Comment by John Hammond [ 06/Apr/22 ] |
|
"John L. Hammond <jhammond@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46948 |
| Comment by Peter Jones [ 07/Oct/22 ] |
|
Fix landed to b2_12 |