[LU-14598] Too many FIDs to precreate OST replaced. Wrong IDIF logic. Created: 09/Apr/21 Updated: 11/Apr/23 Resolved: 28/Apr/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.12.6, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.12.8, Lustre 2.15.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
During testing, job failure was observed V-1: Entering mdtest_stat... 12/01/2020 04:46:28: Process 30(nid00514): FAILED in mdtest_stat, unable to stat file: No such file or directory Rank 30 [Tue Dec 1 04:46:28 2020] [c0-1c2s0n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 30 12/01/2020 04:46:29: Process 3(nid00514): FAILED in mdtest_stat, unable to stat file: No such file or directory Rank 3 [Tue Dec 1 04:46:29 2020] [c0-1c2s0n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3 _pmiu_daemon(SIGCHLD): [NID 00514] [c0-1c2s0n2] [Tue Dec 1 04:46:29 2020] PE RANK 3 exit signal Aborted [NID 00514] 2020-12-01 04:46:29 Apid 7421368: initiated application termination Application 7421368 exit codes: 134 Application 7421368 resources: utime ~0s, stime ~8s, Rss ~9528, inblocks ~0, outblocks ~0 Job Script: command stopped at Tue Dec 1 04:46:36 CST 2020 Job Script: command runtime was 240 seconds drwxr-xr-x 3 vers tsttool 4096 Dec 1 04:42 /lus/snx11281/ostest.vers/alsorun.20201130141403.23327.saturn-p4/CL_mdtest_pfl_1s_fo.6.cdR49p.1606819353/CL_mdtest_pfl_1s_fo ls: cannot access '/lus/snx11281/ostest.vers/alsorun.20201130141403.23327.saturn-p4/CL_mdtest_pfl_1s_fo.6.cdR49p.1606819353/CL_mdtest_pfl_1s_fo/#test-dir.0/mdtest_tree.0.0/file.mdtest.0.5023': No such file or directory ls: cannot access '/lus/snx11281/ostest.vers/alsorun.20201130141403.23327.saturn-p4/CL_mdtest_pfl_1s_fo.6.cdR49p.1606819353/CL_mdtest_pfl_1s_fo/#test-dir.0/mdtest_tree.0.0/file.mdtest.0.4854': No such file or directory ... Dec 1 04:45:12 snx11281n002 kernel: Lustre: 15664:0:(client.c:2188:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606819417/real 1606819417] req@ffff94b0eb7d9680 x1684868787957760/t0(0) o5->snx11281-OST0001-osc-MDT0000@10.10.100.6@o2ib3:28/4 lens 432/432 e 0 to 1 dl 1606819512 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 job:'' Dec 1 04:45:12 snx11281n002 kernel: Lustre: snx11281-OST0001-osc-MDT0000: Connection to snx11281-OST0001 (at 10.10.100.6@o2ib3) was lost; in progress operations using this service will wait for recovery to complete Dec 1 04:45:12 snx11281n002 kernel: LustreError: 15664:0:(osp_precreate.c:687:osp_precreate_send()) snx11281-OST0001-osc-MDT0000: can't precreate: rc = -11 Dec 1 04:45:12 snx11281n002 kernel: LustreError: 15664:0:(osp_precreate.c:1358:osp_precreate_thread()) snx11281-OST0001-osc-MDT0000: cannot precreate objects: rc = -11 With a debug patch we have catch vmcore with additional info. [ 1401.294579] Lustre: format at ofd_dev.c:1651:ofd_create_hdl doesn't end in newline [ 1401.294587] Lustre: snx11281-OST0004: Too many FIDs to precreate OST replaced or reformatted: LFSCK will clean up [ 1401.294587] LustreError: 19714:0:(ofd_dev.c:1665:ofd_create_hdl()) ASSERTION( diff <= 5 * OST_MAX_PRECREATE ) failed: snx11281-OST0004: precreate FID 0x0:4395085801 is over 100000 larger than the LAST_ID 0x0:4294967297 diff 100118504 [ 1401.294589] LustreError: 19714:0:(ofd_dev.c:1665:ofd_create_hdl()) LBUG Here is a root cause. |
| Comments |
| Comment by Gerrit Updater [ 09/Apr/21 ] |
|
Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/43248 |
| Comment by Andreas Dilger [ 14/Apr/21 ] |
|
Looking at this, I'm trying to understand the details correctly. The "IDIF" sequence actually has 2^48 OIDs for each OST, so the code should not be interpreting "seq" 0x100040001 as a different sequence number from 0x100040000, but rather the high 16-bit value = 0x0001 of the OID for OST0004. |
| Comment by Alexander Boyko [ 14/Apr/21 ] |
|
@Andreas Dilger, yes, the right sequence should be used - 0x100040000 instead of 0x100040001. I'm adding a bit more info from vmcore/debugfs. It shows that a new sequence was loaded at ofd and new LAST_ID was created. At vmcore OST0004 has the next loaded sequences. And it shows a freshly allocated file/sequence with oid 1 for MDT0. crash> list ofd_seq.os_list -s ofd_seq.os_oi,os_lastid_obj -H ffffa13a6823f630 -x
ffffa143f47e5e80
os_oi = {
{
oi = {
oi_id = 0x400000400,
oi_seq = 0xd6b439bb
},
oi_fid = {
f_seq = 0x400000400,
f_oid = 0xd6b439bb,
f_ver = 0x0
}
}
}
os_lastid_obj = 0xffffa13d3808ce00
ffffa139ff995480
os_oi = {
{
oi = {
oi_id = 0x100040001,
oi_seq = 0x1
},
oi_fid = {
f_seq = 0x100040001,
f_oid = 0x1,
f_ver = 0x0
}
}
}
os_lastid_obj = 0xffffa13da8c5b700
crash> osd_object 0xffffa13da8c5b700
struct osd_object {
oo_dt = {
do_lu = {
lo_header = 0xffffa13d4ca06d20,
lo_dev = 0xffffa13db9aa5000,
lo_ops = 0xffffffffc1f3b100 <osd_lu_obj_ops>,
lo_linkage = {
next = 0xffffa13d4ca06d60,
prev = 0xffffa13d4ca06d88
},
lo_dev_ref = {<No data fields>}
},
do_ops = 0xffffffffc1f3b000 <osd_obj_ops>,
do_body_ops = 0xffffffffc1f3cf40 <osd_body_ops>,
do_index_ops = 0x0
},
oo_inode = 0xffffa13eae1bedc8,
....
crash> inode 0xffffa13eae1bedc8
struct inode {
i_mode = 33188,
i_opflags = 0,
i_uid = {
val = 0
},
i_gid = {
val = 0
},
i_flags = 128,
i_acl = 0xffffffffffffffff,
i_default_acl = 0xffffffffffffffff,
i_op = 0xffffffffc24b5b40 <ldiskfs_file_inode_operations>,
i_sb = 0xffffa13f77734000,
i_mapping = 0xffffa13eae1bef18,
i_security = 0x0,
i_ino = 426717,
The creation time of inode 426717 is exact crash time. The sequence loaded at ofd was a new one and LAST_ID was created for it, instead of O/0/LAST_ID. # debugfs /dev/md0 debugfs 1.45.6.cr1 (14-Aug-2020) debugfs: stat <426717> Inode: 426717 Type: regular Mode: 0644 Flags: 0x84000 Generation: 1901991170 Version: 0x00000000:00000000 User: 0 Group: 0 Project: 0 Size: 8 File ACL: 0 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x605a7ccc:a3bdb8b8 -- Tue Mar 23 18:42:04 2021 atime: 0x605a7ccc:a3bdb8b8 -- Tue Mar 23 18:42:04 2021 mtime: 0x605a7ccc:a3bdb8b8 -- Tue Mar 23 18:42:04 2021 crtime: 0x605a7ccc:a3bdb8b8 -- Tue Mar 23 18:42:04 2021 Size of extra inode fields: 32 Extended attributes: trusted.lma (24) = 08 00 00 00 00 00 00 00 01 00 04 00 01 00 00 00 00 00 00 00 00 00 00 00 lma: fid=[0x100040001:0x0:0x0] compat=8 incompat=0 EXTENTS: (0):1193148674 debugfs: stat O/0/LAST_ID Inode: 96 Type: regular Mode: 0644 Flags: 0x84000 Generation: 132363868 Version: 0x00000000:00000000 User: 0 Group: 0 Project: 0 Size: 8 File ACL: 0 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5e9f512f:d323ef10 -- Tue Apr 21 15:01:51 2020 atime: 0x5d016f96:3b027e8c -- Wed Jun 12 16:33:10 2019 mtime: 0x5d016f96:3b027e8c -- Wed Jun 12 16:33:10 2019 crtime: 0x5d016f96:3b027e8c -- Wed Jun 12 16:33:10 2019 Size of extra inode fields: 32 Extended attributes: trusted.lma (24) = 08 00 00 00 00 00 00 00 00 00 04 00 01 00 00 00 00 00 00 00 00 00 00 00 lma: fid=[0x100040000:0x0:0x0] compat=8 incompat=0 EXTENTS: (0):36993 |
| Comment by Alexander Boyko [ 14/Apr/21 ] |
|
Related to this issue, I see one more problem, ofd_preprw_write() has a loop for creation of missing objects, without limits with 100k like osp has. It could hang a recovery with a small numbers, and break recovery when millions object should be created. May be it is better to return old logic, when a single object is created for this write IO? |
| Comment by Gerrit Updater [ 28/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43248/ |
| Comment by Peter Jones [ 28/Apr/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 04/May/21 ] |
|
Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43541 |
| Comment by Gerrit Updater [ 14/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43541/ |
| Comment by Gerrit Updater [ 09/Jan/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49583 |
| Comment by Gerrit Updater [ 10/Jan/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49586 |
| Comment by Gerrit Updater [ 19/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49583/ |
| Comment by Gerrit Updater [ 11/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49586/ |