Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
Data corruption/miscompare issue observed during 48 hours lustre FOFB.
[63] FAILED comparison of buffer containing 8-byte ints: [63] File name = /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m [63] In transfer 129, 131072 errors between buffer indices 0 and 131071. [63] File byte offset = 33957085184: [63] Expected: 0x0000003f671ad176 0000000000000008 0000003f671ad176 0000000000000018 [63] Actual: 0x6161616161616161 6161616161616161 6161616161616161 6161616161616161 ** error ** ERROR in IOR.c (line 448): data check error, aborting execution. ERROR: No such file or directory ** exiting ** Rank 63 [Thu Oct 24 18:07:48 2024] [c0-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 63 _pmiu_daemon(SIGCHLD): [NID 00173] [c0-0c2s11n1] [Thu Oct 24 18:07:49 2024] PE RANK 63 exit signal Aborted [NID 00173] 2024-10-24 18:07:49 Apid 8319126: initiated application termination
First issue with timestamp :
read 5759 524288 1024.00 0.042503 5.65 0.000407 5.69 6 XXCEL Using Time Stamp 1729810806 (0x671ad176) for Data Signature delaying 15 seconds . . . Commencing write performance test. Thu Oct 24 18:00:25 2024 WARNING: Expected aggregate file size = 34359738368. WARNING: Stat() of aggregate file size = 33957085184. WARNING: Using actual aggregate bytes moved = 34359738368. write 74.34 524288 1024.00 0.010731 440.78 0.002005 440.79 7 XXCEL Verifying contents of the file(s) just written. Thu Oct 24 18:07:45 2024 [63] At transfer buffer #128, index #0 (file byte offset 33957085184): [63] Expected: 0x0000003f671ad176 [63] Actual: 0x6161616161616161 [63] At transfer buffer #128, index #1 (file byte offset 33957085192): [63] Expected: 0x0000000000000008 [63] Actual: 0x6161616161616161 [63] At transfer buffer #128, index #2 (file byte offset 33957085200):
FOFB operation during the corruption time :
2024-10-24t17:45:15 Test 17 -- operation kill failover of kjcf05n03 2024-10-24t18:11:14 Test 17 -- operation kill failback of kjcf05n03
Recovery after kjcf05n03 Failover:
Oct 24 18:04:49 kjcf05n06 kernel: Lustre: kjcf05-MDT0001-lwp-OST0002: Connection restored to 10.16.100.52@o2ib (at 10.16.100.52@o2ib) Oct 24 18:05:31 kjcf05n02 kernel: LustreError: 1970850:0:(mdt_open.c:1442:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid req@00000000cb79bd48 x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:341/0 lens 520/4576 e 0 to 0 dl 1729811181 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121 Oct 24 18:05:31 kjcf05n02 kernel: LustreError: 1970850:0:(mdt_handler.c:5119:mdt_intent_open()) @@@ Replay open failed with -14 req@00000000cb79bd48 x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:341/0 lens 520/608 e 0 to 0 dl 1729811181 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121 Oct 24 18:05:33 kjcf05n00 kernel: Lustre: kjcf05-MDT0001-mdc-ffff9734e40bd800: Connection restored to 10.16.100.52@o2ib (at 10.16.100.52@o2ib)
Recovery after kjcf05n03 Failback :
Oct 24 18:27:05 kjcf05n03 kernel: LustreError: 52805:0:(mdt_open.c:1442:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid req@00000000b48954bd x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:119/0 lens 520/4176 e 0 to 0 dl 1729812474 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121 Oct 24 18:27:05 kjcf05n03 kernel: LustreError: 52805:0:(mdt_handler.c:5119:mdt_intent_open()) @@@ Replay open failed with -14 req@00000000b48954bd x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:119/0 lens 520/608 e 0 to 0 dl 1729812474 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121 Oct 24 18:27:14 kjcf05n00 kernel: Lustre: kjcf05-MDT0001-mdc-ffff9734e40bd800: Connection restored to 10.16.100.53@o2ib (at 10.16.100.53@o2ib) Oct 24 18:27:14 kjcf05n02 kernel: Lustre: kjcf05-MDT0001-osp-MDT0000: Connection restored to 10.16.100.53@o2ib (at 10.16.100.53@o2ib)
Timestamp for the IOR job :
<150>1 2024-10-24T17:52:40.433638-05:00 c0-0c0s1n1 aprun 25706 p1-20241023t024851 [alps_msgs@34] apid=8319126, Starting, user=1356, cmd_line="aprun -n 64 /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR -o /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m -a MPIIO -i 8 -w -r -W -t 1m -b 512m -C -k -vv -q -d 15 ", num_nodes=2, node_list=172-173, cwd="/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.25633", wdir="/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.25633" <150>1 2024-10-24T18:07:53.597430-05:00 c0-0c0s1n1 apsys 25844 p1-20241023t024851 [alps_msgs@34] apid=8319126, Finishing, user=1356, exit_code=137, exitcode_array=134:0, exitsignal_array=9:0 <142>1 2024-10-24T18:07:55.386375-05:00 c0-0c0s1n1 RUR 29745 p1-20241023t024851 [RUR@34] uid: 1356, apid: 8319126, jobid: 0, cmdname: /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR, plugin: taskstats {"core": 0, "exitcode:signal": 0, "rur_errors": 0, "rur_truncated_record_errors": 0, "max_rss": 0, "stime": 0, "wchar": 0, "rchar": 0, "utime": 0} <142>1 2024-10-24T18:07:55.386441-05:00 c0-0c0s1n1 RUR 29745 p1-20241023t024851 [RUR@34] uid: 1356, apid: 8319126, jobid: 0, cmdname: /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR, plugin: timestamp APP_START 2024-10-24T17:52:40CDT APP_STOP 2024-10-24T18:07:54CDT
lfs getstripe O/P :
./CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m lcm_layout_gen: 7 lcm_mirror_count: 1 lcm_entry_count: 4 lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 131072 lmm_stripe_count: 0 lmm_stripe_size: 131072 lmm_pattern: mdt lmm_layout_gen: 65535 lmm_stripe_offset: 0 lcme_id: 2 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 131072 lcme_extent.e_end: 134217728 lmm_stripe_count: 1 lmm_stripe_size: 131072 lmm_pattern: raid0 lmm_layout_gen: 65535 lmm_stripe_offset: 2 lmm_pool: flash lmm_objects: - 0: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44f1:0x0] } lcme_id: 4 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 134217728 lcme_extent.e_end: 34359738368 lmm_stripe_count: 32 lmm_stripe_size: 131072 lmm_pattern: raid0,overstriped lmm_layout_gen: 65535 lmm_stripe_offset: 2 lmm_pool: flash lmm_objects: - 0: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44f2:0x0] } - 1: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c414:0x0] } - 2: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44f3:0x0] } - 3: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c415:0x0] } - 4: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44f4:0x0] } - 5: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c416:0x0] } - 6: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44f5:0x0] } - 7: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c417:0x0] } - 8: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44f6:0x0] } - 9: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c418:0x0] } - 10: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44f7:0x0] } - 11: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c419:0x0] } - 12: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44f8:0x0] } - 13: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c41a:0x0] } - 14: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44f9:0x0] } - 15: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c41b:0x0] } - 16: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44fa:0x0] } - 17: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c41c:0x0] } - 18: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44fb:0x0] } - 19: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c41d:0x0] } - 20: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44fc:0x0] } - 21: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c41e:0x0] } - 22: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44fd:0x0] } - 23: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c41f:0x0] } - 24: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44fe:0x0] } - 25: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c420:0x0] } - 26: { l_ost_idx: 2, l_fid: [0x300007930:0xfa44ff:0x0] } - 27: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c421:0x0] } - 28: { l_ost_idx: 2, l_fid: [0x300007930:0xfa4500:0x0] } - 29: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c422:0x0] } - 30: { l_ost_idx: 2, l_fid: [0x300007930:0xfa4501:0x0] } - 31: { l_ost_idx: 3, l_fid: [0x3400061c0:0xf4c423:0x0] } lcme_id: 5 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 34359738368 lcme_extent.e_end: EOF lmm_stripe_count: 0 lmm_extension_size: 67108864 lmm_pattern: raid0,overstriped lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: flash
debugfs stat:
[root@kjcf05n03 ~]# debugfs -c -R "stat <481925412>" /dev/md66 debugfs 1.46.5.cr1 (16-Jan-2023) Inode: 481925412 Type: regular Mode: 0644 Flags: 0x80000 Generation: 2621827007 Version: 0x000005f9:19d97c78 User: 1356 Group: 11121 Project: 0 Size: 131072 File ACL: 620866789 Links: 1 Blockcount: 264 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x671ad2fa:00000000 -- Thu Oct 24 18:06:34 2024 atime: 0x671f95bc:00000000 -- Mon Oct 28 08:46:36 2024 mtime: 0x671ad2fa:00000000 -- Thu Oct 24 18:06:34 2024 crtime: 0x671ad189:07bed478 -- Thu Oct 24 18:00:25 2024 Size of extra inode fields: 32 Extended attributes: lma: fid=[0x2403375e4:0x14053:0x0] compat=0 incompat=0 linkea: idx=0 parent=[0x240337296:0x454:0x0] name='IORfile_1m' trusted.dataver (8) = 78 7c d9 19 f9 05 00 00 trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 e8 07 00 00 00 10 01 f4 03 00 00 00 00 trusted.lov (1216) EXTENTS: (0-31):671298939-671298970
[root@kjcf05n03 ~]# jupiter-p1:/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1 m # lfs path2fid /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m [0x2403375e4:0x14053:0x0] jupiter-p1:/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1 m #
[root@kjcf05n03 ~]# debugfs -c -R " dump <481925412> /tmp/mdt.IORfile.1024" /dev/md66 debugfs 1.46.5.cr1 (16-Jan-2023) [root@kjcf05n03 ~]# hexdump -Cv /tmp/mdt.IORfile.1024 | head -n3 00000000 76 d1 1a 67 00 00 00 00 08 00 00 00 00 00 00 00 |v..g............| 00000010 76 d1 1a 67 00 00 00 00 18 00 00 00 00 00 00 00 |v..g............| 00000020 76 d1 1a 67 00 00 00 00 28 00 00 00 00 00 00 00 |v..g....(.......| [root@kjcf05n03 ~]# hexdump -Cv /tmp/mdt.IORfile.1024 | grep -v '76 d1 1a 67' 00020000 [root@kjcf05n03 ~]#
00020000 > 131072/1024 = 128k.