[LU-18416] Data corruption/miscompare observed during 48hr FOFB - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.16.0
Labels:
None
Environment:

Hide
Configuration:

79 x Lustre 2.16 RC clients

4 X LNet routers

Lustre client:

2.16.0_RC4_3_g90745d1

NEO version: 6.7-020.141

Lustre server:

2.16.0_RC4_3_g90745d1

Model SSUs

1xE1000D 1xE1000F

4 x OSS nodes with 2 x HDD OSTs and 2 x flash OSTs

Show
Configuration: 79 x Lustre 2.16 RC clients 4 X LNet routers Lustre client: 2.16.0_RC4_3_g90745d1 NEO version: 6.7-020.141 Lustre server: 2.16.0_RC4_3_g90745d1 Model SSUs 1xE1000D 1xE1000F 4 x OSS nodes with 2 x HDD OSTs and 2 x flash OSTs

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Data corruption/miscompare issue observed during 48 hours lustre FOFB.

 [63] FAILED comparison of buffer containing 8-byte ints:
 [63]   File name = /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
 [63]   In transfer 129, 131072 errors between buffer indices 0 and 131071.
 [63]   File byte offset = 33957085184:
 [63]     Expected: 0x0000003f671ad176 0000000000000008 0000003f671ad176 0000000000000018 
 [63]     Actual:   0x6161616161616161 6161616161616161 6161616161616161 6161616161616161 
 ** error **
 ERROR in IOR.c (line 448): data check error, aborting execution.
 ERROR: No such file or directory
 ** exiting **
 Rank 63 [Thu Oct 24 18:07:48 2024] [c0-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 63
 _pmiu_daemon(SIGCHLD): [NID 00173] [c0-0c2s11n1] [Thu Oct 24 18:07:49 2024] PE RANK 63 exit signal Aborted
 [NID 00173] 2024-10-24 18:07:49 Apid 8319126: initiated application termination

First issue with timestamp :

 read      5759       524288     1024.00    0.042503   5.65       0.000407   5.69       6    XXCEL
 Using Time Stamp 1729810806 (0x671ad176) for Data Signature
 delaying 15 seconds . . .
 Commencing write performance test.
 Thu Oct 24 18:00:25 2024
 
 WARNING: Expected aggregate file size       = 34359738368.
 WARNING: Stat() of aggregate file size      = 33957085184.
 WARNING: Using actual aggregate bytes moved = 34359738368.
 write     74.34      524288     1024.00    0.010731   440.78     0.002005   440.79     7    XXCEL
 Verifying contents of the file(s) just written.
 Thu Oct 24 18:07:45 2024
 
 [63] At transfer buffer #128, index #0 (file byte offset 33957085184):
 [63] Expected: 0x0000003f671ad176
 [63] Actual:   0x6161616161616161
 [63] At transfer buffer #128, index #1 (file byte offset 33957085192):
 [63] Expected: 0x0000000000000008
 [63] Actual:   0x6161616161616161
 [63] At transfer buffer #128, index #2 (file byte offset 33957085200):

FOFB operation during the corruption time :

2024-10-24t17:45:15 Test 17  -- operation kill failover of kjcf05n03
2024-10-24t18:11:14 Test 17  -- operation kill failback of kjcf05n03

Recovery after kjcf05n03 Failover:

Oct 24 18:04:49 kjcf05n06 kernel: Lustre: kjcf05-MDT0001-lwp-OST0002: Connection restored to 10.16.100.52@o2ib (at 10.16.100.52@o2ib)
Oct 24 18:05:31 kjcf05n02 kernel: LustreError: 1970850:0:(mdt_open.c:1442:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid  req@00000000cb79bd48 x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:341/0 lens 520/4576 e 0 to 0 dl 1729811181 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
Oct 24 18:05:31 kjcf05n02 kernel: LustreError: 1970850:0:(mdt_handler.c:5119:mdt_intent_open()) @@@ Replay open failed with -14  req@00000000cb79bd48 x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:341/0 lens 520/608 e 0 to 0 dl 1729811181 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
Oct 24 18:05:33 kjcf05n00 kernel: Lustre: kjcf05-MDT0001-mdc-ffff9734e40bd800: Connection restored to 10.16.100.52@o2ib (at 10.16.100.52@o2ib)

Recovery after kjcf05n03 Failback :

Oct 24 18:27:05 kjcf05n03 kernel: LustreError: 52805:0:(mdt_open.c:1442:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid  req@00000000b48954bd x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:119/0 lens 520/4176 e 0 to 0 dl 1729812474 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
Oct 24 18:27:05 kjcf05n03 kernel: LustreError: 52805:0:(mdt_handler.c:5119:mdt_intent_open()) @@@ Replay open failed with -14  req@00000000b48954bd x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:119/0 lens 520/608 e 0 to 0 dl 1729812474 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
Oct 24 18:27:14 kjcf05n00 kernel: Lustre: kjcf05-MDT0001-mdc-ffff9734e40bd800: Connection restored to 10.16.100.53@o2ib (at 10.16.100.53@o2ib)
Oct 24 18:27:14 kjcf05n02 kernel: Lustre: kjcf05-MDT0001-osp-MDT0000: Connection restored to 10.16.100.53@o2ib (at 10.16.100.53@o2ib)

Timestamp for the IOR job :

<150>1 2024-10-24T17:52:40.433638-05:00 c0-0c0s1n1 aprun 25706 p1-20241023t024851 [alps_msgs@34] apid=8319126, Starting, user=1356, cmd_line="aprun -n 64 /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR -o /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m -a MPIIO -i 8 -w -r -W -t 1m -b 512m -C -k -vv -q -d 15 ", num_nodes=2, node_list=172-173, cwd="/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.25633", wdir="/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.25633"
<150>1 2024-10-24T18:07:53.597430-05:00 c0-0c0s1n1 apsys 25844 p1-20241023t024851 [alps_msgs@34] apid=8319126, Finishing, user=1356, exit_code=137, exitcode_array=134:0, exitsignal_array=9:0
<142>1 2024-10-24T18:07:55.386375-05:00 c0-0c0s1n1 RUR 29745 p1-20241023t024851 [RUR@34] uid: 1356, apid: 8319126, jobid: 0, cmdname: /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR, plugin: taskstats {"core": 0, "exitcode:signal": 0, "rur_errors": 0, "rur_truncated_record_errors": 0, "max_rss": 0, "stime": 0, "wchar": 0, "rchar": 0, "utime": 0}
<142>1 2024-10-24T18:07:55.386441-05:00 c0-0c0s1n1 RUR 29745 p1-20241023t024851 [RUR@34] uid: 1356, apid: 8319126, jobid: 0, cmdname: /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR, plugin: timestamp APP_START 2024-10-24T17:52:40CDT APP_STOP 2024-10-24T18:07:54CDT

lfs getstripe O/P :

./CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
   lcm_layout_gen:    7
   lcm_mirror_count:  1
   lcm_entry_count:   4
     lcme_id:             1
     lcme_mirror_id:      0
     lcme_flags:          init
     lcme_extent.e_start: 0
     lcme_extent.e_end:   131072
       lmm_stripe_count:  0
       lmm_stripe_size:   131072
       lmm_pattern:       mdt
       lmm_layout_gen:    65535
       lmm_stripe_offset: 0
 
     lcme_id:             2
     lcme_mirror_id:      0
     lcme_flags:          init
     lcme_extent.e_start: 131072
     lcme_extent.e_end:   134217728
       lmm_stripe_count:  1
       lmm_stripe_size:   131072
       lmm_pattern:       raid0
       lmm_layout_gen:    65535
       lmm_stripe_offset: 2
       lmm_pool:          flash
       lmm_objects:
       -   0: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f1:0x0] }
 
     lcme_id:             4
     lcme_mirror_id:      0
     lcme_flags:          init
     lcme_extent.e_start: 134217728
     lcme_extent.e_end:   34359738368
       lmm_stripe_count:  32
       lmm_stripe_size:   131072
       lmm_pattern:       raid0,overstriped
       lmm_layout_gen:    65535
       lmm_stripe_offset: 2
       lmm_pool:          flash
       lmm_objects:
       -   0: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f2:0x0] }
       -   1: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c414:0x0] }
       -   2: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f3:0x0] }
       -   3: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c415:0x0] }
       -   4: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f4:0x0] }
       -   5: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c416:0x0] }
       -   6: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f5:0x0] }
       -   7: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c417:0x0] }
       -   8: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f6:0x0] }
       -   9: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c418:0x0] }
       -  10: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f7:0x0] }
       -  11: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c419:0x0] }
       -  12: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f8:0x0] }
       -  13: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41a:0x0] }
       -  14: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f9:0x0] }
       -  15: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41b:0x0] }
       -  16: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fa:0x0] }
       -  17: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41c:0x0] }
       -  18: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fb:0x0] }
       -  19: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41d:0x0] }
       -  20: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fc:0x0] }
       -  21: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41e:0x0] }
       -  22: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fd:0x0] }
       -  23: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41f:0x0] }
       -  24: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fe:0x0] }
       -  25: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c420:0x0] }
       -  26: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44ff:0x0] }
       -  27: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c421:0x0] }
       -  28: { l_ost_idx:   2, l_fid: [0x300007930:0xfa4500:0x0] }
       -  29: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c422:0x0] }
       -  30: { l_ost_idx:   2, l_fid: [0x300007930:0xfa4501:0x0] }
       -  31: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c423:0x0] }
 
     lcme_id:             5
     lcme_mirror_id:      0
     lcme_flags:          extension
     lcme_extent.e_start: 34359738368
     lcme_extent.e_end:   EOF
       lmm_stripe_count:  0
       lmm_extension_size: 67108864
       lmm_pattern:       raid0,overstriped
       lmm_layout_gen:    0
       lmm_stripe_offset: -1
       lmm_pool:          flash

debugfs stat:

[root@kjcf05n03 ~]# debugfs -c -R "stat <481925412>" /dev/md66
debugfs 1.46.5.cr1 (16-Jan-2023)
Inode: 481925412  Type: regular  Mode: 0644  Flags: 0x80000
Generation: 2621827007  Version: 0x000005f9:19d97c78
User: 1356  Group: 11121  Project:   0  Size: 131072
File ACL: 620866789
Links: 1  Blockcount: 264
Fragment: Address: 0  Number: 0  Size: 0
 ctime: 0x671ad2fa:00000000 -- Thu Oct 24 18:06:34 2024
 atime: 0x671f95bc:00000000 -- Mon Oct 28 08:46:36 2024
 mtime: 0x671ad2fa:00000000 -- Thu Oct 24 18:06:34 2024
crtime: 0x671ad189:07bed478 -- Thu Oct 24 18:00:25 2024
Size of extra inode fields: 32
Extended attributes:
 lma: fid=[0x2403375e4:0x14053:0x0] compat=0 incompat=0
 linkea: idx=0 parent=[0x240337296:0x454:0x0] name='IORfile_1m'
 trusted.dataver (8) = 78 7c d9 19 f9 05 00 00 
 trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 e8 07 00 00 00 10 01 f4 03 00 00 00 00 
 trusted.lov (1216)
EXTENTS:
(0-31):671298939-671298970

[root@kjcf05n03 ~]#
jupiter-p1:/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1
m # lfs path2fid /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
[0x2403375e4:0x14053:0x0]
jupiter-p1:/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1
m #

[root@kjcf05n03 ~]# debugfs -c -R " dump <481925412> /tmp/mdt.IORfile.1024" /dev/md66
debugfs 1.46.5.cr1 (16-Jan-2023)
[root@kjcf05n03 ~]# hexdump -Cv /tmp/mdt.IORfile.1024 | head -n3
00000000  76 d1 1a 67 00 00 00 00  08 00 00 00 00 00 00 00  |v..g............|
00000010  76 d1 1a 67 00 00 00 00  18 00 00 00 00 00 00 00  |v..g............|
00000020  76 d1 1a 67 00 00 00 00  28 00 00 00 00 00 00 00  |v..g....(.......|
[root@kjcf05n03 ~]# hexdump -Cv /tmp/mdt.IORfile.1024 | grep -v '76 d1 1a 67'
00020000
[root@kjcf05n03 ~]#

00020000 > 131072/1024 = 128k.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

alternate_posix_signature_Command.out15537.txt
1.39 MB
08/Nov/24 12:59 AM
bad_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_20241107005243.txt
286 kB
07/Nov/24 8:22 AM
bad_dklog.nid00037.20241107005243.llog.gz
57.30 MB
07/Nov/24 8:22 AM
bad_dklog.nid00037.20241107005243.llog-1.gz
57.30 MB
07/Nov/24 8:23 AM
bad_dklog.nid00038.20241107005243.llog.gz
46.37 MB
07/Nov/24 8:22 AM
bad_dklog.nid00038.20241107005243.llog-1.gz
46.37 MB
07/Nov/24 8:23 AM
bad_dklog.nid00039.20241107005243.llog.gz
37.98 MB
07/Nov/24 8:22 AM
bad_dklog.nid00039.20241107005243.llog-1.gz
37.98 MB
07/Nov/24 8:23 AM
CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584.txt
297 kB
08/Nov/24 1:02 PM
CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.txt
207 kB
30/Oct/24 9:09 AM
CL_IOR_sel_32ovs_all_mpiio_wr_8iter_1m.txt
206 kB
04/Nov/24 9:52 AM
client_dk.tar.gz
153.44 MB
08/Nov/24 12:29 PM
client_dk.tar-1.gz
153.44 MB
08/Nov/24 12:32 PM
client_dmesg.tar.gz
54 kB
08/Nov/24 12:28 PM
client_dmesg.tar-1.gz
54 kB
08/Nov/24 12:29 PM
client_logs_20241111.tar.gz
185.82 MB
11/Nov/24 7:22 AM
data.20241108.tar.gz
369.31 MB
09/Nov/24 3:41 AM
dmesg.log
150 kB
05/Nov/24 12:38 PM
good_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_20241106181830_output.txt
201 kB
07/Nov/24 12:32 AM
good_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_202411077072444.txt
7 kB
07/Nov/24 2:07 PM
good_dklog.nid00172_202411077072444.log.tar.gz
91.56 MB
07/Nov/24 2:09 PM
good_dklog.nid00172.20241106181830.llog.gz
82.55 MB
07/Nov/24 12:32 AM
good_dklog.nid00173_202411077072444.log.tar.gz
91.21 MB
07/Nov/24 2:09 PM
good_dklog.nid00173.20241106181830.llog.gz
83.21 MB
07/Nov/24 12:32 AM
kjcf05_llog.tar.gz
581 kB
07/Nov/24 10:13 AM
kjcf05n02_llog.tar.gz
68 kB
08/Nov/24 12:41 PM
kjcf05n03_llog.tar.gz
56 kB
08/Nov/24 12:42 PM
mds_mgs_dklog.tar.gz
3.03 MB
08/Nov/24 12:28 PM
mds_mgs_dklog.tar-1.gz
3.03 MB
08/Nov/24 12:29 PM
mds_mgs_dmesg.tar.gz
70 kB
08/Nov/24 12:28 PM
mgs_mds_logs_20241111.tar.gz
11.69 MB
11/Nov/24 6:16 AM
README.lu18416
41 kB
05/Nov/24 9:34 PM
test_output_20241111.txt
298 kB
11/Nov/24 6:04 AM

Issue Links

duplicates

LU-18435 replay of pfl layout resets layout generation

Resolved

is related to

LU-15300 mirror resync can cause EIO to unrelated applications

Resolved

is related to

LU-15012 Unreplayed open leads to version mismatch

Open

Activity

[LU-18416] Data corruption/miscompare observed during 48hr FOFB

Cory Spitz added a comment - 08/Nov/24 5:16 PM

> It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off?
paf, no, bzzz asked for a debug mask without D_HA:

please try to collect again with debug=vfstrace (not +vfstrace)

Sure, we can add D_HA to clients and +layout on the MDS.

Cory Spitz added a comment - 08/Nov/24 5:16 PM > It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off? paf , no, bzzz asked for a debug mask without D_HA: please try to collect again with debug=vfstrace (not +vfstrace) Sure, we can add D_HA to clients and +layout on the MDS.

Alex Zhuravlev added a comment - 08/Nov/24 5:15 PM - edited

It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off? - so Alex's additional debug from lov_io.c was not captured.

I tried to enable full debugging on the condition in the patch, but it didn't help given the test needs few runs to reproduce.

Alex Zhuravlev added a comment - 08/Nov/24 5:15 PM - edited It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off? - so Alex's additional debug from lov_io.c was not captured. I tried to enable full debugging on the condition in the patch, but it didn't help given the test needs few runs to reproduce.

Patrick Farrell (Inactive) added a comment - 08/Nov/24 5:13 PM

Can we please run this with +layout debug enabled on the MDS so we can see what's happening in SEL?

Patrick Farrell (Inactive) added a comment - 08/Nov/24 5:13 PM Can we please run this with +layout debug enabled on the MDS so we can see what's happening in SEL?

Patrick Farrell (Inactive) added a comment - 08/Nov/24 5:08 PM

It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off? - so Alex's additional debug from lov_io.c was not captured. Could you please enable +ha in the debug on the client?

Patrick Farrell (Inactive) added a comment - 08/Nov/24 5:08 PM It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off? - so Alex's additional debug from lov_io.c was not captured. Could you please enable +ha in the debug on the client?

Alex Zhuravlev added a comment - 08/Nov/24 4:23 PM

I can add an LASSERT for the case, but not sure whether you'll be able to fetch lustre logs after, but that would be exact catch.

Alex Zhuravlev added a comment - 08/Nov/24 4:23 PM I can add an LASSERT for the case, but not sure whether you'll be able to fetch lustre logs after, but that would be exact catch.

Patrick Farrell (Inactive) added a comment - 08/Nov/24 4:16 PM

"
so this might also provide another avenue for debugging the issue if it's easier to debug the POSIX than the MPIIO. This is probably a little earlier / less misleading since the root cause of the MPIIO "miscompare" may just be this partial write that retries many times and then returns 0 as you noted causing IOR to think it was successful and then discover it later during a read and declare it a miscompare.
"
IOR wouldn't consider a return of 0 to be success - that's what it's complaining about elsewhere here.

Patrick Farrell (Inactive) added a comment - 08/Nov/24 4:16 PM " so this might also provide another avenue for debugging the issue if it's easier to debug the POSIX than the MPIIO. This is probably a little earlier / less misleading since the root cause of the MPIIO "miscompare" may just be this partial write that retries many times and then returns 0 as you noted causing IOR to think it was successful and then discover it later during a read and declare it a miscompare. " IOR wouldn't consider a return of 0 to be success - that's what it's complaining about elsewhere here.

Prasannakumar Nagasubramani added a comment - 08/Nov/24 12:50 PM - edited

bzzz Reproduced the issue with ( https://review.whamcloud.com/c/fs/lustre-release/+/56926 ) patch on clients.
Attaching logs with recommended debug settings(vfstrace).
mds_mgs_dmesg.tar.gz

mds_mgs_dklog.tar.gz

client_dmesg.tar.gz

client_dk.tar.gz

kjcf05n02_llog.tar.gz

^{kjcf05n03_llog.tar.gz}

CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584.txt

Prasannakumar Nagasubramani added a comment - 08/Nov/24 12:50 PM - edited bzzz Reproduced the issue with ( https://review.whamcloud.com/c/fs/lustre-release/+/56926 ) patch on clients. Attaching logs with recommended debug settings(vfstrace). mds_mgs_dmesg.tar.gz mds_mgs_dklog.tar.gz client_dmesg.tar.gz client_dk.tar.gz kjcf05n02_llog.tar.gz kjcf05n03_llog.tar.gz CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584.txt

Alex Zhuravlev added a comment - 08/Nov/24 5:34 AM

However, when we just removed the setstripe entirely and used the default layout, e.g., we can no longer reproduce the problem.

exactly, this is what I asked for above when mentioned "more regular layout", a good sign.

Prior to the bisect we tried a few other things that people have suggested and one of those was just to use the default layout; so far we have not been able to reproduce the problem that way, so that leads me to suspect that PFL/SEL is related somehow.

yes, the retries I described is due to changing PFL layout.

please, try to reproduce with https://review.whamcloud.com/c/fs/lustre-release/+/56926 and collect debug logs from the client. debug=vfstrace is fine, the patch changes debugging level internally during retries so we'll get a detailed info hopefully.

Alex Zhuravlev added a comment - 08/Nov/24 5:34 AM However, when we just removed the setstripe entirely and used the default layout, e.g., we can no longer reproduce the problem. exactly, this is what I asked for above when mentioned "more regular layout", a good sign. Prior to the bisect we tried a few other things that people have suggested and one of those was just to use the default layout; so far we have not been able to reproduce the problem that way, so that leads me to suspect that PFL/SEL is related somehow. yes, the retries I described is due to changing PFL layout. please, try to reproduce with https://review.whamcloud.com/c/fs/lustre-release/+/56926 and collect debug logs from the client. debug=vfstrace is fine, the patch changes debugging level internally during retries so we'll get a detailed info hopefully.

Josh Schwartz added a comment - 08/Nov/24 1:06 AM - edited

I think the following looks extremely suspicious:

...

basically we try to restart write 1000 times with no success (because of losing layout lock, but this is file per process and no cancel/enqueue for LL has happened) and just exit with result=0 finally.
I'm not very familiar with this code, still digging.. any advice is welcome.

This caught my attention because one of the other things we tried was running this same test with POSIX rather than MPIIO. We were unable to reproduce the miscompare signature and concluded that the problem was MPIIO specific but we've also been dealing with a number of other (presumably) unrelated failures in these tests that are generating a lot of noise. I was going through some of the prior results in more detail looking at all the failures and noticed that in the POSIX runs we had failures that weren't miscompares, but a different case where IOR failed due to exhausting retries much like you describe above that I wanted to point out:

 WARNING: Task 63 requested transfer of 1048576 bytes,
          but transferred 0 bytes at offset 33957085184
 WARNING: This file system requires support of partial write()s.
 WARNING: Requested xfer of 1048576 bytes, but xferred 0 bytes
 Only transferred 0 of 1048576 bytes

and it eventually fails with:

WARNING: Task 63 requested transfer of 1048576 bytes,
          but transferred 0 bytes at offset 33957085184
 ** error **
 ERROR in aiori-POSIX.c (line 266): too many retries -- aborting.
 ERROR: No such file or directory
 ** exiting **

see attached complete log alternate_posix_signature_Command.out15537.txt

so this might also provide another avenue for debugging the issue if it's easier to debug the POSIX than the MPIIO. This is probably a little earlier / less misleading since the root cause of the MPIIO "miscompare" may just be this partial write that retries many times and then returns 0 as you noted causing IOR to think it was successful and then discover it later during a read and declare it a miscompare.

Josh Schwartz added a comment - 08/Nov/24 1:06 AM - edited I think the following looks extremely suspicious: ... basically we try to restart write 1000 times with no success (because of losing layout lock, but this is file per process and no cancel/enqueue for LL has happened) and just exit with result=0 finally. I'm not very familiar with this code, still digging.. any advice is welcome. This caught my attention because one of the other things we tried was running this same test with POSIX rather than MPIIO. We were unable to reproduce the miscompare signature and concluded that the problem was MPIIO specific but we've also been dealing with a number of other (presumably) unrelated failures in these tests that are generating a lot of noise. I was going through some of the prior results in more detail looking at all the failures and noticed that in the POSIX runs we had failures that weren't miscompares, but a different case where IOR failed due to exhausting retries much like you describe above that I wanted to point out: WARNING: Task 63 requested transfer of 1048576 bytes, but transferred 0 bytes at offset 33957085184 WARNING: This file system requires support of partial write()s. WARNING: Requested xfer of 1048576 bytes, but xferred 0 bytes Only transferred 0 of 1048576 bytes and it eventually fails with: WARNING: Task 63 requested transfer of 1048576 bytes, but transferred 0 bytes at offset 33957085184 ** error ** ERROR in aiori-POSIX.c (line 266): too many retries -- aborting. ERROR: No such file or directory ** exiting ** see attached complete log alternate_posix_signature_Command.out15537.txt so this might also provide another avenue for debugging the issue if it's easier to debug the POSIX than the MPIIO. This is probably a little earlier / less misleading since the root cause of the MPIIO "miscompare" may just be this partial write that retries many times and then returns 0 as you noted causing IOR to think it was successful and then discover it later during a read and declare it a miscompare.

Josh Schwartz added a comment - 08/Nov/24 12:31 AM - edited

Prior to the bisect we tried a few other things that people have suggested and one of those was just to use the default layout; so far we have not been able to reproduce the problem that way, so that leads me to suspect that PFL/SEL is related somehow.

Here are the layouts we have tried:

original failure (DoM+SEL+OVS):

lfs setstripe --component-end 128K -L mdt --component-end 128M --extension-size 64M --component-end -1 --overstripe-count 32 --extension-size 64M $dirname

removed DoM, still failed:

lfs setstripe --component-end 128M --extension-size 64M --component-end -1 --overstripe-count 32 --extension-size 64M $dirname

removed OVS, still failed:

lfs setstripe --component-end 128M --extension-size 64M --component-end -1 --extension-size 64M $dirname

However, when we just removed the setstripe entirely and used the default layout, e.g.

 vers@jupiter-p1:/lus/kjcf05/flash/jbs/rerun/CL_IOR_b256m_1s_fpp_mpiio_wr_8iter_1m_LUS_12584> lfs getstripe .
.
stripe_count:  1 stripe_size:   4194304 pattern:       raid0 stripe_offset: -1 pool:          flash

./IORfile_1m.00000059
lmm_stripe_count:  1
lmm_stripe_size:   4194304
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 3
lmm_pool:          flash
        obdidx           objid           objid           group
             3        14734531       0xe0d4c3      0x3400088d1

...

we can no longer reproduce the problem.

You requested:

2) with a simpler layout like a single component, or few components, but no extention?
thanks in advance.

Can you provide a specific layout you would like to see that you suspect would still hit the problem but be sufficiently simple to debug?

We can use up to four OSTs, two are flash and two are disk, the problem seems to occur more often/readily on the disk OSTs, so a layout limited to those two OSTs will likely yield the best reproducibility:

Pool: kjcf05.all
kjcf05-OST0000_UUID
kjcf05-OST0001_UUID
kjcf05-OST0002_UUID
kjcf05-OST0003_UUID
Pool: kjcf05.disk
kjcf05-OST0000_UUID
kjcf05-OST0001_UUID
...
Pool: kjcf05.flash
kjcf05-OST0002_UUID
kjcf05-OST0003_UUID

Josh Schwartz added a comment - 08/Nov/24 12:31 AM - edited Prior to the bisect we tried a few other things that people have suggested and one of those was just to use the default layout; so far we have not been able to reproduce the problem that way, so that leads me to suspect that PFL/SEL is related somehow. Here are the layouts we have tried: original failure (DoM+SEL+OVS): lfs setstripe --component-end 128K -L mdt --component-end 128M --extension-size 64M --component-end -1 --overstripe-count 32 --extension-size 64M $dirname removed DoM, still failed: lfs setstripe --component-end 128M --extension-size 64M --component-end -1 --overstripe-count 32 --extension-size 64M $dirname removed OVS, still failed: lfs setstripe --component-end 128M --extension-size 64M --component-end -1 --extension-size 64M $dirname However, when we just removed the setstripe entirely and used the default layout, e.g. vers@jupiter-p1:/lus/kjcf05/flash/jbs/rerun/CL_IOR_b256m_1s_fpp_mpiio_wr_8iter_1m_LUS_12584> lfs getstripe . . stripe_count: 1 stripe_size: 4194304 pattern: raid0 stripe_offset: -1 pool: flash ./IORfile_1m.00000059 lmm_stripe_count: 1 lmm_stripe_size: 4194304 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 3 lmm_pool: flash obdidx objid objid group 3 14734531 0xe0d4c3 0x3400088d1 ... we can no longer reproduce the problem. You requested: 2) with a simpler layout like a single component, or few components, but no extention? thanks in advance. Can you provide a specific layout you would like to see that you suspect would still hit the problem but be sufficiently simple to debug? We can use up to four OSTs, two are flash and two are disk, the problem seems to occur more often/readily on the disk OSTs, so a layout limited to those two OSTs will likely yield the best reproducibility: Pool: kjcf05.all kjcf05-OST0000_UUID kjcf05-OST0001_UUID kjcf05-OST0002_UUID kjcf05-OST0003_UUID Pool: kjcf05.disk kjcf05-OST0000_UUID kjcf05-OST0001_UUID ... Pool: kjcf05.flash kjcf05-OST0002_UUID kjcf05-OST0003_UUID

Alex Zhuravlev added a comment - 07/Nov/24 5:38 PM

prasannakumar the next time please grab dmesg logs from MDSs

Alex Zhuravlev added a comment - 07/Nov/24 5:38 PM prasannakumar the next time please grab dmesg logs from MDSs

People

Assignee:: Alex Zhuravlev

Reporter:: Prasannakumar Nagasubramani

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 30/Oct/24 9:10 AM

Updated:: 13/Nov/24 4:51 PM

Resolved:: 13/Nov/24 4:49 PM