[LU-15776] 2.15 RC3: lost writes during server fofb by forced panics Created: 22/Apr/22 Updated: 01/Jun/22 Resolved: 01/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Prabu Manoharan | Assignee: | Alexey Lyashkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | corruption | ||
| Environment: |
Server Details: [root@snx11922n000 admin]# cscli fs_info [root@snx11922n000 admin]# cscli show_node_versions -g all Lustre Version : Debug_mb Size : Subsystem_debug: saturn-p2:/ # pcmd -s -n ALL_COMPUTE "mount -t lustre" saturn-p2:/ # xtprocadmin| grep compute |wc No pending applications are present Total placed applications: 6 No applications or reservations are being cleaned up |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
data miscompare seen during server failover by forced panics for an IOR job (ill formed IO). Error Output :
[13] Expected: 0x0000000000029ff8 [13] Actual: 0x0000000000000000 [13] FAILED comparison of buffer containing 8-byte ints: [13] File name = /lus/snx11922/ost2/ostest.vers/alsorun.20220421145202.27664.saturn-p2/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse.1.qSxF7f.1650570751/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse/IORfile.00000021 [13] In transfer 152, 21504 errors between buffer indices 0 and 21503. [13] File byte offset = 37576704: [13] Expected: 0x000000156261b63a 0000000000000008 000000156261b63a 0000000000000018 [13] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
ERROR in IOR.c (line 448): data check error, aborting execution. ERROR: No such file or directory
Client involved during this corruption:
task 13 on nid00097 task 21 on nid00099
saturn-p2:/ # lfs getstripe -r -y /lus/snx11922/ost2/ostest.vers/alsorun.20220421145202.27664.saturn-p2/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse.1.qSxF7f.1650570751/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse/IORfile.00000021 lcm_layout_gen: 7 lcm_mirror_count: 1 lcm_entry_count: 5 component0: lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 sub_layout: lmm_stripe_count: 0 lmm_stripe_size: 1048576 lmm_pattern: mdt lmm_layout_gen: 0 lmm_stripe_offset: 0
component1: lcme_id: 2 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 1048576 lcme_extent.e_end: 269484032 sub_layout: lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 3 lmm_pool: ost2 lmm_objects: - l_ost_idx: 3 l_fid: 0x100030000:0x83fae34:0x0
component2: lcme_id: 3 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 269484032 lcme_extent.e_end: 536870912 sub_layout: lmm_stripe_count: 0 lmm_extension_size: 134217728 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: ost2
component3: lcme_id: 4 lcme_mirror_id: 0 lcme_flags: 0 lcme_extent.e_start: 536870912 lcme_extent.e_end: 536870912 sub_layout: lmm_stripe_count: -1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: ost2
component4: lcme_id: 5 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 536870912 lcme_extent.e_end: EOF sub_layout: lmm_stripe_count: 0 lmm_extension_size: 268435456 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: ost2
Console Logs:
2022-04-21T14:55:08.086885-05:00 c0-0c1s8n0 Lustre: 9811:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887] req@000000007bfb28c1 x1730745432703552/t0(0) o400->snx11922-OST0003-osc-ffff88878818e800@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'' 2022-04-21T14:55:08.086967-05:00 c0-0c1s8n0 Lustre: snx11922-OST0003-osc-ffff88878818e800: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete 2022-04-21T14:55:08.086988-05:00 c0-0c1s8n0 Lustre: 9809:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887] req@000000008ee911d3 x1730745432703488/t0(0) o400->snx11922-OST0002-osc-ffff88878818e800@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'' 2022-04-21T14:55:08.087004-05:00 c0-0c1s8n0 Lustre: snx11922-OST0002-osc-ffff88878818e800: Connection to snx11922-OST0002 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete 2022-04-21T14:55:08.087023-05:00 c0-0c1s8n2 Lustre: 9808:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887] req@00000000e7ca3f51 x1730745432704128/t0(0) o400->snx11922-OST0003-osc-ffff8887854c2000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'' 2022-04-21T14:55:08.087040-05:00 c0-0c1s8n2 Lustre: snx11922-OST0003-osc-ffff8887854c2000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete 2022-04-21T14:55:08.087076-05:00 c0-0c1s8n1 Lustre: 9797:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570889/real 1650570889] req@0000000039643501 x1730745432705728/t0(0) o400->snx11922-OST0003-osc-ffff88838195c000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570905 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'' 2022-04-21T14:55:08.087086-05:00 c0-0c1s8n1 Lustre: snx11922-OST0003-osc-ffff88838195c000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete 2022-04-21T14:55:08.087114-05:00 c0-0c1s8n3 Lustre: 9786:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570889/real 1650570889] req@000000008cb8442c x1730745432700352/t0(0) o400->snx11922-OST0002-osc-ffff888386f33000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570905 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'' 2022-04-21T14:55:08.087129-05:00 c0-0c1s8n3 Lustre: snx11922-OST0003-osc-ffff888386f33000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete
Kern logs : Apr 21 14:55:25 snx11922n004 kernel: md: md3: resync interrupted. Apr 21 14:55:26 snx11922n004 kernel: LDISKFS-fs (md3): recovery complete Apr 21 14:55:26 snx11922n004 kernel: LDISKFS-fs (md3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=panic,errors=panic,journal_checksum,no_mbcache,nodelalloc Apr 21 14:55:27 snx11922n003 kernel: Lustre: snx11922-MDT0001: Received LWP connection from 10.12.0.52@o2ib4, removing former export from 10.12.0.53@o2ib4 Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: Imperative Recovery enabled, recovery window shrunk from 900-2700 down to 450-2700 Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: in recovery but waiting for the first client to connect Apr 21 14:55:27 snx11922n004 kernel: md: resync of RAID array md3 Apr 21 14:55:27 snx11922n004 kernel: [204833]:md.c:10632:md_do_sync(): md: md3: resync range: [0, 30316376064) Apr 21 14:55:27 snx11922n002 kernel: Lustre: snx11922-MDT0000: Received LWP connection from 10.12.0.52@o2ib4, removing former export from 10.12.0.53@o2ib4 Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: Will be in recovery for at least 7:30, or until 16 clients reconnect Apr 21 14:55:40 snx11922n001 kernel: LNet: 2575200:0:(o2iblnd_cb.c:3369:kiblnd_check_conns()) Timed out tx for 10.12.0.53@o2ib4: 2 seconds Apr 21 14:55:40 snx11922n001 kernel: LNet: 2575200:0:(o2iblnd_cb.c:3369:kiblnd_check_conns()) Skipped 1 previous similar message Apr 21 14:55:48 snx11922n003 kernel: Lustre: snx11922-OST0002-osc-MDT0001: Connection restored to 10.12.0.52@o2ib4 (at 10.12.0.52@o2ib4) Apr 21 14:55:48 snx11922n001 kernel: Lustre: snx11922-OST0002-osc-ffff998fa4d8d000: Connection restored to 10.12.0.52@o2ib4 (at 10.12.0.52@o2ib4) Apr 21 14:55:48 snx11922n004 kernel: Lustre: snx11922-OST0002: Recovery over after 0:32, of 16 clients 16 recovered and 0 were evicted.
|
| Comments |
| Comment by Colin Faber [ 22/Apr/22 ] |
|
Are you guys working on this bug now? |
| Comment by Peter Jones [ 25/Apr/22 ] |
|
spitzcor could you please advise if someone from your team is working on this? Is this a blocker for 2.15? |
| Comment by Cory Spitz [ 25/Apr/22 ] |
|
shadow is working this for HPE and so I've assigned it to him. |
| Comment by Cory Spitz [ 30/Apr/22 ] |
|
At this week's LWG call I reported that our attention was in debugging the client. New information has been brought to light, which suggests that it is a server-side issue. Quoted from Peggy G. at HPE: In an attempt to get more complete debug log data, a change was made to the test run to force panic on all computes as soon as IOR detected data corruption, rather than dump debug logs. This change was made during the middle of the run mentioned in the previous comment by Prabu, under the alsorun.20220426124703.28248.saturn-p2 test dirs. 1119374182 : 0000 2755 6962 0900 0000 08f8 0c00 0000 0000 'Uib............ 0020 2755 6962 0900 0000 18f8 0c00 0000 0000 'Uib............ 0040 2755 6962 0900 0000 28f8 0c00 0000 0000 'Uib....(....... 0060 2755 6962 0900 0000 38f8 0c00 0000 0000 'Uib....8....... 7720 2755 6962 0900 0000 d807 0d00 0000 0000 'Uib............ 7740 2755 6962 0900 0000 e807 0d00 0000 0000 'Uib............ 7760 2755 6962 0900 0000 f807 0d00 0000 0000 'Uib............ 1119374183 : 0000 2755 6962 0900 0000 0800 0000 0000 0000 'Uib............ 0020 2755 6962 0900 0000 1800 0000 0000 0000 'Uib............ 0040 2755 6962 0900 0000 2800 0000 0000 0000 'Uib....(....... 0060 2755 6962 0900 0000 3800 0000 0000 0000 'Uib....8....... ... 7720 2755 6962 0900 0000 d80f 0000 0000 0000 'Uib............ 7740 2755 6962 0900 0000 e80f 0000 0000 0000 'Uib............ 7760 2755 6962 0900 0000 f80f 0000 0000 0000 'Uib............ 1119374335 : 0000 2755 6962 0900 0000 0880 0900 0000 0000 'Uib............ 0020 2755 6962 0900 0000 1880 0900 0000 0000 'Uib............ 0040 2755 6962 0900 0000 2880 0900 0000 0000 'Uib....(....... 0060 2755 6962 0900 0000 3880 0900 0000 0000 'Uib....8....... ... 7720 2755 6962 0900 0000 d88f 0900 0000 0000 'Uib............ 7740 2755 6962 0900 0000 e88f 0900 0000 0000 'Uib............ 7760 2755 6962 0900 0000 f88f 0900 0000 0000 'Uib............ 1119374336 : 0000 2755 6962 0900 0000 0890 0900 0000 0000 'Uib............ 0020 2755 6962 0900 0000 1890 0900 0000 0000 'Uib............ 0040 2755 6962 0900 0000 2890 0900 0000 0000 'Uib....(....... 0060 2755 6962 0900 0000 3890 0900 0000 0000 'Uib....8....... ... 7720 2755 6962 0900 0000 d89f 0900 0000 0000 'Uib............ 7740 2755 6962 0900 0000 e89f 0900 0000 0000 'Uib............ 7760 2755 6962 0900 0000 f89f 0900 0000 0000 'Uib............ All the intervening physical disk blocks show the same, correct timestamp for this iteration of the job (i.e. 0x62695527). None of the other IOR jobs used that same timestamp, so the data must have come from this one. The 'hole' is present according to the inode extents, but the actual data did land on disk. 813111137 : 0000 2755 6962 0300 0000 08f8 0c00 0000 0000 'Uib............ 0020 2755 6962 0300 0000 18f8 0c00 0000 0000 'Uib............ 0040 2755 6962 0300 0000 28f8 0c00 0000 0000 'Uib....(....... 0060 2755 6962 0300 0000 38f8 0c00 0000 0000 'Uib....8....... ... 7720 2755 6962 0300 0000 d807 0d00 0000 0000 'Uib............ 7740 2755 6962 0300 0000 e807 0d00 0000 0000 'Uib............ 7760 2755 6962 0300 0000 f807 0d00 0000 0000 'Uib............ 813111138 : 0000 2755 6962 0300 0000 0800 0000 0000 0000 'Uib............ 0020 2755 6962 0300 0000 1800 0000 0000 0000 'Uib............ 0040 2755 6962 0300 0000 2800 0000 0000 0000 'Uib....(....... 0060 2755 6962 0300 0000 3800 0000 0000 0000 'Uib....8....... ... 7720 2755 6962 0300 0000 d80f 0000 0000 0000 'Uib............ 7740 2755 6962 0300 0000 e80f 0000 0000 0000 'Uib............ 7760 2755 6962 0300 0000 f80f 0000 0000 0000 'Uib............ but this only continues through block 813111167: 813111167 : 0000 2755 6962 0300 0000 08d0 0100 0000 0000 'Uib............ 0020 2755 6962 0300 0000 18d0 0100 0000 0000 'Uib............ 0040 2755 6962 0300 0000 28d0 0100 0000 0000 'Uib....(....... 0060 2755 6962 0300 0000 38d0 0100 0000 0000 'Uib....8....... ... 7720 2755 6962 0300 0000 d8df 0100 0000 0000 'Uib............ 7740 2755 6962 0300 0000 e8df 0100 0000 0000 'Uib............ 7760 2755 6962 0300 0000 f8df 0100 0000 0000 'Uib............ 813111168 : 0000 1c55 6962 0f00 0000 0800 0000 0000 0000 .Uib............ 0020 1c55 6962 0f00 0000 1800 0000 0000 0000 .Uib............ 0040 1c55 6962 0f00 0000 2800 0000 0000 0000 .Uib....(....... 0060 1c55 6962 0f00 0000 3800 0000 0000 0000 .Uib....8....... ... 7720 1c55 6962 0f00 0000 d80f 0000 0000 0000 .Uib............ 7740 1c55 6962 0f00 0000 e80f 0000 0000 0000 .Uib............ 7760 1c55 6962 0f00 0000 f80f 0000 0000 0000 .Uib............ Starting at physical block 813111168 within the gap in the object, the data signature is different. It is still obviously IOR data, but with a different timestamp. As it happens, that timestamp, 0x6269551c, is the timestamp used by the 2nd failed IOR job, CL_IOR_dom_sel_all_wr_20iter_834K_rand_noreuse.1.V88PUz.1651069606. That job had 4 corrupt files, one of which was IORfile.00000015, which would have been the writer for this data (3rd 2-byte field in the debugfs blockdump output is the IOR write task #, task #15 writes IORfile.00000015). Since those blocks were free (i.e. not associated with the object for IORfile.00000003 from the other job), they were allocated to this file. IORfile.00000015 object inode does show an extent covering this range: 812573620 : 0000 1c55 6962 0f00 0000 0800 0d00 0000 0000 .Uib............ 0020 1c55 6962 0f00 0000 1800 0d00 0000 0000 .Uib............ 0040 1c55 6962 0f00 0000 2800 0d00 0000 0000 .Uib....(....... 0060 1c55 6962 0f00 0000 3800 0d00 0000 0000 .Uib....8....... ... 7720 1c55 6962 0f00 0000 d807 0000 0000 0000 .Uib............ 7740 1c55 6962 0f00 0000 e807 0000 0000 0000 .Uib............ 7760 1c55 6962 0f00 0000 f807 0000 0000 0000 .Uib............ 812573621 : 0000 1c55 6962 0f00 0000 0808 0000 0000 0000 .Uib............ 0020 1c55 6962 0f00 0000 1808 0000 0000 0000 .Uib............ 0040 1c55 6962 0f00 0000 2808 0000 0000 0000 .Uib....(....... 0060 1c55 6962 0f00 0000 3808 0000 0000 0000 .Uib....8....... ... 7720 1c55 6962 0f00 0000 d817 0000 0000 0000 .Uib............ 7740 1c55 6962 0f00 0000 e817 0000 0000 0000 .Uib............ 7760 1c55 6962 0f00 0000 f817 0000 0000 0000 .Uib............ 812573828 : 0000 1c55 6962 0f00 0000 08f8 0c00 0000 0000 .Uib............ 0020 1c55 6962 0f00 0000 18f8 0c00 0000 0000 .Uib............ 0040 1c55 6962 0f00 0000 28f8 0c00 0000 0000 .Uib....(....... 0060 1c55 6962 0f00 0000 38f8 0c00 0000 0000 .Uib....8....... ... 7720 1c55 6962 0f00 0000 d807 0d00 0000 0000 .Uib............ 7740 1c55 6962 0f00 0000 e807 0d00 0000 0000 .Uib............ 7760 1c55 6962 0f00 0000 f807 0d00 0000 0000 .Uib............ 812573829 : 0000 1c55 6962 0f00 0000 0800 0000 0000 0000 .Uib............ 0020 1c55 6962 0f00 0000 1800 0000 0000 0000 .Uib............ 0040 1c55 6962 0f00 0000 2800 0000 0000 0000 .Uib....(....... 0060 1c55 6962 0f00 0000 3800 0000 0000 0000 .Uib....8....... ... 7720 1c55 6962 0f00 0000 d80f 0000 0000 0000 .Uib............ 7740 1c55 6962 0f00 0000 e80f 0000 0000 0000 .Uib............ 7760 1c55 6962 0f00 0000 f80f 0000 0000 0000 .Uib............ Every physical block in that gap has the expected, valid IOR data, matching timestamp, matching task #. |
| Comment by Peggy Gazzola [ 02/May/22 ] |
|
There's one correction to the previous comment, about the file IORfile.00000015 from job CL_IOR_dom_sel_all_wr_20iter_834K_rand_noreuse.1.V88PUz.1651069606. The hole in that file: For reference, here's debugfs 'stat' output for IORfile.00000015 ost object: [root@snx11922n004 ~]# debugfs -c -R "stat /O/0/d4/127076516" /dev/md2 debugfs 1.46.2.cr1 (13-Jan-2022) /dev/md2: catastrophic mode - not reading inode or group bitmaps Inode: 1146884 Type: regular Mode: 0666 Flags: 0x80000 Generation: 1271779983 Version: 0x00000141:00000ba2 User: 1356 Group: 11121 Project: 0 Size: 479102976 File ACL: 0 Links: 1 Blockcount: 932048 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x626955dd:00000000 -- Wed Apr 27 09:40:29 2022 atime: 0x00000000:00000000 -- Wed Dec 31 18:00:00 1969 mtime: 0x626955dd:00000000 -- Wed Apr 27 09:40:29 2022 crtime: 0x626954e4:ede07d30 -- Wed Apr 27 09:36:20 2022 Size of extra inode fields: 32 Extended attributes: lma: fid=[0x100010000:0x79308a4:0x0] compat=8 incompat=0 fid: parent=[0x200019284:0xf3a:0x0] stripe=0 stripe_size=1048576 stripe_count=1 component_id=2 component_start=1048576 component_end=536870912 layout_version=0 range=0 EXTENTS: (ETB0):293636096, (256-1250):811688192-811689186, (1251-1535):814882449-814882733, (1536-1663):812577152-812577279, (1664-1919):811673856-811674111, (1920-2047):815736064-81573619 1, (2048-2292):812651520-812651764, (2293-2559):813565312-813565578, (2560-2687):815838976-815839103, (2688-2918):813935756-813935986, (2919-3335):811690855-811691271, (3336-3543) :815174805-815175012, (3544-3839):814450656-814450951, (3840-3960):813526339-813526459, (3961-7807):811691897-811695743, (7808-8130):811671552-811671874, (8131-8191):811696067-811 696127, (8192-8339):811704320-811704467, (8340-8447):815992960-815993067, (8448-8756):815736192-815736500, (8757-9727):811704885-811705855, (9728-9798):811687936-811688006, (9799- 13552):811705927-811709680, (13553-13760):812029184-812029391, (13761-14847):811709889-811710975, (14848-14975):812573696-812573823, (14976-15219):812633581-812633824, (15220-1542 8):811711348-811711556, (15429-15636):815843328-815843 |
| Comment by Andreas Dilger [ 04/May/22 ] |
|
Peggy, not a solution to the problem, but some background info on the OST IO path that may help track down a fix:
If there is a crash between the RDMA+write completion, but the journal transaction is not committed, then you will see exactly what you describe - the data is present in the right spot on disk, but not "allocated" to the inode. The client is supposed to replay those RPCs if the server crashed before the allocation transaction was committed to disk, but it doesn't appear that this is happening in this case. Initially I was wondering if this was a case of uninitialized ext4 extents (where the blocks are allocated, but have a flag that tells readers to just report "all zero" for those blocks), but it looks like this is a case where the blocks are not allocated to the file at all. Note that if the clients do replay their BRW RPCs to the OST as they should, then there is no guarantee that the filesystem is going to allocate exactly the same blocks to each inode. I would hope they should, given the proximity of the block numbers, but that might be a reason why the same data is on disk in two places and/or why some of it is allocated to different files. IMHO, this looks like the problem is in the BRW RPC replay mechanism (or lack thereof), and is unlikely to be in the ldiskfs level, unless there is a failing in the commit callbacks that somehow break last_committed and imply the BRW RPC was committed to disk when it is not. |
| Comment by Alexey Lyashkov [ 06/May/22 ] |
|
Andreas, picture much strange when simple request lost / fail during recovery. 00000008:00200020:3.0:1651709243.919899:0:28609:0:(osc_cache.c:1084:osc_extent_make_ready()) extent 00000000c6c59345@{[19456 -> 20015/20479], [1|0|+|locking|wiuY|00000000a3d0090d], [2318336|560|+|-|0000000047a524c6|1024|000000004ef973c1]} make ready
00000008:00100000:3.0:1651709243.920226:0:28609:0:(osc_request.c:1804:osc_brw_prep_request()) brw rpc ffff8887c10f3680/1731924107812288 - object 0x0:137211497 offset 62914560<>83886080
00000100:00100000:6.0:1651709243.920235:0:17045:0:(client.c:1733:ptlrpc_send_new_req()) Sending RPC req@ffff8887c10f3680 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_06:3a768119-6203-4fb5-a916-2d4dabdd8e84:17045:1731924107812288:10.12.2.53@o2ib4001:4:
00000100:00000400:19.0:1651709261.180872:0:17045:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1651709243/real 1651709243] req@ffff8887c10f3680 x1731924107812288/t0(0) o4->snx11922-OST0002-osc-ffff888f8ae7c800@10.12.2.53@o2ib4001:6/4 lens 536/456 e 0 to 1 dl 1651709261 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:''
00000100:00080000:18.0:1651709866.209843:0:17038:0:(recover.c:218:ptlrpc_wake_delayed()) @@@ waking (set ffff888788289200): req@ffff8887c10f3680 x1731924107812288/t0(0) o4->snx11922-OST0002-osc-ffff888f8ae7c800@10.12.0.52@o2ib4000:6/4 lens 536/456 e 0 to 1 dl 1651709261 ref 2 fl Rpc:XQU/0/ffffffff rc 0/-1 job:''
00000100:00100000:7.0:1651709867.863582:0:17045:0:(client.c:2215:ptlrpc_check_set()) Completed RPC req@ffff8887c10f3680 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_06:3a768119-6203-4fb5-a916-2d4dabdd8e84:17045:1731924107812288:10.12.0.52@o2ib4000:4:
thr ffff8aebb2b88600 - pid 82928
req ffff8ad91927f080
request ffff8ad91927f080
from 0.0.0.117@13:4
to 10.12.2.53@5:4001
xid 1731924107812288:1731924107812291
transno 1262720737376
time 923905344:1651709256:0
flags 3
lens 184 208 24 64 0 0
ptlrpc body(v3) ffff8aed93688698
type 4711
tag 0 - last_xid 1731924107811071
conn 22
opc 4 OST_WRITE
num ios 1 (ffff8aed93688820)
obj [0] : 0:0x82dae69 bufs:4
data segments
0 <> 62914560 : 282624 : 460
1 <> 75153408 : 856064 : 460
2 <> 79691776 : 2293760 : 460
3 <> 83693568 : 192512 : 460
not replied
-----
crash> bt 82928
PID: 82928 TASK: ffff8aeb524097c0 CPU: 15 COMMAND: "ll_ost_io00_442"
#0 [ffffa62e6504f880] __schedule at ffffffff8774e1d4
#1 [ffffa62e6504f918] schedule at ffffffff8774e648
#2 [ffffa62e6504f928] osd_trans_stop at ffffffffc1acce24 [osd_ldiskfs]
#3 [ffffa62e6504fa00] ofd_commitrw_write at ffffffffc18f4453 [ofd]
#4 [ffffa62e6504faa0] ofd_commitrw at ffffffffc18f97b1 [ofd]
#5 [ffffa62e6504fb60] finish_wait at ffffffff86f2e5ac
#6 [ffffa62e6504fbd8] tgt_brw_write at ffffffffc1389bb5 [ptlrpc]
#7 [ffffa62e6504fd50] tgt_request_handle at ffffffffc138ade3 [ptlrpc]
#8 [ffffa62e6504fdd0] ptlrpc_server_handle_request at ffffffffc1337953 [ptlrpc]
#9 [ffffa62e6504fe38] ptlrpc_main at ffffffffc13393c6 [ptlrpc]
#10 [ffffa62e6504ff10] kthread at ffffffff86f043a6
#11 [ffffa62e6504ff50] ret_from_fork at ffffffff8780023f
but data lost a part of the segment 2. [15] File byte offset = 79425536: ... file has 1 hole at byte offset 79425536 (15429-19390):680934469-680938430, (19456-25645):680938496-680944685, hole is 65 blks. And lost extent have a write "written" state in extent cache. it might possible if some data have write in the extents allocated early (IOR uses a random point to write) and rpc resend had lost, but client logs say rpc completed without errors. |
| Comment by Peter Jones [ 09/May/22 ] |
|
spitzcor peggy shadow have you managed to marrow down when this problem was introduced? |
| Comment by Alexey Lyashkov [ 09/May/22 ] |
|
Peter, don't have such info. Currently we have just conformation - this request don't executed after recovery, while client think execution OK. Peggy works hard to collect a logs to see this, as this issue reproduced not each run. |
| Comment by Alexey Lyashkov [ 10/May/22 ] |
|
I have small update. From logs obtained yesterday - It looks server skips a request execution with single error/warning like: 00000001:00080000:3.0:1652110683.220558:0:1080451:0:(tgt_lastrcvd.c:2000:tgt_txn_stop_cb()) More than one transaction 1408749330374 |
| Comment by Mikhail Pershin [ 11/May/22 ] |
|
I assume that multiple transaction during write can occur due to 'restart/retry' logic inside ofd_commitrw_write(). To resolve that it is enough to set tti_mult_trans flag, so last_rcvd will be updated properly. I think the better place for that in tgt_brw_write() so both MDT and OFD codepaths are covered: diff --git a/lustre/target/tgt_handler.c b/lustre/target/tgt_handler.c index 3198cc8fc1..d81aff759a 100644 --- a/lustre/target/tgt_handler.c +++ b/lustre/target/tgt_handler.c @@ -2605,6 +2605,7 @@ static void tgt_warn_on_cksum(struct ptlrpc_request *req, int tgt_brw_write(struct tgt_session_info *tsi) { struct ptlrpc_request *req = tgt_ses_req(tsi); + struct tgt_thread_info *tti = tgt_th_info(tsi->tsi_env); struct ptlrpc_bulk_desc *desc = NULL; struct obd_export *exp = req->rq_export; struct niobuf_remote *remote_nb; @@ -2846,6 +2847,9 @@ out_commitrw: nob += len; } + /* allow multiple transactions to be assigned during write commit */ + tti->tti_mult_trans = 1; + /* Must commit after prep above in all cases */ rc = obd_commitrw(tsi->tsi_env, OBD_BRW_WRITE, exp, &repbody->oa, objcount, ioo, remote_nb, npages, local_nb, rc, nob, |
| Comment by Alexey Lyashkov [ 11/May/22 ] |
|
Mike, it looks you right about source of that messages for write. 00002000:00000002:11.0:1652209931.209295:0:82589:0:(ofd_io.c:1391:ofd_commitrw_write()) retry transaction, retries:2 00080000:00000002:11.0:1652209931.209342:0:82589:0:(osd_io.c:1544:osd_declare_write_commit()) snx11922-OST0002/: inode #1638412 extent_bytes 851393 extents 2 credits 33 00000001:00080000:11.0:1652209931.209400:0:82589:0:(tgt_lastrcvd.c:2000:tgt_txn_stop_cb()) More than one transaction 1451699035998 it looks
not so good from performance view :/ |
| Comment by Alexey Lyashkov [ 11/May/22 ] |
|
it looks like this bug addressed for todays corruption hit - commit hits during transaction restart but old assigned transno used for reply (high likely). 00000100:00100000:13.0:1652209930.000194:0:82206:0:(service.c:2134:ptlrpc_server_handle_req_in()) got req x1731924859869568 00002000:00100000:13.0:1652209930.000202:0:82206:0:(ofd_dev.c:2619:ofd_rw_hpreq_check()) @@@ snx11922-OST0002 ll_ost_io00_288: refresh rw locks for [0x2c0000400:0x86ece4e:0x0] (9392128->40992767) req@00000000bd622443 x1731924859869568/t0(0) o4->ea32ef78-799e-4d3c-a8e4-6a210cb165c2@102@gni4:440/0 lens 536/0 e 0 to 0 dl 1652209950 ref 1 fl New:/0/ffffffff rc 0/-1 job:'' 00002000:00100000:13.0:1652209930.000205:0:82206:0:(ofd_dev.c:2478:ofd_prolong_extent_locks()) Prolong locks for req 00000000bd622443 with x1731924859869568 ext(9392128->40992767) 00000100:00100000:13.0:1652209930.000214:0:82206:0:(service.c:2303:ptlrpc_server_handle_request()) Handling RPC req@00000000bd622443 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_288:ea32ef78-799e-4d3c-a8e4-6a210cb165c2+191:16994:x1731924859869568:12345-102@gni4:4: 00080000:00000002:13.0:1652209930.519952:0:82206:0:(osd_io.c:1544:osd_declare_write_commit()) snx11922-OST0002/: inode #1638560 extent_bytes 899332 extents 4 credits 65 00080000:00000002:13.0:1652209930.520032:0:82206:0:(osd_handler.c:4814:osd_xattr_set()) Set version 0x15200015f47 (old 0x15200015f44) for inode 1638560 00000001:00000002:13.0:1652209930.520033:0:82206:0:(tgt_lastrcvd.c:1424:tgt_last_rcvd_update()) transno = 1451699035975, last_committed = 1451699035951 00002000:00000002:13.0:1652209930.638804:0:82206:0:(ofd_io.c:1391:ofd_commitrw_write()) retry transaction, retries:1 00080000:00000002:13.0:1652209930.638833:0:82206:0:(osd_io.c:1544:osd_declare_write_commit()) snx11922-OST0002/: inode #1638560 extent_bytes 871323 extents 3 credits 49 00002000:00000002:13.0:1652209930.694482:0:82206:0:(ofd_io.c:1391:ofd_commitrw_write()) retry transaction, retries:2 00080000:00000002:13.0:1652209930.694508:0:82206:0:(osd_io.c:1544:osd_declare_write_commit()) snx11922-OST0002/: inode #1638560 extent_bytes 867716 extents 1 credits 11 >>> 00000001:00000002:8.0:1652209930.697117:0:82431:0:(tgt_lastrcvd.c:1424:tgt_last_rcvd_update()) transno = 1451699035985, last_committed = 1451699035983 <<<< commit hit. 00002000:00000002:13.0:1652209930.727496:0:82206:0:(ofd_io.c:1391:ofd_commitrw_write()) retry transaction, retries:3 00080000:00000002:13.0:1652209930.727526:0:82206:0:(osd_io.c:1544:osd_declare_write_commit()) snx11922-OST0002/: inode #1638560 extent_bytes 854149 extents 1 credits 11 00000100:00100000:13.0:1652209930.832622:0:82206:0:(service.c:2352:ptlrpc_server_handle_request()) Handled RPC req@00000000bd622443 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_288:ea32ef78-799e-4d3c-a8e4-6a210cb165c2+11:16994:x1731924859869568:12345-102@gni4:4: Request processed in 832406us (832433us total) trans 1451699035975 rc 0/0 00002000:00100000:13.0:1652209930.832630:0:82206:0:(ofd_dev.c:2619:ofd_rw_hpreq_check()) @@@ snx11922-OST0002 ll_ost_io00_288: refresh rw locks for [0x2c0000400:0x86ece4e:0x0] (9392128->40992767) req@00000000bd622443 x1731924859869568/t1451699035975(0) o4->ea32ef78-799e-4d3c-a8e4-6a210cb165c2@102@gni4:440/0 lens 536/456 e 0 to 0 dl 1652209950 ref 1 fl Complete:/0/0 rc 0/0 job:'' 00002000:00100000:13.0:1652209930.832650:0:82206:0:(ofd_dev.c:2478:ofd_prolong_extent_locks()) Prolong locks for req 00000000bd622443 with x1731924859869568 ext(9392128->40992767) |
| Comment by Andreas Dilger [ 19/May/22 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47371 |
| Comment by Colin Faber [ 19/May/22 ] |
|
Hi spitzcor, We have a potential fix https://review.whamcloud.com/#/c/47371/ Can you please try this patch with your reproducer and let us know if it addresses the problem? Thank you! |
| Comment by Cory Spitz [ 19/May/22 ] |
|
Yes, we will test it and report back next week. |
| Comment by Cory Spitz [ 24/May/22 ] |
|
The reverts that Alexey suggested here and in the review of https://review.whamcloud.com/47371/ has survived 72 hours of the reproducer (also noted in the review) whereas it would fail within 30 minutes otherwise. Performance testing is in-progress. The fix from https://review.whamcloud.com/47371/ has also survived an initial level of tests. We're looking to complete at least 24 hours here as well. We can provide another update later this week. In the meantime, the discussion in review of https://review.whamcloud.com/47371 should continue. |
| Comment by Colin Faber [ 26/May/22 ] |
|
Hi spitzcor Any further updates? |
| Comment by Cory Spitz [ 26/May/22 ] |
|
45 hours of test of the reproducer with https://review.whamcloud.com/47371/ has completed successfully, without failure. |
| Comment by Cory Spitz [ 31/May/22 ] |
|
Status update from HPE: additional regression tests of https://review.whamcloud.com/47371/ have completed without issue. |
| Comment by Colin Faber [ 31/May/22 ] |
|
Thanks for the update spitzcor, I'm assuming based on this, the performance testing came back without issue? |
| Comment by Gerrit Updater [ 01/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47371/ |
| Comment by Cory Spitz [ 01/Jun/22 ] |
|
HPE did not complete performance testing of https://review.whamcloud.com/47371/ in isolation. We did however test the reverts that Alexey mentioned before, which checked out without a perf delta. Further, we did measure a workload mix job throughput rate with https://review.whamcloud.com/47371/ and it does not deviate from those reverts or a baseline. |
| Comment by Peter Jones [ 01/Jun/22 ] |
|
That seems thorough enough to consider this issue as resolved in 2.15 RC5 to me - thanks Cory! |