[LU-15776] 2.15 RC3: lost writes during server fofb by forced panics Created: 22/Apr/22  Updated: 01/Jun/22  Resolved: 01/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Prabu Manoharan Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: corruption
Environment:

Server Details:

[root@snx11922n000 admin]# cscli fs_info
----------------------------------------------------------------------------------------------------
OST Redundancy style: Declustered Parity gridraid
Disk I/O Integrity guard (ANSI T10-PI) is in use ✔
----------------------------------------------------------------------------------------------------
Information about "snx11922" file system:
----------------------------------------------------------------------------------------------------
Node Role Targets Failover partner Devices
----------------------------------------------------------------------------------------------------
snx11922n000 mgmt 0 / 0 snx11922n001
snx11922n001 mgmt 0 / 0 snx11922n000
snx11922n002 mgs 1 / 1 snx11922n003 /dev/md65
snx11922n003 mds 1 / 1 snx11922n002 /dev/md66
snx11922n004 oss 2 / 2 snx11922n005 /dev/md0, /dev/md2
snx11922n005 oss 2 / 2 snx11922n004 /dev/md1, /dev/md3

[root@snx11922n000 admin]# cscli show_node_versions -g all
snx11922n000 6.1-010.39
snx11922n001 6.1-010.39
snx11922n002 6.1-010.39-cm-22.04.19-g3cedd6d
snx11922n003 6.1-010.39-cm-22.04.19-g3cedd6d
snx11922n004 6.1-010.39-cm-22.04.19-g3cedd6d
snx11922n005 6.1-010.39-cm-22.04.19-g3cedd6d

Lustre Version :
version=2.15.0_RC3_14_g3cedd6d

Debug Level:
debug=ioctl neterror warning error emerg ha config console lfsck

Debug_mb Size :
debug_mb=101

Subsystem_debug:
undefined mdc mds osc ost class log llite rpc mgmt lnet lnd pinger filter libcfs echo ldlm lov lquota osd lfsck snapshot lmv sec gss mgc mgs fid fld

saturn-p2:/ # pcmd -s -n ALL_COMPUTE "mount -t lustre"
Output from 56-63,96-127:
10.12.0.50@o2ib4000:10.12.2.51@o2ib4001:/snx11922 on /lus/snx11922 type lustre (rw,nochecksum,flock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,encrypt)
Node(s) 56-63,96-127 had exit code 0
saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param version"
Output from 56-63,96-127:
version=2.15.0_RC3_14_g3cedd6d
Node(s) 56-63,96-127 had exit code 0
saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param debug"
Output from 56-63,96-127:
debug=ioctl neterror warning error emerg ha config console lfsck
Node(s) 56-63,96-127 had exit code 0
saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param debug_mb"
Output from 60-63,112-115,120-127:
debug_mb=201
Output from 104-111:
debug_mb=321
Output from 56-59,96-103,116-119:
debug_mb=161
Node(s) 56-63,96-127 had exit code 0
saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param subsystem_debug"
Output from 56-63,96-127:
subsystem_debug=
undefined mdc mds osc ost class log llite rpc mgmt lnet lnd pinger filter libcfs echo ldlm lov lquota osd lfsck snapshot lmv sec gss mgc mgs fid fld
Node(s) 56-63,96-127 had exit code

saturn-p2:/ # xtprocadmin| grep compute |wc
40 240 2360
saturn-p2:/ # apstat
Compute node summary
arch config up resv use avail down rebootq
XT 40 40 36 36 4 0 0

No pending applications are present

Total placed applications: 6
Apid ResId User PEs Nodes Age State Command
3152339 51624314 vers 24 6 0h14m run IOR
3152346 51624321 vers 24 6 0h10m run IOR
3152348 51624323 vers 24 6 0h09m run IOR
3152352 51624327 vers 24 6 0h04m run IOR
3152353 51624328 vers 24 6 0h04m run IOR
3152354 51624329 vers 24 6 0h02m run IOR

No applications or reservations are being cleaned up


Issue Links:
Related
is related to LU-15847 Resolve 'More than one transno' situa... Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

data miscompare seen during server failover by forced panics for an IOR job (ill formed IO).  

Error Output :

 

[13] Expected: 0x0000000000029ff8

[13] Actual:   0x0000000000000000

[13] FAILED comparison of buffer containing 8-byte ints:

[13]   File name = /lus/snx11922/ost2/ostest.vers/alsorun.20220421145202.27664.saturn-p2/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse.1.qSxF7f.1650570751/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse/IORfile.00000021

[13]   In transfer 152, 21504 errors between buffer indices 0 and 21503.

[13]   File byte offset = 37576704:

[13]     Expected: 0x000000156261b63a 0000000000000008 000000156261b63a 0000000000000018

[13]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000

    • error **

ERROR in IOR.c (line 448): data check error, aborting execution.

ERROR: No such file or directory

    • exiting **

 

Client involved during this corruption:

 

task 13 on nid00097

task 21 on nid00099

 

saturn-p2:/ # lfs getstripe -r -y /lus/snx11922/ost2/ostest.vers/alsorun.20220421145202.27664.saturn-p2/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse.1.qSxF7f.1650570751/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse/IORfile.00000021

  lcm_layout_gen:    7

  lcm_mirror_count:  1

  lcm_entry_count:   5

  component0:

    lcme_id:             1

    lcme_mirror_id:      0

    lcme_flags:          init

    lcme_extent.e_start: 0

    lcme_extent.e_end:   1048576

    sub_layout:

      lmm_stripe_count:  0

      lmm_stripe_size:   1048576

      lmm_pattern:       mdt

      lmm_layout_gen:    0

      lmm_stripe_offset: 0

 

  component1:

    lcme_id:             2

    lcme_mirror_id:      0

    lcme_flags:          init

    lcme_extent.e_start: 1048576

    lcme_extent.e_end:   269484032

    sub_layout:

      lmm_stripe_count:  1

      lmm_stripe_size:   1048576

      lmm_pattern:       raid0

      lmm_layout_gen:    0

      lmm_stripe_offset: 3

      lmm_pool:          ost2

      lmm_objects:

      - l_ost_idx: 3

        l_fid:     0x100030000:0x83fae34:0x0

 

  component2:

    lcme_id:             3

    lcme_mirror_id:      0

    lcme_flags:          extension

    lcme_extent.e_start: 269484032

    lcme_extent.e_end:   536870912

    sub_layout:

      lmm_stripe_count:  0

      lmm_extension_size: 134217728

      lmm_pattern:       raid0

      lmm_layout_gen:    0

      lmm_stripe_offset: -1

      lmm_pool:          ost2

 

  component3:

    lcme_id:             4

    lcme_mirror_id:      0

    lcme_flags:          0

    lcme_extent.e_start: 536870912

    lcme_extent.e_end:   536870912

    sub_layout:

      lmm_stripe_count:  -1

      lmm_stripe_size:   1048576

      lmm_pattern:       raid0

      lmm_layout_gen:    0

      lmm_stripe_offset: -1

      lmm_pool:          ost2

 

  component4:

    lcme_id:             5

    lcme_mirror_id:      0

    lcme_flags:          extension

    lcme_extent.e_start: 536870912

    lcme_extent.e_end:   EOF

    sub_layout:

      lmm_stripe_count:  0

      lmm_extension_size: 268435456

      lmm_pattern:       raid0

      lmm_layout_gen:    0

      lmm_stripe_offset: -1

      lmm_pool:          ost2

 

 

 

Console Logs:

 

2022-04-21T14:55:08.086885-05:00 c0-0c1s8n0 Lustre: 9811:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887]  req@000000007bfb28c1 x1730745432703552/t0(0) o400->snx11922-OST0003-osc-ffff88878818e800@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

2022-04-21T14:55:08.086967-05:00 c0-0c1s8n0 Lustre: snx11922-OST0003-osc-ffff88878818e800: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete

2022-04-21T14:55:08.086988-05:00 c0-0c1s8n0 Lustre: 9809:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887]  req@000000008ee911d3 x1730745432703488/t0(0) o400->snx11922-OST0002-osc-ffff88878818e800@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

2022-04-21T14:55:08.087004-05:00 c0-0c1s8n0 Lustre: snx11922-OST0002-osc-ffff88878818e800: Connection to snx11922-OST0002 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete

2022-04-21T14:55:08.087023-05:00 c0-0c1s8n2 Lustre: 9808:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887]  req@00000000e7ca3f51 x1730745432704128/t0(0) o400->snx11922-OST0003-osc-ffff8887854c2000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

2022-04-21T14:55:08.087040-05:00 c0-0c1s8n2 Lustre: snx11922-OST0003-osc-ffff8887854c2000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete

2022-04-21T14:55:08.087076-05:00 c0-0c1s8n1 Lustre: 9797:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570889/real 1650570889]  req@0000000039643501 x1730745432705728/t0(0) o400->snx11922-OST0003-osc-ffff88838195c000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570905 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

2022-04-21T14:55:08.087086-05:00 c0-0c1s8n1 Lustre: snx11922-OST0003-osc-ffff88838195c000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete

2022-04-21T14:55:08.087114-05:00 c0-0c1s8n3 Lustre: 9786:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570889/real 1650570889]  req@000000008cb8442c x1730745432700352/t0(0) o400->snx11922-OST0002-osc-ffff888386f33000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570905 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

2022-04-21T14:55:08.087129-05:00 c0-0c1s8n3 Lustre: snx11922-OST0003-osc-ffff888386f33000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete

 

Kern logs :

Apr 21 14:55:25 snx11922n004 kernel: md: md3: resync interrupted.

Apr 21 14:55:26 snx11922n004 kernel: LDISKFS-fs (md3): recovery complete

Apr 21 14:55:26 snx11922n004 kernel: LDISKFS-fs (md3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=panic,errors=panic,journal_checksum,no_mbcache,nodelalloc

Apr 21 14:55:27 snx11922n003 kernel: Lustre: snx11922-MDT0001: Received LWP connection from 10.12.0.52@o2ib4, removing former export from 10.12.0.53@o2ib4

Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: Imperative Recovery enabled, recovery window shrunk from 900-2700 down to 450-2700

Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: in recovery but waiting for the first client to connect

Apr 21 14:55:27 snx11922n004 kernel: md: resync of RAID array md3

Apr 21 14:55:27 snx11922n004 kernel: [204833]:md.c:10632:md_do_sync(): md: md3: resync range: [0, 30316376064)

Apr 21 14:55:27 snx11922n002 kernel: Lustre: snx11922-MDT0000: Received LWP connection from 10.12.0.52@o2ib4, removing former export from 10.12.0.53@o2ib4

Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: Will be in recovery for at least 7:30, or until 16 clients reconnect

Apr 21 14:55:40 snx11922n001 kernel: LNet: 2575200:0:(o2iblnd_cb.c:3369:kiblnd_check_conns()) Timed out tx for 10.12.0.53@o2ib4: 2 seconds

Apr 21 14:55:40 snx11922n001 kernel: LNet: 2575200:0:(o2iblnd_cb.c:3369:kiblnd_check_conns()) Skipped 1 previous similar message

Apr 21 14:55:48 snx11922n003 kernel: Lustre: snx11922-OST0002-osc-MDT0001: Connection restored to 10.12.0.52@o2ib4 (at 10.12.0.52@o2ib4)

Apr 21 14:55:48 snx11922n001 kernel: Lustre: snx11922-OST0002-osc-ffff998fa4d8d000: Connection restored to 10.12.0.52@o2ib4 (at 10.12.0.52@o2ib4)

Apr 21 14:55:48 snx11922n004 kernel: Lustre: snx11922-OST0002: Recovery over after 0:32, of 16 clients 16 recovered and 0 were evicted.

 



 Comments   
Comment by Colin Faber [ 22/Apr/22 ]

Are you guys working on this bug now?

Comment by Peter Jones [ 25/Apr/22 ]

spitzcor could you please advise if someone from your team is working on this? Is this a blocker for 2.15?

Comment by Cory Spitz [ 25/Apr/22 ]

shadow is working this for HPE and so I've assigned it to him.
I guess it wasn't evident in the report, but this instance reported here is seen with 2.15.0-RC3 (clients & servers).

Comment by Cory Spitz [ 30/Apr/22 ]

At this week's LWG call I reported that our attention was in debugging the client. New information has been brought to light, which suggests that it is a server-side issue. Quoted from Peggy G. at HPE:

In an attempt to get more complete debug log data, a change was made to the test run to force panic on all computes as soon as IOR detected data corruption, rather than dump debug logs. This change was made during the middle of the run mentioned in the previous comment by Prabu, under the alsorun.20220426124703.28248.saturn-p2 test dirs.
There were 2 IOR jobs that hit corruption around the same time:
CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse.1.lDTLio.1651070204, 2 corrupt files each with 1 hole.
CL_IOR_dom_sel_all_wr_20iter_834K_rand_noreuse.1.V88PUz.1651069606, 4 corrupt files, 2 with 2 holes, 2 with 1 hole.
Taking a closer look at the data today, I noticed that in some cases, the gap between physical blocks was the same size as the gap between logical blocks, which seemed unusual. For example, for the 10iter job above, with lDTLio string:
IORfile.00000009 has 1 hole in the OST object starting at logical block 2919
(256-2918):1119371520-1119374182, (3072-16383):1119374336-1119387647,
the physical blocks have exactly the same gap as the logical blocks, 3072-2918=154, 1119374336-1119374182=154 That seemed strange. Checking the physical blocks in that gap, they do have the expected IOR data. The timestamp in IOR data confirms this. Here is output from debugfs blockdump for blocks 111937418[2-3], 111937433[5-6], showing the first 4 and last 3 lines of the bd output for each of those 4 blocks:

1119374182 :
0000  2755 6962 0900 0000 08f8 0c00 0000 0000  'Uib............
0020  2755 6962 0900 0000 18f8 0c00 0000 0000  'Uib............
0040  2755 6962 0900 0000 28f8 0c00 0000 0000  'Uib....(.......
0060  2755 6962 0900 0000 38f8 0c00 0000 0000  'Uib....8.......

7720  2755 6962 0900 0000 d807 0d00 0000 0000  'Uib............
7740  2755 6962 0900 0000 e807 0d00 0000 0000  'Uib............
7760  2755 6962 0900 0000 f807 0d00 0000 0000  'Uib............

1119374183 :
0000  2755 6962 0900 0000 0800 0000 0000 0000  'Uib............
0020  2755 6962 0900 0000 1800 0000 0000 0000  'Uib............
0040  2755 6962 0900 0000 2800 0000 0000 0000  'Uib....(.......
0060  2755 6962 0900 0000 3800 0000 0000 0000  'Uib....8.......
...
7720  2755 6962 0900 0000 d80f 0000 0000 0000  'Uib............
7740  2755 6962 0900 0000 e80f 0000 0000 0000  'Uib............
7760  2755 6962 0900 0000 f80f 0000 0000 0000  'Uib............

1119374335 :
0000  2755 6962 0900 0000 0880 0900 0000 0000  'Uib............
0020  2755 6962 0900 0000 1880 0900 0000 0000  'Uib............
0040  2755 6962 0900 0000 2880 0900 0000 0000  'Uib....(.......
0060  2755 6962 0900 0000 3880 0900 0000 0000  'Uib....8.......
...
7720  2755 6962 0900 0000 d88f 0900 0000 0000  'Uib............
7740  2755 6962 0900 0000 e88f 0900 0000 0000  'Uib............
7760  2755 6962 0900 0000 f88f 0900 0000 0000  'Uib............

1119374336 :
0000  2755 6962 0900 0000 0890 0900 0000 0000  'Uib............
0020  2755 6962 0900 0000 1890 0900 0000 0000  'Uib............
0040  2755 6962 0900 0000 2890 0900 0000 0000  'Uib....(.......
0060  2755 6962 0900 0000 3890 0900 0000 0000  'Uib....8.......
...
7720  2755 6962 0900 0000 d89f 0900 0000 0000  'Uib............
7740  2755 6962 0900 0000 e89f 0900 0000 0000  'Uib............
7760  2755 6962 0900 0000 f89f 0900 0000 0000  'Uib............

All the intervening physical disk blocks show the same, correct timestamp for this iteration of the job (i.e. 0x62695527). None of the other IOR jobs used that same timestamp, so the data must have come from this one. The 'hole' is present according to the inode extents, but the actual data did land on disk.
The other problem file from this job, IORfile.00000003, shows a similar signature: 1 hole in object at logical block 14178
...(256-14177):813097216-813111137, (14336-16383):813111296-813113343...
The gap in physical blocks is the same size as gap in logical blocks. This time though there is a twist. The first 'gap' physical block, 813111138, has the correct, expected IOR data, the next logical data in sequence from the previous physical block.

813111137 :
0000  2755 6962 0300 0000 08f8 0c00 0000 0000  'Uib............
0020  2755 6962 0300 0000 18f8 0c00 0000 0000  'Uib............
0040  2755 6962 0300 0000 28f8 0c00 0000 0000  'Uib....(.......
0060  2755 6962 0300 0000 38f8 0c00 0000 0000  'Uib....8.......
...
7720  2755 6962 0300 0000 d807 0d00 0000 0000  'Uib............
7740  2755 6962 0300 0000 e807 0d00 0000 0000  'Uib............
7760  2755 6962 0300 0000 f807 0d00 0000 0000  'Uib............

813111138 :
0000  2755 6962 0300 0000 0800 0000 0000 0000  'Uib............
0020  2755 6962 0300 0000 1800 0000 0000 0000  'Uib............
0040  2755 6962 0300 0000 2800 0000 0000 0000  'Uib....(.......
0060  2755 6962 0300 0000 3800 0000 0000 0000  'Uib....8.......
...
7720  2755 6962 0300 0000 d80f 0000 0000 0000  'Uib............
7740  2755 6962 0300 0000 e80f 0000 0000 0000  'Uib............
7760  2755 6962 0300 0000 f80f 0000 0000 0000  'Uib............

but this only continues through block 813111167:

813111167 :
0000  2755 6962 0300 0000 08d0 0100 0000 0000  'Uib............
0020  2755 6962 0300 0000 18d0 0100 0000 0000  'Uib............
0040  2755 6962 0300 0000 28d0 0100 0000 0000  'Uib....(.......
0060  2755 6962 0300 0000 38d0 0100 0000 0000  'Uib....8.......
...
7720  2755 6962 0300 0000 d8df 0100 0000 0000  'Uib............
7740  2755 6962 0300 0000 e8df 0100 0000 0000  'Uib............
7760  2755 6962 0300 0000 f8df 0100 0000 0000  'Uib............

813111168 :
0000  1c55 6962 0f00 0000 0800 0000 0000 0000  .Uib............
0020  1c55 6962 0f00 0000 1800 0000 0000 0000  .Uib............
0040  1c55 6962 0f00 0000 2800 0000 0000 0000  .Uib....(.......
0060  1c55 6962 0f00 0000 3800 0000 0000 0000  .Uib....8.......
...
7720  1c55 6962 0f00 0000 d80f 0000 0000 0000  .Uib............
7740  1c55 6962 0f00 0000 e80f 0000 0000 0000  .Uib............
7760  1c55 6962 0f00 0000 f80f 0000 0000 0000  .Uib............

Starting at physical block 813111168 within the gap in the object, the data signature is different. It is still obviously IOR data, but with a different timestamp. As it happens, that timestamp, 0x6269551c, is the timestamp used by the 2nd failed IOR job, CL_IOR_dom_sel_all_wr_20iter_834K_rand_noreuse.1.V88PUz.1651069606. That job had 4 corrupt files, one of which was IORfile.00000015, which would have been the writer for this data (3rd 2-byte field in the debugfs blockdump output is the IOR write task #, task #15 writes IORfile.00000015). Since those blocks were free (i.e. not associated with the object for IORfile.00000003 from the other job), they were allocated to this file. IORfile.00000015 object inode does show an extent covering this range:
(80481-80608):813111168-813111295.
From the V88PUz job, IORfile.00000015 has 1 hole:
...(41700-41908):812573412-812573620, (42117-45439):812573829-812577151, ...
Again, the logical/physical gaps are the same size. And like IORfile.00000009 from the other job, the "missing" IOR data is present on those physical blocks. From debugfs 'bd' output:

812573620 :
0000  1c55 6962 0f00 0000 0800 0d00 0000 0000  .Uib............
0020  1c55 6962 0f00 0000 1800 0d00 0000 0000  .Uib............
0040  1c55 6962 0f00 0000 2800 0d00 0000 0000  .Uib....(.......
0060  1c55 6962 0f00 0000 3800 0d00 0000 0000  .Uib....8.......
...
7720  1c55 6962 0f00 0000 d807 0000 0000 0000  .Uib............
7740  1c55 6962 0f00 0000 e807 0000 0000 0000  .Uib............
7760  1c55 6962 0f00 0000 f807 0000 0000 0000  .Uib............

812573621 :
0000  1c55 6962 0f00 0000 0808 0000 0000 0000  .Uib............
0020  1c55 6962 0f00 0000 1808 0000 0000 0000  .Uib............
0040  1c55 6962 0f00 0000 2808 0000 0000 0000  .Uib....(.......
0060  1c55 6962 0f00 0000 3808 0000 0000 0000  .Uib....8.......
...
7720  1c55 6962 0f00 0000 d817 0000 0000 0000  .Uib............
7740  1c55 6962 0f00 0000 e817 0000 0000 0000  .Uib............
7760  1c55 6962 0f00 0000 f817 0000 0000 0000  .Uib............

812573828 :
0000  1c55 6962 0f00 0000 08f8 0c00 0000 0000  .Uib............
0020  1c55 6962 0f00 0000 18f8 0c00 0000 0000  .Uib............
0040  1c55 6962 0f00 0000 28f8 0c00 0000 0000  .Uib....(.......
0060  1c55 6962 0f00 0000 38f8 0c00 0000 0000  .Uib....8.......
...
7720  1c55 6962 0f00 0000 d807 0d00 0000 0000  .Uib............
7740  1c55 6962 0f00 0000 e807 0d00 0000 0000  .Uib............
7760  1c55 6962 0f00 0000 f807 0d00 0000 0000  .Uib............

812573829 :
0000  1c55 6962 0f00 0000 0800 0000 0000 0000  .Uib............
0020  1c55 6962 0f00 0000 1800 0000 0000 0000  .Uib............
0040  1c55 6962 0f00 0000 2800 0000 0000 0000  .Uib....(.......
0060  1c55 6962 0f00 0000 3800 0000 0000 0000  .Uib....8.......
...
7720  1c55 6962 0f00 0000 d80f 0000 0000 0000  .Uib............
7740  1c55 6962 0f00 0000 e80f 0000 0000 0000  .Uib............
7760  1c55 6962 0f00 0000 f80f 0000 0000 0000  .Uib............

Every physical block in that gap has the expected, valid IOR data, matching timestamp, matching task #.

Comment by Peggy Gazzola [ 02/May/22 ]

There's one correction to the previous comment, about the file IORfile.00000015 from job CL_IOR_dom_sel_all_wr_20iter_834K_rand_noreuse.1.V88PUz.1651069606. The hole in that file:
(41700-41908):812573412-812573620, (42117-45439):812573829-812577151
does not completely contain the expected data from the hole. I ran icheck on all the blocks in that range, found some blocks in the middle of the hole that are in use, by this same inode, extent (14848-14975):812573696-812573823. physical blocks 812573621-812573695 have valid IOR data for that hole. Blocks 812573696-812577823 have data in the middle of xfer started at file byte offset 60635136. Blocks 812577824-812573828 contain data that matches the expected last 5 blks of the 'hole'.

For reference, here's debugfs 'stat' output for IORfile.00000015 ost object:

[root@snx11922n004 ~]# debugfs -c -R "stat /O/0/d4/127076516" /dev/md2
debugfs 1.46.2.cr1 (13-Jan-2022)
/dev/md2: catastrophic mode - not reading inode or group bitmaps
Inode: 1146884   Type: regular    Mode:  0666   Flags: 0x80000
Generation: 1271779983    Version: 0x00000141:00000ba2
User:  1356   Group: 11121   Project:     0   Size: 479102976
File ACL: 0
Links: 1   Blockcount: 932048
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x626955dd:00000000 -- Wed Apr 27 09:40:29 2022
 atime: 0x00000000:00000000 -- Wed Dec 31 18:00:00 1969
 mtime: 0x626955dd:00000000 -- Wed Apr 27 09:40:29 2022
crtime: 0x626954e4:ede07d30 -- Wed Apr 27 09:36:20 2022
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x100010000:0x79308a4:0x0] compat=8 incompat=0
  fid: parent=[0x200019284:0xf3a:0x0] stripe=0 stripe_size=1048576 stripe_count=1 component_id=2 component_start=1048576 component_end=536870912 layout_version=0 range=0
EXTENTS:
(ETB0):293636096, (256-1250):811688192-811689186, (1251-1535):814882449-814882733, (1536-1663):812577152-812577279, (1664-1919):811673856-811674111, (1920-2047):815736064-81573619
1, (2048-2292):812651520-812651764, (2293-2559):813565312-813565578, (2560-2687):815838976-815839103, (2688-2918):813935756-813935986, (2919-3335):811690855-811691271, (3336-3543)
:815174805-815175012, (3544-3839):814450656-814450951, (3840-3960):813526339-813526459, (3961-7807):811691897-811695743, (7808-8130):811671552-811671874, (8131-8191):811696067-811
696127, (8192-8339):811704320-811704467, (8340-8447):815992960-815993067, (8448-8756):815736192-815736500, (8757-9727):811704885-811705855, (9728-9798):811687936-811688006, (9799-
13552):811705927-811709680, (13553-13760):812029184-812029391, (13761-14847):811709889-811710975, (14848-14975):812573696-812573823, (14976-15219):812633581-812633824, (15220-1542
8):811711348-811711556, (15429-15636):815843328-815843
Comment by Andreas Dilger [ 04/May/22 ]

Peggy, not a solution to the problem, but some background info on the OST IO path that may help track down a fix:

  • on bulk write, the OST will allocate blocks for the inode in an in-memory journal transaction
  • if the write is "large" (osd-ldiskfs.testfs-OST0001.writethrough_max_io_mb=8 or over by default) then preallocated pages are used from a per-thread stash, otherwise they are allocated in the inode's mapping in page cache
  • the RDMA from the client will land in these pages and be submitted directly to disk
  • the client RPC will complete after the RDMA data is written to disk and is assigned a Lustre transno
  • the journal transaction with the metadata changes to allocate the blocks for the inode will commit asynchronously (5s later by default)
  • the client is supposed to keep these dirty pages in RAM for RPC replay until it gets a last_committed value >= transno

If there is a crash between the RDMA+write completion, but the journal transaction is not committed, then you will see exactly what you describe - the data is present in the right spot on disk, but not "allocated" to the inode. The client is supposed to replay those RPCs if the server crashed before the allocation transaction was committed to disk, but it doesn't appear that this is happening in this case. Initially I was wondering if this was a case of uninitialized ext4 extents (where the blocks are allocated, but have a flag that tells readers to just report "all zero" for those blocks), but it looks like this is a case where the blocks are not allocated to the file at all.

Note that if the clients do replay their BRW RPCs to the OST as they should, then there is no guarantee that the filesystem is going to allocate exactly the same blocks to each inode. I would hope they should, given the proximity of the block numbers, but that might be a reason why the same data is on disk in two places and/or why some of it is allocated to different files.

IMHO, this looks like the problem is in the BRW RPC replay mechanism (or lack thereof), and is unlikely to be in the ldiskfs level, unless there is a failing in the commit callbacks that somehow break last_committed and imply the BRW RPC was committed to disk when it is not.

Comment by Alexey Lyashkov [ 06/May/22 ]

Andreas, picture much strange when simple request lost / fail during recovery.
lost region don't mapped over any extent sends from client to server and request isn't finished on server during crash time.
so it's simple resend over recovery case. in example it's

00000008:00200020:3.0:1651709243.919899:0:28609:0:(osc_cache.c:1084:osc_extent_make_ready()) extent 00000000c6c59345@{[19456 -> 20015/20479], [1|0|+|locking|wiuY|00000000a3d0090d], [2318336|560|+|-|0000000047a524c6|1024|000000004ef973c1]} make ready


00000008:00100000:3.0:1651709243.920226:0:28609:0:(osc_request.c:1804:osc_brw_prep_request()) brw rpc ffff8887c10f3680/1731924107812288 - object 0x0:137211497 offset 62914560<>83886080

00000100:00100000:6.0:1651709243.920235:0:17045:0:(client.c:1733:ptlrpc_send_new_req()) Sending RPC req@ffff8887c10f3680 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_06:3a768119-6203-4fb5-a916-2d4dabdd8e84:17045:1731924107812288:10.12.2.53@o2ib4001:4:
00000100:00000400:19.0:1651709261.180872:0:17045:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1651709243/real 1651709243]  req@ffff8887c10f3680 x1731924107812288/t0(0) o4->snx11922-OST0002-osc-ffff888f8ae7c800@10.12.2.53@o2ib4001:6/4 lens 536/456 e 0 to 1 dl 1651709261 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:''
00000100:00080000:18.0:1651709866.209843:0:17038:0:(recover.c:218:ptlrpc_wake_delayed()) @@@ waking (set ffff888788289200):  req@ffff8887c10f3680 x1731924107812288/t0(0) o4->snx11922-OST0002-osc-ffff888f8ae7c800@10.12.0.52@o2ib4000:6/4 lens 536/456 e 0 to 1 dl 1651709261 ref 2 fl Rpc:XQU/0/ffffffff rc 0/-1 job:''
00000100:00100000:7.0:1651709867.863582:0:17045:0:(client.c:2215:ptlrpc_check_set()) Completed RPC req@ffff8887c10f3680 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_06:3a768119-6203-4fb5-a916-2d4dabdd8e84:17045:1731924107812288:10.12.0.52@o2ib4000:4:
thr ffff8aebb2b88600 - pid 82928
                        req ffff8ad91927f080
request ffff8ad91927f080
from 0.0.0.117@13:4
to 10.12.2.53@5:4001
xid 1731924107812288:1731924107812291
transno 1262720737376
time 923905344:1651709256:0
flags 3
lens 184 208 24 64 0 0
ptlrpc body(v3) ffff8aed93688698
        type 4711
        tag 0 - last_xid 1731924107811071
        conn 22
        opc 4 OST_WRITE
num ios 1 (ffff8aed93688820)
obj [0] : 0:0x82dae69 bufs:4
data segments
0 <> 62914560 : 282624 : 460
1 <> 75153408 : 856064 : 460
2 <> 79691776 : 2293760 : 460
3 <> 83693568 : 192512 : 460
not replied

-----
crash> bt 82928
PID: 82928  TASK: ffff8aeb524097c0  CPU: 15  COMMAND: "ll_ost_io00_442"
 #0 [ffffa62e6504f880] __schedule at ffffffff8774e1d4
 #1 [ffffa62e6504f918] schedule at ffffffff8774e648
 #2 [ffffa62e6504f928] osd_trans_stop at ffffffffc1acce24 [osd_ldiskfs]
 #3 [ffffa62e6504fa00] ofd_commitrw_write at ffffffffc18f4453 [ofd]
 #4 [ffffa62e6504faa0] ofd_commitrw at ffffffffc18f97b1 [ofd]
 #5 [ffffa62e6504fb60] finish_wait at ffffffff86f2e5ac
 #6 [ffffa62e6504fbd8] tgt_brw_write at ffffffffc1389bb5 [ptlrpc]
 #7 [ffffa62e6504fd50] tgt_request_handle at ffffffffc138ade3 [ptlrpc]
 #8 [ffffa62e6504fdd0] ptlrpc_server_handle_request at ffffffffc1337953 [ptlrpc]
 #9 [ffffa62e6504fe38] ptlrpc_main at ffffffffc13393c6 [ptlrpc]
#10 [ffffa62e6504ff10] kthread at ffffffff86f043a6
#11 [ffffa62e6504ff50] ret_from_fork at ffffffff8780023f

but data lost a part of the segment 2.

[15]   File byte offset = 79425536:
...
file has 1 hole at byte offset 79425536

(15429-19390):680934469-680938430, (19456-25645):680938496-680944685,

hole is 65 blks.

And lost extent have a write "written" state in extent cache.

it might possible if some data have write in the extents allocated early (IOR uses a random point to write) and rpc resend had lost, but client logs say rpc completed without errors.

Comment by Peter Jones [ 09/May/22 ]

spitzcor peggy shadow have you managed to marrow down when this problem was introduced?

Comment by Alexey Lyashkov [ 09/May/22 ]

Peter, don't have such info. Currently we have just conformation - this request don't executed after recovery, while client think execution OK. Peggy works hard to collect a logs to see this, as this issue reproduced not each run.

Comment by Alexey Lyashkov [ 10/May/22 ]

I have small update. From logs obtained yesterday - It looks server skips a request execution with single error/warning like:

00000001:00080000:3.0:1652110683.220558:0:1080451:0:(tgt_lastrcvd.c:2000:tgt_txn_stop_cb()) More than one transaction 1408749330374
Comment by Mikhail Pershin [ 11/May/22 ]

I assume that multiple transaction during write can occur due to 'restart/retry' logic inside ofd_commitrw_write(). To resolve that it is enough to set tti_mult_trans flag, so last_rcvd will be updated properly. I think the better place for that in tgt_brw_write() so both MDT and OFD codepaths are covered:

diff --git a/lustre/target/tgt_handler.c b/lustre/target/tgt_handler.c
index 3198cc8fc1..d81aff759a 100644
--- a/lustre/target/tgt_handler.c
+++ b/lustre/target/tgt_handler.c
@@ -2605,6 +2605,7 @@ static void tgt_warn_on_cksum(struct ptlrpc_request *req,
 int tgt_brw_write(struct tgt_session_info *tsi)
 {
        struct ptlrpc_request   *req = tgt_ses_req(tsi);
+       struct tgt_thread_info *tti = tgt_th_info(tsi->tsi_env);
        struct ptlrpc_bulk_desc *desc = NULL;
        struct obd_export       *exp = req->rq_export;
        struct niobuf_remote    *remote_nb;
@@ -2846,6 +2847,9 @@ out_commitrw:
                nob += len;
        }
 
+       /* allow multiple transactions to be assigned during write commit */
+       tti->tti_mult_trans = 1;
+
        /* Must commit after prep above in all cases */
        rc = obd_commitrw(tsi->tsi_env, OBD_BRW_WRITE, exp, &repbody->oa,
                          objcount, ioo, remote_nb, npages, local_nb, rc, nob,
Comment by Alexey Lyashkov [ 11/May/22 ]

Mike, it looks you right about source of that messages for write.

00002000:00000002:11.0:1652209931.209295:0:82589:0:(ofd_io.c:1391:ofd_commitrw_write()) retry transaction, retries:2
00080000:00000002:11.0:1652209931.209342:0:82589:0:(osd_io.c:1544:osd_declare_write_commit()) snx11922-OST0002/: inode #1638412 extent_bytes 851393 extents 2 credits 33
00000001:00080000:11.0:1652209931.209400:0:82589:0:(tgt_lastrcvd.c:2000:tgt_txn_stop_cb()) More than one transaction 1451699035998

it looks LU-14134 osd-ldiskfs: reduce credits for new writing - not so good as expected.

  1. grep ofd_commitrw_write lustre.txt | grep -c retry
    21543
    ..
    sometimes it needs an 4 retry
    00002000:00000002:18.0:1652209930.195716:0:82181:0:(ofd_io.c:1391:ofd_commitrw_write()) retry transaction, retries:4

not so good from performance view :/

Comment by Alexey Lyashkov [ 11/May/22 ]

it looks like this bug addressed for todays corruption hit - commit hits during transaction restart but old assigned transno used for reply (high likely).

00000100:00100000:13.0:1652209930.000194:0:82206:0:(service.c:2134:ptlrpc_server_handle_req_in()) got req x1731924859869568
00002000:00100000:13.0:1652209930.000202:0:82206:0:(ofd_dev.c:2619:ofd_rw_hpreq_check()) @@@ snx11922-OST0002 ll_ost_io00_288: refresh rw locks for [0x2c0000400:0x86ece4e:0x0] (9392128->40992767)  req@00000000bd622443 x1731924859869568/t0(0) o4->ea32ef78-799e-4d3c-a8e4-6a210cb165c2@102@gni4:440/0 lens 536/0 e 0 to 0 dl 1652209950 ref 1 fl New:/0/ffffffff rc 0/-1 job:''
00002000:00100000:13.0:1652209930.000205:0:82206:0:(ofd_dev.c:2478:ofd_prolong_extent_locks()) Prolong locks for req 00000000bd622443 with x1731924859869568 ext(9392128->40992767)
00000100:00100000:13.0:1652209930.000214:0:82206:0:(service.c:2303:ptlrpc_server_handle_request()) Handling RPC req@00000000bd622443 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_288:ea32ef78-799e-4d3c-a8e4-6a210cb165c2+191:16994:x1731924859869568:12345-102@gni4:4:
00080000:00000002:13.0:1652209930.519952:0:82206:0:(osd_io.c:1544:osd_declare_write_commit()) snx11922-OST0002/: inode #1638560 extent_bytes 899332 extents 4 credits 65
00080000:00000002:13.0:1652209930.520032:0:82206:0:(osd_handler.c:4814:osd_xattr_set()) Set version 0x15200015f47 (old 0x15200015f44) for inode 1638560
00000001:00000002:13.0:1652209930.520033:0:82206:0:(tgt_lastrcvd.c:1424:tgt_last_rcvd_update()) transno = 1451699035975, last_committed = 1451699035951
00002000:00000002:13.0:1652209930.638804:0:82206:0:(ofd_io.c:1391:ofd_commitrw_write()) retry transaction, retries:1
00080000:00000002:13.0:1652209930.638833:0:82206:0:(osd_io.c:1544:osd_declare_write_commit()) snx11922-OST0002/: inode #1638560 extent_bytes 871323 extents 3 credits 49
00002000:00000002:13.0:1652209930.694482:0:82206:0:(ofd_io.c:1391:ofd_commitrw_write()) retry transaction, retries:2
00080000:00000002:13.0:1652209930.694508:0:82206:0:(osd_io.c:1544:osd_declare_write_commit()) snx11922-OST0002/: inode #1638560 extent_bytes 867716 extents 1 credits 11

>>> 00000001:00000002:8.0:1652209930.697117:0:82431:0:(tgt_lastrcvd.c:1424:tgt_last_rcvd_update()) transno = 1451699035985, last_committed = 1451699035983 <<<< commit hit.

00002000:00000002:13.0:1652209930.727496:0:82206:0:(ofd_io.c:1391:ofd_commitrw_write()) retry transaction, retries:3
00080000:00000002:13.0:1652209930.727526:0:82206:0:(osd_io.c:1544:osd_declare_write_commit()) snx11922-OST0002/: inode #1638560 extent_bytes 854149 extents 1 credits 11
00000100:00100000:13.0:1652209930.832622:0:82206:0:(service.c:2352:ptlrpc_server_handle_request()) Handled RPC req@00000000bd622443 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_288:ea32ef78-799e-4d3c-a8e4-6a210cb165c2+11:16994:x1731924859869568:12345-102@gni4:4: Request processed in 832406us (832433us total) trans 1451699035975 rc 0/0
00002000:00100000:13.0:1652209930.832630:0:82206:0:(ofd_dev.c:2619:ofd_rw_hpreq_check()) @@@ snx11922-OST0002 ll_ost_io00_288: refresh rw locks for [0x2c0000400:0x86ece4e:0x0] (9392128->40992767)  req@00000000bd622443 x1731924859869568/t1451699035975(0) o4->ea32ef78-799e-4d3c-a8e4-6a210cb165c2@102@gni4:440/0 lens 536/456 e 0 to 0 dl 1652209950 ref 1 fl Complete:/0/0 rc 0/0 job:''
00002000:00100000:13.0:1652209930.832650:0:82206:0:(ofd_dev.c:2478:ofd_prolong_extent_locks()) Prolong locks for req 00000000bd622443 with x1731924859869568 ext(9392128->40992767)
Comment by Andreas Dilger [ 19/May/22 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47371
Subject: LU-15776 tgt: fix transaction handling in tgt_brw_write()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3329972df95f788929dec15846bc00593c64bdf

Comment by Colin Faber [ 19/May/22 ]

Hi spitzcor,

We have a potential fix https://review.whamcloud.com/#/c/47371/

Can you please try this patch with your reproducer and let us know if it addresses the problem?

Thank you!

Comment by Cory Spitz [ 19/May/22 ]

Yes, we will test it and report back next week.

Comment by Cory Spitz [ 24/May/22 ]

The reverts that Alexey suggested here and in the review of https://review.whamcloud.com/47371/ has survived 72 hours of the reproducer (also noted in the review) whereas it would fail within 30 minutes otherwise. Performance testing is in-progress.

The fix from https://review.whamcloud.com/47371/ has also survived an initial level of tests. We're looking to complete at least 24 hours here as well. We can provide another update later this week. In the meantime, the discussion in review of https://review.whamcloud.com/47371 should continue.

Comment by Colin Faber [ 26/May/22 ]

Hi spitzcor 

Any further updates?

Comment by Cory Spitz [ 26/May/22 ]

45 hours of test of the reproducer with https://review.whamcloud.com/47371/ has completed successfully, without failure.

Comment by Cory Spitz [ 31/May/22 ]

Status update from HPE: additional regression tests of https://review.whamcloud.com/47371/ have completed without issue.

Comment by Colin Faber [ 31/May/22 ]

Thanks for the update spitzcor, I'm assuming based on this, the performance testing came back without issue?

Comment by Gerrit Updater [ 01/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47371/
Subject: LU-15776 tgt: fix transaction handling in tgt_brw_write()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 10a29ad7616aeaadabf6d96146f52ef348a3a6f1

Comment by Cory Spitz [ 01/Jun/22 ]

HPE did not complete performance testing of https://review.whamcloud.com/47371/ in isolation. We did however test the reverts that Alexey mentioned before, which checked out without a perf delta. Further, we did measure a workload mix job throughput rate with https://review.whamcloud.com/47371/ and it does not deviate from those reverts or a baseline.

Comment by Peter Jones [ 01/Jun/22 ]

That seems thorough enough to consider this issue as resolved in 2.15 RC5 to me - thanks Cory!

Generated at Sat Feb 10 03:21:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.