Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.15.0
-
Server Details:
[root@snx11922n000 admin]# cscli fs_info
----------------------------------------------------------------------------------------------------
OST Redundancy style: Declustered Parity gridraid
Disk I/O Integrity guard (ANSI T10-PI) is in use ✔
----------------------------------------------------------------------------------------------------
Information about "snx11922" file system:
----------------------------------------------------------------------------------------------------
Node Role Targets Failover partner Devices
----------------------------------------------------------------------------------------------------
snx11922n000 mgmt 0 / 0 snx11922n001
snx11922n001 mgmt 0 / 0 snx11922n000
snx11922n002 mgs 1 / 1 snx11922n003 /dev/md65
snx11922n003 mds 1 / 1 snx11922n002 /dev/md66
snx11922n004 oss 2 / 2 snx11922n005 /dev/md0, /dev/md2
snx11922n005 oss 2 / 2 snx11922n004 /dev/md1, /dev/md3
[root@snx11922n000 admin]# cscli show_node_versions -g all
snx11922n000 6.1-010.39
snx11922n001 6.1-010.39
snx11922n002 6.1-010.39-cm-22.04.19-g3cedd6d
snx11922n003 6.1-010.39-cm-22.04.19-g3cedd6d
snx11922n004 6.1-010.39-cm-22.04.19-g3cedd6d
snx11922n005 6.1-010.39-cm-22.04.19-g3cedd6d
Lustre Version :
version=2.15.0_RC3_14_g3cedd6d Debug Level: debug=ioctl neterror warning error emerg ha config console lfsck
Debug_mb Size :
debug_mb=101
Subsystem_debug:
undefined mdc mds osc ost class log llite rpc mgmt lnet lnd pinger filter libcfs echo ldlm lov lquota osd lfsck snapshot lmv sec gss mgc mgs fid fld
saturn-p2:/ # pcmd -s -n ALL_COMPUTE "mount -t lustre"
Output from 56-63,96-127:
10.12.0.50@o2ib4000:10.12.2.51@o2ib4001:/snx11922 on /lus/snx11922 type lustre (rw,nochecksum,flock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,encrypt)
Node(s) 56-63,96-127 had exit code 0
saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param version"
Output from 56-63,96-127:
version=2.15.0_RC3_14_g3cedd6d
Node(s) 56-63,96-127 had exit code 0
saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param debug"
Output from 56-63,96-127:
debug=ioctl neterror warning error emerg ha config console lfsck
Node(s) 56-63,96-127 had exit code 0
saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param debug_mb"
Output from 60-63,112-115,120-127:
debug_mb=201
Output from 104-111:
debug_mb=321
Output from 56-59,96-103,116-119:
debug_mb=161
Node(s) 56-63,96-127 had exit code 0
saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param subsystem_debug"
Output from 56-63,96-127:
subsystem_debug=
undefined mdc mds osc ost class log llite rpc mgmt lnet lnd pinger filter libcfs echo ldlm lov lquota osd lfsck snapshot lmv sec gss mgc mgs fid fld
Node(s) 56-63,96-127 had exit code
saturn-p2:/ # xtprocadmin| grep compute |wc
40 240 2360
saturn-p2:/ # apstat
Compute node summary
arch config up resv use avail down rebootq
XT 40 40 36 36 4 0 0
No pending applications are present
Total placed applications: 6
Apid ResId User PEs Nodes Age State Command
3152339 51624314 vers 24 6 0h14m run IOR
3152346 51624321 vers 24 6 0h10m run IOR
3152348 51624323 vers 24 6 0h09m run IOR
3152352 51624327 vers 24 6 0h04m run IOR
3152353 51624328 vers 24 6 0h04m run IOR
3152354 51624329 vers 24 6 0h02m run IOR
No applications or reservations are being cleaned upServer Details: [ root@snx11922n000 admin]# cscli fs_info ---------------------------------------------------------------------------------------------------- OST Redundancy style: Declustered Parity gridraid Disk I/O Integrity guard (ANSI T10-PI) is in use ✔ ---------------------------------------------------------------------------------------------------- Information about "snx11922" file system: ---------------------------------------------------------------------------------------------------- Node Role Targets Failover partner Devices ---------------------------------------------------------------------------------------------------- snx11922n000 mgmt 0 / 0 snx11922n001 snx11922n001 mgmt 0 / 0 snx11922n000 snx11922n002 mgs 1 / 1 snx11922n003 /dev/md65 snx11922n003 mds 1 / 1 snx11922n002 /dev/md66 snx11922n004 oss 2 / 2 snx11922n005 /dev/md0, /dev/md2 snx11922n005 oss 2 / 2 snx11922n004 /dev/md1, /dev/md3 [ root@snx11922n000 admin]# cscli show_node_versions -g all snx11922n000 6.1-010.39 snx11922n001 6.1-010.39 snx11922n002 6.1-010.39-cm-22.04.19-g3cedd6d snx11922n003 6.1-010.39-cm-22.04.19-g3cedd6d snx11922n004 6.1-010.39-cm-22.04.19-g3cedd6d snx11922n005 6.1-010.39-cm-22.04.19-g3cedd6d Lustre Version : version=2.15.0_RC3_14_g3cedd6d Debug Level: debug=ioctl neterror warning error emerg ha config console lfsck Debug_mb Size : debug_mb=101 Subsystem_debug: undefined mdc mds osc ost class log llite rpc mgmt lnet lnd pinger filter libcfs echo ldlm lov lquota osd lfsck snapshot lmv sec gss mgc mgs fid fld saturn-p2:/ # pcmd -s -n ALL_COMPUTE "mount -t lustre" Output from 56-63,96-127: 10.12.0.50@o2ib4000 : 10.12.2.51@o2ib4001 :/snx11922 on /lus/snx11922 type lustre (rw,nochecksum,flock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,encrypt) Node(s) 56-63,96-127 had exit code 0 saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param version" Output from 56-63,96-127: version=2.15.0_RC3_14_g3cedd6d Node(s) 56-63,96-127 had exit code 0 saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param debug" Output from 56-63,96-127: debug=ioctl neterror warning error emerg ha config console lfsck Node(s) 56-63,96-127 had exit code 0 saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param debug_mb" Output from 60-63,112-115,120-127: debug_mb=201 Output from 104-111: debug_mb=321 Output from 56-59,96-103,116-119: debug_mb=161 Node(s) 56-63,96-127 had exit code 0 saturn-p2:/ # pcmd -s -n ALL_COMPUTE "lctl get_param subsystem_debug" Output from 56-63,96-127: subsystem_debug= undefined mdc mds osc ost class log llite rpc mgmt lnet lnd pinger filter libcfs echo ldlm lov lquota osd lfsck snapshot lmv sec gss mgc mgs fid fld Node(s) 56-63,96-127 had exit code saturn-p2:/ # xtprocadmin| grep compute |wc 40 240 2360 saturn-p2:/ # apstat Compute node summary arch config up resv use avail down rebootq XT 40 40 36 36 4 0 0 No pending applications are present Total placed applications: 6 Apid ResId User PEs Nodes Age State Command 3152339 51624314 vers 24 6 0h14m run IOR 3152346 51624321 vers 24 6 0h10m run IOR 3152348 51624323 vers 24 6 0h09m run IOR 3152352 51624327 vers 24 6 0h04m run IOR 3152353 51624328 vers 24 6 0h04m run IOR 3152354 51624329 vers 24 6 0h02m run IOR No applications or reservations are being cleaned up
-
3
-
9223372036854775807
Description
data miscompare seen during server failover by forced panics for an IOR job (ill formed IO).
Error Output :
[13] Expected: 0x0000000000029ff8
[13] Actual: 0x0000000000000000
[13] FAILED comparison of buffer containing 8-byte ints:
[13] File name = /lus/snx11922/ost2/ostest.vers/alsorun.20220421145202.27664.saturn-p2/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse.1.qSxF7f.1650570751/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse/IORfile.00000021
[13] In transfer 152, 21504 errors between buffer indices 0 and 21503.
[13] File byte offset = 37576704:
[13] Expected: 0x000000156261b63a 0000000000000008 000000156261b63a 0000000000000018
[13] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
-
- error **
ERROR in IOR.c (line 448): data check error, aborting execution.
ERROR: No such file or directory
-
- exiting **
Client involved during this corruption:
task 13 on nid00097
task 21 on nid00099
saturn-p2:/ # lfs getstripe -r -y /lus/snx11922/ost2/ostest.vers/alsorun.20220421145202.27664.saturn-p2/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse.1.qSxF7f.1650570751/CL_IOR_dom_sel_all_wr_10iter_834K_rand_noreuse/IORfile.00000021
lcm_layout_gen: 7
lcm_mirror_count: 1
lcm_entry_count: 5
component0:
lcme_id: 1
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 1048576
sub_layout:
lmm_stripe_count: 0
lmm_stripe_size: 1048576
lmm_pattern: mdt
lmm_layout_gen: 0
lmm_stripe_offset: 0
component1:
lcme_id: 2
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 1048576
lcme_extent.e_end: 269484032
sub_layout:
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 3
lmm_pool: ost2
lmm_objects:
- l_ost_idx: 3
l_fid: 0x100030000:0x83fae34:0x0
component2:
lcme_id: 3
lcme_mirror_id: 0
lcme_flags: extension
lcme_extent.e_start: 269484032
lcme_extent.e_end: 536870912
sub_layout:
lmm_stripe_count: 0
lmm_extension_size: 134217728
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
lmm_pool: ost2
component3:
lcme_id: 4
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 536870912
lcme_extent.e_end: 536870912
sub_layout:
lmm_stripe_count: -1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
lmm_pool: ost2
component4:
lcme_id: 5
lcme_mirror_id: 0
lcme_flags: extension
lcme_extent.e_start: 536870912
lcme_extent.e_end: EOF
sub_layout:
lmm_stripe_count: 0
lmm_extension_size: 268435456
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
lmm_pool: ost2
Console Logs:
2022-04-21T14:55:08.086885-05:00 c0-0c1s8n0 Lustre: 9811:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887] req@000000007bfb28c1 x1730745432703552/t0(0) o400->snx11922-OST0003-osc-ffff88878818e800@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
2022-04-21T14:55:08.086967-05:00 c0-0c1s8n0 Lustre: snx11922-OST0003-osc-ffff88878818e800: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete
2022-04-21T14:55:08.086988-05:00 c0-0c1s8n0 Lustre: 9809:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887] req@000000008ee911d3 x1730745432703488/t0(0) o400->snx11922-OST0002-osc-ffff88878818e800@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
2022-04-21T14:55:08.087004-05:00 c0-0c1s8n0 Lustre: snx11922-OST0002-osc-ffff88878818e800: Connection to snx11922-OST0002 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete
2022-04-21T14:55:08.087023-05:00 c0-0c1s8n2 Lustre: 9808:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570887/real 1650570887] req@00000000e7ca3f51 x1730745432704128/t0(0) o400->snx11922-OST0003-osc-ffff8887854c2000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570903 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
2022-04-21T14:55:08.087040-05:00 c0-0c1s8n2 Lustre: snx11922-OST0003-osc-ffff8887854c2000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete
2022-04-21T14:55:08.087076-05:00 c0-0c1s8n1 Lustre: 9797:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570889/real 1650570889] req@0000000039643501 x1730745432705728/t0(0) o400->snx11922-OST0003-osc-ffff88838195c000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570905 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
2022-04-21T14:55:08.087086-05:00 c0-0c1s8n1 Lustre: snx11922-OST0003-osc-ffff88838195c000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete
2022-04-21T14:55:08.087114-05:00 c0-0c1s8n3 Lustre: 9786:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1650570889/real 1650570889] req@000000008cb8442c x1730745432700352/t0(0) o400->snx11922-OST0002-osc-ffff888386f33000@10.12.2.53@o2ib4001:28/4 lens 224/224 e 0 to 1 dl 1650570905 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
2022-04-21T14:55:08.087129-05:00 c0-0c1s8n3 Lustre: snx11922-OST0003-osc-ffff888386f33000: Connection to snx11922-OST0003 (at 10.12.2.53@o2ib4001) was lost; in progress operations using this service will wait for recovery to complete
Kern logs :
Apr 21 14:55:25 snx11922n004 kernel: md: md3: resync interrupted.
Apr 21 14:55:26 snx11922n004 kernel: LDISKFS-fs (md3): recovery complete
Apr 21 14:55:26 snx11922n004 kernel: LDISKFS-fs (md3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=panic,errors=panic,journal_checksum,no_mbcache,nodelalloc
Apr 21 14:55:27 snx11922n003 kernel: Lustre: snx11922-MDT0001: Received LWP connection from 10.12.0.52@o2ib4, removing former export from 10.12.0.53@o2ib4
Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: Imperative Recovery enabled, recovery window shrunk from 900-2700 down to 450-2700
Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: in recovery but waiting for the first client to connect
Apr 21 14:55:27 snx11922n004 kernel: md: resync of RAID array md3
Apr 21 14:55:27 snx11922n004 kernel: [204833]:md.c:10632:md_do_sync(): md: md3: resync range: [0, 30316376064)
Apr 21 14:55:27 snx11922n002 kernel: Lustre: snx11922-MDT0000: Received LWP connection from 10.12.0.52@o2ib4, removing former export from 10.12.0.53@o2ib4
Apr 21 14:55:27 snx11922n004 kernel: Lustre: snx11922-OST0003: Will be in recovery for at least 7:30, or until 16 clients reconnect
Apr 21 14:55:40 snx11922n001 kernel: LNet: 2575200:0:(o2iblnd_cb.c:3369:kiblnd_check_conns()) Timed out tx for 10.12.0.53@o2ib4: 2 seconds
Apr 21 14:55:40 snx11922n001 kernel: LNet: 2575200:0:(o2iblnd_cb.c:3369:kiblnd_check_conns()) Skipped 1 previous similar message
Apr 21 14:55:48 snx11922n003 kernel: Lustre: snx11922-OST0002-osc-MDT0001: Connection restored to 10.12.0.52@o2ib4 (at 10.12.0.52@o2ib4)
Apr 21 14:55:48 snx11922n001 kernel: Lustre: snx11922-OST0002-osc-ffff998fa4d8d000: Connection restored to 10.12.0.52@o2ib4 (at 10.12.0.52@o2ib4)
Apr 21 14:55:48 snx11922n004 kernel: Lustre: snx11922-OST0002: Recovery over after 0:32, of 16 clients 16 recovered and 0 were evicted.
Attachments
Issue Links
- is related to
-
LU-15847 Resolve 'More than one transno' situations
- Resolved