[LU-12258] sanity test_101d timeout when doing rolling upgrade OSS from 2.10.7 to 2.12.1 with ZFS Created: 01/May/19  Updated: 20/May/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Attachments: HTML File dmesg    
Issue Links:
Related
is related to LU-12234 sanity-benchmark test iozone hangs in... Open
is related to LU-9845 ost-pools test_22 hangs with ‘WARNING... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

1. setup system with 2.10.7 with 1 MDS (ZFS), 2 OSTs (ZFS), 1 client
2. upgrade OSS from 2.10.7 to 2.12.1, others remain 2.10.7, run sanity, test_101d timeout, on OSS side showing following trace. ldiskfs doesn't have the problem

[ 3268.291244] Lustre: DEBUG MARKER: == sanity test 101d: file read with and witt
hout read-ahead enabled =================================== 00:21:09 (15566700699
)
[ 3280.980154] WARNING: MMP writes to pool 'lustre-ost1' have not succeeded in oo
ver 5s; suspending pool
[ 3280.981448] WARNING: Pool 'lustre-ost1' has encountered an uncorrectable I/O  
failure and has been suspended.

[ 3281.091886] WARNING: MMP writes to pool 'lustre-ost2' have not succeeded in oo
ver 5s; suspending pool
[ 3281.092868] WARNING: Pool 'lustre-ost2' has encountered an uncorrectable I/O  
failure and has been suspended.

[ 3474.405076] LNet: Service thread pid 30189 was inactive for 200.40s. The three
ad might be hung, or it might only be slow and will resume later. Dumping the stt
ack trace for debugging purposes:
[ 3474.409289] Pid: 30189, comm: ll_ost_io00_000 3.10.0-957.10.1.el7_lustre.x86__
64 #1 SMP Mon Apr 22 22:25:47 UTC 2019
[ 3474.410361] Call Trace:
[ 3474.410361] Call Trace:
[ 3474.410694]  [<ffffffffc07922d5>] cv_wait_common+0x125/0x150 [spl]
[ 3474.411403]  [<ffffffffc0792315>] __cv_wait+0x15/0x20 [spl]
[ 3474.412004]  [<ffffffffc08d32bf>] txg_wait_synced+0xef/0x140 [zfs]
[ 3474.412817]  [<ffffffffc0888c95>] dmu_tx_wait+0x275/0x3c0 [zfs]
[ 3474.413488]  [<ffffffffc0888e72>] dmu_tx_assign+0x92/0x490 [zfs]
[ 3474.414163]  [<ffffffffc11f6009>] osd_trans_start+0x199/0x440 [osd_zfs]
[ 3474.414896]  [<ffffffffc131cc85>] ofd_trans_start+0x75/0xf0 [ofd]
[ 3474.415596]  [<ffffffffc1323881>] ofd_commitrw_write+0xa31/0x1d40 [ofd]
[ 3474.416312]  [<ffffffffc1327c6c>] ofd_commitrw+0x48c/0x9e0 [ofd]
[ 3474.416962]  [<ffffffffc102947c>] tgt_brw_write+0x10cc/0x1cf0 [ptlrpc]
[ 3474.417923]  [<ffffffffc10251da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 3474.418699]  [<ffffffffc0fca80b>] ptlrpc_server_handle_request+0x24b/0xab0 [pp
tlrpc]
[ 3474.419550]  [<ffffffffc0fce13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[ 3474.420261]  [<ffffffffa0cc1c71>] kthread+0xd1/0xe0
[ 3474.420817]  [<ffffffffa1375c37>] ret_from_fork_nospec_end+0x0/0x39
[ 3474.421507]  [<ffffffffffffffff>] 0xffffffffffffffff


 Comments   
Comment by James Nunez (Inactive) [ 20/May/19 ]

I see a very similar issue in 2.10.8 RC1 full testing with ZFS in sanity-benchmark test bonnie with logs at https://testing.whamcloud.com/test_sets/0f1e6704-7606-11e9-92d8-52540065bddc.

Generated at Sat Feb 10 02:51:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.