[LU-10084] replay-ost-single test_3: test failed to respond and timed out Created: 05/Oct/17  Updated: 19/Mar/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

trevis, full, x86_64 servers, ppc clients
servers: el7.4, ldiskfs, branch master, v2.10.53.1, b3642
clients: el7.4, branch master, v2.10.53.1, b3642


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/ba995751-659c-4e63-9b5b-fbf101137b78

From client dmesg:

[  600.102679] sync            D 00003fff9cdd6448     0 16435  16247 0x00000080
[  600.102719] Call Trace:
[  600.102734] [c000000079b835d0] [c00000007ffdbb80] 0xc00000007ffdbb80 (unreliable)
[  600.102780] [c000000079b837a0] [c000000000019634] .__switch_to+0x254/0x460
[  600.102818] [c000000079b83850] [c0000000009a9c1c] .__schedule+0x43c/0xb00
[  600.102858] [c000000079b83980] [c0000000009a5e78] .schedule_timeout+0x398/0x460
[  600.102903] [c000000079b83a90] [c0000000009aa7c8] .wait_for_completion+0x148/0x1d0
[  600.102955] [c000000079b83b60] [c0000000003688b4] .sync_inodes_sb+0xc4/0x260
[  600.102994] [c000000079b83c70] [c000000000371c9c] .sync_inodes_one_sb+0x1c/0x30
[  600.103039] [c000000079b83ce0] [c0000000003243cc] .iterate_supers+0x22c/0x2f0
[  600.103078] [c000000079b83da0] [c000000000371fb8] .sys_sync+0x48/0xd0
[  600.103117] [c000000079b83e30] [c00000000000a184] system_call+0x38/0xb4


 Comments   
Comment by James Nunez (Inactive) [ 19/Mar/19 ]

We continue to see replay-ost-single test 3 hang. A recent example is for 2.10.7 RC1, https://testing.whamcloud.com/test_sets/42434dfe-4332-11e9-92fe-52540065bddc

with the following in the client (vm1) dmesg

[  480.120291] INFO: task tee:18534 blocked for more than 120 seconds.
[  480.120391] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  480.120446] tee             D 00003fff8b241588     0 18534  18350 0x00000080
[  480.120510] Call Trace:
[  480.120546] [c000000077ddabe0] [0000000000003d15] 0x3d15 (unreliable)
[  480.120635] [c000000077ddadb0] [c00000000001b76c] .__switch_to+0x25c/0x470
[  480.120707] [c000000077ddae60] [c000000000aa40fc] .__schedule+0x42c/0xae0
[  480.120766] [c000000077ddaf90] [c000000000a9fea4] .schedule_timeout+0x394/0x470
[  480.120830] [c000000077ddb090] [c000000000aa3c1c] .io_schedule+0xcc/0x180
[  480.120885] [c000000077ddb120] [c000000000aa01a0] .bit_wait_io+0x20/0x80
[  480.120939] [c000000077ddb1a0] [c000000000aa043c] .__wait_on_bit+0x17c/0x210
[  480.121000] [c000000077ddb250] [c000000000291d50] .wait_on_page_bit+0x100/0x120
[  480.121094] [c000000077ddb310] [d000000003d70848] .vvp_page_assume+0x48/0xe0 [lustre]
[  480.121202] [c000000077ddb390] [d000000002a46ad0] .cl_page_assume+0xf0/0x490 [obdclass]
[  480.121278] [c000000077ddb450] [d000000003d59c68] .ll_write_begin+0x198/0xb60 [lustre]
[  480.121342] [c000000077ddb550] [c00000000028f924] .generic_file_buffered_write+0x134/0x320
[  480.121408] [c000000077ddb670] [c0000000002919c0] .__generic_file_aio_write+0x320/0x4a0
[  480.121490] [c000000077ddb740] [d000000003d76948] .vvp_io_write_start+0x378/0x1210 [lustre]
[  480.121575] [c000000077ddb870] [d000000002a4b828] .cl_io_start+0xc8/0x240 [obdclass]
[  480.121657] [c000000077ddb910] [d000000002a51f78] .cl_io_loop+0x948/0x1180 [obdclass]
[  480.121731] [c000000077ddba40] [d000000003cea04c] .ll_file_io_generic+0x27c/0x1020 [lustre]
[  480.121803] [c000000077ddbbd0] [d000000003ceb2dc] .ll_file_aio_write+0x20c/0x320 [lustre]
[  480.121876] [c000000077ddbca0] [d000000003ceb508] .ll_file_write+0x118/0x310 [lustre]
[  480.121945] [c000000077ddbd80] [c000000000371a64] .SyS_write+0x164/0x440
[  480.122002] [c000000077ddbe30] [c00000000000a284] system_call+0x38/0xfc
Generated at Sat Feb 10 02:31:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.