[LU-11183] sanity test 244 hangs with no information in the logs Created: 26/Jul/18  Updated: 17/Mar/22  Resolved: 17/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-11128 replay-single test timeout Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 244 hangs in recent testing. The last thing seen in the test_log is a hang during test 10 or 11

== sanity test 244: sendfile with group lock tests =================================================== 18:46:27 (1531334787)
35+0 records in
35+0 records out
36700160 bytes (37 MB) copied, 0.482129 s, 76.1 MB/s
Starting test test10 at 1531334788

In all cases, the stack_dump is empty and the only thing seen in the console logs and dmesg is the test starting, rebooting and start testing sanity-sec. There’s about a one hour gap between when the test last reports in and when the node is rebooted

[ 5769.663115] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 244: sendfile with group lock tests =================================================== 18:46:27 \(1531334787\)
[ 5769.858388] Lustre: DEBUG MARKER: == sanity test 244: sendfile with group lock tests =================================================== 18:46:27 (1531334787)

<ConMan> Console [trevis-12vm4] disconnected from <trevis-12:6003> at 07-11 19:49.

<ConMan> Console [trevis-12vm4] connected to <trevis-12:6003> at 07-11 19:49.

It looks like there is no information on why this test hung.

We have several instances of this with logs at
https://testing.whamcloud.com/test_sets/fca99f92-6fcd-11e8-aa24-52540065bddc
https://testing.whamcloud.com/test_sets/553d0058-80cd-11e8-b441-52540065bddc
https://testing.whamcloud.com/test_sets/f52eaf70-8d67-11e8-87f3-52540065bddc



 Comments   
Comment by Charlie Olmstead [ 01/Aug/18 ]

Stack dumps were empty due to a bug in AT (ATM-1046). Are there new instances of this issue after 7/23?

Comment by Andreas Dilger [ 05/Aug/18 ]

+1 on master

https://testing.whamcloud.com/test_sets/3ef30c02-97ed-11e8-b0aa-52540065bddc

Comment by Mikhail Pershin [ 28/Aug/18 ]

+1 on master
https://testing.whamcloud.com/test_sets/b74c176a-aacc-11e8-80f7-52540065bddc

Comment by Andreas Dilger [ 04/Sep/18 ]

16 failures in the past 4 weeks, of 1160 test runs, or 0.13% failure rate. Definitely not one of the top 10 at this point.

Comment by Nathaniel Clark [ 14/Sep/18 ]

master review-ldiskfs

https://testing.whamcloud.com/test_sets/8eb7ed38-b7c4-11e8-a7de-52540065bddc

In the console log for client 1 (onyx-30vm1) is the stack traces of all running processes including sendfile_grouplock:

[ 9676.429140] sendfile_groupl S ffff974e39416eb0     0 22342  22160 0x00000080
[ 9676.429921] Call Trace:
[ 9676.430178]  [<ffffffffb8714029>] schedule+0x29/0x70
[ 9676.430681]  [<ffffffffb87118d4>] schedule_timeout+0x174/0x2c0
[ 9676.431389]  [<ffffffffb80a3750>] ? internal_add_timer+0x70/0x70
[ 9676.431985]  [<ffffffffc09c2650>] ? ptlrpc_init_rq_pool+0x110/0x110 [ptlrpc]
[ 9676.432705]  [<ffffffffc09cc3b0>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
[ 9676.433440]  [<ffffffffb80cf670>] ? wake_up_state+0x20/0x20
[ 9676.433990]  [<ffffffffc09cc73d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[ 9676.434666]  [<ffffffffc09b16a2>] ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc]
[ 9676.435414]  [<ffffffffc09ac7c0>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
[ 9676.436210]  [<ffffffffc0ba2d50>] ? osc_lock_lockless_cancel+0xe0/0xe0 [osc]
[ 9676.436929]  [<ffffffffc0ba1ad0>] ? osc_lock_upcall+0x580/0x580 [osc]
[ 9676.437653]  [<ffffffffc0b98965>] osc_enqueue_base+0x2b5/0x6a0 [osc]
[ 9676.438288]  [<ffffffffc0ba1550>] ? osc_lock_lvb_update+0x330/0x330 [osc]
[ 9676.438979]  [<ffffffffc0ba37bb>] osc_lock_enqueue+0x38b/0x840 [osc]
[ 9676.439738]  [<ffffffffc0ba1550>] ? osc_lock_lvb_update+0x330/0x330 [osc]
[ 9676.440443]  [<ffffffffc0802d95>] cl_lock_enqueue+0x65/0x120 [obdclass]
[ 9676.441176]  [<ffffffffc0bfc285>] lov_lock_enqueue+0x95/0x150 [lov]
[ 9676.441817]  [<ffffffffc0802d95>] cl_lock_enqueue+0x65/0x120 [obdclass]
[ 9676.442567]  [<ffffffffc0803327>] cl_lock_request+0x67/0x1f0 [obdclass]
[ 9676.443227]  [<ffffffffc080721b>] cl_io_lock+0x2bb/0x3d0 [obdclass]
[ 9676.443866]  [<ffffffffc08075ab>] cl_io_loop+0x11b/0xc70 [obdclass]
[ 9676.444634]  [<ffffffffc0c54db2>] ll_file_io_generic+0x4e2/0xd10 [lustre]
[ 9676.445301]  [<ffffffffc0695395>] ? cfs_trace_unlock_tcd+0x35/0x90 [libcfs]
[ 9676.445990]  [<ffffffffb8357e01>] ? vsnprintf+0x1c1/0x6a0
[ 9676.446644]  [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120
[ 9676.447271]  [<ffffffffc0c55b32>] ll_file_aio_write+0x372/0x540 [lustre]
[ 9676.447945]  [<ffffffffc0c55da4>] ll_file_write+0xa4/0x170 [lustre]
[ 9676.448648]  [<ffffffffb821c082>] __kernel_write+0x72/0x140
[ 9676.449194]  [<ffffffffb824ccfe>] write_pipe_buf+0x5e/0xa0
[ 9676.449748]  [<ffffffffb824c386>] splice_from_pipe_feed+0x86/0x130
[ 9676.450446]  [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120
[ 9676.451054]  [<ffffffffb824c96e>] __splice_from_pipe+0x6e/0x90
[ 9676.451639]  [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120
[ 9676.452363]  [<ffffffffb824e34e>] splice_from_pipe+0x5e/0x90
[ 9676.452960]  [<ffffffffb824e399>] default_file_splice_write+0x19/0x30
[ 9676.453610]  [<ffffffffb824d190>] do_splice_from+0xb0/0xf0
[ 9676.454208]  [<ffffffffb824d1f0>] direct_splice_actor+0x20/0x30
[ 9676.454793]  [<ffffffffb824cf27>] splice_direct_to_actor+0xd7/0x200
[ 9676.455486]  [<ffffffffb824d1d0>] ? do_splice_from+0xf0/0xf0
[ 9676.456039]  [<ffffffffb824d0b2>] do_splice_direct+0x62/0x90
[ 9676.456611]  [<ffffffffb821bb38>] do_sendfile+0x1d8/0x3c0
[ 9676.457212]  [<ffffffffb821d1ea>] SyS_sendfile64+0x9a/0xb0
[ 9676.457764]  [<ffffffffb87206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 9676.458508]  [<ffffffffb8720795>] system_call_fastpath+0x1c/0x21
[ 9676.459099]  [<ffffffffb87206e1>] ? system_call_after_swapgs+0xae/0x146
Comment by Andreas Dilger [ 15/Sep/18 ]

This looks like a duplicate of LU-11128, which has a fix about to land. I've also got an additional debugging patch that should land, in case Alex's patch doesn't fix this, and to improve debugability beyond "deafening silence in the logs".

Comment by Patrick Farrell [ 17/Mar/22 ]

Probably a dupe of LU-11128

Generated at Sat Feb 10 02:41:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.