[LU-11183] sanity test 244 hangs with no information in the logs Created: 26/Jul/18 Updated: 17/Mar/22 Resolved: 17/Mar/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
sanity test 244 hangs in recent testing. The last thing seen in the test_log is a hang during test 10 or 11 == sanity test 244: sendfile with group lock tests =================================================== 18:46:27 (1531334787) 35+0 records in 35+0 records out 36700160 bytes (37 MB) copied, 0.482129 s, 76.1 MB/s Starting test test10 at 1531334788 In all cases, the stack_dump is empty and the only thing seen in the console logs and dmesg is the test starting, rebooting and start testing sanity-sec. There’s about a one hour gap between when the test last reports in and when the node is rebooted [ 5769.663115] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 244: sendfile with group lock tests =================================================== 18:46:27 \(1531334787\) [ 5769.858388] Lustre: DEBUG MARKER: == sanity test 244: sendfile with group lock tests =================================================== 18:46:27 (1531334787) <ConMan> Console [trevis-12vm4] disconnected from <trevis-12:6003> at 07-11 19:49. <ConMan> Console [trevis-12vm4] connected to <trevis-12:6003> at 07-11 19:49. It looks like there is no information on why this test hung. We have several instances of this with logs at |
| Comments |
| Comment by Charlie Olmstead [ 01/Aug/18 ] |
|
Stack dumps were empty due to a bug in AT (ATM-1046). Are there new instances of this issue after 7/23? |
| Comment by Andreas Dilger [ 05/Aug/18 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/3ef30c02-97ed-11e8-b0aa-52540065bddc |
| Comment by Mikhail Pershin [ 28/Aug/18 ] |
|
+1 on master |
| Comment by Andreas Dilger [ 04/Sep/18 ] |
|
16 failures in the past 4 weeks, of 1160 test runs, or 0.13% failure rate. Definitely not one of the top 10 at this point. |
| Comment by Nathaniel Clark [ 14/Sep/18 ] |
|
master review-ldiskfs https://testing.whamcloud.com/test_sets/8eb7ed38-b7c4-11e8-a7de-52540065bddc In the console log for client 1 (onyx-30vm1) is the stack traces of all running processes including sendfile_grouplock: [ 9676.429140] sendfile_groupl S ffff974e39416eb0 0 22342 22160 0x00000080 [ 9676.429921] Call Trace: [ 9676.430178] [<ffffffffb8714029>] schedule+0x29/0x70 [ 9676.430681] [<ffffffffb87118d4>] schedule_timeout+0x174/0x2c0 [ 9676.431389] [<ffffffffb80a3750>] ? internal_add_timer+0x70/0x70 [ 9676.431985] [<ffffffffc09c2650>] ? ptlrpc_init_rq_pool+0x110/0x110 [ptlrpc] [ 9676.432705] [<ffffffffc09cc3b0>] ptlrpc_set_wait+0x480/0x790 [ptlrpc] [ 9676.433440] [<ffffffffb80cf670>] ? wake_up_state+0x20/0x20 [ 9676.433990] [<ffffffffc09cc73d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [ 9676.434666] [<ffffffffc09b16a2>] ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc] [ 9676.435414] [<ffffffffc09ac7c0>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc] [ 9676.436210] [<ffffffffc0ba2d50>] ? osc_lock_lockless_cancel+0xe0/0xe0 [osc] [ 9676.436929] [<ffffffffc0ba1ad0>] ? osc_lock_upcall+0x580/0x580 [osc] [ 9676.437653] [<ffffffffc0b98965>] osc_enqueue_base+0x2b5/0x6a0 [osc] [ 9676.438288] [<ffffffffc0ba1550>] ? osc_lock_lvb_update+0x330/0x330 [osc] [ 9676.438979] [<ffffffffc0ba37bb>] osc_lock_enqueue+0x38b/0x840 [osc] [ 9676.439738] [<ffffffffc0ba1550>] ? osc_lock_lvb_update+0x330/0x330 [osc] [ 9676.440443] [<ffffffffc0802d95>] cl_lock_enqueue+0x65/0x120 [obdclass] [ 9676.441176] [<ffffffffc0bfc285>] lov_lock_enqueue+0x95/0x150 [lov] [ 9676.441817] [<ffffffffc0802d95>] cl_lock_enqueue+0x65/0x120 [obdclass] [ 9676.442567] [<ffffffffc0803327>] cl_lock_request+0x67/0x1f0 [obdclass] [ 9676.443227] [<ffffffffc080721b>] cl_io_lock+0x2bb/0x3d0 [obdclass] [ 9676.443866] [<ffffffffc08075ab>] cl_io_loop+0x11b/0xc70 [obdclass] [ 9676.444634] [<ffffffffc0c54db2>] ll_file_io_generic+0x4e2/0xd10 [lustre] [ 9676.445301] [<ffffffffc0695395>] ? cfs_trace_unlock_tcd+0x35/0x90 [libcfs] [ 9676.445990] [<ffffffffb8357e01>] ? vsnprintf+0x1c1/0x6a0 [ 9676.446644] [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120 [ 9676.447271] [<ffffffffc0c55b32>] ll_file_aio_write+0x372/0x540 [lustre] [ 9676.447945] [<ffffffffc0c55da4>] ll_file_write+0xa4/0x170 [lustre] [ 9676.448648] [<ffffffffb821c082>] __kernel_write+0x72/0x140 [ 9676.449194] [<ffffffffb824ccfe>] write_pipe_buf+0x5e/0xa0 [ 9676.449748] [<ffffffffb824c386>] splice_from_pipe_feed+0x86/0x130 [ 9676.450446] [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120 [ 9676.451054] [<ffffffffb824c96e>] __splice_from_pipe+0x6e/0x90 [ 9676.451639] [<ffffffffb824cca0>] ? splice_write_to_file+0x120/0x120 [ 9676.452363] [<ffffffffb824e34e>] splice_from_pipe+0x5e/0x90 [ 9676.452960] [<ffffffffb824e399>] default_file_splice_write+0x19/0x30 [ 9676.453610] [<ffffffffb824d190>] do_splice_from+0xb0/0xf0 [ 9676.454208] [<ffffffffb824d1f0>] direct_splice_actor+0x20/0x30 [ 9676.454793] [<ffffffffb824cf27>] splice_direct_to_actor+0xd7/0x200 [ 9676.455486] [<ffffffffb824d1d0>] ? do_splice_from+0xf0/0xf0 [ 9676.456039] [<ffffffffb824d0b2>] do_splice_direct+0x62/0x90 [ 9676.456611] [<ffffffffb821bb38>] do_sendfile+0x1d8/0x3c0 [ 9676.457212] [<ffffffffb821d1ea>] SyS_sendfile64+0x9a/0xb0 [ 9676.457764] [<ffffffffb87206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 9676.458508] [<ffffffffb8720795>] system_call_fastpath+0x1c/0x21 [ 9676.459099] [<ffffffffb87206e1>] ? system_call_after_swapgs+0xae/0x146 |
| Comment by Andreas Dilger [ 15/Sep/18 ] |
|
This looks like a duplicate of |
| Comment by Patrick Farrell [ 17/Mar/22 ] |
|
Probably a dupe of |