[LU-11975] sanity test 60a hangs in llog_test_10 Created: 19/Feb/19  Updated: 08/Apr/19  Resolved: 27/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_60a hangs when running llog_test.c. This test started hanging on 18 February 2019 with 2.12.51.51.

Looking at the logs for one of the failures, https://testing.whamcloud.com/test_sets/41ea8004-348f-11e9-a05f-52540065bddc , the last thing seen in the client test_log is

trevis-10vm12: + RC=0
trevis-10vm12: + eval '/sbin/lctl <<-EOF || RC=2
trevis-10vm12: 	attach llog_test llt_name llt_uuid
trevis-10vm12: 	ignore_errors
trevis-10vm12: 	setup MGS
trevis-10vm12: 	device llt_name
trevis-10vm12: 	cleanup
trevis-10vm12: 	detach
trevis-10vm12: EOF'
trevis-10vm12: ++ /sbin/lctl

Looking at the MDS console log, we see the test in hung in llog_test 10h

[ 2700.758677] Lustre: 29346:0:(llog_test.c:2045:llog_test_10()) 10h: write 64767 more log records
[ 2700.758851] Lustre: 29365:0:(llog_test.c:1469:cat_check_old_cb()) seeing record at index 2 - [0x1:0x4e:0x0] in log [0xa:0x11:0x0]
[ 2700.760785] LustreError: 29346:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 1317 sleeping
[ 2880.228053] INFO: task jbd2/dm-3-8:6113 blocked for more than 120 seconds.
[ 2880.228819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2880.229589] jbd2/dm-3-8     D ffff9862a41e9040     0  6113      2 0x00000080
[ 2880.230319] Call Trace:
[ 2880.230599]  [<ffffffff85767c49>] schedule+0x29/0x70
[ 2880.231117]  [<ffffffffc039c2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2]
[ 2880.231941]  [<ffffffff8502a59e>] ? __switch_to+0xce/0x580
[ 2880.232676]  [<ffffffff850d0880>] ? finish_task_switch+0x50/0x1c0
[ 2880.233269]  [<ffffffff850c2d00>] ? wake_up_atomic_t+0x30/0x30
[ 2880.233825]  [<ffffffff8576778f>] ? __schedule+0x3ff/0x890
[ 2880.234424]  [<ffffffff850aa0de>] ? try_to_del_timer_sync+0x5e/0x90
[ 2880.235057]  [<ffffffffc03a2e89>] kjournald2+0xc9/0x260 [jbd2]
[ 2880.235611]  [<ffffffff850c2d00>] ? wake_up_atomic_t+0x30/0x30
[ 2880.236169]  [<ffffffffc03a2dc0>] ? commit_timeout+0x10/0x10 [jbd2]
[ 2880.236834]  [<ffffffff850c1c31>] kthread+0xd1/0xe0
[ 2880.237302]  [<ffffffff850c1b60>] ? insert_kthread_work+0x40/0x40
[ 2880.237882]  [<ffffffff85774c37>] ret_from_fork_nospec_begin+0x21/0x21
[ 2880.238500]  [<ffffffff850c1b60>] ? insert_kthread_work+0x40/0x40

The sanity test 60a hang at https://testing.whamcloud.com/test_sets/80bd5466-339d-11e9-bd83-52540065bddc has a different call trace in the MDS console log

[ 4208.005408] Lustre: 9960:0:(llog_test.c:2045:llog_test_10()) 10h: write 64767 more log records
[ 4208.005528] Lustre: 13221:0:(llog_test.c:1469:cat_check_old_cb()) seeing record at index 2 - [0x1:0x35:0x0] in log [0xa:0x15:0x0]
[ 4208.008970] LustreError: 9960:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 1317 sleeping
[ 4440.279385] INFO: task txg_quiesce:17743 blocked for more than 120 seconds.
[ 4440.280697] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4440.282045] txg_quiesce     D ffff9910f6268000     0 17743      2 0x00000080
[ 4440.283468] Call Trace:
[ 4440.283971]  [<ffffffffa2f67c49>] schedule+0x29/0x70
[ 4440.284879]  [<ffffffffc048c2d5>] cv_wait_common+0x125/0x150 [spl]
[ 4440.285989]  [<ffffffffa28c2d00>] ? wake_up_atomic_t+0x30/0x30
[ 4440.287197]  [<ffffffffc048c315>] __cv_wait+0x15/0x20 [spl]
[ 4440.288234]  [<ffffffffc05f2c2b>] txg_quiesce_thread+0x2fb/0x410 [zfs]
[ 4440.289475]  [<ffffffffc05f2930>] ? txg_init+0x2b0/0x2b0 [zfs]
[ 4440.290514]  [<ffffffffc0487063>] thread_generic_wrapper+0x73/0x80 [spl]
[ 4440.291768]  [<ffffffffc0486ff0>] ? __thread_exit+0x20/0x20 [spl]
[ 4440.292837]  [<ffffffffa28c1c31>] kthread+0xd1/0xe0
[ 4440.293814]  [<ffffffffa28c1b60>] ? insert_kthread_work+0x40/0x40
[ 4440.294891]  [<ffffffffa2f74c37>] ret_from_fork_nospec_begin+0x21/0x21
[ 4440.296120]  [<ffffffffa28c1b60>] ? insert_kthread_work+0x40/0x40

The patch for LU-11591, https://review.whamcloud.com/#/c/33683/ , was the last patch to modify llog_test test 10 and landed around the time sanity test 60a started failing.

Logs for additional failures can be found at
https://testing.whamcloud.com/test_sets/6f903f48-348b-11e9-ae87-52540065bddc
https://testing.whamcloud.com/test_sets/86731228-3376-11e9-ae87-52540065bddc



 Comments   
Comment by Gerrit Updater [ 21/Feb/19 ]

Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/34287
Subject: LU-11975 test: fix for llog test 10h
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6cd6fa06bfc34ef84d775ccdaa369f4dd101d6ab

Comment by Gerrit Updater [ 27/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34287/
Subject: LU-11975 test: fix for llog test 10h
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0670d5aed457196121c843fd24877d3f2670d478

Comment by Peter Jones [ 27/Feb/19 ]

Landed for 2.13

Comment by Minh Diep [ 08/Apr/19 ]

b2_12 is part of https://review.whamcloud.com/#/c/34507/

Generated at Sat Feb 10 02:48:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.