[LU-10778] racer test 1 hangs in racer process cleanup Created: 06/Mar/18 Updated: 11/Aug/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.10.4, Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
racer test_1 hangs with very little information on what is hung. In the suite_log, the only information that looks different from a successful run is layout: raid0 raid0 layout: raid0 raid0 racer cleanup Trace dump: = ./file_create.sh:1:main() : FAIL: test-framework exiting on error racer cleanup sleeping 5 sec ... sleeping 5 sec ... there should be NO racer processes: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND Filesystem 1K-blocks Used Available Use% Mounted on 10.2.4.96@tcp:/lustre 75566092 298872 71166892 1% /mnt/lustre2 We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND Filesystem 1K-blocks Used Available Use% Mounted on 10.2.4.96@tcp:/lustre 75566092 298872 71166892 1% /mnt/lustre2 We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. Looking at the above output and looking at racer_cleanup() in tests/racer/racer.sh 27 racer_cleanup()
28 {
29 echo "racer cleanup"
30 for P in $RACER_PROGS; do
31 killall -q $P.sh
32 done
33 trap 0
34
35 local TOT_WAIT=0
36 local MAX_WAIT=$DURATION
37 local SHORT_WAIT=5
38
39 local rc
40 while [[ $TOT_WAIT -le $MAX_WAIT ]]; do
41 rc=0
42 echo sleeping $SHORT_WAIT sec ...
43 sleep $SHORT_WAIT
44 # this only checks whether processes exist
45 for P in $RACER_PROGS; do
46 killall -0 $P.sh
47 [[ $? -eq 0 ]] && (( rc+=1 ))
48 done
49 if [[ $rc -eq 0 ]]; then
50 echo there should be NO racer processes:
51 ps uww -C "${RACER_PROGS// /,}"
52 return 0
53 fi
54 echo -n "Waited $(( TOT_WAIT + SHORT_WAIT)), rc=$rc "
55 (( SHORT_WAIT+=SHORT_WAIT ))
56 (( TOT_WAIT+=SHORT_WAIT ))
57 done
58 ps uww -C "${RACER_PROGS// /,}"
59 return 1
60 }
We see the output “there should be NO racer processes:”, but we don’t see “Waited …”. Could the test be hung in cleaning up processes? There are stack traces in the console messages for each of the nodes/vms with Lustre processes listed in some of the traces. For example on the [ 9869.370497] SysRq : Changing Loglevel [ 9869.370996] Loglevel set to 8 [ 9869.860527] SysRq : Show State [ 9869.860942] task PC stack pid father [ 9869.861552] systemd S ffff88007c140000 0 1 0 0x00000000 [ 9869.862317] Call Trace: [ 9869.862579] [<ffffffff816ab8a9>] schedule+0x29/0x70 [ 9869.863062] [<ffffffff816aa9bd>] schedule_hrtimeout_range_clock+0x12d/0x150 [ 9869.863824] [<ffffffff8124da09>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0 [ 9869.864588] [<ffffffff816aa9f3>] schedule_hrtimeout_range+0x13/0x20 [ 9869.865246] [<ffffffff8124dc9e>] ep_poll+0x23e/0x360 [ 9869.865790] [<ffffffff810c6620>] ? wake_up_state+0x20/0x20 [ 9869.866337] [<ffffffff8124f12d>] SyS_epoll_wait+0xed/0x120 [ 9869.866933] [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214 [ 9869.867586] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b [ 9869.868228] [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214 … [ 9871.032843] Call Trace: [ 9871.033091] [<ffffffff816ab8a9>] schedule+0x29/0x70 [ 9871.033588] [<ffffffffc0691884>] cfs_wi_scheduler+0x304/0x450 [libcfs] [ 9871.034328] [<ffffffff810b3690>] ? wake_up_atomic_t+0x30/0x30 [ 9871.034918] [<ffffffffc0691580>] ? cfs_wi_sched_create+0x680/0x680 [libcfs] [ 9871.035593] [<ffffffff810b270f>] kthread+0xcf/0xe0 [ 9871.036123] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 [ 9871.036728] [<ffffffff816b8798>] ret_from_fork+0x58/0x90 [ 9871.037314] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 [ 9871.037927] cfs_rh_01 S ffff880079f50000 0 13072 2 0x00000080 [ but no process is clearly hung. Logs for these hangs are at |
| Comments |
| Comment by Saurabh Tandan (Inactive) [ 08/May/18 ] |
|
+1 on 2.10.3 https://testing.hpdd.intel.com/test_sets/16cb3064-50c2-11e8-b9d3-52540065bddc |
| Comment by Andreas Dilger [ 06/Feb/19 ] |
|
This is still happening fairly frequently, about 20% of recent test runs. |
| Comment by Arshad Hussain [ 09/Jul/21 ] |
|
+1 On Master: https://testing.whamcloud.com/test_sets/98916306-88da-4f75-a874-58d0b7c9f8fb |
| Comment by Arshad Hussain [ 09/Jul/21 ] |
|
>+1 On Master: https://testing.whamcloud.com/test_sets/98916306-88da-4f75-a874-58d0b7c9f8fb |
| Comment by Alena Nikitenko [ 30/Nov/21 ] |
|
Might have encountered this in 2.12.8 testing: https://testing.whamcloud.com/test_sets/d46c6037-83f1-4fed-8f56-d03de5a7ff1e ... racer cleanup racer cleanup sleeping 5 sec ... sleeping 5 sec ... ./file_exec.sh: line 16: 6184 Terminated $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null there should be NO racer processes: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND Filesystem 1K-blocks Used Available Use% Mounted on 10.240.23.189@tcp:/lustre 13362580 3687184 8803420 30% /mnt/lustre2 We survived /usr/lib64/lustre/tests/racer/racer.sh for 900 seconds. ... [Sat Nov 20 06:05:59 2021] systemd S ffff9485783e8640 0 1 0 0x00000000 [Sat Nov 20 06:05:59 2021] Call Trace: [Sat Nov 20 06:05:59 2021] [<ffffffff93789179>] schedule+0x29/0x70 [Sat Nov 20 06:05:59 2021] [<ffffffff937885bd>] schedule_hrtimeout_range_clock+0x12d/0x150 [Sat Nov 20 06:05:59 2021] [<ffffffff9329d5e9>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0 [Sat Nov 20 06:05:59 2021] [<ffffffff937885f3>] schedule_hrtimeout_range+0x13/0x20 [Sat Nov 20 06:05:59 2021] [<ffffffff9329d87e>] ep_poll+0x23e/0x360 [Sat Nov 20 06:05:59 2021] [<ffffffff93260a81>] ? do_unlinkat+0xf1/0x2d0 [Sat Nov 20 06:05:59 2021] [<ffffffff930dadf0>] ? wake_up_state+0x20/0x20 [Sat Nov 20 06:05:59 2021] [<ffffffff9329ed5d>] SyS_epoll_wait+0xed/0x120 [Sat Nov 20 06:05:59 2021] [<ffffffff93795ec9>] ? system_call_after_swapgs+0x96/0x13a [Sat Nov 20 06:05:59 2021] [<ffffffff93795f92>] system_call_fastpath+0x25/0x2a [Sat Nov 20 06:05:59 2021] [<ffffffff93795ed5>] ? system_call_after_swapgs+0xa2/0x13a ... cfs_rh_00 S ffff9484f6656300 0 9208 2 0x00000080 [Sat Nov 20 06:06:01 2021] Call Trace: [Sat Nov 20 06:06:01 2021] [<ffffffff93789179>] schedule+0x29/0x70 [Sat Nov 20 06:06:01 2021] [<ffffffffc09a8e6c>] cfs_wi_scheduler+0x30c/0x460 [libcfs] [Sat Nov 20 06:06:01 2021] [<ffffffff930c6f50>] ? wake_up_atomic_t+0x30/0x30 [Sat Nov 20 06:06:01 2021] [<ffffffffc09a8b60>] ? cfs_wi_sched_create+0x680/0x680 [libcfs] [Sat Nov 20 06:06:01 2021] [<ffffffff930c5e61>] kthread+0xd1/0xe0 [Sat Nov 20 06:06:01 2021] [<ffffffff930c5d90>] ? insert_kthread_work+0x40/0x40 [Sat Nov 20 06:06:01 2021] [<ffffffff93795df7>] ret_from_fork_nospec_begin+0x21/0x21 [Sat Nov 20 06:06:01 2021] [<ffffffff930c5d90>] ? insert_kthread_work+0x40/0x40 [Sat Nov 20 06:06:01 2021] cfs_rh_01 S ffff9484f6652100 0 9209 2 0x00000080 |