Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.11.0, Lustre 2.10.4, Lustre 2.13.0
-
None
-
3
-
9223372036854775807
Description
racer test_1 hangs with very little information on what is hung. In the suite_log, the only information that looks different from a successful run is
layout: raid0 raid0 layout: raid0 raid0 racer cleanup Trace dump: = ./file_create.sh:1:main() : FAIL: test-framework exiting on error racer cleanup sleeping 5 sec ... sleeping 5 sec ... there should be NO racer processes: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND Filesystem 1K-blocks Used Available Use% Mounted on 10.2.4.96@tcp:/lustre 75566092 298872 71166892 1% /mnt/lustre2 We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND Filesystem 1K-blocks Used Available Use% Mounted on 10.2.4.96@tcp:/lustre 75566092 298872 71166892 1% /mnt/lustre2 We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
Looking at the above output and looking at racer_cleanup() in tests/racer/racer.sh
27 racer_cleanup()
28 {
29 echo "racer cleanup"
30 for P in $RACER_PROGS; do
31 killall -q $P.sh
32 done
33 trap 0
34
35 local TOT_WAIT=0
36 local MAX_WAIT=$DURATION
37 local SHORT_WAIT=5
38
39 local rc
40 while [[ $TOT_WAIT -le $MAX_WAIT ]]; do
41 rc=0
42 echo sleeping $SHORT_WAIT sec ...
43 sleep $SHORT_WAIT
44 # this only checks whether processes exist
45 for P in $RACER_PROGS; do
46 killall -0 $P.sh
47 [[ $? -eq 0 ]] && (( rc+=1 ))
48 done
49 if [[ $rc -eq 0 ]]; then
50 echo there should be NO racer processes:
51 ps uww -C "${RACER_PROGS// /,}"
52 return 0
53 fi
54 echo -n "Waited $(( TOT_WAIT + SHORT_WAIT)), rc=$rc "
55 (( SHORT_WAIT+=SHORT_WAIT ))
56 (( TOT_WAIT+=SHORT_WAIT ))
57 done
58 ps uww -C "${RACER_PROGS// /,}"
59 return 1
60 }
We see the output “there should be NO racer processes:”, but we don’t see “Waited …”. Could the test be hung in cleaning up processes?
There are stack traces in the console messages for each of the nodes/vms with Lustre processes listed in some of the traces. For example on the
[ 9869.370497] SysRq : Changing Loglevel [ 9869.370996] Loglevel set to 8 [ 9869.860527] SysRq : Show State [ 9869.860942] task PC stack pid father [ 9869.861552] systemd S ffff88007c140000 0 1 0 0x00000000 [ 9869.862317] Call Trace: [ 9869.862579] [<ffffffff816ab8a9>] schedule+0x29/0x70 [ 9869.863062] [<ffffffff816aa9bd>] schedule_hrtimeout_range_clock+0x12d/0x150 [ 9869.863824] [<ffffffff8124da09>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0 [ 9869.864588] [<ffffffff816aa9f3>] schedule_hrtimeout_range+0x13/0x20 [ 9869.865246] [<ffffffff8124dc9e>] ep_poll+0x23e/0x360 [ 9869.865790] [<ffffffff810c6620>] ? wake_up_state+0x20/0x20 [ 9869.866337] [<ffffffff8124f12d>] SyS_epoll_wait+0xed/0x120 [ 9869.866933] [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214 [ 9869.867586] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b [ 9869.868228] [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214 … [ 9871.032843] Call Trace: [ 9871.033091] [<ffffffff816ab8a9>] schedule+0x29/0x70 [ 9871.033588] [<ffffffffc0691884>] cfs_wi_scheduler+0x304/0x450 [libcfs] [ 9871.034328] [<ffffffff810b3690>] ? wake_up_atomic_t+0x30/0x30 [ 9871.034918] [<ffffffffc0691580>] ? cfs_wi_sched_create+0x680/0x680 [libcfs] [ 9871.035593] [<ffffffff810b270f>] kthread+0xcf/0xe0 [ 9871.036123] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 [ 9871.036728] [<ffffffff816b8798>] ret_from_fork+0x58/0x90 [ 9871.037314] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 [ 9871.037927] cfs_rh_01 S ffff880079f50000 0 13072 2 0x00000080 [
but no process is clearly hung.
Logs for these hangs are at
https://testing.whamcloud.com/test_sets/7ffca4b0-200b-11e8-9ec4-52540065bddc