Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4, Lustre 2.13.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

racer test_1 hangs with very little information on what is hung. In the suite_log, the only information that looks different from a successful run is

layout: raid0 raid0
layout: raid0 raid0
racer cleanup
  Trace dump:
  = ./file_create.sh:1:main()
: FAIL: test-framework exiting on error
racer cleanup
sleeping 5 sec ...
sleeping 5 sec ...
there should be NO racer processes:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Filesystem            1K-blocks   Used Available Use% Mounted on
10.2.4.96@tcp:/lustre  75566092 298872  71166892   1% /mnt/lustre2
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
there should be NO racer processes:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Filesystem            1K-blocks   Used Available Use% Mounted on
10.2.4.96@tcp:/lustre  75566092 298872  71166892   1% /mnt/lustre2
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.

Looking at the above output and looking at racer_cleanup() in tests/racer/racer.sh

  27 racer_cleanup()
  28 {
  29         echo "racer cleanup"
  30         for P in $RACER_PROGS; do
  31                 killall -q $P.sh
  32         done
  33         trap 0
  34 
  35         local TOT_WAIT=0
  36         local MAX_WAIT=$DURATION
  37         local SHORT_WAIT=5
  38 
  39         local rc
  40         while [[ $TOT_WAIT -le $MAX_WAIT ]]; do
  41                 rc=0
  42                 echo sleeping $SHORT_WAIT sec ...
  43                 sleep $SHORT_WAIT
  44                 # this only checks whether processes exist
  45                 for P in $RACER_PROGS; do
  46                         killall -0 $P.sh
  47                         [[ $? -eq 0 ]] && (( rc+=1 ))
  48                 done
  49                 if [[ $rc -eq 0 ]]; then
  50                         echo there should be NO racer processes:
  51                         ps uww -C "${RACER_PROGS// /,}"
  52                         return 0
  53                 fi
  54                 echo -n "Waited $(( TOT_WAIT + SHORT_WAIT)), rc=$rc "
  55                 (( SHORT_WAIT+=SHORT_WAIT ))
  56                 (( TOT_WAIT+=SHORT_WAIT ))
  57         done
  58         ps uww -C "${RACER_PROGS// /,}"
  59         return 1
  60 }

We see the output “there should be NO racer processes:”, but we don’t see “Waited …”. Could the test be hung in cleaning up processes?

There are stack traces in the console messages for each of the nodes/vms with Lustre processes listed in some of the traces. For example on the

 [ 9869.370497] SysRq : Changing Loglevel
[ 9869.370996] Loglevel set to 8
[ 9869.860527] SysRq : Show State
[ 9869.860942]   task                        PC stack   pid father
[ 9869.861552] systemd         S ffff88007c140000     0     1      0 0x00000000
[ 9869.862317] Call Trace:
[ 9869.862579]  [<ffffffff816ab8a9>] schedule+0x29/0x70
[ 9869.863062]  [<ffffffff816aa9bd>] schedule_hrtimeout_range_clock+0x12d/0x150
[ 9869.863824]  [<ffffffff8124da09>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0
[ 9869.864588]  [<ffffffff816aa9f3>] schedule_hrtimeout_range+0x13/0x20
[ 9869.865246]  [<ffffffff8124dc9e>] ep_poll+0x23e/0x360
[ 9869.865790]  [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[ 9869.866337]  [<ffffffff8124f12d>] SyS_epoll_wait+0xed/0x120
[ 9869.866933]  [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214
[ 9869.867586]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[ 9869.868228]  [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
…
[ 9871.032843] Call Trace:
[ 9871.033091]  [<ffffffff816ab8a9>] schedule+0x29/0x70
[ 9871.033588]  [<ffffffffc0691884>] cfs_wi_scheduler+0x304/0x450 [libcfs]
[ 9871.034328]  [<ffffffff810b3690>] ? wake_up_atomic_t+0x30/0x30
[ 9871.034918]  [<ffffffffc0691580>] ? cfs_wi_sched_create+0x680/0x680 [libcfs]
[ 9871.035593]  [<ffffffff810b270f>] kthread+0xcf/0xe0
[ 9871.036123]  [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
[ 9871.036728]  [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[ 9871.037314]  [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
[ 9871.037927] cfs_rh_01       S ffff880079f50000     0 13072      2 0x00000080
[

but no process is clearly hung.

Logs for these hangs are at
https://testing.whamcloud.com/test_sets/7ffca4b0-200b-11e8-9ec4-52540065bddc

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(1 mentioned in)

Details

Description

Attachments

Issue Links

Activity

People

Dates