[LU-10778] racer test 1 hangs in racer process cleanup Created: 06/Mar/18  Updated: 11/Aug/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4, Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

racer test_1 hangs with very little information on what is hung. In the suite_log, the only information that looks different from a successful run is

layout: raid0 raid0
layout: raid0 raid0
racer cleanup
  Trace dump:
  = ./file_create.sh:1:main()
: FAIL: test-framework exiting on error
racer cleanup
sleeping 5 sec ...
sleeping 5 sec ...
there should be NO racer processes:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Filesystem            1K-blocks   Used Available Use% Mounted on
10.2.4.96@tcp:/lustre  75566092 298872  71166892   1% /mnt/lustre2
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
there should be NO racer processes:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Filesystem            1K-blocks   Used Available Use% Mounted on
10.2.4.96@tcp:/lustre  75566092 298872  71166892   1% /mnt/lustre2
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.

Looking at the above output and looking at racer_cleanup() in tests/racer/racer.sh

  27 racer_cleanup()
  28 {
  29         echo "racer cleanup"
  30         for P in $RACER_PROGS; do
  31                 killall -q $P.sh
  32         done
  33         trap 0
  34 
  35         local TOT_WAIT=0
  36         local MAX_WAIT=$DURATION
  37         local SHORT_WAIT=5
  38 
  39         local rc
  40         while [[ $TOT_WAIT -le $MAX_WAIT ]]; do
  41                 rc=0
  42                 echo sleeping $SHORT_WAIT sec ...
  43                 sleep $SHORT_WAIT
  44                 # this only checks whether processes exist
  45                 for P in $RACER_PROGS; do
  46                         killall -0 $P.sh
  47                         [[ $? -eq 0 ]] && (( rc+=1 ))
  48                 done
  49                 if [[ $rc -eq 0 ]]; then
  50                         echo there should be NO racer processes:
  51                         ps uww -C "${RACER_PROGS// /,}"
  52                         return 0
  53                 fi
  54                 echo -n "Waited $(( TOT_WAIT + SHORT_WAIT)), rc=$rc "
  55                 (( SHORT_WAIT+=SHORT_WAIT ))
  56                 (( TOT_WAIT+=SHORT_WAIT ))
  57         done
  58         ps uww -C "${RACER_PROGS// /,}"
  59         return 1
  60 }

We see the output “there should be NO racer processes:”, but we don’t see “Waited …”. Could the test be hung in cleaning up processes?

There are stack traces in the console messages for each of the nodes/vms with Lustre processes listed in some of the traces. For example on the

 [ 9869.370497] SysRq : Changing Loglevel
[ 9869.370996] Loglevel set to 8
[ 9869.860527] SysRq : Show State
[ 9869.860942]   task                        PC stack   pid father
[ 9869.861552] systemd         S ffff88007c140000     0     1      0 0x00000000
[ 9869.862317] Call Trace:
[ 9869.862579]  [<ffffffff816ab8a9>] schedule+0x29/0x70
[ 9869.863062]  [<ffffffff816aa9bd>] schedule_hrtimeout_range_clock+0x12d/0x150
[ 9869.863824]  [<ffffffff8124da09>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0
[ 9869.864588]  [<ffffffff816aa9f3>] schedule_hrtimeout_range+0x13/0x20
[ 9869.865246]  [<ffffffff8124dc9e>] ep_poll+0x23e/0x360
[ 9869.865790]  [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[ 9869.866337]  [<ffffffff8124f12d>] SyS_epoll_wait+0xed/0x120
[ 9869.866933]  [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214
[ 9869.867586]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[ 9869.868228]  [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
…
[ 9871.032843] Call Trace:
[ 9871.033091]  [<ffffffff816ab8a9>] schedule+0x29/0x70
[ 9871.033588]  [<ffffffffc0691884>] cfs_wi_scheduler+0x304/0x450 [libcfs]
[ 9871.034328]  [<ffffffff810b3690>] ? wake_up_atomic_t+0x30/0x30
[ 9871.034918]  [<ffffffffc0691580>] ? cfs_wi_sched_create+0x680/0x680 [libcfs]
[ 9871.035593]  [<ffffffff810b270f>] kthread+0xcf/0xe0
[ 9871.036123]  [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
[ 9871.036728]  [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[ 9871.037314]  [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
[ 9871.037927] cfs_rh_01       S ffff880079f50000     0 13072      2 0x00000080
[ 

but no process is clearly hung.

Logs for these hangs are at
https://testing.whamcloud.com/test_sets/7ffca4b0-200b-11e8-9ec4-52540065bddc



 Comments   
Comment by Saurabh Tandan (Inactive) [ 08/May/18 ]

+1 on 2.10.3 https://testing.hpdd.intel.com/test_sets/16cb3064-50c2-11e8-b9d3-52540065bddc

Comment by Andreas Dilger [ 06/Feb/19 ]

This is still happening fairly frequently, about 20% of recent test runs.

Comment by Arshad Hussain [ 09/Jul/21 ]

+1 On Master: https://testing.whamcloud.com/test_sets/98916306-88da-4f75-a874-58d0b7c9f8fb

Comment by Arshad Hussain [ 09/Jul/21 ]

>+1 On Master: https://testing.whamcloud.com/test_sets/98916306-88da-4f75-a874-58d0b7c9f8fb
Sorry. This is unrelated.

Comment by Alena Nikitenko [ 30/Nov/21 ]

Might have encountered this in 2.12.8 testing: https://testing.whamcloud.com/test_sets/d46c6037-83f1-4fed-8f56-d03de5a7ff1e

...
racer cleanup
racer cleanup
sleeping 5 sec ...
sleeping 5 sec ...
./file_exec.sh: line 16:  6184 Terminated              $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
there should be NO racer processes:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Filesystem                1K-blocks    Used Available Use% Mounted on
10.240.23.189@tcp:/lustre  13362580 3687184   8803420  30% /mnt/lustre2
We survived /usr/lib64/lustre/tests/racer/racer.sh for 900 seconds. 
...
[Sat Nov 20 06:05:59 2021] systemd         S ffff9485783e8640     0     1      0 0x00000000
[Sat Nov 20 06:05:59 2021] Call Trace:
[Sat Nov 20 06:05:59 2021]  [<ffffffff93789179>] schedule+0x29/0x70
[Sat Nov 20 06:05:59 2021]  [<ffffffff937885bd>] schedule_hrtimeout_range_clock+0x12d/0x150
[Sat Nov 20 06:05:59 2021]  [<ffffffff9329d5e9>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0
[Sat Nov 20 06:05:59 2021]  [<ffffffff937885f3>] schedule_hrtimeout_range+0x13/0x20
[Sat Nov 20 06:05:59 2021]  [<ffffffff9329d87e>] ep_poll+0x23e/0x360
[Sat Nov 20 06:05:59 2021]  [<ffffffff93260a81>] ? do_unlinkat+0xf1/0x2d0
[Sat Nov 20 06:05:59 2021]  [<ffffffff930dadf0>] ? wake_up_state+0x20/0x20
[Sat Nov 20 06:05:59 2021]  [<ffffffff9329ed5d>] SyS_epoll_wait+0xed/0x120
[Sat Nov 20 06:05:59 2021]  [<ffffffff93795ec9>] ? system_call_after_swapgs+0x96/0x13a
[Sat Nov 20 06:05:59 2021]  [<ffffffff93795f92>] system_call_fastpath+0x25/0x2a
[Sat Nov 20 06:05:59 2021]  [<ffffffff93795ed5>] ? system_call_after_swapgs+0xa2/0x13a
...
cfs_rh_00       S ffff9484f6656300     0  9208      2 0x00000080
[Sat Nov 20 06:06:01 2021] Call Trace:
[Sat Nov 20 06:06:01 2021]  [<ffffffff93789179>] schedule+0x29/0x70
[Sat Nov 20 06:06:01 2021]  [<ffffffffc09a8e6c>] cfs_wi_scheduler+0x30c/0x460 [libcfs]
[Sat Nov 20 06:06:01 2021]  [<ffffffff930c6f50>] ? wake_up_atomic_t+0x30/0x30
[Sat Nov 20 06:06:01 2021]  [<ffffffffc09a8b60>] ? cfs_wi_sched_create+0x680/0x680 [libcfs]
[Sat Nov 20 06:06:01 2021]  [<ffffffff930c5e61>] kthread+0xd1/0xe0
[Sat Nov 20 06:06:01 2021]  [<ffffffff930c5d90>] ? insert_kthread_work+0x40/0x40
[Sat Nov 20 06:06:01 2021]  [<ffffffff93795df7>] ret_from_fork_nospec_begin+0x21/0x21
[Sat Nov 20 06:06:01 2021]  [<ffffffff930c5d90>] ? insert_kthread_work+0x40/0x40
[Sat Nov 20 06:06:01 2021] cfs_rh_01       S ffff9484f6652100     0  9209      2 0x00000080 
Generated at Sat Feb 10 02:38:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.