Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10778

racer test 1 hangs in racer process cleanup

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.11.0, Lustre 2.10.4, Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      racer test_1 hangs with very little information on what is hung. In the suite_log, the only information that looks different from a successful run is

      layout: raid0 raid0
      layout: raid0 raid0
      racer cleanup
        Trace dump:
        = ./file_create.sh:1:main()
      : FAIL: test-framework exiting on error
      racer cleanup
      sleeping 5 sec ...
      sleeping 5 sec ...
      there should be NO racer processes:
      USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
      Filesystem            1K-blocks   Used Available Use% Mounted on
      10.2.4.96@tcp:/lustre  75566092 298872  71166892   1% /mnt/lustre2
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      there should be NO racer processes:
      USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
      Filesystem            1K-blocks   Used Available Use% Mounted on
      10.2.4.96@tcp:/lustre  75566092 298872  71166892   1% /mnt/lustre2
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      

      Looking at the above output and looking at racer_cleanup() in tests/racer/racer.sh

        27 racer_cleanup()
        28 {
        29         echo "racer cleanup"
        30         for P in $RACER_PROGS; do
        31                 killall -q $P.sh
        32         done
        33         trap 0
        34 
        35         local TOT_WAIT=0
        36         local MAX_WAIT=$DURATION
        37         local SHORT_WAIT=5
        38 
        39         local rc
        40         while [[ $TOT_WAIT -le $MAX_WAIT ]]; do
        41                 rc=0
        42                 echo sleeping $SHORT_WAIT sec ...
        43                 sleep $SHORT_WAIT
        44                 # this only checks whether processes exist
        45                 for P in $RACER_PROGS; do
        46                         killall -0 $P.sh
        47                         [[ $? -eq 0 ]] && (( rc+=1 ))
        48                 done
        49                 if [[ $rc -eq 0 ]]; then
        50                         echo there should be NO racer processes:
        51                         ps uww -C "${RACER_PROGS// /,}"
        52                         return 0
        53                 fi
        54                 echo -n "Waited $(( TOT_WAIT + SHORT_WAIT)), rc=$rc "
        55                 (( SHORT_WAIT+=SHORT_WAIT ))
        56                 (( TOT_WAIT+=SHORT_WAIT ))
        57         done
        58         ps uww -C "${RACER_PROGS// /,}"
        59         return 1
        60 }
      

      We see the output “there should be NO racer processes:”, but we don’t see “Waited …”. Could the test be hung in cleaning up processes?

      There are stack traces in the console messages for each of the nodes/vms with Lustre processes listed in some of the traces. For example on the

       [ 9869.370497] SysRq : Changing Loglevel
      [ 9869.370996] Loglevel set to 8
      [ 9869.860527] SysRq : Show State
      [ 9869.860942]   task                        PC stack   pid father
      [ 9869.861552] systemd         S ffff88007c140000     0     1      0 0x00000000
      [ 9869.862317] Call Trace:
      [ 9869.862579]  [<ffffffff816ab8a9>] schedule+0x29/0x70
      [ 9869.863062]  [<ffffffff816aa9bd>] schedule_hrtimeout_range_clock+0x12d/0x150
      [ 9869.863824]  [<ffffffff8124da09>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0
      [ 9869.864588]  [<ffffffff816aa9f3>] schedule_hrtimeout_range+0x13/0x20
      [ 9869.865246]  [<ffffffff8124dc9e>] ep_poll+0x23e/0x360
      [ 9869.865790]  [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
      [ 9869.866337]  [<ffffffff8124f12d>] SyS_epoll_wait+0xed/0x120
      [ 9869.866933]  [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214
      [ 9869.867586]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
      [ 9869.868228]  [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
      …
      [ 9871.032843] Call Trace:
      [ 9871.033091]  [<ffffffff816ab8a9>] schedule+0x29/0x70
      [ 9871.033588]  [<ffffffffc0691884>] cfs_wi_scheduler+0x304/0x450 [libcfs]
      [ 9871.034328]  [<ffffffff810b3690>] ? wake_up_atomic_t+0x30/0x30
      [ 9871.034918]  [<ffffffffc0691580>] ? cfs_wi_sched_create+0x680/0x680 [libcfs]
      [ 9871.035593]  [<ffffffff810b270f>] kthread+0xcf/0xe0
      [ 9871.036123]  [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
      [ 9871.036728]  [<ffffffff816b8798>] ret_from_fork+0x58/0x90
      [ 9871.037314]  [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
      [ 9871.037927] cfs_rh_01       S ffff880079f50000     0 13072      2 0x00000080
      [ 
      
      

      but no process is clearly hung.

      Logs for these hangs are at
      https://testing.whamcloud.com/test_sets/7ffca4b0-200b-11e8-9ec4-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: