Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11540

racer test 1 fails with 'test_1 failed with 1'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.10.5, Lustre 2.12.1
    • SLES12 SP3 server and clients
    • 3
    • 9223372036854775807

    Description

      racer test_1 fails. Looking at the client test_log at https://testing.whamcloud.com/test_sets/fb82b210-cf05-11e8-9238-52540065bddc ,
      we see that one of the racer jobs returned 254

      pid=8389 rc=0
      pid=8390 rc=0
      pid=8392 rc=254
      pid=8395 rc=0
      pid=8396 rc=0
      pid=8398 rc=0
      pid=8400 rc=0
      pid=8404 rc=0
       racer test_1: @@@@@@ FAIL: test_1 failed with 1 
      

      Looking at the racer test 1 code,

        90         local rpids=""
        91         for rdir in $RDIRS; do
        92                 do_nodes $clients "DURATION=$DURATION \
        93                         MDSCOUNT=$MDSCOUNT OSTCOUNT=$OSTCOUNT\
        94                         RACER_ENABLE_REMOTE_DIRS=$RACER_ENABLE_REMOTE_DIRS \
        95                         RACER_ENABLE_STRIPED_DIRS=$RACER_ENABLE_STRIPED_DIRS \
        96                         RACER_ENABLE_MIGRATION=$RACER_ENABLE_MIGRATION \
        97                         RACER_ENABLE_PFL=$RACER_ENABLE_PFL \
        98                         RACER_ENABLE_DOM=$RACER_ENABLE_DOM \
        99                         RACER_ENABLE_FLR=$RACER_ENABLE_FLR \
       100                         LFS=$LFS \
       101                         $racer $rdir $NUM_RACER_THREADS" &
       102                 pid=$!
       103                 rpids="$rpids $pid"
       104         done
       105 
      …
       118 
       119         echo racers pids: $rpids
       120         for pid in $rpids; do
       121                 wait $pid
       122                 rc=$?
       123                 echo "pid=$pid rc=$rc"
       124                 if [ $rc != 0 ]; then
       125                     rrc=$((rrc + 1))
       126                 fi
       127         done
       128 
      

      Looking at both the client console logs, we see a problem with fork and system-coredump

      [42276.361879] cgroup: fork rejected by pids controller in /system.slice/xinetd.service
      [42311.453149] LustreError: 28228:0:(namei.c:87:ll_set_inode()) Can not initialize inode [0x200000406:0x45:0x0] without object type: valid = 0x100000001
      [42311.453157] LustreError: 28228:0:(llite_lib.c:2407:ll_prep_inode()) new_inode -fatal: rc -12
      [42352.382819] Lustre: 19021:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539417453/real 1539417453]  req@ffff8800621803c0 x1614195941033888/t0(0) o36->lustre-MDT0000-mdc-ffff88007b01c800@10.2.8.135@tcp:12/10 lens 488/4528 e 0 to 1 dl 1539417497 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [42352.382831] Lustre: 19021:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [42352.382849] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection to lustre-MDT0000 (at 10.2.8.135@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [42352.389206] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection restored to 10.2.8.135@tcp (at 10.2.8.135@tcp)
      [42352.389212] Lustre: Skipped 1 previous similar message
      [42649.104856] Lustre: 7938:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539417749/real 1539417749]  req@ffff88004543c3c0 x1614195947256656/t0(0) o36->lustre-MDT0000-mdc-ffff88007b01c800@10.2.8.135@tcp:12/10 lens 488/4528 e 0 to 1 dl 1539417793 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [42649.104876] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection to lustre-MDT0000 (at 10.2.8.135@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [42649.113726] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection restored to 10.2.8.135@tcp (at 10.2.8.135@tcp)
      [42718.880275] 9[21199]: segfault at 0 ip           (null) sp 00007ffe335fa0d8 error 14 in 9[400000+7000]
      [42718.902756] systemd-coredump[21474]: Not enough arguments passed from kernel (0, expected 6).
      [42879.211671] 19[24911]: segfault at 8 ip 00007f777e099b50 sp 00007fffd06ec020 error 4 in ld-2.22.so[7f777e08d000+21000]
      [42879.251815] systemd-coredump[24974]: Not enough arguments passed from kernel (0, expected 6).
      [42880.639477] 19[26804]: segfault at 8 ip 00007f8cfa368418 sp 00007ffffd6e0ae0 error 4 in ld-2.22.so[7f8cfa35d000+21000]
      

      So far, this has only been seen when testing SLES12 SP3 servers and clients.

      Logs for more failed racer test suites are at
      https://testing.whamcloud.com/test_sets/2cfb6602-cbaf-11e8-b589-52540065bddc
      https://testing.whamcloud.com/test_sets/2f5b1916-c609-11e8-b143-52540065bddc
      https://testing.whamcloud.com/test_sets/fb82b210-cf05-11e8-9238-52540065bddc

      Although the following has more failures, it looks like we’ve seen this in the b2_10 branch
      https://testing.whamcloud.com/test_sessions/9e10fa14-a8d5-4e71-853f-3a4a653c3b52

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: