[LU-11540] racer test 1 fails with 'test_1 failed with 1' Created: 17/Oct/18  Updated: 10/Apr/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.5, Lustre 2.12.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: sles12
Environment:

SLES12 SP3 server and clients


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

racer test_1 fails. Looking at the client test_log at https://testing.whamcloud.com/test_sets/fb82b210-cf05-11e8-9238-52540065bddc ,
we see that one of the racer jobs returned 254

pid=8389 rc=0
pid=8390 rc=0
pid=8392 rc=254
pid=8395 rc=0
pid=8396 rc=0
pid=8398 rc=0
pid=8400 rc=0
pid=8404 rc=0
 racer test_1: @@@@@@ FAIL: test_1 failed with 1 

Looking at the racer test 1 code,

  90         local rpids=""
  91         for rdir in $RDIRS; do
  92                 do_nodes $clients "DURATION=$DURATION \
  93                         MDSCOUNT=$MDSCOUNT OSTCOUNT=$OSTCOUNT\
  94                         RACER_ENABLE_REMOTE_DIRS=$RACER_ENABLE_REMOTE_DIRS \
  95                         RACER_ENABLE_STRIPED_DIRS=$RACER_ENABLE_STRIPED_DIRS \
  96                         RACER_ENABLE_MIGRATION=$RACER_ENABLE_MIGRATION \
  97                         RACER_ENABLE_PFL=$RACER_ENABLE_PFL \
  98                         RACER_ENABLE_DOM=$RACER_ENABLE_DOM \
  99                         RACER_ENABLE_FLR=$RACER_ENABLE_FLR \
 100                         LFS=$LFS \
 101                         $racer $rdir $NUM_RACER_THREADS" &
 102                 pid=$!
 103                 rpids="$rpids $pid"
 104         done
 105 
…
 118 
 119         echo racers pids: $rpids
 120         for pid in $rpids; do
 121                 wait $pid
 122                 rc=$?
 123                 echo "pid=$pid rc=$rc"
 124                 if [ $rc != 0 ]; then
 125                     rrc=$((rrc + 1))
 126                 fi
 127         done
 128 

Looking at both the client console logs, we see a problem with fork and system-coredump

[42276.361879] cgroup: fork rejected by pids controller in /system.slice/xinetd.service
[42311.453149] LustreError: 28228:0:(namei.c:87:ll_set_inode()) Can not initialize inode [0x200000406:0x45:0x0] without object type: valid = 0x100000001
[42311.453157] LustreError: 28228:0:(llite_lib.c:2407:ll_prep_inode()) new_inode -fatal: rc -12
[42352.382819] Lustre: 19021:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539417453/real 1539417453]  req@ffff8800621803c0 x1614195941033888/t0(0) o36->lustre-MDT0000-mdc-ffff88007b01c800@10.2.8.135@tcp:12/10 lens 488/4528 e 0 to 1 dl 1539417497 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[42352.382831] Lustre: 19021:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[42352.382849] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection to lustre-MDT0000 (at 10.2.8.135@tcp) was lost; in progress operations using this service will wait for recovery to complete
[42352.389206] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection restored to 10.2.8.135@tcp (at 10.2.8.135@tcp)
[42352.389212] Lustre: Skipped 1 previous similar message
[42649.104856] Lustre: 7938:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539417749/real 1539417749]  req@ffff88004543c3c0 x1614195947256656/t0(0) o36->lustre-MDT0000-mdc-ffff88007b01c800@10.2.8.135@tcp:12/10 lens 488/4528 e 0 to 1 dl 1539417793 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[42649.104876] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection to lustre-MDT0000 (at 10.2.8.135@tcp) was lost; in progress operations using this service will wait for recovery to complete
[42649.113726] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection restored to 10.2.8.135@tcp (at 10.2.8.135@tcp)
[42718.880275] 9[21199]: segfault at 0 ip           (null) sp 00007ffe335fa0d8 error 14 in 9[400000+7000]
[42718.902756] systemd-coredump[21474]: Not enough arguments passed from kernel (0, expected 6).
[42879.211671] 19[24911]: segfault at 8 ip 00007f777e099b50 sp 00007fffd06ec020 error 4 in ld-2.22.so[7f777e08d000+21000]
[42879.251815] systemd-coredump[24974]: Not enough arguments passed from kernel (0, expected 6).
[42880.639477] 19[26804]: segfault at 8 ip 00007f8cfa368418 sp 00007ffffd6e0ae0 error 4 in ld-2.22.so[7f8cfa35d000+21000]

So far, this has only been seen when testing SLES12 SP3 servers and clients.

Logs for more failed racer test suites are at
https://testing.whamcloud.com/test_sets/2cfb6602-cbaf-11e8-b589-52540065bddc
https://testing.whamcloud.com/test_sets/2f5b1916-c609-11e8-b143-52540065bddc
https://testing.whamcloud.com/test_sets/fb82b210-cf05-11e8-9238-52540065bddc

Although the following has more failures, it looks like we’ve seen this in the b2_10 branch
https://testing.whamcloud.com/test_sessions/9e10fa14-a8d5-4e71-853f-3a4a653c3b52


Generated at Sat Feb 10 02:44:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.