Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.12.0, Lustre 2.10.5, Lustre 2.12.1
-
SLES12 SP3 server and clients
-
3
-
9223372036854775807
Description
racer test_1 fails. Looking at the client test_log at https://testing.whamcloud.com/test_sets/fb82b210-cf05-11e8-9238-52540065bddc ,
we see that one of the racer jobs returned 254
pid=8389 rc=0 pid=8390 rc=0 pid=8392 rc=254 pid=8395 rc=0 pid=8396 rc=0 pid=8398 rc=0 pid=8400 rc=0 pid=8404 rc=0 racer test_1: @@@@@@ FAIL: test_1 failed with 1
Looking at the racer test 1 code,
90 local rpids="" 91 for rdir in $RDIRS; do 92 do_nodes $clients "DURATION=$DURATION \ 93 MDSCOUNT=$MDSCOUNT OSTCOUNT=$OSTCOUNT\ 94 RACER_ENABLE_REMOTE_DIRS=$RACER_ENABLE_REMOTE_DIRS \ 95 RACER_ENABLE_STRIPED_DIRS=$RACER_ENABLE_STRIPED_DIRS \ 96 RACER_ENABLE_MIGRATION=$RACER_ENABLE_MIGRATION \ 97 RACER_ENABLE_PFL=$RACER_ENABLE_PFL \ 98 RACER_ENABLE_DOM=$RACER_ENABLE_DOM \ 99 RACER_ENABLE_FLR=$RACER_ENABLE_FLR \ 100 LFS=$LFS \ 101 $racer $rdir $NUM_RACER_THREADS" & 102 pid=$! 103 rpids="$rpids $pid" 104 done 105 … 118 119 echo racers pids: $rpids 120 for pid in $rpids; do 121 wait $pid 122 rc=$? 123 echo "pid=$pid rc=$rc" 124 if [ $rc != 0 ]; then 125 rrc=$((rrc + 1)) 126 fi 127 done 128
Looking at both the client console logs, we see a problem with fork and system-coredump
[42276.361879] cgroup: fork rejected by pids controller in /system.slice/xinetd.service [42311.453149] LustreError: 28228:0:(namei.c:87:ll_set_inode()) Can not initialize inode [0x200000406:0x45:0x0] without object type: valid = 0x100000001 [42311.453157] LustreError: 28228:0:(llite_lib.c:2407:ll_prep_inode()) new_inode -fatal: rc -12 [42352.382819] Lustre: 19021:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539417453/real 1539417453] req@ffff8800621803c0 x1614195941033888/t0(0) o36->lustre-MDT0000-mdc-ffff88007b01c800@10.2.8.135@tcp:12/10 lens 488/4528 e 0 to 1 dl 1539417497 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [42352.382831] Lustre: 19021:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message [42352.382849] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection to lustre-MDT0000 (at 10.2.8.135@tcp) was lost; in progress operations using this service will wait for recovery to complete [42352.389206] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection restored to 10.2.8.135@tcp (at 10.2.8.135@tcp) [42352.389212] Lustre: Skipped 1 previous similar message [42649.104856] Lustre: 7938:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1539417749/real 1539417749] req@ffff88004543c3c0 x1614195947256656/t0(0) o36->lustre-MDT0000-mdc-ffff88007b01c800@10.2.8.135@tcp:12/10 lens 488/4528 e 0 to 1 dl 1539417793 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [42649.104876] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection to lustre-MDT0000 (at 10.2.8.135@tcp) was lost; in progress operations using this service will wait for recovery to complete [42649.113726] Lustre: lustre-MDT0000-mdc-ffff88007b01c800: Connection restored to 10.2.8.135@tcp (at 10.2.8.135@tcp) [42718.880275] 9[21199]: segfault at 0 ip (null) sp 00007ffe335fa0d8 error 14 in 9[400000+7000] [42718.902756] systemd-coredump[21474]: Not enough arguments passed from kernel (0, expected 6). [42879.211671] 19[24911]: segfault at 8 ip 00007f777e099b50 sp 00007fffd06ec020 error 4 in ld-2.22.so[7f777e08d000+21000] [42879.251815] systemd-coredump[24974]: Not enough arguments passed from kernel (0, expected 6). [42880.639477] 19[26804]: segfault at 8 ip 00007f8cfa368418 sp 00007ffffd6e0ae0 error 4 in ld-2.22.so[7f8cfa35d000+21000]
So far, this has only been seen when testing SLES12 SP3 servers and clients.
Logs for more failed racer test suites are at
https://testing.whamcloud.com/test_sets/2cfb6602-cbaf-11e8-b589-52540065bddc
https://testing.whamcloud.com/test_sets/2f5b1916-c609-11e8-b143-52540065bddc
https://testing.whamcloud.com/test_sets/fb82b210-cf05-11e8-9238-52540065bddc
Although the following has more failures, it looks like we’ve seen this in the b2_10 branch
https://testing.whamcloud.com/test_sessions/9e10fa14-a8d5-4e71-853f-3a4a653c3b52