[LU-13634] racer test 1 failed with ‘Illegal instruction (core dumped)’ Created: 04/Jun/20  Updated: 10/Sep/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

ARM clients


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

racer test_1 fails with the error message 'test_1 failed with 8'. Looking at the failure for an ARM client test, at https://testing.whamcloud.com/test_sets/e8f119a9-57ca-45e3-b8f3-246009fe1b75, in the suite_log we see many ‘illegal instructions’ errors like

./file_exec.sh: line 16: 25000 Illegal instruction     (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
racer cleanup
sleeping 5 sec ...
racer cleanup
sleeping 5 sec ...
Waited 5, rc=3 sleeping 10 sec ...
Waited 5, rc=3 sleeping 10 sec ...
Waited 20, rc=3 sleeping 20 sec ...
Waited 20, rc=3 sleeping 20 sec ...
Waited 50, rc=3 sleeping 40 sec ...
Waited 50, rc=3 sleeping 40 sec ...
Waited 110, rc=3 sleeping 80 sec ...
Waited 110, rc=3 sleeping 80 sec ...
Waited 230, rc=3 sleeping 160 sec ...
Waited 230, rc=3 sleeping 160 sec ...
Waited 470, rc=3 sleeping 320 sec ...
Waited 470, rc=3 sleeping 320 sec ...
Waited 950, rc=3 USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Filesystem             1K-blocks   Used Available Use% Mounted on
10.2.8.103@tcp:/lustre  15466208 221764  14268332   2% /mnt/lustre
Waited 950, rc=2 USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Filesystem             1K-blocks   Used Available Use% Mounted on
10.2.8.103@tcp:/lustre  15466208 221764  14268332   2% /mnt/lustre
Running /usr/lib64/lustre/tests/racer/racer.sh for 900 seconds. CTRL-C to exit
Running /usr/lib64/lustre/tests/racer/racer.sh for 900 seconds. CTRL-C to exit
layout: raid0 raid0 pfl pfl pfl dom dom dom flr flr flr
layout: raid0 raid0 pfl pfl pfl dom dom dom flr flr flr
layout: raid0 raid0 pfl pfl pfl dom dom dom flr flr flr
layout: raid0 raid0 pfl pfl pfl dom dom dom flr flr flr
layout: raid0 raid0 pfl pfl pfl dom dom dom flr flr flr
layout: raid0 raid0 pfl pfl pfl dom dom dom flr flr flr
./file_exec.sh: line 16: 29938 Illegal instruction     (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
racer cleanup
./file_exec.sh: line 16: 26172 Illegal instruction     (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
sleeping 5 sec ...
racer cleanup
…
./file_exec.sh: line 16: 22980 Illegal instruction     (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
./file_exec.sh: line 16:  7946 Terminated              $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
sleeping 5 sec ...
racer cleanup
sleeping 5 sec ...
Waited 5, rc=3 sleeping 10 sec ...
Waited 5, rc=3 sleeping 10 sec ...
Waited 20, rc=3 sleeping 20 sec ...
Waited 20, rc=3 sleeping 20 sec ...
Waited 50, rc=3 sleeping 40 sec ...
Waited 50, rc=3 sleeping 40 sec ...
Waited 110, rc=3 sleeping 80 sec ...
Waited 110, rc=3 sleeping 80 sec ...
Waited 230, rc=3 sleeping 160 sec ...
Waited 230, rc=3 sleeping 160 sec ...
Waited 470, rc=3 sleeping 320 sec ...
Waited 470, rc=3 sleeping 320 sec ...
Waited 950, rc=2 USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Filesystem             1K-blocks   Used Available Use% Mounted on
10.2.8.103@tcp:/lustre  15466208 221764  14268332   2% /mnt/lustre2
Waited 950, rc=3 USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Filesystem             1K-blocks   Used Available Use% Mounted on
10.2.8.103@tcp:/lustre  15466208 221764  14268332   2% /mnt/lustre2
pid=14004 rc=1
pid=14006 rc=1
pid=14007 rc=1
 racer test_1: @@@@@@ FAIL: test_1 failed with 8 

We’ve seen these errors in at least three test sessions in the past four months starting with the Lustre version 2.12.4.61 only on the b2_12 branch starting, all during ARM client testing:
https://testing.whamcloud.com/test_sets/849d0827-b6ca-46ec-b0be-4727c4bda504
https://testing.whamcloud.com/test_sets/58789911-1a3e-4c3c-ba7f-d1e7216f0ead

There is at least one earlier failure like this for 2.13.51.37 on 27 JAN 2020:
https://testing.whamcloud.com/test_sets/a554c12e-41ca-11ea-9847-52540065bddc



 Comments   
Comment by Andreas Dilger [ 05/Jun/20 ]

AFAIK, this is fairly normal for racer, as it is doing crazy things to the files. The main goal is not to crash, and to be able to clean up afterward.

We should see if syzbot from the upstream kernel is publicly available and consider running that as well for a whole new level of painful race conditions, if racer isn't failing enough anymore.

Comment by John Hammond [ 05/Jun/20 ]

Two things here:

  1. In file_exec.sh we use egrep to filter out shell error messages matching "Segmentation fault" and "Bus error". We should also filter "Illegal instruction".
  2. I do not think that the illegal instruction messages are necessarily connected to the test failures. Instead I suspect that the test failures are because racer/racer.sh cannot clean up all of its children.
Comment by Gerrit Updater [ 10/Sep/20 ]

Vitaly Fertman (vitaly.fertman@hpe.com) uploaded a new patch: https://review.whamcloud.com/39878
Subject: LU-13634 ldlm: extra checks for DOM locks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 404f823c22d0c23ddc8d740bda69a0a8e9505d83

Generated at Sat Feb 10 03:02:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.