[LU-13232] sanity test 160j fails with 'read changelog failed' Created: 10/Feb/20  Updated: 20/Feb/20  Resolved: 20/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: ppc
Environment:

PPC clients


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_160j fails with 'read changelog failed' for PPC client testing 100% of the time.

Looking at a recent failure at https://testing.whamcloud.com/test_sets/d3720002-4a27-11ea-b69a-52540065bddc, the actual error is a problem with the input to cat

Registered 1 changelog users: 'cl3'
total: 2 create in 0.00 seconds: 1052.66 ops/second
cat: -: Invalid argument
 sanity test_160j: @@@@@@ FAIL: read changelog failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6121:error()
  = /usr/lib64/lustre/tests/sanity.sh:14350:test_160j()

The code that is failing in sanity test 160j is

14341         # read changelog
14342         cat <&4 >/dev/null || error "read changelog failed"

Looking at the client1 (vm12) console log, we see

[ 5314.374481] Lustre: DEBUG MARKER: == sanity test 160j: client can be umounted while its chanangelog is being used ===================== 01:24:59 (1581125099)
[ 5314.494530] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre2
[ 5314.506580] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock trevis-10vm12@tcp:/lustre /mnt/lustre2
[ 5314.555637] Lustre: Mounted lustre-client
[ 5315.555507] Lustre: 10940:0:(llog_cat.c:808:llog_cat_process_common()) lustre-MDT0000-mdc-c0000000b5687800: invalid record in catalog [0x5:0x0:0xa]:0: rc = -22
[ 5315.555690] LustreError: 10940:0:(mdc_changelog.c:295:chlg_load()) lustre-MDT0000-mdc-c0000000b5687800: fail to process llog: rc = -22
[ 5315.600825] Lustre: Unmounted lustre-client
[ 5315.777197] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_160j: @@@@@@ FAIL: read changelog failed 

sanity test 160j started failing for PPC clients as soon as it was first introduced/landed on 27 SEPT 2019.

Logs for more PPC client sanity test 160j failures are at
https://testing.whamcloud.com/test_sets/717d4832-1dba-11ea-80b4-52540065bddc
https://testing.whamcloud.com/test_sets/5e7bd63a-f7af-11e9-b62b-52540065bddc



 Comments   
Comment by Andreas Dilger [ 12/Feb/20 ]

This looks like it may be the root cause of many later failures. This test unmounts the client, then fails (likely because of unexpected output), then doesn't remount the client again. All of the later failures are because there is no Lustre client mounted.

== sanity test 160j: client can be umounted  while its chanangelog is being used
CMD: trevis-77vm7.trevis.whamcloud.com mount -t lustre -o user_xattr,flock trevis-10vm12@tcp:/lustre /mnt/lustre2
:
cat: -: Invalid argument
sanity test_160j: @@@@@@ FAIL: read changelog failed
Comment by Andreas Dilger [ 12/Feb/20 ]

Looking at the test itself, this is pretty clear:

         # umount the first lustre mount
         umount $MOUNT

This should have stack_trap calls to undo the various changes in the test, like mount the client, unmount client2, close the file descriptors, etc. rather than doing this manually at the end of the test. I suspect with a simple patch to clean up after this failure that many of the following failures will go away also.

Comment by Gerrit Updater [ 12/Feb/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37550
Subject: LU-13232 tests: add stack_trap to clean up snaity 160j
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9bfbb05901029b637a9b12262eeff181b8d70348

Comment by Gerrit Updater [ 20/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37550/
Subject: LU-13232 tests: add stack_trap to clean up sanity 160j
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4891c873184ad2fc3e90abc769456166998cace3

Comment by Peter Jones [ 20/Feb/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:59:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.