[LU-12087] sanity-scrub test 10a fails with “Fail to cleanup the env!” Created: 19/Mar/19  Updated: 01/May/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.10.6, Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: ppc
Environment:

ppc64 clients


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-scrub test_10a fails for ppc64 with “Fail to cleanup the env!”

Looking at a recent failure, https://testing.whamcloud.com/test_sets/2b0db972-4859-11e9-b98a-52540065bddc, we see that we can’t remove directories on the Lustre file system from a previous sanity-scrub test. From the suite_log, we see

rm: cannot remove '/mnt/lustre/d9.sanity-scrub/mds1': Directory not empty
 sanity-scrub test_10a: @@@@@@ FAIL: Fail to cleanup the env! 

Looking at the OSS (vm1) console log, we see a Lustre error during test 9

[ 1095.831214] Lustre: DEBUG MARKER: trevis-26vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 1104.386561] LustreError: 11824:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-lwp-OST0000: namespace resource [0x200000006:0x1020000:0x0].0x0 (ffff8a8765dfe600) refcount nonzero (1) after lock cleanup; forcing cleanup.
[ 1104.388585] LustreError: 11824:0:(ldlm_resource.c:1146:ldlm_resource_complain()) Skipped 1 previous similar message
[ 1131.369609] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null

On the console log for client 2 (vm9), we some messages

[ 1168.406302] Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
[ 1174.948324] Lustre: 3119:0:(mdc_request.c:1504:mdc_read_page()) Page-wide hash collision: 0xfeffffffffffffff
[ 1174.948439] Lustre: 3119:0:(mdc_request.c:1504:mdc_read_page()) Skipped 54 previous similar messages
[ 1176.178907] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-scrub test_10a: @@@@@@ FAIL: Fail to cleanup the env! 

We see this issue only for ppc64 client testing. Note: Although this test has failed with the same message for non-ppc64 clients, in these cases several/most tests prior to 10a fail with not being able to clean up the environment

In some cases, we don’t see any of the above error messages. For example for a recent 2.10.7 RC1 failure at https://testing.whamcloud.com/test_sets/3f1ccaa6-4332-11e9-92fe-52540065bddc, we don’t see any of these error messages in test 9 nor test 10.

Other failures for sanity-scrub test 10a are at
https://testing.whamcloud.com/test_sets/4e833ba2-b72c-11e8-a7de-52540065bddc
https://testing.whamcloud.com/test_sets/d22cb2d0-e288-11e8-bfe1-52540065bddc
https://testing.whamcloud.com/test_sets/660c8ec2-2734-11e9-b97f-52540065bddc



 Comments   
Comment by James Nunez (Inactive) [ 30/Apr/19 ]

We see sanityn test 37 fail with

== sanityn test 37: check i_size is not updated for directory on close (bug 18695) =================== 04:35:07 (1555994107)
multiop /mnt/lustre/d37.sanityn vD_c
TMPPIPE=/tmp/multiop_open_wait_pipe.9642
total: 10000 create in 6.51 seconds: 1536.03 ops/second
 sanityn test_37: @@@@@@ FAIL: 3523 != 10000 truncated directory? 

You see the 'hash collision' message on the client 1 console log

[11634.540885] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 37: check i_size is not updated for directory on close \(bug 18695\) =================== 04:35:07 \(1555994107\)
[11634.743158] Lustre: DEBUG MARKER: == sanityn test 37: check i_size is not updated for directory on close (bug 18695) =================== 04:35:07 (1555994107)
[11641.370854] Lustre: 31253:0:(mdc_request.c:1519:mdc_read_page()) Page-wide hash collision: 0xfeffffffffffffff
[11641.370970] Lustre: 31253:0:(mdc_request.c:1519:mdc_read_page()) Skipped 1 previous similar message
[11641.536138] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_37: @@@@@@ FAIL: 3523 != 10000 truncated directory? 
[11641.718318] Lustre: DEBUG MARKER: sanityn test_37: @@@@@@ FAIL: 3523 != 10000 truncated directory?

See the following for logs:
https://testing.whamcloud.com/test_sets/7d68fa34-668f-11e9-8bb1-52540065bddc

Generated at Sat Feb 10 02:49:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.