[LU-7344] sanity test_154g test30 fail on cleanup: FAIL: test_154g failed with 1 Created: 27/Oct/15  Updated: 11/Apr/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

autotest


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 154g subtest 30 fails on removing links the test created. Logs are at https://testing.hpdd.intel.com/test_sets/9608c94e-7c22-11e5-9851-5254006e85c2

From the test_log:

Finishing test test30 at 1445869186
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0330': Input/output error
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0329': Cannot send after transport endpoint shutdown
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0254': Cannot send after transport endpoint shutdown
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0678': Cannot send after transport endpoint shutdown
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0986': Cannot send after transport endpoint shutdown
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0309': Cannot send after transport endpoint shutdown
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0608': Cannot send after transport endpoint shutdown
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0286': Cannot send after transport endpoint shutdown
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0479': Cannot send after transport endpoint shutdown
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0231': Cannot send after transport endpoint shutdown
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0798': Cannot send after transport endpoint shutdown
rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0824': Cannot send after transport endpoint shutdown
llapi_fid_test: llapi_fid_test.c:98: cleanup: assertion 'WEXITSTATUS(rc) == 0' failed: rm command returned 1
 sanity test_154g: @@@@@@ FAIL: test_154g failed with 1 

From the client console logs, the client is having connection problems:

14:19:55:LustreError: 11-0: lustre-MDT0000-mdc-ffff880077e11c00: operation ldlm_enqueue to node 10.1.5.239@tcp failed: rc = -107
14:19:55:Lustre: lustre-MDT0000-mdc-ffff880077e11c00: Connection to lustre-MDT0000 (at 10.1.5.239@tcp) was lost; in progress operations using this service will wait for recovery to complete
14:19:55:LustreError: 167-0: lustre-MDT0000-mdc-ffff880077e11c00: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
14:19:55:LustreError: 23082:0:(mdc_locks.c:1176:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -5
14:19:55:LustreError: 23082:0:(mdc_locks.c:1176:mdc_intent_getattr_async_interpret()) Skipped 4 previous similar messages
14:19:55:Lustre: lustre-MDT0000-mdc-ffff880077e11c00: Connection restored to 10.1.5.239@tcp (at 10.1.5.239@tcp)
14:19:55:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_154g: @@@@@@ FAIL: test_154g failed with 1 
14:19:55:Lustre: DEBUG MARKER: sanity test_154g: @@@@@@ FAIL: test_154g failed with 1

We’ve seen this failure a couple of times this month. Logs are at https://testing.hpdd.intel.com/test_sets/8b07cd46-70a2-11e5-9bcc-5254006e85c2 and
https://testing.hpdd.intel.com/test_sets/957630d2-75a8-11e5-bac5-5254006e85c2. In the last client console log, we see an addition error message about nonzero refcount:

09:23:05:LustreError: 11-0: lustre-MDT0000-mdc-ffff88007daeb800: operation ldlm_enqueue to node 10.1.4.105@tcp failed: rc = -107
09:23:05:Lustre: lustre-MDT0000-mdc-ffff88007daeb800: Connection to lustre-MDT0000 (at 10.1.4.105@tcp) was lost; in progress operations using this service will wait for recovery to complete
09:23:05:LustreError: 167-0: lustre-MDT0000-mdc-ffff88007daeb800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
09:23:05:LustreError: 23311:0:(mdc_locks.c:1176:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -5
09:23:05:LustreError: 12432:0:(ldlm_resource.c:887:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff88007daeb800: namespace resource [0x200004282:0x82c:0x0].0x0 (ffff88007c0a72c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
09:23:05:LustreError: 12432:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x200004282:0x82c:0x0].0x0 (ffff88007c0a72c0) refcount = 2
09:23:05:Lustre: lustre-MDT0000-mdc-ffff88007daeb800: Connection restored to 10.1.4.105@tcp (at 10.1.4.105@tcp)
09:23:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_154g: @@@@@@ FAIL: test_154g failed with 1 


 Comments   
Comment by Saurabh Tandan (Inactive) [ 24/Dec/15 ]

Another instance found for the following config:
Server: 2.7.1 , b2_7_fe/34
Client: Master, build# 3276, RHEL 6.7
https://testing.hpdd.intel.com/test_sets/610eff92-a602-11e5-a14c-5254006e85c2

Generated at Sat Feb 10 02:08:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.