Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.8.0, Lustre 2.10.0
-
None
-
autotest
-
3
-
9223372036854775807
Description
sanity test 154g subtest 30 fails on removing links the test created. Logs are at https://testing.hpdd.intel.com/test_sets/9608c94e-7c22-11e5-9851-5254006e85c2
From the test_log:
Finishing test test30 at 1445869186 rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0330': Input/output error rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0329': Cannot send after transport endpoint shutdown rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0254': Cannot send after transport endpoint shutdown rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0678': Cannot send after transport endpoint shutdown rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0986': Cannot send after transport endpoint shutdown rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0309': Cannot send after transport endpoint shutdown rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0608': Cannot send after transport endpoint shutdown rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0286': Cannot send after transport endpoint shutdown rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0479': Cannot send after transport endpoint shutdown rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0231': Cannot send after transport endpoint shutdown rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0798': Cannot send after transport endpoint shutdown rm: cannot remove `/mnt/lustre/d154g.sanity/llapi_fid_test_name_9585766/link0824': Cannot send after transport endpoint shutdown llapi_fid_test: llapi_fid_test.c:98: cleanup: assertion 'WEXITSTATUS(rc) == 0' failed: rm command returned 1 sanity test_154g: @@@@@@ FAIL: test_154g failed with 1
From the client console logs, the client is having connection problems:
14:19:55:LustreError: 11-0: lustre-MDT0000-mdc-ffff880077e11c00: operation ldlm_enqueue to node 10.1.5.239@tcp failed: rc = -107 14:19:55:Lustre: lustre-MDT0000-mdc-ffff880077e11c00: Connection to lustre-MDT0000 (at 10.1.5.239@tcp) was lost; in progress operations using this service will wait for recovery to complete 14:19:55:LustreError: 167-0: lustre-MDT0000-mdc-ffff880077e11c00: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 14:19:55:LustreError: 23082:0:(mdc_locks.c:1176:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -5 14:19:55:LustreError: 23082:0:(mdc_locks.c:1176:mdc_intent_getattr_async_interpret()) Skipped 4 previous similar messages 14:19:55:Lustre: lustre-MDT0000-mdc-ffff880077e11c00: Connection restored to 10.1.5.239@tcp (at 10.1.5.239@tcp) 14:19:55:Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_154g: @@@@@@ FAIL: test_154g failed with 1 14:19:55:Lustre: DEBUG MARKER: sanity test_154g: @@@@@@ FAIL: test_154g failed with 1
We’ve seen this failure a couple of times this month. Logs are at https://testing.hpdd.intel.com/test_sets/8b07cd46-70a2-11e5-9bcc-5254006e85c2 and
https://testing.hpdd.intel.com/test_sets/957630d2-75a8-11e5-bac5-5254006e85c2. In the last client console log, we see an addition error message about nonzero refcount:
09:23:05:LustreError: 11-0: lustre-MDT0000-mdc-ffff88007daeb800: operation ldlm_enqueue to node 10.1.4.105@tcp failed: rc = -107 09:23:05:Lustre: lustre-MDT0000-mdc-ffff88007daeb800: Connection to lustre-MDT0000 (at 10.1.4.105@tcp) was lost; in progress operations using this service will wait for recovery to complete 09:23:05:LustreError: 167-0: lustre-MDT0000-mdc-ffff88007daeb800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 09:23:05:LustreError: 23311:0:(mdc_locks.c:1176:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -5 09:23:05:LustreError: 12432:0:(ldlm_resource.c:887:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff88007daeb800: namespace resource [0x200004282:0x82c:0x0].0x0 (ffff88007c0a72c0) refcount nonzero (1) after lock cleanup; forcing cleanup. 09:23:05:LustreError: 12432:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x200004282:0x82c:0x0].0x0 (ffff88007c0a72c0) refcount = 2 09:23:05:Lustre: lustre-MDT0000-mdc-ffff88007daeb800: Connection restored to 10.1.4.105@tcp (at 10.1.4.105@tcp) 09:23:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_154g: @@@@@@ FAIL: test_154g failed with 1