[LU-10326] sanity test 60a times out on ‘umount -d /mnt/lustre-mds1’ Created: 04/Dec/17  Updated: 19/Mar/19  Resolved: 04/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.2, Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: ubuntu
Environment:

Ubuntu Lustre clients


Issue Links:
Duplicate
is duplicated by LU-10320 sanity test 17g fails with ‘FAIL: <ma... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 60a hangs on unmount of the MDS for Ubuntu clients only. The last thing seen in the client test log is

NOW reload debugging syms..
CMD: trevis-18vm4 /usr/sbin/lctl dk
CMD: trevis-18vm4 which llog_reader 2> /dev/null
CMD: trevis-18vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts
Stopping /mnt/lustre-mds1 (opts:) on trevis-18vm4
CMD: trevis-18vm4 umount -d /mnt/lustre-mds1

Looking at the dmesg log on the MDS (vm4), all llog_test.c tests ran and completed, but we experienced issues when cleaning up/unmounting the MDS:

[ 3162.697626] Lustre: DEBUG MARKER: ls -d /sbin/llog_reader
[ 3163.062999] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts
[ 3163.339465] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
[ 3163.485523] LustreError: 25441:0:(ldlm_resource.c:1094:ldlm_resource_complain()) lustre-MDT0000-lwp-MDT0000: namespace resource [0x200000006:0x1010000:0x0].0x0 (ffff88005b72b6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
[ 3163.489978] LustreError: 25441:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0x0 (ffff88005b72b6c0) refcount = 2
[ 3163.493845] LustreError: 25441:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order):
[ 3163.496226] LustreError: 25441:0:(ldlm_resource.c:1682:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff88005b6a6d80/0xa7b5899cf22cc13d lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xa7b5899cf22cc1bb expref: -99 pid: 12740 timeout: 0 lvb_type: 2
[ 3163.502995] LustreError: 25441:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000006:0x10000:0x0].0x0 (ffff880000047e40) refcount = 2
[ 3163.507253] LustreError: 25441:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order):
[ 3163.509922] Lustre: Failing over lustre-MDT0000
[ 3165.457763] Lustre: lustre-MDT0000: Not available for connect from 10.9.4.212@tcp (stopping)
[ 3165.572163] LustreError: 25441:0:(genops.c:436:class_free_dev()) Cleanup lustre-QMT0000 returned -95
[ 3165.574521] LustreError: 25441:0:(genops.c:436:class_free_dev()) Skipped 1 previous similar message
[ 3167.767456] Lustre: lustre-MDT0000: Not available for connect from 10.9.4.213@tcp (stopping)
[ 3167.770166] Lustre: Skipped 6 previous similar messages
[ 3170.455396] Lustre: lustre-MDT0000: Not available for connect from 10.9.4.212@tcp (stopping)
[ 3172.756491] Lustre: lustre-MDT0000: Not available for connect from 10.9.4.213@tcp (stopping)
[ 3172.759177] Lustre: Skipped 7 previous similar messages

This failure started on October 27, 2017 2.10.54 for master branch and on November 27, 2017 2.10.2 RC1 for b2_10

Logs for this failure are at:
master branch
https://testing.hpdd.intel.com/test_sets/7f763f4e-d76b-11e7-a066-52540065bddc
https://testing.hpdd.intel.com/test_sets/6860bbbe-d021-11e7-a066-52540065bddc (interop with 2.9.0)
https://testing.hpdd.intel.com/test_sets/591a4292-ca59-11e7-9840-52540065bddc
https://testing.hpdd.intel.com/test_sets/54b95766-bb6c-11e7-84a9-52540065bddc

b2_10
https://testing.hpdd.intel.com/test_sets/bbeaa6be-d459-11e7-9c63-52540065bddc



 Comments   
Comment by John Hammond [ 05/Dec/17 ]

Likely due to the same cause as LU-10320.

Comment by Sarah Liu [ 17/May/18 ]

+1 on b2_10 https://testing.hpdd.intel.com/test_sets/051774a8-5956-11e8-abc3-52540065bddc

Comment by Sarah Liu [ 30/May/18 ]

In tag-2.11.52 SLES12sp3 server/client testing, sanity 60a failed as similar reason. sanity test_17g passed in the same session

https://testing.hpdd.intel.com/test_sets/652db46e-5a74-11e8-abc3-52540065bddc

Comment by Sarah Liu [ 19/Mar/19 ]

similar issue hit in interop testing of 2.10.7
https://testing.whamcloud.com/test_sets/5db78bc0-432a-11e9-92fe-52540065bddc

Generated at Sat Feb 10 02:34:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.