[LU-8792] Interop - master<->2.8 :sanity-hsm test_107: hung while umount MDT Created: 02/Nov/16  Updated: 16/Jan/19  Resolved: 16/Jan/19

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0, Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Yang Sheng
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Interop - 2.8.0 EL6.7 Server/ EL7.2 Client
Server - b2_8_fe build# 12, RHEL6.7
Client - master ,build# 3468 RHEL 7.2


Issue Links:
Related
is related to LU-2067 ldlm_resource_complain()) Namespace M... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5d28dbcc-a076-11e6-8761-5254006e85c2.

The sub-test test_107 failed with the following error:

test failed to respond and timed out

OST console:

23:51:00:Lustre: DEBUG MARKER: == sanity-hsm test 107: Copytool re-register after MDS restart ======================================= 22:50:03 (1477954203)
23:51:00:LustreError: 11-0: lustre-MDT0000-lwp-OST0001: operation obd_ping to node 10.9.5.239@tcp failed: rc = -107
23:51:00:Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.5.239@tcp) was lost; in progress operations using this service will wait for recovery to complete

23:51:01:automount     D 0000000000000000     0  3827      1 0x00000080
23:51:01: ffff88003e827be8 0000000000000082 00000000ffffffff 000023fcaf1a2ab5
23:51:01: ffff880037f78070 ffff8800466a4bb0 000000000081ce38 ffffffffa7e85d92
23:51:01: 000000001cc8adaa 0000000100d02944 ffff88005a82bad8 ffff88003e827fd8
23:51:01:Call Trace:
23:51:01: [<ffffffff811279c0>] ? sync_page_killable+0x0/0x40
23:51:01: [<ffffffff815399b3>] io_schedule+0x73/0xc0
23:51:01: [<ffffffff811279ad>] sync_page+0x3d/0x50
23:51:01: [<ffffffff811279ce>] sync_page_killable+0xe/0x40
23:51:01: [<ffffffff8153a24a>] __wait_on_bit_lock+0x5a/0xc0
23:51:01: [<ffffffff811278d7>] __lock_page_killable+0x67/0x70
23:51:01: [<ffffffff810a14e0>] ? wake_bit_function+0x0/0x50
23:51:01: [<ffffffff8112777e>] ? find_get_page+0x1e/0xa0
23:51:01: [<ffffffff81129604>] generic_file_aio_read+0x4b4/0x700
23:51:01: [<ffffffff81191e4a>] do_sync_read+0xfa/0x140
23:51:01: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
23:51:01: [<ffffffff81159b85>] ? do_mmap_pgoff+0x335/0x380
23:51:01: [<ffffffff81232036>] ? security_file_permission+0x16/0x20
23:51:01: [<ffffffff81192745>] vfs_read+0xb5/0x1a0
23:51:01: [<ffffffff8119351f>] ? fget_light_pos+0x3f/0x50
23:51:01: [<ffffffff81192a91>] sys_read+0x51/0xb0
23:51:01: [<ffffffff810e8c7e>] ? __audit_syscall_exit+0x25e/0x290
23:51:01: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

MDS consoe:

23:17:51:LustreError: 17770:0:(client.c:1133:ptlrpc_import_delay_req()) Skipped 9 previous similar messages
23:17:51:Lustre: lustre-MDT0000: Not available for connect from 10.9.5.241@tcp (stopping)
23:17:51:Lustre: Skipped 1130 previous similar messages
23:17:51:LustreError: 0-0: Forced cleanup waiting for mdt-lustre-MDT0000_UUID namespace with 1 resources in use, (rc=-110)
23:17:51:LustreError: Skipped 102 previous similar messages
23:17:51:LustreError: 17786:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5
23:17:51:LustreError: 17786:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 9 previous similar messages
23:17:51:LustreError: 17822:0:(client.c:1133:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8800444f73c0 x1549736680098240/t0(0) o101->lustre-MDT0000-lwp-MDT0000@0@lo:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
23:17:51:LustreError: 17822:0:(client.c:1133:ptlrpc_import_delay_req()) Skipped 17 previous similar messages
23:17:51:LustreError: 17836:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5
23:17:51:LustreError: 17836:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 17 previous similar messages
23:17:51:Lustre: lustre-MDT0000: Not available for connect from 10.9.5.242@tcp (stopping)
23:17:51:Lustre: Skipped 1321 previous similar messages
23:17:51:LustreError: 0-0: Forced cleanup waiting for mdt-lustre-MDT0000_UUID namespace with 1 resources in use, (rc=-110)
23:50:37:********** Timeout by autotest system **********

Client Console:

23:50:57:[10301.531931] Lustre: DEBUG MARKER: == sanity-hsm test 107: Copytool re-register after MDS restart ======================================= 22:50:03 (1477954203)
23:50:57:[10311.388026] LustreError: 11-0: lustre-MDT0000-mdc-ffff880007dc1800: operation obd_ping to node 10.9.5.239@tcp failed: rc = -107
23:50:57:[10311.396253] Lustre: lustre-MDT0000-mdc-ffff880007dc1800: Connection to lustre-MDT0000 (at 10.9.5.239@tcp) was lost; in progress operations using this service will wait for recovery to complete

23:50:57:[13944.654013] auditd          D ffff88007bb8fd90     0   490      1 0x00000000
23:50:57:[13944.654013]  ffff88007bb8fc30 0000000000000086 ffff880036707300 ffff88007bb8ffd8
23:50:57:[13944.654013]  ffff88007bb8ffd8 ffff88007bb8ffd8 ffff880036707300 ffff88007fc167c0
23:50:57:[13944.654013]  0000000000000000 7fffffffffffffff ffffffff81168d10 ffff88007bb8fd90
23:50:57:[13944.654013] Call Trace:
23:50:57:[13944.654013]  [<ffffffff81168d10>] ? wait_on_page_read+0x60/0x60
23:50:57:[13944.654013]  [<ffffffff8163bb39>] schedule+0x29/0x70
23:50:57:[13944.654013]  [<ffffffff81639829>] schedule_timeout+0x209/0x2d0
23:50:57:[13944.654013]  [<ffffffff81058aaf>] ? kvm_clock_get_cycles+0x1f/0x30
23:50:57:[13944.654013]  [<ffffffff81168d10>] ? wait_on_page_read+0x60/0x60
23:50:57:[13944.654013]  [<ffffffff8163b16e>] io_schedule_timeout+0xae/0x130
23:50:57:[13944.654013]  [<ffffffff8163b208>] io_schedule+0x18/0x20
23:50:57:[13944.654013]  [<ffffffff81168d1e>] sleep_on_page+0xe/0x20
23:50:57:[13944.654013]  [<ffffffff816399b0>] __wait_on_bit+0x60/0x90
23:50:57:[13944.654013]  [<ffffffff81168aa6>] wait_on_page_bit+0x86/0xb0
23:50:57:[13944.654013]  [<ffffffff810a6c00>] ? wake_atomic_t_function+0x40/0x40
23:50:57:[13944.654013]  [<ffffffff81168be1>] filemap_fdatawait_range+0x111/0x1b0
23:50:57:[13944.654013]  [<ffffffff81175d3e>] ? do_writepages+0x1e/0x40
23:50:57:[13944.654013]  [<ffffffff8116aad5>] ? __filemap_fdatawrite_range+0x65/0x80
23:50:57:[13944.654013]  [<ffffffff8116abff>] filemap_write_and_wait_range+0x3f/0x70
23:50:57:[13944.654013]  [<ffffffffa01807aa>] ext4_sync_file+0xba/0x320 [ext4]
23:50:57:[13944.654013]  [<ffffffff81210185>] do_fsync+0x65/0xa0
23:50:57:[13944.654013]  [<ffffffff81210450>] SyS_fsync+0x10/0x20
23:50:57:[13944.654013]  [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Peter Jones [ 04/Nov/16 ]

Yang Sheng

Could you please advise on this one?

Peter

Comment by Yang Sheng [ 09/Nov/16 ]

The umount hung on:

umount        S 0000000000000000     0 17653  17652 0x00000080
 ffff88004f20f9a8 0000000000000082 ffffffffa0853748 0000000000000000
 ffff88004f20fa08 000000004f20c000 ffffffffa08b62b0 ffff880046461b18
 000000564f20f948 0000000100d03478 ffff88004f3bdad8 ffff88004f20ffd8
Call Trace:
 [<ffffffffa0483e1e>] ? cfs_hash_spin_lock+0xe/0x10 [libcfs]
 [<ffffffff8153a042>] schedule_timeout+0x192/0x2e0
 [<ffffffff81089be0>] ? process_timeout+0x0/0x10
 [<ffffffffa075f1e0>] __ldlm_namespace_free+0x1c0/0x560 [ptlrpc]
 [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
 [<ffffffffa075f5ef>] ldlm_namespace_free_prior+0x6f/0x220 [ptlrpc]
 [<ffffffffa0e045d2>] mdt_device_fini+0x6a2/0x12e0 [mdt]
 [<ffffffffa055eb06>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
 [<ffffffffa0577332>] class_cleanup+0x572/0xd20 [obdclass]
 [<ffffffffa055a386>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0579646>] class_process_config+0x1b66/0x24c0 [obdclass]
 [<ffffffffa047dcf1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff81178a2c>] ? __kmalloc+0x21c/0x230
 [<ffffffffa057a45f>] class_manual_cleanup+0x4bf/0xc90 [obdclass]
 [<ffffffffa055a386>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa05aba1c>] server_put_super+0x8bc/0xcd0 [obdclass]
 [<ffffffff811944bb>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff811945a6>] kill_anon_super+0x16/0x60
 [<ffffffffa057d646>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff81194d47>] deactivate_super+0x57/0x80
 [<ffffffff811b4d3f>] mntput_no_expire+0xbf/0x110
 [<ffffffff811b588b>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

and

LustreError: 0-0: Forced cleanup waiting for mdt-lustre-MDT0000_UUID namespace with 1 resources in use, (rc=-110)

So looks like some lock still not released. But i cannot find any clue to prove that.

Then from MDS dmesg log:

Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-hsm test 107: Copytool re-register after MDS restart ======================================= 22:50:03 \(1477954203\)
Lustre: DEBUG MARKER: == sanity-hsm test 107: Copytool re-register after MDS restart ======================================= 22:50:03 (1477954203)
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x21a:0x0'.*action='ARCHIVE'/ {print $13}' | cut -f2 -d=
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x21a:0x0'.*action='ARCHIVE'/ {print $13}' | cut -f2 -d=
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts
Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
Lustre: Failing over lustre-MDT0000
LustreError: 17653:0:(ldlm_resource.c:887:ldlm_resource_complain()) mdt-lustre-MDT0000_UUID: namespace resource [0x200000404:0xb24e:0x0].0x0 (ffff8800453f3080) refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 17653:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x200000404:0xb24e:0x0].0x0 (ffff8800453f3080) refcount = 2
Lustre: lustre-MDT0000: Not available for connect from 10.9.5.240@tcp (stopping)

The refcount of resource is nonzero but without any locks dump. Looks like same as LU-2067?

Comment by James Casper [ 24/May/17 ]

2.9.57, b3575:
https://testing.hpdd.intel.com/test_sessions/2209839b-42d4-4fe6-91f1-96f9ce3c5a69

Comment by Yang Sheng [ 16/Jan/19 ]

Please feel free to reopen it.

Generated at Sat Feb 10 02:20:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.