[LU-8792] Interop - master<->2.8 :sanity-hsm test_107: hung while umount MDT Created: 02/Nov/16 Updated: 16/Jan/19 Resolved: 16/Jan/19 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0, Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Yang Sheng |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Interop - 2.8.0 EL6.7 Server/ EL7.2 Client |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5d28dbcc-a076-11e6-8761-5254006e85c2. The sub-test test_107 failed with the following error: test failed to respond and timed out OST console: 23:51:00:Lustre: DEBUG MARKER: == sanity-hsm test 107: Copytool re-register after MDS restart ======================================= 22:50:03 (1477954203) 23:51:00:LustreError: 11-0: lustre-MDT0000-lwp-OST0001: operation obd_ping to node 10.9.5.239@tcp failed: rc = -107 23:51:00:Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.5.239@tcp) was lost; in progress operations using this service will wait for recovery to complete 23:51:01:automount D 0000000000000000 0 3827 1 0x00000080 23:51:01: ffff88003e827be8 0000000000000082 00000000ffffffff 000023fcaf1a2ab5 23:51:01: ffff880037f78070 ffff8800466a4bb0 000000000081ce38 ffffffffa7e85d92 23:51:01: 000000001cc8adaa 0000000100d02944 ffff88005a82bad8 ffff88003e827fd8 23:51:01:Call Trace: 23:51:01: [<ffffffff811279c0>] ? sync_page_killable+0x0/0x40 23:51:01: [<ffffffff815399b3>] io_schedule+0x73/0xc0 23:51:01: [<ffffffff811279ad>] sync_page+0x3d/0x50 23:51:01: [<ffffffff811279ce>] sync_page_killable+0xe/0x40 23:51:01: [<ffffffff8153a24a>] __wait_on_bit_lock+0x5a/0xc0 23:51:01: [<ffffffff811278d7>] __lock_page_killable+0x67/0x70 23:51:01: [<ffffffff810a14e0>] ? wake_bit_function+0x0/0x50 23:51:01: [<ffffffff8112777e>] ? find_get_page+0x1e/0xa0 23:51:01: [<ffffffff81129604>] generic_file_aio_read+0x4b4/0x700 23:51:01: [<ffffffff81191e4a>] do_sync_read+0xfa/0x140 23:51:01: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40 23:51:01: [<ffffffff81159b85>] ? do_mmap_pgoff+0x335/0x380 23:51:01: [<ffffffff81232036>] ? security_file_permission+0x16/0x20 23:51:01: [<ffffffff81192745>] vfs_read+0xb5/0x1a0 23:51:01: [<ffffffff8119351f>] ? fget_light_pos+0x3f/0x50 23:51:01: [<ffffffff81192a91>] sys_read+0x51/0xb0 23:51:01: [<ffffffff810e8c7e>] ? __audit_syscall_exit+0x25e/0x290 23:51:01: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b MDS consoe: 23:17:51:LustreError: 17770:0:(client.c:1133:ptlrpc_import_delay_req()) Skipped 9 previous similar messages 23:17:51:Lustre: lustre-MDT0000: Not available for connect from 10.9.5.241@tcp (stopping) 23:17:51:Lustre: Skipped 1130 previous similar messages 23:17:51:LustreError: 0-0: Forced cleanup waiting for mdt-lustre-MDT0000_UUID namespace with 1 resources in use, (rc=-110) 23:17:51:LustreError: Skipped 102 previous similar messages 23:17:51:LustreError: 17786:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5 23:17:51:LustreError: 17786:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 9 previous similar messages 23:17:51:LustreError: 17822:0:(client.c:1133:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8800444f73c0 x1549736680098240/t0(0) o101->lustre-MDT0000-lwp-MDT0000@0@lo:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 23:17:51:LustreError: 17822:0:(client.c:1133:ptlrpc_import_delay_req()) Skipped 17 previous similar messages 23:17:51:LustreError: 17836:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5 23:17:51:LustreError: 17836:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 17 previous similar messages 23:17:51:Lustre: lustre-MDT0000: Not available for connect from 10.9.5.242@tcp (stopping) 23:17:51:Lustre: Skipped 1321 previous similar messages 23:17:51:LustreError: 0-0: Forced cleanup waiting for mdt-lustre-MDT0000_UUID namespace with 1 resources in use, (rc=-110) 23:50:37:********** Timeout by autotest system ********** Client Console: 23:50:57:[10301.531931] Lustre: DEBUG MARKER: == sanity-hsm test 107: Copytool re-register after MDS restart ======================================= 22:50:03 (1477954203) 23:50:57:[10311.388026] LustreError: 11-0: lustre-MDT0000-mdc-ffff880007dc1800: operation obd_ping to node 10.9.5.239@tcp failed: rc = -107 23:50:57:[10311.396253] Lustre: lustre-MDT0000-mdc-ffff880007dc1800: Connection to lustre-MDT0000 (at 10.9.5.239@tcp) was lost; in progress operations using this service will wait for recovery to complete 23:50:57:[13944.654013] auditd D ffff88007bb8fd90 0 490 1 0x00000000 23:50:57:[13944.654013] ffff88007bb8fc30 0000000000000086 ffff880036707300 ffff88007bb8ffd8 23:50:57:[13944.654013] ffff88007bb8ffd8 ffff88007bb8ffd8 ffff880036707300 ffff88007fc167c0 23:50:57:[13944.654013] 0000000000000000 7fffffffffffffff ffffffff81168d10 ffff88007bb8fd90 23:50:57:[13944.654013] Call Trace: 23:50:57:[13944.654013] [<ffffffff81168d10>] ? wait_on_page_read+0x60/0x60 23:50:57:[13944.654013] [<ffffffff8163bb39>] schedule+0x29/0x70 23:50:57:[13944.654013] [<ffffffff81639829>] schedule_timeout+0x209/0x2d0 23:50:57:[13944.654013] [<ffffffff81058aaf>] ? kvm_clock_get_cycles+0x1f/0x30 23:50:57:[13944.654013] [<ffffffff81168d10>] ? wait_on_page_read+0x60/0x60 23:50:57:[13944.654013] [<ffffffff8163b16e>] io_schedule_timeout+0xae/0x130 23:50:57:[13944.654013] [<ffffffff8163b208>] io_schedule+0x18/0x20 23:50:57:[13944.654013] [<ffffffff81168d1e>] sleep_on_page+0xe/0x20 23:50:57:[13944.654013] [<ffffffff816399b0>] __wait_on_bit+0x60/0x90 23:50:57:[13944.654013] [<ffffffff81168aa6>] wait_on_page_bit+0x86/0xb0 23:50:57:[13944.654013] [<ffffffff810a6c00>] ? wake_atomic_t_function+0x40/0x40 23:50:57:[13944.654013] [<ffffffff81168be1>] filemap_fdatawait_range+0x111/0x1b0 23:50:57:[13944.654013] [<ffffffff81175d3e>] ? do_writepages+0x1e/0x40 23:50:57:[13944.654013] [<ffffffff8116aad5>] ? __filemap_fdatawrite_range+0x65/0x80 23:50:57:[13944.654013] [<ffffffff8116abff>] filemap_write_and_wait_range+0x3f/0x70 23:50:57:[13944.654013] [<ffffffffa01807aa>] ext4_sync_file+0xba/0x320 [ext4] 23:50:57:[13944.654013] [<ffffffff81210185>] do_fsync+0x65/0xa0 23:50:57:[13944.654013] [<ffffffff81210450>] SyS_fsync+0x10/0x20 23:50:57:[13944.654013] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Peter Jones [ 04/Nov/16 ] |
|
Yang Sheng Could you please advise on this one? Peter |
| Comment by Yang Sheng [ 09/Nov/16 ] |
|
The umount hung on: umount S 0000000000000000 0 17653 17652 0x00000080 ffff88004f20f9a8 0000000000000082 ffffffffa0853748 0000000000000000 ffff88004f20fa08 000000004f20c000 ffffffffa08b62b0 ffff880046461b18 000000564f20f948 0000000100d03478 ffff88004f3bdad8 ffff88004f20ffd8 Call Trace: [<ffffffffa0483e1e>] ? cfs_hash_spin_lock+0xe/0x10 [libcfs] [<ffffffff8153a042>] schedule_timeout+0x192/0x2e0 [<ffffffff81089be0>] ? process_timeout+0x0/0x10 [<ffffffffa075f1e0>] __ldlm_namespace_free+0x1c0/0x560 [ptlrpc] [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 [<ffffffffa075f5ef>] ldlm_namespace_free_prior+0x6f/0x220 [ptlrpc] [<ffffffffa0e045d2>] mdt_device_fini+0x6a2/0x12e0 [mdt] [<ffffffffa055eb06>] ? class_disconnect_exports+0x116/0x2f0 [obdclass] [<ffffffffa0577332>] class_cleanup+0x572/0xd20 [obdclass] [<ffffffffa055a386>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0579646>] class_process_config+0x1b66/0x24c0 [obdclass] [<ffffffffa047dcf1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81178a2c>] ? __kmalloc+0x21c/0x230 [<ffffffffa057a45f>] class_manual_cleanup+0x4bf/0xc90 [obdclass] [<ffffffffa055a386>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa05aba1c>] server_put_super+0x8bc/0xcd0 [obdclass] [<ffffffff811944bb>] generic_shutdown_super+0x5b/0xe0 [<ffffffff811945a6>] kill_anon_super+0x16/0x60 [<ffffffffa057d646>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81194d47>] deactivate_super+0x57/0x80 [<ffffffff811b4d3f>] mntput_no_expire+0xbf/0x110 [<ffffffff811b588b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b and LustreError: 0-0: Forced cleanup waiting for mdt-lustre-MDT0000_UUID namespace with 1 resources in use, (rc=-110) So looks like some lock still not released. But i cannot find any clue to prove that. Then from MDS dmesg log: Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-hsm test 107: Copytool re-register after MDS restart ======================================= 22:50:03 \(1477954203\)
Lustre: DEBUG MARKER: == sanity-hsm test 107: Copytool re-register after MDS restart ======================================= 22:50:03 (1477954203)
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x21a:0x0'.*action='ARCHIVE'/ {print $13}' | cut -f2 -d=
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x21a:0x0'.*action='ARCHIVE'/ {print $13}' | cut -f2 -d=
Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts
Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
Lustre: Failing over lustre-MDT0000
LustreError: 17653:0:(ldlm_resource.c:887:ldlm_resource_complain()) mdt-lustre-MDT0000_UUID: namespace resource [0x200000404:0xb24e:0x0].0x0 (ffff8800453f3080) refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 17653:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x200000404:0xb24e:0x0].0x0 (ffff8800453f3080) refcount = 2
Lustre: lustre-MDT0000: Not available for connect from 10.9.5.240@tcp (stopping)
The refcount of resource is nonzero but without any locks dump. Looks like same as |
| Comment by James Casper [ 24/May/17 ] |
|
2.9.57, b3575: |
| Comment by Yang Sheng [ 16/Jan/19 ] |
|
Please feel free to reopen it. |