[LU-706] 1.8<->2.1 interop: recovery-small test 51 hung Created: 22/Sep/11  Updated: 28/May/17  Resolved: 28/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Old Lustre Versions: 1.8.5 and 1.8.6-wc1
Lustre 1.8.6-wc1 Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/

New Lustre Version: 2.1.0 RC2
Lustre 2.1.0 RC2 Build: http://newbuild.whamcloud.com/job/lustre-master/283/
Network: TCP (1GigE)

Clean upgrading (Lustre servers and clients were upgraded all at once) from Lustre 1.8.5 and 1.8.6-wc1 to Lustre 2.1.0 RC2 under the following configuration:
OSS1: RHEL5/x86_64 upgrade from 1.8.6-wc1 to 2.1.0
OSS2: RHEL5/x86_64 upgrade from 1.8.5 to 2.1.0
MDS: RHEL5/x86_64 upgrade from 1.8.6-wc1 to 2.1.0
Client1: RHEL6/x86_64 upgrade from 1.8.6-wc1 to 2.1.0
Client2: RHEL5/x86_64 upgrade from 1.8.5 to 2.1.0

Test nodes:
OSS1: fat-amd-2
OSS2: fat-amd-3
MDS: fat-amd-1
Client1: client-12
Client2: client-13


Severity: 3
Rank (Obsolete): 10720

 Description   

After the clean upgrading, running recovery-small test 51 hung, dmesg on the client node client-12 showed that:

INFO: task writemany:32634 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
writemany     D 0000000000000002     0 32634  32632 0x00000080
 ffff8802dc4f9d38 0000000000000082 0000000000000000 02001bcd48000001
 ffff880202001bcd ffff8802d7a84238 ffff8802000041ed 00000001002e2106
 ffff880326b8b038 ffff8802dc4f9fd8 000000000000f598 ffff880326b8b038
Call Trace:
 [<ffffffff814dc13e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8115b102>] ? kmem_cache_alloc+0x182/0x190
 [<ffffffff814dbfdb>] mutex_lock+0x2b/0x50
 [<ffffffff81182650>] do_filp_open+0x2c0/0xd90
 [<ffffffff811af00e>] ? __fsnotify_inode_delete+0xe/0x10
 [<ffffffff8118c6aa>] ? __destroy_inode+0x2a/0xa0
 [<ffffffff8117f005>] ? putname+0x35/0x50
 [<ffffffff8118f0d2>] ? alloc_fd+0x92/0x160
 [<ffffffff8116f879>] do_sys_open+0x69/0x140
 [<ffffffff8100c635>] ? math_state_restore+0x45/0x60
 [<ffffffff8116f990>] sys_open+0x20/0x30
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
INFO: task writemany:32635 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
writemany     D 0000000000000001     0 32635  32632 0x00000080
 ffff8802d6e67e08 0000000000000086 0000000000000000 ffffffff812061ff
 ffff8802d6e67de8 ffffffff8118087d 0000000000000000 00000001002e211b
 ffff8802dd459ab8 ffff8802d6e67fd8 000000000000f598 ffff8802dd459ab8
Call Trace:
 [<ffffffff812061ff>] ? security_inode_permission+0x1f/0x30
 [<ffffffff8118087d>] ? __link_path_walk+0x11d/0x820
 [<ffffffff814dc13e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff814dbfdb>] mutex_lock+0x2b/0x50
 [<ffffffff81183456>] do_unlinkat+0x96/0x1c0
 [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81183596>] sys_unlink+0x16/0x20
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
INFO: task writemany:32637 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
writemany     D 0000000000000002     0 32637  32632 0x00000080
 ffff8802dc6e3e08 0000000000000086 0000000000000000 ffffffff812061ff
 ffff8802dc6e3de8 ffffffff8118087d 0000000000000000 00000001002e20c1
 ffff8802d0799af8 ffff8802dc6e3fd8 000000000000f598 ffff8802d0799af8
Call Trace:
 [<ffffffff812061ff>] ? security_inode_permission+0x1f/0x30
 [<ffffffff8118087d>] ? __link_path_walk+0x11d/0x820
 [<ffffffff814dc13e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff814dbfdb>] mutex_lock+0x2b/0x50
 [<ffffffff81183456>] do_unlinkat+0x96/0x1c0
 [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81183596>] sys_unlink+0x16/0x20
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
LustreError: 32638:0:(client.c:1899:ptlrpc_interrupted_set()) INTERRUPTED SET ffff8802e30e2b40
LustreError: 32638:0:(file.c:158:ll_close_inode_openhandle()) inode 144146779976237058 mdc close failed: rc = -4
INFO: task ptlrpcd-rcv:6723 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ptlrpcd-rcv   D 0000000000000001     0  6723      2 0x00000080
 ffff8803222cd7f0 0000000000000046 0000000000000000 0000000000000000
 ffff8802cdaf92b8 ffff88031b326400 ffff8802cdaf9180 00000001002f14f2
 ffff8803222c9078 ffff8803222cdfd8 000000000000f598 ffff8803222c9078
Call Trace:
 [<ffffffffa02f96fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
 [<ffffffffa065d140>] seq_client_flush+0x80/0x110 [fid]
 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
 [<ffffffffa066bb7c>] mdc_fid_delete+0x1c/0x20 [mdc]
 [<ffffffffa07b1c4f>] ll_delete_inode+0x12f/0x650 [lustre]
 [<ffffffffa07b1b20>] ? ll_delete_inode+0x0/0x650 [lustre]
 [<ffffffff8118d15e>] generic_delete_inode+0xde/0x1d0
 [<ffffffff8118d2b5>] generic_drop_inode+0x65/0x80
 [<ffffffff8118c122>] iput+0x62/0x70
 [<ffffffffa07ce640>] ll_md_blocking_ast+0x270/0x600 [lustre]
 [<ffffffffa04b5ba3>] ldlm_cancel_callback+0x63/0xf0 [ptlrpc]
 [<ffffffffa04b5c89>] ldlm_lock_cancel+0x59/0x190 [ptlrpc]
 [<ffffffffa04d1e91>] ? ldlm_prepare_lru_list+0x281/0x460 [ptlrpc]
 [<ffffffffa04d2df8>] ldlm_cli_cancel_list_local+0x78/0x230 [ptlrpc]
 [<ffffffffa0303691>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]
 [<ffffffffa04cf9b0>] ? ldlm_cancel_no_wait_policy+0x0/0xf0 [ptlrpc]
 [<ffffffffa04d3265>] ldlm_cancel_lru_local+0x35/0x40 [ptlrpc]
 [<ffffffffa04d3395>] ldlm_replay_locks+0x125/0x7d0 [ptlrpc]
 [<ffffffffa0520f5a>] ptlrpc_import_recovery_state_machine+0x8ca/0xd20 [ptlrpc]
 [<ffffffffa04fe19c>] ? lustre_msg_get_last_committed+0x7c/0xe0 [ptlrpc]
 [<ffffffffa0521ba8>] ptlrpc_connect_interpret+0x5d8/0x1fe0 [ptlrpc]
 [<ffffffffa04f17ea>] ptlrpc_check_set+0x47a/0x1b20 [ptlrpc]
 [<ffffffff8107a11b>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffff8107a1a2>] ? del_timer_sync+0x22/0x30
 [<ffffffffa0525428>] ptlrpcd_check+0x1c8/0x250 [ptlrpc]
 [<ffffffffa05257cb>] ptlrpcd+0x31b/0x390 [ptlrpc]
 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
 [<ffffffffa05254b0>] ? ptlrpcd+0x0/0x390 [ptlrpc]
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffffa05254b0>] ? ptlrpcd+0x0/0x390 [ptlrpc]
 [<ffffffffa05254b0>] ? ptlrpcd+0x0/0x390 [ptlrpc]
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

Maloo report: https://maloo.whamcloud.com/test_sets/d10f8468-e515-11e0-9909-52540025f9af

This test passed while running it after clean upgrading from Lustre Lustre 1.8.5 and 1.8.6-wc1 to Lustre 2.1.0 RC1:
https://maloo.whamcloud.com/sub_tests/d9c48674-d82e-11e0-8d02-52540025f9af



 Comments   
Comment by Andreas Dilger [ 28/May/17 ]

Close old issue.

Generated at Sat Feb 10 01:09:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.