Details
-
Bug
-
Resolution: Low Priority
-
Minor
-
None
-
Lustre 2.1.0
-
None
-
Servers and physical clients
----------------------------------------:
Kernel: 2.6.18-238
Lustre: 2.0.62
VM clients:
---------------
kernel-2.6.18-238.12.1.el5xen
Lustre: 2.0.62/63
-
3
-
10123
Description
We've been shifting between all 4 different kerberos flavors just for purposes for benchmarking..
In the end, we'll just stick with 1 kerberos flavor.
During instantaneous shifts from 1 flavor to another, we observe the there needs to be a syncing that needs to happen among all systems in the realm- clients especially as the fs is inaccessbile (hangs) until the sync is completed.
Syncs can take as long at 5minutes up to 15-20minutes.
Below are all the errors that show up on the clients during the syncing process.
1. Lustre_msg_buf
INFO: task bash:9333 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bash D 0000000000000200 0 9333 1 9382 (NOTLB)
ffff880018f559c8 0000000000000286 0000000000000000 ffffffff886c2ac0
000000000000000a ffff88002152d0c0 ffff88003071f7e0 00000000000393ca
ffff88002152d2a8 ffff880022118200
Call Trace:
[<ffffffff886c2ac0>] :ptlrpc:lustre_msg_buf+0x0/0x80
[<ffffffff886c3760>] :ptlrpc:lustre_swab_ldlm_request+0x0/0x20
[<ffffffff8026481a>] __down+0xc3/0xd8
[<ffffffff8028a044>] default_wake_function+0x0/0xe
[<ffffffff802644d8>] __down_failed+0x35/0x3a
[<ffffffff88811dac>] :mdc:.text.lock.mdc_locks+0x5/0x21
[<ffffffff80264ea7>] __kprobes_text_start+0x34f/0x438
[<ffffffff88a07587>] :lmv:lmv_object_find+0x77/0xf0
[<ffffffff889f77a7>] :lmv:lmv_enqueue+0x1097/0x1c00
[<ffffffff802d2cb7>] __kmalloc+0x8f/0x9f
[<ffffffff8897562c>] :lustre:ll_i2suppgid+0xc/0x20
[<ffffffff802899c9>] enqueue_task+0x41/0x56
[<ffffffff889344dd>] :lustre:ll_get_dir_page+0x84d/0x19c0
[<ffffffff88968be4>] :lustre:ll_readahead_init+0x14/0x30
[<ffffffff889472ab>] :lustre:ll_local_open+0x25b/0x310
[<ffffffff8020b398>] kmem_cache_alloc+0x62/0x6d
[<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
[<ffffffff8869d320>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
[<ffffffff80226132>] filldir+0x0/0xb7
[<ffffffff8893d80e>] :lustre:ll_readdir+0x22e/0x618
[<ffffffff88935a32>] :lustre:ll_dir_open+0x72/0xf0
[<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14
[<ffffffff80222c1d>] __up_read+0x19/0x7f
2. ll_dir_open
Jul 19 13:29:09 client1 kernel: INFO: task ls:9428 blocked for more than 120 seconds.
Jul 19 13:29:09 client1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 19 13:29:09 client1 kernel: ls D ffff880001004460 0 9428 1 9382 8801 (NOTLB)
Jul 19 13:29:09 client1 kernel: ffff880023b13e88 0000000000000286 0000000100000000 ffff88003910e100
Jul 19 13:29:09 client1 kernel: 0000000000000008 ffff88001ab677e0 ffffffff804fcb00 00000000002334df
Jul 19 13:29:09 client1 kernel: ffff88001ab679c8 00000000ffffff9c
Jul 19 13:29:09 client1 kernel: Call Trace:
Jul 19 13:29:09 client1 kernel: [<ffffffff889359c0>] :lustre:ll_dir_open+0x0/0xf0
Jul 19 13:29:09 client1 kernel: [<ffffffff88935a32>] :lustre:ll_dir_open+0x72/0xf0
Jul 19 13:29:09 client1 kernel: [<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14
Jul 19 13:29:09 client1 kernel: [<ffffffff80222c1d>] __up_read+0x19/0x7f
Jul 19 13:29:09 client1 kernel: [<ffffffff80226132>] filldir+0x0/0xb7
Jul 19 13:29:09 client1 kernel: [<ffffffff80263a5e>] __mutex_lock_slowpath+0x60/0x9b
Jul 19 13:29:09 client1 kernel: [<ffffffff80263aa8>] .text.lock.mutex+0xf/0x14
Jul 19 13:29:09 client1 kernel: [<ffffffff80236a07>] vfs_readdir+0x5c/0xa9
Jul 19 13:29:09 client1 kernel: [<ffffffff8023a239>] sys_getdents+0x75/0xbd
Jul 19 13:29:09 client1 kernel: [<ffffffff80260295>] tracesys+0x47/0xb6
Jul 19 13:29:09 client1 kernel: [<ffffffff802602f9>] tracesys+0xab/0xb6
3. cfs_hash_bd_from_key
Jul 14 12:08:25 client1 kernel: INFO: task less:2759 blocked for more than 120 seconds.
Jul 14 12:08:25 client1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 14 12:08:25 client1 kernel: less D ffff880001004460 0 2759 2363 (NOTLB)
Jul 14 12:08:25 client1 kernel: ffff88001f441698 0000000000000282 0000000000000000 ffff8800388da828
Jul 14 12:08:25 client1 kernel: 0000000000000007 ffff8800280cd820 ffffffff804fcb00 00000000000317d9
Jul 14 12:08:25 client1 kernel: ffff8800280cda08 ffffffff884d64dc
Jul 14 12:08:25 client1 kernel: Call Trace:
Jul 14 12:08:25 client1 kernel: [<ffffffff884d64dc>] :libcfs:cfs_hash_bd_from_key+0x3c/0xc0
Jul 14 12:08:25 client1 kernel: [<ffffffff886c4430>] :ptlrpc:lustre_msg_string+0x0/0x280
Jul 14 12:08:25 client1 kernel: [<ffffffff886c4430>] :ptlrpc:lustre_msg_string+0x0/0x280
Jul 14 12:08:25 client1 kernel: [<ffffffff886ec957>] :ptlrpc:__req_capsule_get+0x2d7/0x5f0
Jul 14 12:08:25 client1 kernel: [<ffffffff8026481a>] __down+0xc3/0xd8
Jul 14 12:08:25 client1 kernel: [<ffffffff8028a044>] default_wake_function+0x0/0xe
Jul 14 12:08:25 client1 kernel: [<ffffffff802644d8>] __down_failed+0x35/0x3a
Jul 14 12:08:25 client1 kernel: [<ffffffff88811dac>] :mdc:.text.lock.mdc_locks+0x5/0x21
Jul 14 12:08:25 client1 kernel: [<ffffffff88683b1b>] :ptlrpc:ldlm_res_hop_fid_hash+0x1cb/0x1f0
Jul 14 12:08:25 client1 kernel: [<ffffffff8881154c>] :mdc:mdc_intent_lock+0x46c/0x610
Jul 14 12:08:25 client1 kernel: [<ffffffff886835e9>] :ptlrpc:ldlm_lock_match+0x729/0x870
Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
Jul 14 12:08:25 client1 kernel: [<ffffffff8869d320>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
Jul 14 12:08:25 client1 kernel: [<ffffffff88a07587>] :lmv:lmv_object_find+0x77/0xf0
Jul 14 12:08:25 client1 kernel: [<ffffffff88a063c6>] :lmv:lmv_intent_open+0x9c6/0x12c0
Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
Jul 14 12:08:25 client1 kernel: [<ffffffff88605871>] :obdclass:cl_lock_counters+0x21/0x70
Jul 14 12:08:25 client1 kernel: [<ffffffff889f16af>] :lmv:lmv_lock_match+0x44f/0x5e0
Jul 14 12:08:25 client1 kernel: [<ffffffff88a06f84>] :lmv:lmv_intent_lock+0x2c4/0x390
Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
Jul 14 12:08:25 client1 kernel: [<ffffffff889533c5>] :lustre:ll_prep_md_op_data+0x385/0x460
Jul 14 12:08:25 client1 kernel: [<ffffffff88683b1b>] :ptlrpc:ldlm_res_hop_fid_hash+0x1cb/0x1f0
Jul 14 12:08:25 client1 kernel: [<ffffffff8892f38a>] :lustre:ll_revalidate_it+0xb6a/0x19c0
Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
Jul 14 12:08:25 client1 kernel: [<ffffffff884d64dc>] :libcfs:cfs_hash_bd_from_key+0x3c/0xc0
Jul 14 12:08:25 client1 kernel: [<ffffffff8868487e>] :ptlrpc:ldlm_resource_putref+0x12e/0x240
Jul 14 12:08:25 client1 kernel: [<ffffffff802d2cb7>] __kmalloc+0x8f/0x9f
Jul 14 12:08:25 client1 kernel: [<ffffffff88973cf1>] :lustre:ll_convert_intent+0x121/0x1f0
Jul 14 12:08:25 client1 kernel: [<ffffffff889312f4>] :lustre:ll_revalidate_nd+0x194/0x39c
Jul 14 12:08:25 client1 kernel: [<ffffffff80209f27>] __d_lookup+0x98/0xff
Jul 14 12:08:25 client1 kernel: [<ffffffff8020d79d>] do_lookup+0x18f/0x1e6
Jul 14 12:08:25 client1 kernel: [<ffffffff8020a9b8>] __link_path_walk+0xa2a/0xfb9
Jul 14 12:10:25 client1 kernel: [<ffffffff8020d48e>] do_path_lookup+0x275/0x2f1
Jul 14 12:10:25 client1 kernel: [<ffffffff8020b398>] kmem_cache_alloc+0x62/0x6d
Jul 14 12:10:25 client1 kernel: [<ffffffff80224048>] __path_lookup_intent_open+0x56/0x97
Jul 14 12:10:25 client1 kernel: [<ffffffff8021b66c>] open_namei+0x73/0x715
Jul 14 12:10:25 client1 kernel: [<ffffffff80227dc6>] do_filp_open+0x1c/0x38
Jul 14 12:10:25 client1 kernel: [<ffffffff8021a525>] do_sys_open+0x44/0xbe
Jul 14 12:10:25 client1 kernel: [<ffffffff802602f9>] tracesys+0xab/0xb6
Jul 14 12:10:25 client1 kernel:
Feedback/suggestions would be appreciated.
Thanks,
josephine