[LU-510] Errors during shifts of Lustre Kerberos flavors Created: 19/Jul/11  Updated: 11/Oct/17  Resolved: 11/Oct/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Josephine Palencia Assignee: WC Triage
Resolution: Low Priority Votes: 0
Labels: None
Environment:

Servers and physical clients
----------------------------------------:
Kernel: 2.6.18-238
Lustre: 2.0.62

VM clients:
---------------
kernel-2.6.18-238.12.1.el5xen
Lustre: 2.0.62/63


Severity: 3
Rank (Obsolete): 10123

 Description   

We've been shifting between all 4 different kerberos flavors just for purposes for benchmarking..
In the end, we'll just stick with 1 kerberos flavor.

During instantaneous shifts from 1 flavor to another, we observe the there needs to be a syncing that needs to happen among all systems in the realm- clients especially as the fs is inaccessbile (hangs) until the sync is completed.

Syncs can take as long at 5minutes up to 15-20minutes.

Below are all the errors that show up on the clients during the syncing process.

1. Lustre_msg_buf

INFO: task bash:9333 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bash D 0000000000000200 0 9333 1 9382 (NOTLB)
ffff880018f559c8 0000000000000286 0000000000000000 ffffffff886c2ac0
000000000000000a ffff88002152d0c0 ffff88003071f7e0 00000000000393ca
ffff88002152d2a8 ffff880022118200
Call Trace:
[<ffffffff886c2ac0>] :ptlrpc:lustre_msg_buf+0x0/0x80
[<ffffffff886c3760>] :ptlrpc:lustre_swab_ldlm_request+0x0/0x20
[<ffffffff8026481a>] __down+0xc3/0xd8
[<ffffffff8028a044>] default_wake_function+0x0/0xe
[<ffffffff802644d8>] __down_failed+0x35/0x3a
[<ffffffff88811dac>] :mdc:.text.lock.mdc_locks+0x5/0x21
[<ffffffff80264ea7>] __kprobes_text_start+0x34f/0x438
[<ffffffff88a07587>] :lmv:lmv_object_find+0x77/0xf0
[<ffffffff889f77a7>] :lmv:lmv_enqueue+0x1097/0x1c00
[<ffffffff802d2cb7>] __kmalloc+0x8f/0x9f
[<ffffffff8897562c>] :lustre:ll_i2suppgid+0xc/0x20
[<ffffffff802899c9>] enqueue_task+0x41/0x56
[<ffffffff889344dd>] :lustre:ll_get_dir_page+0x84d/0x19c0
[<ffffffff88968be4>] :lustre:ll_readahead_init+0x14/0x30
[<ffffffff889472ab>] :lustre:ll_local_open+0x25b/0x310
[<ffffffff8020b398>] kmem_cache_alloc+0x62/0x6d
[<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
[<ffffffff8869d320>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
[<ffffffff80226132>] filldir+0x0/0xb7
[<ffffffff8893d80e>] :lustre:ll_readdir+0x22e/0x618
[<ffffffff88935a32>] :lustre:ll_dir_open+0x72/0xf0
[<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14
[<ffffffff80222c1d>] __up_read+0x19/0x7f

2. ll_dir_open

Jul 19 13:29:09 client1 kernel: INFO: task ls:9428 blocked for more than 120 seconds.
Jul 19 13:29:09 client1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 19 13:29:09 client1 kernel: ls D ffff880001004460 0 9428 1 9382 8801 (NOTLB)
Jul 19 13:29:09 client1 kernel: ffff880023b13e88 0000000000000286 0000000100000000 ffff88003910e100
Jul 19 13:29:09 client1 kernel: 0000000000000008 ffff88001ab677e0 ffffffff804fcb00 00000000002334df
Jul 19 13:29:09 client1 kernel: ffff88001ab679c8 00000000ffffff9c
Jul 19 13:29:09 client1 kernel: Call Trace:
Jul 19 13:29:09 client1 kernel: [<ffffffff889359c0>] :lustre:ll_dir_open+0x0/0xf0
Jul 19 13:29:09 client1 kernel: [<ffffffff88935a32>] :lustre:ll_dir_open+0x72/0xf0
Jul 19 13:29:09 client1 kernel: [<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14
Jul 19 13:29:09 client1 kernel: [<ffffffff80222c1d>] __up_read+0x19/0x7f
Jul 19 13:29:09 client1 kernel: [<ffffffff80226132>] filldir+0x0/0xb7
Jul 19 13:29:09 client1 kernel: [<ffffffff80263a5e>] __mutex_lock_slowpath+0x60/0x9b
Jul 19 13:29:09 client1 kernel: [<ffffffff80263aa8>] .text.lock.mutex+0xf/0x14
Jul 19 13:29:09 client1 kernel: [<ffffffff80236a07>] vfs_readdir+0x5c/0xa9
Jul 19 13:29:09 client1 kernel: [<ffffffff8023a239>] sys_getdents+0x75/0xbd
Jul 19 13:29:09 client1 kernel: [<ffffffff80260295>] tracesys+0x47/0xb6
Jul 19 13:29:09 client1 kernel: [<ffffffff802602f9>] tracesys+0xab/0xb6

3. cfs_hash_bd_from_key

Jul 14 12:08:25 client1 kernel: INFO: task less:2759 blocked for more than 120 seconds.
Jul 14 12:08:25 client1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 14 12:08:25 client1 kernel: less D ffff880001004460 0 2759 2363 (NOTLB)
Jul 14 12:08:25 client1 kernel: ffff88001f441698 0000000000000282 0000000000000000 ffff8800388da828
Jul 14 12:08:25 client1 kernel: 0000000000000007 ffff8800280cd820 ffffffff804fcb00 00000000000317d9
Jul 14 12:08:25 client1 kernel: ffff8800280cda08 ffffffff884d64dc
Jul 14 12:08:25 client1 kernel: Call Trace:
Jul 14 12:08:25 client1 kernel: [<ffffffff884d64dc>] :libcfs:cfs_hash_bd_from_key+0x3c/0xc0
Jul 14 12:08:25 client1 kernel: [<ffffffff886c4430>] :ptlrpc:lustre_msg_string+0x0/0x280
Jul 14 12:08:25 client1 kernel: [<ffffffff886c4430>] :ptlrpc:lustre_msg_string+0x0/0x280
Jul 14 12:08:25 client1 kernel: [<ffffffff886ec957>] :ptlrpc:__req_capsule_get+0x2d7/0x5f0
Jul 14 12:08:25 client1 kernel: [<ffffffff8026481a>] __down+0xc3/0xd8
Jul 14 12:08:25 client1 kernel: [<ffffffff8028a044>] default_wake_function+0x0/0xe
Jul 14 12:08:25 client1 kernel: [<ffffffff802644d8>] __down_failed+0x35/0x3a
Jul 14 12:08:25 client1 kernel: [<ffffffff88811dac>] :mdc:.text.lock.mdc_locks+0x5/0x21
Jul 14 12:08:25 client1 kernel: [<ffffffff88683b1b>] :ptlrpc:ldlm_res_hop_fid_hash+0x1cb/0x1f0
Jul 14 12:08:25 client1 kernel: [<ffffffff8881154c>] :mdc:mdc_intent_lock+0x46c/0x610
Jul 14 12:08:25 client1 kernel: [<ffffffff886835e9>] :ptlrpc:ldlm_lock_match+0x729/0x870
Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
Jul 14 12:08:25 client1 kernel: [<ffffffff8869d320>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
Jul 14 12:08:25 client1 kernel: [<ffffffff88a07587>] :lmv:lmv_object_find+0x77/0xf0
Jul 14 12:08:25 client1 kernel: [<ffffffff88a063c6>] :lmv:lmv_intent_open+0x9c6/0x12c0
Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
Jul 14 12:08:25 client1 kernel: [<ffffffff88605871>] :obdclass:cl_lock_counters+0x21/0x70
Jul 14 12:08:25 client1 kernel: [<ffffffff889f16af>] :lmv:lmv_lock_match+0x44f/0x5e0
Jul 14 12:08:25 client1 kernel: [<ffffffff88a06f84>] :lmv:lmv_intent_lock+0x2c4/0x390
Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
Jul 14 12:08:25 client1 kernel: [<ffffffff889533c5>] :lustre:ll_prep_md_op_data+0x385/0x460
Jul 14 12:08:25 client1 kernel: [<ffffffff88683b1b>] :ptlrpc:ldlm_res_hop_fid_hash+0x1cb/0x1f0
Jul 14 12:08:25 client1 kernel: [<ffffffff8892f38a>] :lustre:ll_revalidate_it+0xb6a/0x19c0
Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
Jul 14 12:08:25 client1 kernel: [<ffffffff884d64dc>] :libcfs:cfs_hash_bd_from_key+0x3c/0xc0
Jul 14 12:08:25 client1 kernel: [<ffffffff8868487e>] :ptlrpc:ldlm_resource_putref+0x12e/0x240
Jul 14 12:08:25 client1 kernel: [<ffffffff802d2cb7>] __kmalloc+0x8f/0x9f
Jul 14 12:08:25 client1 kernel: [<ffffffff88973cf1>] :lustre:ll_convert_intent+0x121/0x1f0
Jul 14 12:08:25 client1 kernel: [<ffffffff889312f4>] :lustre:ll_revalidate_nd+0x194/0x39c
Jul 14 12:08:25 client1 kernel: [<ffffffff80209f27>] __d_lookup+0x98/0xff
Jul 14 12:08:25 client1 kernel: [<ffffffff8020d79d>] do_lookup+0x18f/0x1e6
Jul 14 12:08:25 client1 kernel: [<ffffffff8020a9b8>] __link_path_walk+0xa2a/0xfb9
Jul 14 12:10:25 client1 kernel: [<ffffffff8020d48e>] do_path_lookup+0x275/0x2f1
Jul 14 12:10:25 client1 kernel: [<ffffffff8020b398>] kmem_cache_alloc+0x62/0x6d
Jul 14 12:10:25 client1 kernel: [<ffffffff80224048>] __path_lookup_intent_open+0x56/0x97
Jul 14 12:10:25 client1 kernel: [<ffffffff8021b66c>] open_namei+0x73/0x715
Jul 14 12:10:25 client1 kernel: [<ffffffff80227dc6>] do_filp_open+0x1c/0x38
Jul 14 12:10:25 client1 kernel: [<ffffffff8021a525>] do_sys_open+0x44/0xbe
Jul 14 12:10:25 client1 kernel: [<ffffffff802602f9>] tracesys+0xab/0xb6
Jul 14 12:10:25 client1 kernel:

Feedback/suggestions would be appreciated.

Thanks,
josephine



 Comments   
Comment by Josephine Palencia [ 19/Jul/11 ]

We changed the Kerberos flavor around 11:30am this morning. Right after the krb flavor switch, we still able to run some iozone on PSC client1. But the accessing-to-extenci-denied problem came at 1:30pm on PSC client1 and on vmuf at 3:00pm. The resync problem seems to be more complicated than expected.

Comment by Josephine Palencia [ 19/Jul/11 ]

VMUF is at UF(Univ of FLorida).

Comment by Andreas Dilger [ 27/Jan/12 ]

I didn't mean to mark this fixed.

Comment by Andreas Dilger [ 27/Jan/12 ]

Josephine, thank you for submitting this bug.

Unfortunately, we currently do not have resources or funding for testing and fixing Kerberos at this time.

Fortunately, this issue should not impact your system during normal operation, and I assume you have finished your Kerberos flavour benchmarking in the meantime.

We could keep this issue open for tracking purposes, and possibly at some time in the future when we have funding for Kerberos maintenance they can look into this problem. It is still useful to have you using Kerberos and reporting issues here, so that if other users are interested in trying it out they can see what the state of this feature is.

Generated at Sat Feb 10 01:07:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.