Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-510

Errors during shifts of Lustre Kerberos flavors

    XMLWordPrintable

Details

    • Bug
    • Resolution: Low Priority
    • Minor
    • None
    • Lustre 2.1.0
    • None
    • Servers and physical clients
      ----------------------------------------:
      Kernel: 2.6.18-238
      Lustre: 2.0.62

      VM clients:
      ---------------
      kernel-2.6.18-238.12.1.el5xen
      Lustre: 2.0.62/63
    • 3
    • 10123

    Description

      We've been shifting between all 4 different kerberos flavors just for purposes for benchmarking..
      In the end, we'll just stick with 1 kerberos flavor.

      During instantaneous shifts from 1 flavor to another, we observe the there needs to be a syncing that needs to happen among all systems in the realm- clients especially as the fs is inaccessbile (hangs) until the sync is completed.

      Syncs can take as long at 5minutes up to 15-20minutes.

      Below are all the errors that show up on the clients during the syncing process.

      1. Lustre_msg_buf

      INFO: task bash:9333 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      bash D 0000000000000200 0 9333 1 9382 (NOTLB)
      ffff880018f559c8 0000000000000286 0000000000000000 ffffffff886c2ac0
      000000000000000a ffff88002152d0c0 ffff88003071f7e0 00000000000393ca
      ffff88002152d2a8 ffff880022118200
      Call Trace:
      [<ffffffff886c2ac0>] :ptlrpc:lustre_msg_buf+0x0/0x80
      [<ffffffff886c3760>] :ptlrpc:lustre_swab_ldlm_request+0x0/0x20
      [<ffffffff8026481a>] __down+0xc3/0xd8
      [<ffffffff8028a044>] default_wake_function+0x0/0xe
      [<ffffffff802644d8>] __down_failed+0x35/0x3a
      [<ffffffff88811dac>] :mdc:.text.lock.mdc_locks+0x5/0x21
      [<ffffffff80264ea7>] __kprobes_text_start+0x34f/0x438
      [<ffffffff88a07587>] :lmv:lmv_object_find+0x77/0xf0
      [<ffffffff889f77a7>] :lmv:lmv_enqueue+0x1097/0x1c00
      [<ffffffff802d2cb7>] __kmalloc+0x8f/0x9f
      [<ffffffff8897562c>] :lustre:ll_i2suppgid+0xc/0x20
      [<ffffffff802899c9>] enqueue_task+0x41/0x56
      [<ffffffff889344dd>] :lustre:ll_get_dir_page+0x84d/0x19c0
      [<ffffffff88968be4>] :lustre:ll_readahead_init+0x14/0x30
      [<ffffffff889472ab>] :lustre:ll_local_open+0x25b/0x310
      [<ffffffff8020b398>] kmem_cache_alloc+0x62/0x6d
      [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
      [<ffffffff8869d320>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
      [<ffffffff80226132>] filldir+0x0/0xb7
      [<ffffffff8893d80e>] :lustre:ll_readdir+0x22e/0x618
      [<ffffffff88935a32>] :lustre:ll_dir_open+0x72/0xf0
      [<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14
      [<ffffffff80222c1d>] __up_read+0x19/0x7f

      2. ll_dir_open

      Jul 19 13:29:09 client1 kernel: INFO: task ls:9428 blocked for more than 120 seconds.
      Jul 19 13:29:09 client1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Jul 19 13:29:09 client1 kernel: ls D ffff880001004460 0 9428 1 9382 8801 (NOTLB)
      Jul 19 13:29:09 client1 kernel: ffff880023b13e88 0000000000000286 0000000100000000 ffff88003910e100
      Jul 19 13:29:09 client1 kernel: 0000000000000008 ffff88001ab677e0 ffffffff804fcb00 00000000002334df
      Jul 19 13:29:09 client1 kernel: ffff88001ab679c8 00000000ffffff9c
      Jul 19 13:29:09 client1 kernel: Call Trace:
      Jul 19 13:29:09 client1 kernel: [<ffffffff889359c0>] :lustre:ll_dir_open+0x0/0xf0
      Jul 19 13:29:09 client1 kernel: [<ffffffff88935a32>] :lustre:ll_dir_open+0x72/0xf0
      Jul 19 13:29:09 client1 kernel: [<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14
      Jul 19 13:29:09 client1 kernel: [<ffffffff80222c1d>] __up_read+0x19/0x7f
      Jul 19 13:29:09 client1 kernel: [<ffffffff80226132>] filldir+0x0/0xb7
      Jul 19 13:29:09 client1 kernel: [<ffffffff80263a5e>] __mutex_lock_slowpath+0x60/0x9b
      Jul 19 13:29:09 client1 kernel: [<ffffffff80263aa8>] .text.lock.mutex+0xf/0x14
      Jul 19 13:29:09 client1 kernel: [<ffffffff80236a07>] vfs_readdir+0x5c/0xa9
      Jul 19 13:29:09 client1 kernel: [<ffffffff8023a239>] sys_getdents+0x75/0xbd
      Jul 19 13:29:09 client1 kernel: [<ffffffff80260295>] tracesys+0x47/0xb6
      Jul 19 13:29:09 client1 kernel: [<ffffffff802602f9>] tracesys+0xab/0xb6

      3. cfs_hash_bd_from_key

      Jul 14 12:08:25 client1 kernel: INFO: task less:2759 blocked for more than 120 seconds.
      Jul 14 12:08:25 client1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Jul 14 12:08:25 client1 kernel: less D ffff880001004460 0 2759 2363 (NOTLB)
      Jul 14 12:08:25 client1 kernel: ffff88001f441698 0000000000000282 0000000000000000 ffff8800388da828
      Jul 14 12:08:25 client1 kernel: 0000000000000007 ffff8800280cd820 ffffffff804fcb00 00000000000317d9
      Jul 14 12:08:25 client1 kernel: ffff8800280cda08 ffffffff884d64dc
      Jul 14 12:08:25 client1 kernel: Call Trace:
      Jul 14 12:08:25 client1 kernel: [<ffffffff884d64dc>] :libcfs:cfs_hash_bd_from_key+0x3c/0xc0
      Jul 14 12:08:25 client1 kernel: [<ffffffff886c4430>] :ptlrpc:lustre_msg_string+0x0/0x280
      Jul 14 12:08:25 client1 kernel: [<ffffffff886c4430>] :ptlrpc:lustre_msg_string+0x0/0x280
      Jul 14 12:08:25 client1 kernel: [<ffffffff886ec957>] :ptlrpc:__req_capsule_get+0x2d7/0x5f0
      Jul 14 12:08:25 client1 kernel: [<ffffffff8026481a>] __down+0xc3/0xd8
      Jul 14 12:08:25 client1 kernel: [<ffffffff8028a044>] default_wake_function+0x0/0xe
      Jul 14 12:08:25 client1 kernel: [<ffffffff802644d8>] __down_failed+0x35/0x3a
      Jul 14 12:08:25 client1 kernel: [<ffffffff88811dac>] :mdc:.text.lock.mdc_locks+0x5/0x21
      Jul 14 12:08:25 client1 kernel: [<ffffffff88683b1b>] :ptlrpc:ldlm_res_hop_fid_hash+0x1cb/0x1f0
      Jul 14 12:08:25 client1 kernel: [<ffffffff8881154c>] :mdc:mdc_intent_lock+0x46c/0x610
      Jul 14 12:08:25 client1 kernel: [<ffffffff886835e9>] :ptlrpc:ldlm_lock_match+0x729/0x870
      Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
      Jul 14 12:08:25 client1 kernel: [<ffffffff8869d320>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
      Jul 14 12:08:25 client1 kernel: [<ffffffff88a07587>] :lmv:lmv_object_find+0x77/0xf0
      Jul 14 12:08:25 client1 kernel: [<ffffffff88a063c6>] :lmv:lmv_intent_open+0x9c6/0x12c0
      Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
      Jul 14 12:08:25 client1 kernel: [<ffffffff88605871>] :obdclass:cl_lock_counters+0x21/0x70
      Jul 14 12:08:25 client1 kernel: [<ffffffff889f16af>] :lmv:lmv_lock_match+0x44f/0x5e0
      Jul 14 12:08:25 client1 kernel: [<ffffffff88a06f84>] :lmv:lmv_intent_lock+0x2c4/0x390
      Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
      Jul 14 12:08:25 client1 kernel: [<ffffffff889533c5>] :lustre:ll_prep_md_op_data+0x385/0x460
      Jul 14 12:08:25 client1 kernel: [<ffffffff88683b1b>] :ptlrpc:ldlm_res_hop_fid_hash+0x1cb/0x1f0
      Jul 14 12:08:25 client1 kernel: [<ffffffff8892f38a>] :lustre:ll_revalidate_it+0xb6a/0x19c0
      Jul 14 12:08:25 client1 kernel: [<ffffffff88975a00>] :lustre:ll_md_blocking_ast+0x0/0x580
      Jul 14 12:08:25 client1 kernel: [<ffffffff884d64dc>] :libcfs:cfs_hash_bd_from_key+0x3c/0xc0
      Jul 14 12:08:25 client1 kernel: [<ffffffff8868487e>] :ptlrpc:ldlm_resource_putref+0x12e/0x240
      Jul 14 12:08:25 client1 kernel: [<ffffffff802d2cb7>] __kmalloc+0x8f/0x9f
      Jul 14 12:08:25 client1 kernel: [<ffffffff88973cf1>] :lustre:ll_convert_intent+0x121/0x1f0
      Jul 14 12:08:25 client1 kernel: [<ffffffff889312f4>] :lustre:ll_revalidate_nd+0x194/0x39c
      Jul 14 12:08:25 client1 kernel: [<ffffffff80209f27>] __d_lookup+0x98/0xff
      Jul 14 12:08:25 client1 kernel: [<ffffffff8020d79d>] do_lookup+0x18f/0x1e6
      Jul 14 12:08:25 client1 kernel: [<ffffffff8020a9b8>] __link_path_walk+0xa2a/0xfb9
      Jul 14 12:10:25 client1 kernel: [<ffffffff8020d48e>] do_path_lookup+0x275/0x2f1
      Jul 14 12:10:25 client1 kernel: [<ffffffff8020b398>] kmem_cache_alloc+0x62/0x6d
      Jul 14 12:10:25 client1 kernel: [<ffffffff80224048>] __path_lookup_intent_open+0x56/0x97
      Jul 14 12:10:25 client1 kernel: [<ffffffff8021b66c>] open_namei+0x73/0x715
      Jul 14 12:10:25 client1 kernel: [<ffffffff80227dc6>] do_filp_open+0x1c/0x38
      Jul 14 12:10:25 client1 kernel: [<ffffffff8021a525>] do_sys_open+0x44/0xbe
      Jul 14 12:10:25 client1 kernel: [<ffffffff802602f9>] tracesys+0xab/0xb6
      Jul 14 12:10:25 client1 kernel:

      Feedback/suggestions would be appreciated.

      Thanks,
      josephine

      Attachments

        Activity

          People

            wc-triage WC Triage
            josephin Josephine Palencia
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: