Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-607

port bz24419 (ldlm namespace lock contention during oom)

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.0.0, Lustre 1.8.6
    • None
    • 3
    • 24,419
    • 9726

    Description

      While running a regression test:

      [2011-01-20 19:14:30][c0-0c0s5n0]Kernel panic - not syncing: oom_kill_process killing invalid app
      rcad_svcs.
      [2011-01-20 19:14:30][c0-0c0s5n0]Pid: 5529, comm: stressapptest Tainted: P
      2.6.32.24-0.2.1_1.0000.5704-cray_gem_c #1
      [2011-01-20 19:14:30][c0-0c0s5n0]Call Trace:
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff810072b9>] try_stack_unwind+0x149/0x190
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff81005d90>] dump_trace+0x90/0x2f0
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff81006eb7>] show_trace_log_lvl+0x57/0x70
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff81006ee0>] show_trace+0x10/0x20
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff8125c69c>] dump_stack+0x72/0x7b
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff8125c71a>] panic+0x75/0x13b
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff810968e6>] __oom_kill_task+0xa6/0x190
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff81096db5>] oom_kill_process+0x245/0x2e0
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff81097296>] __out_of_memory+0x176/0x1e0
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff810977d9>] out_of_memory+0x4d9/0x560
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff8109ab72>] __alloc_pages_nodemask+0x662/0x680
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff810c1ba0>] alloc_page_vma+0x70/0x100
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff810b069f>] handle_mm_fault+0xbff/0xd00
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff8101f857>] do_page_fault+0x147/0x2c0
      [2011-01-20 19:14:30][c0-0c0s5n0] [<ffffffff8125f7af>] page_fault+0x1f/0x30
      [2011-01-20 19:14:30][c0-0c0s5n0] [<000000000041ae8a>] 0x41ae8a

      crash> gdb list *(ldlm_pools_shrink+0x93)
      0xffffffffa026a853 is in ldlm_pools_shrink
      (/usr/src/packages/BUILD/cray-lustre-1.8.4/lustre/ptlrpc/../../lustre/ldlm/ldlm_pool.c:1086
      1076 for (nr_ns = atomic_read(ldlm_namespace_nr(client));
      1077 nr_ns > 0; nr_ns--)
      1078 {
      1079 mutex_down(ldlm_namespace_lock(client));
      1080 if (list_empty(ldlm_namespace_list(client)))

      { 1081 mutex_up(ldlm_namespace_lock(client)); 1082 return 0; 1083 }

      1084 ns = ldlm_namespace_first_locked(client);
      1085 ldlm_namespace_get(ns);
      1086 ldlm_namespace_move_locked(ns, client);
      1087 mutex_up(ldlm_namespace_lock(client));
      1088 total += ldlm_pool_shrink(&ns->ns_pool, 0, gfp_mask);
      1089 ldlm_namespace_put(ns, 1);

      Fix:

      ldlm_namespace_free removes namespace from list and free memory without checking namespace's refcount
      while ldlm_pools_shrink might get namespace from the list and start ldlm_pool_shrink() for it.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              bobijam Zhenyu Xu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: