Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1663

MDS threads hang for over 725s, causing fail over

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 1.8.7
    • Lustre 1.8.6.80, jenkins-g9d9d86f-PRISTINE-2.6.18-238.12.1.el5_lustre.gd70e443
      Centos 5.5
    • 2
    • 4055

    Description

      At NOAA, there are two filesystems that were installed at the same time, lfs1 and lfs2. Recently lfs2 has been having MDS lockups, which cause a failover to the second MDS. It seems to run ok for a couple days and then whichever MDS is currently running will lockup and failover to the other one. lfs1, however, is not affected, though it runs an identical set up as far as hardware goes.

      We have the stack traces that get logged, but not the lustre-logs, as they have been on tmpfs. We've changed the debug_file location, so hopefully we'll get the next batch. I'll put a sampling of the interesting call traces, and attach the rest.

      Here is the root cause of the failover. The health_check times out and prints NOT HEALTHY, which causes ha to failover:
      Jul 17 17:23:30 lfs-mds-2-2 kernel: LustreError: 16021:0:(service.c:2124:ptlrpc_service_health_check()) mds: unhealthy - request has been waiting 725s

      This one makes it look like it might be quota related:
      Jul 17 17:14:04 lfs-mds-2-2 kernel: Call Trace:
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff887f9220>] :lnet:LNetPut+0x730/0x840
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff800649fb>] __down+0xc3/0xd8
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
      Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff88a29490>] :lquota:dqacq_handler+0x0/0xc20
      ...

      This one looks a little like LU-1395 or LU-1269:
      Jul 4 17:58:29 lfs-mds-2-2 kernel: Call Trace:
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888ceb51>] ldlm_resource_add_lock+0xb1/0x180 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e2a00>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8006388b>] schedule_timeout+0x8a/0xad
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8009987d>] process_timeout+0x0/0x5
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e4555>] ldlm_completion_ast+0x4d5/0x880 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c9709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c4b6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e30bb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff88c611a7>] enqueue_ordered_locks+0x387/0x4d0 [mds]
      Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e09a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
      ...

      Attachments

        1. 09sep.tar.bz2
          8.40 MB
        2. call_traces
          392 kB
        3. kern.log.2013-02-23.gz
          104 kB
        4. kern.log-20120721
          191 kB
        5. ll-1181-decoded.txt.gz
          0.2 kB
        6. log1.bz2
          438 kB
        7. lustre-log.txt.bz2
          4.55 MB
        8. mds1.log
          8.37 MB
        9. mds2.log
          3.45 MB

        Issue Links

          Activity

            [LU-1663] MDS threads hang for over 725s, causing fail over

            Hi Peter, yes, but, we are waiting inspection. Once code review is done, we will apply this to the kernel for servers.

            ihara Shuichi Ihara (Inactive) added a comment - Hi Peter, yes, but, we are waiting inspection. Once code review is done, we will apply this to the kernel for servers.
            pjones Peter Jones added a comment - ok Ihara. Patch is at http://review.whamcloud.com/#/c/6147/

            Hi, would you please review this patch sonner? we are hitting multple server crash due to this issue.

            ihara Shuichi Ihara (Inactive) added a comment - Hi, would you please review this patch sonner? we are hitting multple server crash due to this issue.
            green Oleg Drokin added a comment -

            yes, this patch looks like it would do the right thing.

            green Oleg Drokin added a comment - yes, this patch looks like it would do the right thing.

            The openviz link was just supposed to be a link to the patch, the kernel we are running is a stock Centos kernel. The backtrace and the core come from the same crash. There is another core, but it was placed on the wrong disk and we need to wait for a downtime to get it (next Tuesday). If it is a different backtrace, I'll update the ticket then.

            As far as the fix for the GFP_NOFS bug, does that patch work? I found the patch on git.kernel.org as well:
            http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=335e92e8a515420bd47a6b0f01cb9a206c0ed6e4

            Should I backport that to b1_8/RHEL5?

            Thanks.

            kitwestneat Kit Westneat (Inactive) added a comment - The openviz link was just supposed to be a link to the patch, the kernel we are running is a stock Centos kernel. The backtrace and the core come from the same crash. There is another core, but it was placed on the wrong disk and we need to wait for a downtime to get it (next Tuesday). If it is a different backtrace, I'll update the ticket then. As far as the fix for the GFP_NOFS bug, does that patch work? I found the patch on git.kernel.org as well: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=335e92e8a515420bd47a6b0f01cb9a206c0ed6e4 Should I backport that to b1_8/RHEL5? Thanks.
            green Oleg Drokin added a comment -

            Ok, I got vmcore, but the vmlinux from openvz is not it.

            In any case I looked inside of the core and it looks like it's for this latest crash that we already know what happened that is a different bug altogether.

            What I am looking for is vmcore from a situation where jbd2 commit threads locks up.

            green Oleg Drokin added a comment - Ok, I got vmcore, but the vmlinux from openvz is not it. In any case I looked inside of the core and it looks like it's for this latest crash that we already know what happened that is a different bug altogether. What I am looking for is vmcore from a situation where jbd2 commit threads locks up.

            I pushed the patch we are using for the transaction wrap around in case it would be useful:
            http://review.whamcloud.com/6147

            kitwestneat Kit Westneat (Inactive) added a comment - I pushed the patch we are using for the transaction wrap around in case it would be useful: http://review.whamcloud.com/6147
            kitwestneat Kit Westneat (Inactive) added a comment - Does this patch look correct? http://git.openvz.org/?p=linux-2.6.32-openvz;a=commitdiff;h=335e92e8a515420bd47a6b0f01cb9a206c0ed6e4 We are just using the default 1.8.9 kernel: http://downloads.whamcloud.com/public/lustre/latest-maintenance-release/el5/server/RPMS/x86_64/kernel-debuginfo-2.6.18-348.1.1.el5_lustre.x86_64.rpm Dennis is uploading the coredump now and will update the ticket when it is finished.
            green Oleg Drokin added a comment -

            as for the vmcore, put it somewhere where we can get it? we also would need kernel-debuginfo you used at the very least. lustre modules might be useful just in case too.

            green Oleg Drokin added a comment - as for the vmcore, put it somewhere where we can get it? we also would need kernel-debuginfo you used at the very least. lustre modules might be useful just in case too.
            green Oleg Drokin added a comment -

            yes, this latest trace is a different issue similar to lu-3071, but in your trace there is no lustre in the picture at all.
            It sounds there's a bug in mballoc in your ext3 code that allocates mb_cache entry with a wrong flag, should be GFP_NOFS.

            I checked in rhel5 kernel I have around and the bug is there in fs/mbcache.c:

            struct mb_cache_entry *
            mb_cache_entry_alloc(struct mb_cache *cache)
            {
                    struct mb_cache_entry *ce;
            
                    atomic_inc(&cache->c_entry_count);
                    ce = kmem_cache_alloc(cache->c_entry_cache, GFP_KERNEL);   <===== BUG
            

            In rhel6 I see they actually pass a gfp flags to this function to be used by the alloc to avoid this, you might be able to hunt down the patch that fixes that I guess.

            green Oleg Drokin added a comment - yes, this latest trace is a different issue similar to lu-3071, but in your trace there is no lustre in the picture at all. It sounds there's a bug in mballoc in your ext3 code that allocates mb_cache entry with a wrong flag, should be GFP_NOFS. I checked in rhel5 kernel I have around and the bug is there in fs/mbcache.c: struct mb_cache_entry * mb_cache_entry_alloc(struct mb_cache *cache) { struct mb_cache_entry *ce; atomic_inc(&cache->c_entry_count); ce = kmem_cache_alloc(cache->c_entry_cache, GFP_KERNEL); <===== BUG In rhel6 I see they actually pass a gfp flags to this function to be used by the alloc to avoid this, you might be able to hunt down the patch that fixes that I guess.

            Backtrace:
            PID: 23917 TASK: ffff810abe18c7e0 CPU: 2 COMMAND: "vgck"
            #0 [ffff810b35bd75a0] crash_kexec at ffffffff800b1192
            #1 [ffff810b35bd7660] __die at ffffffff80065137
            #2 [ffff810b35bd76a0] die at ffffffff8006c735
            #3 [ffff810b35bd76d0] do_invalid_op at ffffffff8006ccf5
            #4 [ffff810b35bd7790] error_exit at ffffffff8005ddf9
            [exception RIP: jbd2_journal_start+58]
            RIP: ffffffff8b32ae31 RSP: ffff810b35bd7848 RFLAGS: 00010206
            RAX: ffff810c1aef19c0 RBX: ffff81061476dd38 RCX: ffff810910fdcc00
            RDX: ffffffffffffffe2 RSI: 0000000000000012 RDI: ffff81061476dd38
            RBP: ffff810919769800 R8: ffff81000005b600 R9: ffff8104cd371a80
            R10: 0000000000000000 R11: ffffffff8b36cb60 R12: 0000000000000012
            R13: ffff810b35bd78c8 R14: 0000000000000080 R15: 0000000000000100
            ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
            #5 [ffff810b35bd7860] ldiskfs_dquot_drop at ffffffff8b36fdbe
            #6 [ffff810b35bd7880] clear_inode at ffffffff800234e7
            #7 [ffff810b35bd7890] dispose_list at ffffffff800352d6
            #8 [ffff810b35bd78c0] shrink_icache_memory at ffffffff8002dcfb
            #9 [ffff810b35bd7900] shrink_slab at ffffffff8003f7cd
            #10 [ffff810b35bd7940] zone_reclaim at ffffffff800cf9ae
            #11 [ffff810b35bd79f0] get_page_from_freelist at ffffffff8000a8a7
            #12 [ffff810b35bd7a60] __alloc_pages at ffffffff8000f48a
            #13 [ffff810b35bd7ad0] cache_grow at ffffffff80017a52
            #14 [ffff810b35bd7b20] cache_alloc_refill at ffffffff8005c3ee
            #15 [ffff810b35bd7b60] kmem_cache_alloc at ffffffff8000ac96
            #16 [ffff810b35bd7b80] mb_cache_entry_alloc at ffffffff801063df
            #17 [ffff810b35bd7ba0] ext3_xattr_cache_insert at ffffffff8805bb1e
            #18 [ffff810b35bd7bd0] ext3_xattr_get at ffffffff8805cd3a
            #19 [ffff810b35bd7c50] ext3_get_acl at ffffffff8805d4e2
            #20 [ffff810b35bd7ca0] ext3_init_acl at ffffffff8805d8de
            #21 [ffff810b35bd7ce0] ext3_new_inode at ffffffff8804e21f
            #22 [ffff810b35bd7db0] ext3_create at ffffffff88054b6b
            #23 [ffff810b35bd7e00] vfs_create at ffffffff8003aa1b
            #24 [ffff810b35bd7e30] open_namei at ffffffff8001b749
            #25 [ffff810b35bd7ea0] do_filp_open at ffffffff80027ba1
            #26 [ffff810b35bd7f50] do_sys_open at ffffffff80019fa5
            #27 [ffff810b35bd7f80] tracesys at ffffffff8005d29e (via system_call)
            RIP: 0000003d598c5b40 RSP: 00007fff8f6ae3b8 RFLAGS: 00000246
            RAX: ffffffffffffffda RBX: ffffffff8005d29e RCX: ffffffffffffffff
            RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007fff8f6ae490
            RBP: 00007fff8f6ae4e0 R8: 0000000000000004 R9: 0000000000000000
            R10: 0000000000000241 R11: 0000000000000246 R12: 000000000e3a5e90
            R13: 0000000000484f87 R14: 0000000000000004 R15: 000000000e3a5e90
            ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

            Dennis pointed out that it looks like LU-3071. Reading that bug, it looks like it should affect MDTs and 1.8 as well, is that correct? The assertion hit is at the same place:
            J_ASSERT(handle->h_transaction->t_journal == journal);

            kitwestneat Kit Westneat (Inactive) added a comment - Backtrace: PID: 23917 TASK: ffff810abe18c7e0 CPU: 2 COMMAND: "vgck" #0 [ffff810b35bd75a0] crash_kexec at ffffffff800b1192 #1 [ffff810b35bd7660] __die at ffffffff80065137 #2 [ffff810b35bd76a0] die at ffffffff8006c735 #3 [ffff810b35bd76d0] do_invalid_op at ffffffff8006ccf5 #4 [ffff810b35bd7790] error_exit at ffffffff8005ddf9 [exception RIP: jbd2_journal_start+58] RIP: ffffffff8b32ae31 RSP: ffff810b35bd7848 RFLAGS: 00010206 RAX: ffff810c1aef19c0 RBX: ffff81061476dd38 RCX: ffff810910fdcc00 RDX: ffffffffffffffe2 RSI: 0000000000000012 RDI: ffff81061476dd38 RBP: ffff810919769800 R8: ffff81000005b600 R9: ffff8104cd371a80 R10: 0000000000000000 R11: ffffffff8b36cb60 R12: 0000000000000012 R13: ffff810b35bd78c8 R14: 0000000000000080 R15: 0000000000000100 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffff810b35bd7860] ldiskfs_dquot_drop at ffffffff8b36fdbe #6 [ffff810b35bd7880] clear_inode at ffffffff800234e7 #7 [ffff810b35bd7890] dispose_list at ffffffff800352d6 #8 [ffff810b35bd78c0] shrink_icache_memory at ffffffff8002dcfb #9 [ffff810b35bd7900] shrink_slab at ffffffff8003f7cd #10 [ffff810b35bd7940] zone_reclaim at ffffffff800cf9ae #11 [ffff810b35bd79f0] get_page_from_freelist at ffffffff8000a8a7 #12 [ffff810b35bd7a60] __alloc_pages at ffffffff8000f48a #13 [ffff810b35bd7ad0] cache_grow at ffffffff80017a52 #14 [ffff810b35bd7b20] cache_alloc_refill at ffffffff8005c3ee #15 [ffff810b35bd7b60] kmem_cache_alloc at ffffffff8000ac96 #16 [ffff810b35bd7b80] mb_cache_entry_alloc at ffffffff801063df #17 [ffff810b35bd7ba0] ext3_xattr_cache_insert at ffffffff8805bb1e #18 [ffff810b35bd7bd0] ext3_xattr_get at ffffffff8805cd3a #19 [ffff810b35bd7c50] ext3_get_acl at ffffffff8805d4e2 #20 [ffff810b35bd7ca0] ext3_init_acl at ffffffff8805d8de #21 [ffff810b35bd7ce0] ext3_new_inode at ffffffff8804e21f #22 [ffff810b35bd7db0] ext3_create at ffffffff88054b6b #23 [ffff810b35bd7e00] vfs_create at ffffffff8003aa1b #24 [ffff810b35bd7e30] open_namei at ffffffff8001b749 #25 [ffff810b35bd7ea0] do_filp_open at ffffffff80027ba1 #26 [ffff810b35bd7f50] do_sys_open at ffffffff80019fa5 #27 [ffff810b35bd7f80] tracesys at ffffffff8005d29e (via system_call) RIP: 0000003d598c5b40 RSP: 00007fff8f6ae3b8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005d29e RCX: ffffffffffffffff RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007fff8f6ae490 RBP: 00007fff8f6ae4e0 R8: 0000000000000004 R9: 0000000000000000 R10: 0000000000000241 R11: 0000000000000246 R12: 000000000e3a5e90 R13: 0000000000484f87 R14: 0000000000000004 R15: 000000000e3a5e90 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b Dennis pointed out that it looks like LU-3071 . Reading that bug, it looks like it should affect MDTs and 1.8 as well, is that correct? The assertion hit is at the same place: J_ASSERT(handle->h_transaction->t_journal == journal);

            People

              green Oleg Drokin
              kitwestneat Kit Westneat (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: