Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5503

MDS (2.4.2) are getting "Service thread ... inactive" and file-system times out

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.4.2
    • Server: CentOS 6.3/Lustre 2.4.2
      Client: CentOS 6.3/Lustre 2.5.2
    • 3
    • 15353

    Description

      We are getting increasing number of MDS service inactive like below. Whenever this happens, clients get evicted and file-system hang.

      LNet: Service thread pid 3766 was inactive for 334.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 3766, comm: mdt_rdpg00_003

      Call Trace:
      [<ffffffff81133281>] ? shrink_inactive_list+0x191/0x830
      [<ffffffff81133017>] ? shrink_active_list+0x297/0x370
      [<ffffffff81133cce>] ? shrink_mem_cgroup_zone+0x3ae/0x610
      [<ffffffff8117289d>] ? mem_cgroup_iter+0xfd/0x280
      [<ffffffff81133f93>] ? shrink_zone+0x63/0xb0
      [<ffffffff81134b99>] ? zone_reclaim+0x349/0x400
      [<ffffffff8112a6ac>] ? get_page_from_freelist+0x69c/0x830
      [<ffffffff811b5a07>] ? __find_get_block+0x97/0xe0
      [<ffffffff8112bc43>] ? __alloc_pages_nodemask+0x113/0x8d0
      [<ffffffffa0b2d46e>] ? ldiskfs_ind_get_blocks+0x35e/0x610 [ldiskfs]
      [<ffffffff81170b10>] ? mem_cgroup_get_reclaim_stat_from_page+0x20/0x70
      [<ffffffff8112f18d>] ? update_page_reclaim_stat+0x2d/0x70
      [<ffffffff81166dc2>] ? kmem_getpages+0x62/0x170
      [<ffffffff8116742f>] ? cache_grow+0x2cf/0x320
      [<ffffffff8116774a>] ? ____cache_alloc_node+0x8a/0x160
      [<ffffffff811686db>] ? kmem_cache_alloc+0x11b/0x190
      [<ffffffffa0b526b0>] ? ldiskfs_alloc_inode+0x20/0x150 [ldiskfs]
      [<ffffffff8119d257>] ? alloc_inode+0x27/0xa0
      [<ffffffff8119e168>] ? iget_locked+0x78/0x170
      [<ffffffffa0b2f6e7>] ? ldiskfs_iget+0x37/0x800 [ldiskfs]
      [<ffffffffa0791ad2>] ? fld_server_lookup+0x72/0x3d0 [fld]
      [<ffffffff8119cbbe>] ? generic_detach_inode+0x18e/0x1f0
      [<ffffffffa0b9e4ae>] ? osd_iget+0x2e/0x2c0 [osd_ldiskfs]
      [<ffffffffa0ba9546>] ? osd_ea_fid_get+0x176/0x2c0 [osd_ldiskfs]
      [<ffffffffa0b9ff8a>] ? osd_remote_fid+0x9a/0x280 [osd_ldiskfs]
      [<ffffffffa0bb00f5>] ? osd_it_ea_rec+0xb45/0x1470 [osd_ldiskfs]
      [<ffffffffa0b1f3a5>] ? call_filldir+0xb5/0x150 [ldiskfs]
      [<ffffffffa0bad170>] ? osd_ldiskfs_filldir+0x0/0x480 [osd_ldiskfs]
      [<ffffffffa0b1fd09>] ? ldiskfs_readdir+0x5a9/0x730 [ldiskfs]
      [<ffffffffa0bad170>] ? osd_ldiskfs_filldir+0x0/0x480 [osd_ldiskfs]
      [<ffffffffa0b5f1fd>] ? htree_unlock+0x3d/0x2c6 [ldiskfs]
      [<ffffffffa0d70341>] ? lod_it_rec+0x21/0x90 [lod]
      [<ffffffffa0a241dc>] ? mdd_dir_page_build+0xfc/0x210 [mdd]
      [<ffffffffa0435b22>] ? dt_index_walk+0x162/0x3d0 [obdclass]
      [<ffffffff815101e6>] ? down_read+0x16/0x30
      [<ffffffffa0a240e0>] ? mdd_dir_page_build+0x0/0x210 [mdd]
      [<ffffffffa0a25e2b>] ? mdd_readpage+0x38b/0x5a0 [mdd]
      [<ffffffffa0cb727f>] ? mdt_readpage+0x47f/0x960 [mdt]
      [<ffffffffa0ca7a97>] ? mdt_handle_common+0x647/0x16d0 [mdt]
      [<ffffffffa0ce1685>] ? mds_readpage_handle+0x15/0x20 [mdt]
      [<ffffffffa05cf3a8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      [<ffffffffa02e95de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [<ffffffffa02fad9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      [<ffffffffa05c6709>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
      [<ffffffffa05d073e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
      [<ffffffffa05cfc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      [<ffffffffa05cfc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      [<ffffffffa05cfc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

      LustreError: dumping log to /tmp/lustre-log.1408400803.3766

      Attachments

        Issue Links

          Activity

            [LU-5503] MDS (2.4.2) are getting "Service thread ... inactive" and file-system times out

            Hi Zhengyu,

            Our client has debug flags set to

            lctl > get_param debug
            debug=ioctl neterror warning error emerg ha config console

            If the client messages log attached in this ticket doesn't have it, how can I get it?
            We have the problem occurring continuously, so it won't be too hard for me to get it.

            thanks,
            Haisong

            haisong Haisong Cai (Inactive) added a comment - Hi Zhengyu, Our client has debug flags set to lctl > get_param debug debug=ioctl neterror warning error emerg ha config console If the client messages log attached in this ticket doesn't have it, how can I get it? We have the problem occurring continuously, so it won't be too hard for me to get it. thanks, Haisong
            bobijam Zhenyu Xu added a comment -

            since it has LU-1128 patch, so the server should have set SLV and bail out the loop, and clients should honor the SLV change and cancel locks hence that MDS release some locks, do you have some clients debug log that I can check whether clients honor the SLV change and cancel locks voluntarily? Thanks.

            bobijam Zhenyu Xu added a comment - since it has LU-1128 patch, so the server should have set SLV and bail out the loop, and clients should honor the SLV change and cancel locks hence that MDS release some locks, do you have some clients debug log that I can check whether clients honor the SLV change and cancel locks voluntarily? Thanks.

            Hi Zhenyu,

            We are running binary RPM from Intel.

            [root@dolphin-mds-9-2 ~]# uname -a
            Linux dolphin-mds-9-2.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux

            haisong Haisong Cai (Inactive) added a comment - Hi Zhenyu, We are running binary RPM from Intel. [root@dolphin-mds-9-2 ~] # uname -a Linux dolphin-mds-9-2.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux
            bobijam Zhenyu Xu added a comment -

            update for what I found till now (haven't figured out the root cause yet)

            in the lustre-log.1408400803.3766.3766, at time of 1408400290.481213, mdt thread 3766 is serving a readpage request, and found short of memory, it runs into memory reclaim path, and goes down through to ldlm_pools_shrink(), and in server's pools shrinker, it just adjust its SLV (Server Lock Volume) and does not release memory, so LU-1128 patch makes ldlm_pools_shrink() return -1 so that kernel code will not loop calling shrink_slab().

            2.4.2 code should have this patch, but the log shows since 1408400290.481213, thread 3766 did nothing but keep looping ldlm_pools_shrink(), that is unexpected.

            Haisong, what is the kernel version you are using?

            bobijam Zhenyu Xu added a comment - update for what I found till now (haven't figured out the root cause yet) in the lustre-log.1408400803.3766.3766, at time of 1408400290.481213, mdt thread 3766 is serving a readpage request, and found short of memory, it runs into memory reclaim path, and goes down through to ldlm_pools_shrink(), and in server's pools shrinker, it just adjust its SLV (Server Lock Volume) and does not release memory, so LU-1128 patch makes ldlm_pools_shrink() return -1 so that kernel code will not loop calling shrink_slab(). 2.4.2 code should have this patch, but the log shows since 1408400290.481213, thread 3766 did nothing but keep looping ldlm_pools_shrink(), that is unexpected. Haisong, what is the kernel version you are using?

            Clients (2.5.2) are still reporting errors LU_3487 claimed that should have been fixed:

            Aug 18 11:32:12 tscc-login1 kernel: LustreError: 55183:0:(vvp_io.c:1229:vvp_io_init()) dolphin: refresh file yout [0x20000518e:0x2f:0x0] error -5.

            Aug 18 14:13:17 tscc-login1 kernel: LustreError: 56399:0:(vvp_io.c:1229:vvp_io_init()) dolphin: refresh file yout [0x200005186:0xc:0x0] error -108.
            Aug 18 14:13:17 tscc-login1 kernel: LustreError: 56399:0:(vvp_io.c:1229:vvp_io_init()) Skipped 1 previous simar message

            haisong Haisong Cai (Inactive) added a comment - Clients (2.5.2) are still reporting errors LU_3487 claimed that should have been fixed: Aug 18 11:32:12 tscc-login1 kernel: LustreError: 55183:0:(vvp_io.c:1229:vvp_io_init()) dolphin: refresh file yout [0x20000518e:0x2f:0x0] error -5. Aug 18 14:13:17 tscc-login1 kernel: LustreError: 56399:0:(vvp_io.c:1229:vvp_io_init()) dolphin: refresh file yout [0x200005186:0xc:0x0] error -108. Aug 18 14:13:17 tscc-login1 kernel: LustreError: 56399:0:(vvp_io.c:1229:vvp_io_init()) Skipped 1 previous simar message
            pjones Peter Jones added a comment -

            Bobijam

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please advise? Thanks Peter

            People

              bobijam Zhenyu Xu
              haisong Haisong Cai (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: