[LU-5503] MDS (2.4.2) are getting "Service thread ... inactive" and file-system times out - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.4.2
Labels:
- sdsc
Environment:
Server: CentOS 6.3/Lustre 2.4.2
Client: CentOS 6.3/Lustre 2.5.2

Severity:
3
Rank (Obsolete):
15353

Description

We are getting increasing number of MDS service inactive like below. Whenever this happens, clients get evicted and file-system hang.

LNet: Service thread pid 3766 was inactive for 334.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 3766, comm: mdt_rdpg00_003

Call Trace:
[<ffffffff81133281>] ? shrink_inactive_list+0x191/0x830
[<ffffffff81133017>] ? shrink_active_list+0x297/0x370
[<ffffffff81133cce>] ? shrink_mem_cgroup_zone+0x3ae/0x610
[<ffffffff8117289d>] ? mem_cgroup_iter+0xfd/0x280
[<ffffffff81133f93>] ? shrink_zone+0x63/0xb0
[<ffffffff81134b99>] ? zone_reclaim+0x349/0x400
[<ffffffff8112a6ac>] ? get_page_from_freelist+0x69c/0x830
[<ffffffff811b5a07>] ? __find_get_block+0x97/0xe0
[<ffffffff8112bc43>] ? __alloc_pages_nodemask+0x113/0x8d0
[<ffffffffa0b2d46e>] ? ldiskfs_ind_get_blocks+0x35e/0x610 [ldiskfs]
[<ffffffff81170b10>] ? mem_cgroup_get_reclaim_stat_from_page+0x20/0x70
[<ffffffff8112f18d>] ? update_page_reclaim_stat+0x2d/0x70
[<ffffffff81166dc2>] ? kmem_getpages+0x62/0x170
[<ffffffff8116742f>] ? cache_grow+0x2cf/0x320
[<ffffffff8116774a>] ? ____cache_alloc_node+0x8a/0x160
[<ffffffff811686db>] ? kmem_cache_alloc+0x11b/0x190
[<ffffffffa0b526b0>] ? ldiskfs_alloc_inode+0x20/0x150 [ldiskfs]
[<ffffffff8119d257>] ? alloc_inode+0x27/0xa0
[<ffffffff8119e168>] ? iget_locked+0x78/0x170
[<ffffffffa0b2f6e7>] ? ldiskfs_iget+0x37/0x800 [ldiskfs]
[<ffffffffa0791ad2>] ? fld_server_lookup+0x72/0x3d0 [fld]
[<ffffffff8119cbbe>] ? generic_detach_inode+0x18e/0x1f0
[<ffffffffa0b9e4ae>] ? osd_iget+0x2e/0x2c0 [osd_ldiskfs]
[<ffffffffa0ba9546>] ? osd_ea_fid_get+0x176/0x2c0 [osd_ldiskfs]
[<ffffffffa0b9ff8a>] ? osd_remote_fid+0x9a/0x280 [osd_ldiskfs]
[<ffffffffa0bb00f5>] ? osd_it_ea_rec+0xb45/0x1470 [osd_ldiskfs]
[<ffffffffa0b1f3a5>] ? call_filldir+0xb5/0x150 [ldiskfs]
[<ffffffffa0bad170>] ? osd_ldiskfs_filldir+0x0/0x480 [osd_ldiskfs]
[<ffffffffa0b1fd09>] ? ldiskfs_readdir+0x5a9/0x730 [ldiskfs]
[<ffffffffa0bad170>] ? osd_ldiskfs_filldir+0x0/0x480 [osd_ldiskfs]
[<ffffffffa0b5f1fd>] ? htree_unlock+0x3d/0x2c6 [ldiskfs]
[<ffffffffa0d70341>] ? lod_it_rec+0x21/0x90 [lod]
[<ffffffffa0a241dc>] ? mdd_dir_page_build+0xfc/0x210 [mdd]
[<ffffffffa0435b22>] ? dt_index_walk+0x162/0x3d0 [obdclass]
[<ffffffff815101e6>] ? down_read+0x16/0x30
[<ffffffffa0a240e0>] ? mdd_dir_page_build+0x0/0x210 [mdd]
[<ffffffffa0a25e2b>] ? mdd_readpage+0x38b/0x5a0 [mdd]
[<ffffffffa0cb727f>] ? mdt_readpage+0x47f/0x960 [mdt]
[<ffffffffa0ca7a97>] ? mdt_handle_common+0x647/0x16d0 [mdt]
[<ffffffffa0ce1685>] ? mds_readpage_handle+0x15/0x20 [mdt]
[<ffffffffa05cf3a8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
[<ffffffffa02e95de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[<ffffffffa02fad9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
[<ffffffffa05c6709>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
[<ffffffff81055ad3>] ? __wake_up+0x53/0x70
[<ffffffffa05d073e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
[<ffffffffa05cfc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
[<ffffffff8100c0ca>] ? child_rip+0xa/0x20
[<ffffffffa05cfc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
[<ffffffffa05cfc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1408400803.3766

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1408400803.3766.gz
4.59 MB
18/Aug/14 10:39 PM
Lustre-messages-2014-08-18-1
735 kB
18/Aug/14 10:39 PM

Issue Links

is related to

LU-5333 rm cause MDS to complain hung tasks and disconnecting clients

Resolved

Activity

[LU-5503] MDS (2.4.2) are getting "Service thread ... inactive" and file-system times out

Haisong Cai (Inactive) added a comment - 21/Aug/14 4:26 AM

Hi Zhengyu,

Our client has debug flags set to

lctl > get_param debug
debug=ioctl neterror warning error emerg ha config console

If the client messages log attached in this ticket doesn't have it, how can I get it?
We have the problem occurring continuously, so it won't be too hard for me to get it.

thanks,
Haisong

Haisong Cai (Inactive) added a comment - 21/Aug/14 4:26 AM Hi Zhengyu, Our client has debug flags set to lctl > get_param debug debug=ioctl neterror warning error emerg ha config console If the client messages log attached in this ticket doesn't have it, how can I get it? We have the problem occurring continuously, so it won't be too hard for me to get it. thanks, Haisong

Zhenyu Xu added a comment - 21/Aug/14 3:16 AM

since it has ~~LU-1128~~ patch, so the server should have set SLV and bail out the loop, and clients should honor the SLV change and cancel locks hence that MDS release some locks, do you have some clients debug log that I can check whether clients honor the SLV change and cancel locks voluntarily? Thanks.

Zhenyu Xu added a comment - 21/Aug/14 3:16 AM since it has LU-1128 patch, so the server should have set SLV and bail out the loop, and clients should honor the SLV change and cancel locks hence that MDS release some locks, do you have some clients debug log that I can check whether clients honor the SLV change and cancel locks voluntarily? Thanks.

Haisong Cai (Inactive) added a comment - 20/Aug/14 5:01 AM

Hi Zhenyu,

We are running binary RPM from Intel.

[root@dolphin-mds-9-2 ~]# uname -a
Linux dolphin-mds-9-2.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux

Haisong Cai (Inactive) added a comment - 20/Aug/14 5:01 AM Hi Zhenyu, We are running binary RPM from Intel. [root@dolphin-mds-9-2 ~] # uname -a Linux dolphin-mds-9-2.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux

Zhenyu Xu added a comment - 20/Aug/14 3:38 AM

update for what I found till now (haven't figured out the root cause yet)

in the lustre-log.1408400803.3766.3766, at time of 1408400290.481213, mdt thread 3766 is serving a readpage request, and found short of memory, it runs into memory reclaim path, and goes down through to ldlm_pools_shrink(), and in server's pools shrinker, it just adjust its SLV (Server Lock Volume) and does not release memory, so ~~LU-1128~~ patch makes ldlm_pools_shrink() return -1 so that kernel code will not loop calling shrink_slab().

2.4.2 code should have this patch, but the log shows since 1408400290.481213, thread 3766 did nothing but keep looping ldlm_pools_shrink(), that is unexpected.

Haisong, what is the kernel version you are using?

Zhenyu Xu added a comment - 20/Aug/14 3:38 AM update for what I found till now (haven't figured out the root cause yet) in the lustre-log.1408400803.3766.3766, at time of 1408400290.481213, mdt thread 3766 is serving a readpage request, and found short of memory, it runs into memory reclaim path, and goes down through to ldlm_pools_shrink(), and in server's pools shrinker, it just adjust its SLV (Server Lock Volume) and does not release memory, so LU-1128 patch makes ldlm_pools_shrink() return -1 so that kernel code will not loop calling shrink_slab(). 2.4.2 code should have this patch, but the log shows since 1408400290.481213, thread 3766 did nothing but keep looping ldlm_pools_shrink(), that is unexpected. Haisong, what is the kernel version you are using?

Haisong Cai (Inactive) added a comment - 19/Aug/14 6:31 PM

Clients (2.5.2) are still reporting errors LU_3487 claimed that should have been fixed:

Aug 18 11:32:12 tscc-login1 kernel: LustreError: 55183:0:(vvp_io.c:1229:vvp_io_init()) dolphin: refresh file yout [0x20000518e:0x2f:0x0] error -5.

Aug 18 14:13:17 tscc-login1 kernel: LustreError: 56399:0:(vvp_io.c:1229:vvp_io_init()) dolphin: refresh file yout [0x200005186:0xc:0x0] error -108.
Aug 18 14:13:17 tscc-login1 kernel: LustreError: 56399:0:(vvp_io.c:1229:vvp_io_init()) Skipped 1 previous simar message

Haisong Cai (Inactive) added a comment - 19/Aug/14 6:31 PM Clients (2.5.2) are still reporting errors LU_3487 claimed that should have been fixed: Aug 18 11:32:12 tscc-login1 kernel: LustreError: 55183:0:(vvp_io.c:1229:vvp_io_init()) dolphin: refresh file yout [0x20000518e:0x2f:0x0] error -5. Aug 18 14:13:17 tscc-login1 kernel: LustreError: 56399:0:(vvp_io.c:1229:vvp_io_init()) dolphin: refresh file yout [0x200005186:0xc:0x0] error -108. Aug 18 14:13:17 tscc-login1 kernel: LustreError: 56399:0:(vvp_io.c:1229:vvp_io_init()) Skipped 1 previous simar message

Peter Jones added a comment - 18/Aug/14 11:04 PM

Bobijam

Could you please advise?

Thanks

Peter

Peter Jones added a comment - 18/Aug/14 11:04 PM Bobijam Could you please advise? Thanks Peter

MDS (2.4.2) are getting "Service thread ... inactive" and file-system times out

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates