[LU-6829] runtests test_1: MDS hung Created: 09/Jul/15  Updated: 29/Sep/15  Resolved: 26/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Di Wang
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

client and server: lustre-master build # 3094 RHEL7 DNE


Issue Links:
Related
is related to LU-6578 inodebit locks for remote entries. Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/24729d7e-2640-11e5-8b33-5254006e85c2.

The sub-test test_1 failed with the following error:

test failed to respond and timed out

MDS dmesg

[ 1244.736431] Lustre: DEBUG MARKER: /usr/sbin/lctl mark copying 880 files from \/etc \/bin to \/mnt\/lustre\/d1.runtests\/etc \/bin at Wed Jul  8 20:30:42 UTC 2015
[ 1244.857091] Lustre: DEBUG MARKER: copying 880 files from /etc /bin to /mnt/lustre/d1.runtests/etc /bin at Wed Jul 8 20:30:42 UTC 2015
[ 1252.974179] Lustre: 2786:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387443/real 1436387443]  req@ffff880069257300 x1506161297721592/t0(0) o400->lustre-MDT0002-osp-MDT0000@10.1.5.252@tcp:24/4 lens 224/224 e 0 to 1 dl 1436387450 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1252.979486] Lustre: lustre-MDT0002-osp-MDT0000: Connection to lustre-MDT0002 (at 10.1.5.252@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 1257.239137] Lustre: 2785:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387448/real 1436387448]  req@ffff880068c86700 x1506161297721644/t0(0) o400->lustre-MDT0002-osp-MDT0000@10.1.5.252@tcp:24/4 lens 224/224 e 0 to 1 dl 1436387455 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1257.243885] Lustre: 2785:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[ 1258.983125] Lustre: 2784:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387450/real 1436387450]  req@ffff88006912db00 x1506161297721660/t0(0) o38->lustre-MDT0003-osp-MDT0000@10.1.5.252@tcp:24/4 lens 520/544 e 0 to 1 dl 1436387456 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1258.988595] Lustre: 2784:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[ 1273.982152] Lustre: 2784:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387460/real 1436387460]  req@ffff880069d87300 x1506161297721744/t0(0) o38->lustre-MDT0001-osp-MDT0000@10.1.5.252@tcp:24/4 lens 520/544 e 0 to 1 dl 1436387471 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1273.987550] Lustre: 2784:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[ 1285.839237] Lustre: 2815:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387476/real 1436387476]  req@ffff880069d87300 x1506161297721892/t0(0) o104->lustre-MDT0000@10.1.5.252@tcp:15/16 lens 296/224 e 0 to 1 dl 1436387483 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
[ 1285.844932] Lustre: 2815:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[ 1292.205644] Lustre: lustre-MDT0000: haven't heard from client lustre-MDT0000-lwp-MDT0001_UUID (at 10.1.5.252@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff88007bc28400, cur 1436387490 expire 1436387460 last 1436387441
[ 1293.982194] Lustre: 2784:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387475/real 1436387475]  req@ffff880069d85b00 x1506161297721880/t0(0) o38->lustre-MDT0001-osp-MDT0000@10.1.5.252@tcp:24/4 lens 520/544 e 0 to 1 dl 1436387491 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1318.879183] LNet: Service thread pid 2815 was inactive for 40.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 1318.882589] Pid: 2815, comm: mdt00_001
[ 1318.883358] 
Call Trace:
[ 1318.884177]  [<ffffffff8160a409>] schedule+0x29/0x70
[ 1318.884703]  [<ffffffff816082b5>] schedule_timeout+0x175/0x2d0
[ 1318.885134]  [<ffffffff8107ee80>] ? process_timeout+0x0/0x10
[ 1318.885707]  [<ffffffffa09a8320>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc]
[ 1318.886241]  [<ffffffffa09b2978>] ptlrpc_set_wait+0x4b8/0x9e0 [ptlrpc]
[ 1318.886787]  [<ffffffff810a9650>] ? default_wake_function+0x0/0x20
[ 1318.887253]  [<ffffffffa09b2f1d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[ 1318.887764]  [<ffffffffa0f433c5>] osp_remote_sync+0xd5/0x1b0 [osp]
[ 1318.888210]  [<ffffffffa0f2a25d>] osp_attr_get+0x41d/0x690 [osp]
[ 1318.888728]  [<ffffffffa0f25ab4>] osp_object_init+0x114/0x280 [osp]
[ 1318.889220]  [<ffffffffa076058f>] lu_object_alloc+0xdf/0x310 [obdclass]
[ 1318.889785]  [<ffffffffa076099c>] lu_object_find_try+0x17c/0x2c0 [obdclass]
[ 1318.890295]  [<ffffffffa0760b8c>] lu_object_find_at+0xac/0xe0 [obdclass]
[ 1318.890940]  [<ffffffffa0e6ebb5>] ? lod_index_lookup+0x25/0x30 [lod]
[ 1318.891407]  [<ffffffffa0ecc307>] ? __mdd_lookup.isra.17+0x317/0x440 [mdd]
[ 1318.891993]  [<ffffffffa0760bd6>] lu_object_find+0x16/0x20 [obdclass]
[ 1318.892575]  [<ffffffffa0d8a73b>] mdt_object_find+0x4b/0x170 [mdt]
[ 1318.893146]  [<ffffffffa0db0726>] mdt_reint_open+0x2776/0x2d50 [mdt]
[ 1318.893624]  [<ffffffffa07795f9>] ? upcall_cache_get_entry+0x3e9/0x8e0 [obdclass]
[ 1318.894169]  [<ffffffff812ddd72>] ? strlcpy+0x42/0x60
[ 1318.894571]  [<ffffffffa0da46e0>] mdt_reint_rec+0x80/0x210 [mdt]
[ 1318.894985]  [<ffffffffa0d884e9>] mdt_reint_internal+0x5d9/0xb20 [mdt]
[ 1318.895456]  [<ffffffffa0d88b92>] mdt_intent_reint+0x162/0x410 [mdt]
[ 1318.896056]  [<ffffffffa0d921ea>] mdt_intent_policy+0x57a/0xb30 [mdt]
[ 1318.896619]  [<ffffffffa0970223>] ldlm_lock_enqueue+0x353/0x930 [ptlrpc]
[ 1318.897117]  [<ffffffffa0998da2>] ldlm_handle_enqueue0+0x4f2/0x16e0 [ptlrpc]
[ 1318.897725]  [<ffffffffa09c0360>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
[ 1318.898274]  [<ffffffffa0a1eed2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 1318.898762]  [<ffffffffa0a2329b>] tgt_request_handle+0x88b/0x1100 [ptlrpc]
[ 1318.899278]  [<ffffffffa09cafbb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[ 1318.899926]  [<ffffffffa09c8078>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 1318.900416]  [<ffffffff810a9662>] ? default_wake_function+0x12/0x20
[ 1318.900956]  [<ffffffff810a0898>] ? __wake_up_common+0x58/0x90
[ 1318.901397]  [<ffffffffa09ce900>] ptlrpc_main+0xc00/0x1f60 [ptlrpc]
[ 1318.901953]  [<ffffffffa09cdd00>] ? ptlrpc_main+0x0/0x1f60 [ptlrpc]
[ 1318.902407]  [<ffffffff8109739f>] kthread+0xcf/0xe0
[ 1318.902851]  [<ffffffff810972d0>] ? kthread+0x0/0xe0
[ 1318.903215]  [<ffffffff81614f7c>] ret_from_fork+0x7c/0xb0
[ 1318.903640]  [<ffffffff810972d0>] ? kthread+0x0/0xe0



 Comments   
Comment by Sebastien Buisson (Inactive) [ 04/Aug/15 ]

A new instance of the problem:
https://testing.hpdd.intel.com/test_sets/ca6cf5aa-3a02-11e5-8f15-5254006e85c2

Comment by Di Wang [ 26/Aug/15 ]
https://testing.hpdd.intel.com/test_sets/ca6cf5aa-3a02-11e5-8f15-5254006e85c2

This is actually LU-6845

Comment by Di Wang [ 26/Aug/15 ]

I checked most failures since the last months. All of them are caused by other problem LU-7047, LU-7048, or LU-6578. So I will close this for now, since it does not happen for more than one month.

Generated at Sat Feb 10 02:03:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.