[LU-6444] Hard Failover recovery-mds-scale test_failover_ost: test_failover_ost returned 3 Created: 08/Apr/15  Updated: 02/Jul/15  Resolved: 12/Jun/15

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

client and server: lustre-master build #2967 zfs


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/83ac0620-d67e-11e4-8a24-5254006e85c2.

The sub-test test_failover_ost failed with the following error:

test_failover_ost returned 3

MDS console

Lustre: lustre-MDT0000: Recovery over after 0:09, of 2 clients 2 recovered and 0 were evicted.
Lustre: 2817:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1427510311/real 1427510311]  req@ffff88006b6859c0 x1496853024604548/t0(0) o8->lustre-OST0002-osc-MDT0000@10.2.4.226@tcp:28/4 lens 400/544 e 0 to 1 dl 1427510316 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 2817:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1427510321/real 1427510321]  req@ffff8800722326c0 x1496853024604620/t0(0) o8->lustre-OST0006-osc-MDT0000@10.2.4.226@tcp:28/4 lens 400/544 e 0 to 1 dl 1427510331 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 2817:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
LNet: Service thread pid 3181 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 3181, comm: mdt00_003

Call Trace:
 [<ffffffff8152b162>] schedule_timeout+0x192/0x2e0
 [<ffffffff810874f0>] ? process_timeout+0x0/0x10
 [<ffffffffa1325951>] osp_precreate_reserve+0x3d1/0x810 [osp]
 [<ffffffffa131b740>] ? osp_object_free+0x2d0/0x4a0 [osp]
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
 [<ffffffffa131fbb6>] osp_declare_object_create+0x1a6/0x650 [osp]
 [<ffffffffa125d81a>] lod_qos_declare_object_on+0x12a/0x4f0 [lod]
 [<ffffffffa12603f3>] lod_alloc_rr.clone.2+0xbb3/0x1020 [lod]
 [<ffffffffa079dfb1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa1261442>] lod_qos_prep_create+0xbe2/0x19e0 [lod]
 [<ffffffffa1254d42>] lod_declare_striped_object+0x162/0x9b0 [lod]
 [<ffffffffa125c547>] lod_declare_object_create+0x2c7/0x460 [lod]
 [<ffffffffa12c4506>] mdd_declare_object_create_internal+0x116/0x340 [mdd]
 [<ffffffffa12bd79e>] mdd_create+0x68e/0x1730 [mdd]
 [<ffffffffa118f7b8>] mdo_create+0x18/0x50 [mdt]
 [<ffffffffa1196bbf>] mdt_reint_open+0x1f8f/0x2c70 [mdt]
 [<ffffffffa0906cbc>] ? upcall_cache_get_entry+0x29c/0x880 [obdclass]
 [<ffffffffa1180cad>] mdt_reint_rec+0x5d/0x200 [mdt]
 [<ffffffffa116511b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
 [<ffffffffa11655e6>] mdt_intent_reint+0x1f6/0x430 [mdt]
 [<ffffffffa1163bd4>] mdt_intent_policy+0x494/0xce0 [mdt]
 [<ffffffffa0abf4f9>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc]
 [<ffffffffa0aeb52b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
 [<ffffffffa0b6bbb1>] tgt_enqueue+0x61/0x230 [ptlrpc]
 [<ffffffffa0b6c7fe>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
 [<ffffffffa0b1c661>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
 [<ffffffffa0b1b820>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1427510345.3181
LNet: Service thread pid 3181 completed after 80.04s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Duration:               86400
Server failover period: 1200 seconds
Exited after:           37 seconds


 Comments   
Comment by Oleg Drokin [ 09/Apr/15 ]

So I see that OST also has timeouts waiting for transaction to commit.
Unfortunately no console logs for MDS to correlate the two records, but could be a slow zfs ost causing havoc here?

Lustre: Skipped 2 previous similar messages
LNet: Service thread pid 11335 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 11335, comm: ll_ost00_003

Call Trace:
 [<ffffffffa0fe3860>] ? osd_write+0x220/0x390 [osd_zfs]
 [<ffffffff8109ed4e>] ? prepare_to_wait_exclusive+0x4e/0x80
 [<ffffffffa01455f5>] cv_wait_common+0x105/0x120 [spl]
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0145665>] __cv_wait+0x15/0x20 [spl]
 [<ffffffffa024824b>] txg_wait_synced+0x7b/0xb0 [zfs]
 [<ffffffffa0fd4f35>] osd_trans_stop+0x415/0x4d0 [osd_zfs]
 [<ffffffffa1120f5f>] ofd_trans_stop+0x1f/0x60 [ofd]
 [<ffffffffa11256c4>] ofd_attr_set+0x304/0x7f0 [ofd]
 [<ffffffffa1112944>] ofd_setattr_hdl+0x1b4/0x9d0 [ofd]
 [<ffffffffa0b6c7fe>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
 [<ffffffffa0b1c661>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
 [<ffffffffa0b1b820>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Comment by Sarah Liu [ 01/Jun/15 ]

also seen with ldiskfs

https://testing.hpdd.intel.com/test_sets/f8515fe4-007b-11e5-9650-5254006e85c2

Comment by Sarah Liu [ 12/Jun/15 ]

dup of LU-6669

Generated at Sat Feb 10 02:00:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.