[LU-6444] Hard Failover recovery-mds-scale test_failover_ost: test_failover_ost returned 3 Created: 08/Apr/15 Updated: 02/Jul/15 Resolved: 12/Jun/15 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client and server: lustre-master build #2967 zfs |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/83ac0620-d67e-11e4-8a24-5254006e85c2. The sub-test test_failover_ost failed with the following error: test_failover_ost returned 3 MDS console Lustre: lustre-MDT0000: Recovery over after 0:09, of 2 clients 2 recovered and 0 were evicted. Lustre: 2817:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1427510311/real 1427510311] req@ffff88006b6859c0 x1496853024604548/t0(0) o8->lustre-OST0002-osc-MDT0000@10.2.4.226@tcp:28/4 lens 400/544 e 0 to 1 dl 1427510316 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 2817:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1427510321/real 1427510321] req@ffff8800722326c0 x1496853024604620/t0(0) o8->lustre-OST0006-osc-MDT0000@10.2.4.226@tcp:28/4 lens 400/544 e 0 to 1 dl 1427510331 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 2817:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages LNet: Service thread pid 3181 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 3181, comm: mdt00_003 Call Trace: [<ffffffff8152b162>] schedule_timeout+0x192/0x2e0 [<ffffffff810874f0>] ? process_timeout+0x0/0x10 [<ffffffffa1325951>] osp_precreate_reserve+0x3d1/0x810 [osp] [<ffffffffa131b740>] ? osp_object_free+0x2d0/0x4a0 [osp] [<ffffffff81064b90>] ? default_wake_function+0x0/0x20 [<ffffffffa131fbb6>] osp_declare_object_create+0x1a6/0x650 [osp] [<ffffffffa125d81a>] lod_qos_declare_object_on+0x12a/0x4f0 [lod] [<ffffffffa12603f3>] lod_alloc_rr.clone.2+0xbb3/0x1020 [lod] [<ffffffffa079dfb1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa1261442>] lod_qos_prep_create+0xbe2/0x19e0 [lod] [<ffffffffa1254d42>] lod_declare_striped_object+0x162/0x9b0 [lod] [<ffffffffa125c547>] lod_declare_object_create+0x2c7/0x460 [lod] [<ffffffffa12c4506>] mdd_declare_object_create_internal+0x116/0x340 [mdd] [<ffffffffa12bd79e>] mdd_create+0x68e/0x1730 [mdd] [<ffffffffa118f7b8>] mdo_create+0x18/0x50 [mdt] [<ffffffffa1196bbf>] mdt_reint_open+0x1f8f/0x2c70 [mdt] [<ffffffffa0906cbc>] ? upcall_cache_get_entry+0x29c/0x880 [obdclass] [<ffffffffa1180cad>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa116511b>] mdt_reint_internal+0x4cb/0x7a0 [mdt] [<ffffffffa11655e6>] mdt_intent_reint+0x1f6/0x430 [mdt] [<ffffffffa1163bd4>] mdt_intent_policy+0x494/0xce0 [mdt] [<ffffffffa0abf4f9>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc] [<ffffffffa0aeb52b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc] [<ffffffffa0b6bbb1>] tgt_enqueue+0x61/0x230 [ptlrpc] [<ffffffffa0b6c7fe>] tgt_request_handle+0x8be/0x1000 [ptlrpc] [<ffffffffa0b1c661>] ptlrpc_main+0xe41/0x1960 [ptlrpc] [<ffffffffa0b1b820>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [<ffffffff8109e66e>] kthread+0x9e/0xc0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1427510345.3181 LNet: Service thread pid 3181 completed after 80.04s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Lustre: DEBUG MARKER: /usr/sbin/lctl mark Duration: 86400 Server failover period: 1200 seconds Exited after: 37 seconds |
| Comments |
| Comment by Oleg Drokin [ 09/Apr/15 ] |
|
So I see that OST also has timeouts waiting for transaction to commit. Lustre: Skipped 2 previous similar messages LNet: Service thread pid 11335 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 11335, comm: ll_ost00_003 Call Trace: [<ffffffffa0fe3860>] ? osd_write+0x220/0x390 [osd_zfs] [<ffffffff8109ed4e>] ? prepare_to_wait_exclusive+0x4e/0x80 [<ffffffffa01455f5>] cv_wait_common+0x105/0x120 [spl] [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0145665>] __cv_wait+0x15/0x20 [spl] [<ffffffffa024824b>] txg_wait_synced+0x7b/0xb0 [zfs] [<ffffffffa0fd4f35>] osd_trans_stop+0x415/0x4d0 [osd_zfs] [<ffffffffa1120f5f>] ofd_trans_stop+0x1f/0x60 [ofd] [<ffffffffa11256c4>] ofd_attr_set+0x304/0x7f0 [ofd] [<ffffffffa1112944>] ofd_setattr_hdl+0x1b4/0x9d0 [ofd] [<ffffffffa0b6c7fe>] tgt_request_handle+0x8be/0x1000 [ptlrpc] [<ffffffffa0b1c661>] ptlrpc_main+0xe41/0x1960 [ptlrpc] [<ffffffffa0b1b820>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [<ffffffff8109e66e>] kthread+0x9e/0xc0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 |
| Comment by Sarah Liu [ 01/Jun/15 ] |
|
also seen with ldiskfs https://testing.hpdd.intel.com/test_sets/f8515fe4-007b-11e5-9650-5254006e85c2 |
| Comment by Sarah Liu [ 12/Jun/15 ] |
|
dup of |