[LU-7490] out_tx_write_exec()) LBUG Created: 27/Nov/15 Updated: 28/Jan/16 Resolved: 28/Jan/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
The error occurred during soak testing of master branch build '20151122' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151122). DNE is enabled. MDSes are configured in active-active failover configuration. Sequence of events:
Nov 26 11:44:54 lola-8 kernel: LustreError: 8491:0:(out_lib.c:692:out_tx_write_exec()) read record [0x240089779:0x1:0x0] tail_pos 173122472 rc -53 index 50635 size 172659608 Nov 26 11:44:54 lola-8 kernel: LustreError: 8491:0:(out_lib.c:693:out_tx_write_exec()) LBUG Nov 26 11:44:54 lola-8 kernel: Pid: 8491, comm: mdt_out03_004 Nov 26 11:44:54 lola-8 kernel: Nov 26 11:44:54 lola-8 kernel: Call Trace: Nov 26 11:44:54 lola-8 kernel: [<ffffffffa07fb875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Nov 26 11:44:54 lola-8 kernel: [<ffffffffa07fbe77>] lbug_with_loc+0x47/0xb0 [libcfs] Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bb60a0>] out_tx_write_exec+0x500/0x7a0 [ptlrpc] Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bb934b>] ? out_tx_xattr_set_exec+0xeb/0x680 [ptlrpc] Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bae13a>] out_tx_end+0xda/0x5d0 [ptlrpc] Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bb3726>] out_handle+0xbd6/0x1890 [ptlrpc] Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0afa4e0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0baae1c>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc] Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0b52711>] ptlrpc_main+0xe41/0x1910 [ptlrpc] Nov 26 11:44:54 lola-8 kernel: [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0b518d0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] Nov 26 11:44:54 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Nov 26 11:44:54 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Nov 26 11:44:54 lola-8 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Nov 26 11:44:54 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Nov 26 11:44:54 lola-8 kernel: Nov 26 11:44:54 lola-8 kernel: LustreError: dumping log to /tmp/lustre-log.1448567093.8491 Most likely this event is related to Attached console and messages log file of MDS (lola-8), kernel debug log file mentioned in the LBUG error message and error messages extracted from Lustre client nodes messages files that showed up at the same time. |
| Comments |
| Comment by Di Wang [ 01/Dec/15 ] |
|
According to the debug log 00000020:00080000:29.0:1448567093.939286:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939290:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:11.0:1448567093.939292:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928) req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939297:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:11.0:1448567093.939298:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928) req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939303:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:11.0:1448567093.939305:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928) req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939310:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:11.0:1448567093.939316:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928) req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939318:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939325:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:11.0:1448567093.939325:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928) req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939332:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939336:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939340:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939344:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00020000:29.0:1448567093.940373:0:8491:0:(out_lib.c:692:out_tx_write_exec()) read record [0x240089779:0x1:0x0] tail_pos 173122472 rc -53 index 50635 size 172659608 00000020:00040000:29.0:1448567093.940379:0:8491:0:(out_lib.c:693:out_tx_write_exec()) LBUG It seems the request x1518924261650748 is not being invalidated for failure, which cause the corruption. sigh, there are no debug log on MDT0006, so I do not know what happens there. Hmm, it seems not all of resend requests are added to the delayed_list, /* retry indefinitely on EINPROGRESS */
if (lustre_msg_get_status(req->rq_repmsg) == -EINPROGRESS &&
ptlrpc_no_resend(req) == 0 && !req->rq_no_retry_einprogress) {
time_t now = cfs_time_current_sec();
DEBUG_REQ(D_RPCTRACE, req, "Resending request on EINPROGRESS");
spin_lock(&req->rq_lock);
req->rq_resend = 1;
spin_unlock(&req->rq_lock);
req->rq_nr_resend++;
/* Readjust the timeout for current conditions */
ptlrpc_at_set_req_timeout(req);
/* delay resend to give a chance to the server to get ready.
* The delay is increased by 1s on every resend and is capped to
* the current request timeout (i.e. obd_timeout if AT is off,
* or AT service time x 125% + 5s, see at_est2timeout) */
if (req->rq_nr_resend > req->rq_timeout)
req->rq_sent = now + req->rq_timeout;
else
req->rq_sent = now + req->rq_nr_resend;
RETURN(0);
}
I will update http://review.whamcloud.com/#/c/17199/ |
| Comment by Di Wang [ 01/Dec/15 ] |
|
Here is the patch http://review.whamcloud.com/#/c/17199/ |
| Comment by Gerrit Updater [ 28/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17199/ |
| Comment by Joseph Gmitter (Inactive) [ 28/Jan/16 ] |
|
Landed for 2.8 |