[LU-7490] out_tx_write_exec()) LBUG Created: 27/Nov/15  Updated: 28/Jan/16  Resolved: 28/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: soak
Environment:

lola
build: 2.7.63-4-gf84e06e, a7eface85ea2d2aa6198681264b082a0244855d4 + patches


Attachments: File console-lola-8.log.bz2     File lola-8-lbug-client-messages.txt.bz2     File lustre-log.1448567093.8491.bz2     File messages-lola-8.log.bz2    
Issue Links:
Related
is related to LU-7488 _req_capsule_get()) ASSERTION( fmt !=... Resolved
is related to LU-7488 _req_capsule_get()) ASSERTION( fmt !=... Resolved
is related to LU-7455 Tracking tickets to make DNE pass soa... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The error occurred during soak testing of master branch build '20151122' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151122). DNE is enabled. MDSes are configured in active-active failover configuration.

Sequence of events:

  • 2015-11-26 10:32 Failover resources (mdt-0,1) lola-8 --> lola-9 started
  • 2015-11-26 11:40 Failback resources (mdt-0.1) lola-9 --> lola-8 completed successful
  • 2015-11-26 11:44 LBUG on lola-8. See the following message.
Nov 26 11:44:54 lola-8 kernel: LustreError: 8491:0:(out_lib.c:692:out_tx_write_exec()) read record [0x240089779:0x1:0x0] tail_pos 173122472 rc -53 index 50635 size 172659608
Nov 26 11:44:54 lola-8 kernel: LustreError: 8491:0:(out_lib.c:693:out_tx_write_exec()) LBUG
Nov 26 11:44:54 lola-8 kernel: Pid: 8491, comm: mdt_out03_004
Nov 26 11:44:54 lola-8 kernel: 
Nov 26 11:44:54 lola-8 kernel: Call Trace:
Nov 26 11:44:54 lola-8 kernel: [<ffffffffa07fb875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Nov 26 11:44:54 lola-8 kernel: [<ffffffffa07fbe77>] lbug_with_loc+0x47/0xb0 [libcfs]
Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bb60a0>] out_tx_write_exec+0x500/0x7a0 [ptlrpc]
Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bb934b>] ? out_tx_xattr_set_exec+0xeb/0x680 [ptlrpc]
Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bae13a>] out_tx_end+0xda/0x5d0 [ptlrpc]
Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bb3726>] out_handle+0xbd6/0x1890 [ptlrpc]
Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0afa4e0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0baae1c>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0b52711>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
Nov 26 11:44:54 lola-8 kernel: [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0b518d0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
Nov 26 11:44:54 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
Nov 26 11:44:54 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Nov 26 11:44:54 lola-8 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
Nov 26 11:44:54 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Nov 26 11:44:54 lola-8 kernel: 
Nov 26 11:44:54 lola-8 kernel: LustreError: dumping log to /tmp/lustre-log.1448567093.8491

Most likely this event is related to LU-7488 which happened almost at the same time on the HA failover partner (lola-9)

Attached console and messages log file of MDS (lola-8), kernel debug log file mentioned in the LBUG error message and error messages extracted from Lustre client nodes messages files that showed up at the same time.



 Comments   
Comment by Di Wang [ 01/Dec/15 ]

According to the debug log

00000020:00080000:29.0:1448567093.939286:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:29.0:1448567093.939290:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:11.0:1448567093.939292:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928)  req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:29.0:1448567093.939297:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:11.0:1448567093.939298:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928)  req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:29.0:1448567093.939303:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:11.0:1448567093.939305:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928)  req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:29.0:1448567093.939310:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:11.0:1448567093.939316:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928)  req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:29.0:1448567093.939318:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:29.0:1448567093.939325:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:11.0:1448567093.939325:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928)  req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:29.0:1448567093.939332:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:29.0:1448567093.939336:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:29.0:1448567093.939340:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00080000:29.0:1448567093.939344:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
00000020:00020000:29.0:1448567093.940373:0:8491:0:(out_lib.c:692:out_tx_write_exec()) read record [0x240089779:0x1:0x0] tail_pos 173122472 rc -53 index 50635 size 172659608
00000020:00040000:29.0:1448567093.940379:0:8491:0:(out_lib.c:693:out_tx_write_exec()) LBUG

It seems the request x1518924261650748 is not being invalidated for failure, which cause the corruption. sigh, there are no debug log on MDT0006, so I do not know what happens there.

Hmm, it seems not all of resend requests are added to the delayed_list,

      /* retry indefinitely on EINPROGRESS */
        if (lustre_msg_get_status(req->rq_repmsg) == -EINPROGRESS &&
            ptlrpc_no_resend(req) == 0 && !req->rq_no_retry_einprogress) {
                time_t  now = cfs_time_current_sec();

                DEBUG_REQ(D_RPCTRACE, req, "Resending request on EINPROGRESS");
                spin_lock(&req->rq_lock);
                req->rq_resend = 1;
                spin_unlock(&req->rq_lock);
                req->rq_nr_resend++;

                /* Readjust the timeout for current conditions */
                ptlrpc_at_set_req_timeout(req);
                /* delay resend to give a chance to the server to get ready.
                 * The delay is increased by 1s on every resend and is capped to
                 * the current request timeout (i.e. obd_timeout if AT is off,
                 * or AT service time x 125% + 5s, see at_est2timeout) */
                if (req->rq_nr_resend > req->rq_timeout)
                        req->rq_sent = now + req->rq_timeout;
                else
                        req->rq_sent = now + req->rq_nr_resend;

                RETURN(0);
        }

I will update http://review.whamcloud.com/#/c/17199/

Comment by Di Wang [ 01/Dec/15 ]

Here is the patch http://review.whamcloud.com/#/c/17199/

Comment by Gerrit Updater [ 28/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17199/
Subject: LU-7490 recovery: abort update recovery once fails
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 92890d8f555d12ad32dc9841a328e84c5d26e896

Comment by Joseph Gmitter (Inactive) [ 28/Jan/16 ]

Landed for 2.8

Generated at Sat Feb 10 02:09:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.