Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • lola
      build: 2.7.63-4-gf84e06e, a7eface85ea2d2aa6198681264b082a0244855d4 + patches
    • 3
    • 9223372036854775807

    Description

      The error occurred during soak testing of master branch build '20151122' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151122). DNE is enabled. MDSes are configured in active-active failover configuration.

      Sequence of events:

      • 2015-11-26 10:32 Failover resources (mdt-0,1) lola-8 --> lola-9 started
      • 2015-11-26 11:40 Failback resources (mdt-0.1) lola-9 --> lola-8 completed successful
      • 2015-11-26 11:44 LBUG on lola-8. See the following message.
      Nov 26 11:44:54 lola-8 kernel: LustreError: 8491:0:(out_lib.c:692:out_tx_write_exec()) read record [0x240089779:0x1:0x0] tail_pos 173122472 rc -53 index 50635 size 172659608
      Nov 26 11:44:54 lola-8 kernel: LustreError: 8491:0:(out_lib.c:693:out_tx_write_exec()) LBUG
      Nov 26 11:44:54 lola-8 kernel: Pid: 8491, comm: mdt_out03_004
      Nov 26 11:44:54 lola-8 kernel: 
      Nov 26 11:44:54 lola-8 kernel: Call Trace:
      Nov 26 11:44:54 lola-8 kernel: [<ffffffffa07fb875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Nov 26 11:44:54 lola-8 kernel: [<ffffffffa07fbe77>] lbug_with_loc+0x47/0xb0 [libcfs]
      Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bb60a0>] out_tx_write_exec+0x500/0x7a0 [ptlrpc]
      Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bb934b>] ? out_tx_xattr_set_exec+0xeb/0x680 [ptlrpc]
      Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bae13a>] out_tx_end+0xda/0x5d0 [ptlrpc]
      Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0bb3726>] out_handle+0xbd6/0x1890 [ptlrpc]
      Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0afa4e0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
      Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0baae1c>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
      Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0b52711>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      Nov 26 11:44:54 lola-8 kernel: [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
      Nov 26 11:44:54 lola-8 kernel: [<ffffffffa0b518d0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      Nov 26 11:44:54 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
      Nov 26 11:44:54 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Nov 26 11:44:54 lola-8 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
      Nov 26 11:44:54 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Nov 26 11:44:54 lola-8 kernel: 
      Nov 26 11:44:54 lola-8 kernel: LustreError: dumping log to /tmp/lustre-log.1448567093.8491
      

      Most likely this event is related to LU-7488 which happened almost at the same time on the HA failover partner (lola-9)

      Attached console and messages log file of MDS (lola-8), kernel debug log file mentioned in the LBUG error message and error messages extracted from Lustre client nodes messages files that showed up at the same time.

      Attachments

        Issue Links

          Activity

            [LU-7490] out_tx_write_exec()) LBUG

            Landed for 2.8

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17199/
            Subject: LU-7490 recovery: abort update recovery once fails
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 92890d8f555d12ad32dc9841a328e84c5d26e896

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17199/ Subject: LU-7490 recovery: abort update recovery once fails Project: fs/lustre-release Branch: master Current Patch Set: Commit: 92890d8f555d12ad32dc9841a328e84c5d26e896
            di.wang Di Wang added a comment - Here is the patch http://review.whamcloud.com/#/c/17199/
            di.wang Di Wang added a comment - - edited

            According to the debug log

            00000020:00080000:29.0:1448567093.939286:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:29.0:1448567093.939290:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:11.0:1448567093.939292:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928)  req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:29.0:1448567093.939297:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:11.0:1448567093.939298:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928)  req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:29.0:1448567093.939303:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:11.0:1448567093.939305:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928)  req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:29.0:1448567093.939310:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:11.0:1448567093.939316:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928)  req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:29.0:1448567093.939318:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:29.0:1448567093.939325:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:11.0:1448567093.939325:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928)  req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:29.0:1448567093.939332:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:29.0:1448567093.939336:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:29.0:1448567093.939340:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00080000:29.0:1448567093.939344:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488)  req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0
            00000020:00020000:29.0:1448567093.940373:0:8491:0:(out_lib.c:692:out_tx_write_exec()) read record [0x240089779:0x1:0x0] tail_pos 173122472 rc -53 index 50635 size 172659608
            00000020:00040000:29.0:1448567093.940379:0:8491:0:(out_lib.c:693:out_tx_write_exec()) LBUG
            

            It seems the request x1518924261650748 is not being invalidated for failure, which cause the corruption. sigh, there are no debug log on MDT0006, so I do not know what happens there.

            Hmm, it seems not all of resend requests are added to the delayed_list,

                  /* retry indefinitely on EINPROGRESS */
                    if (lustre_msg_get_status(req->rq_repmsg) == -EINPROGRESS &&
                        ptlrpc_no_resend(req) == 0 && !req->rq_no_retry_einprogress) {
                            time_t  now = cfs_time_current_sec();
            
                            DEBUG_REQ(D_RPCTRACE, req, "Resending request on EINPROGRESS");
                            spin_lock(&req->rq_lock);
                            req->rq_resend = 1;
                            spin_unlock(&req->rq_lock);
                            req->rq_nr_resend++;
            
                            /* Readjust the timeout for current conditions */
                            ptlrpc_at_set_req_timeout(req);
                            /* delay resend to give a chance to the server to get ready.
                             * The delay is increased by 1s on every resend and is capped to
                             * the current request timeout (i.e. obd_timeout if AT is off,
                             * or AT service time x 125% + 5s, see at_est2timeout) */
                            if (req->rq_nr_resend > req->rq_timeout)
                                    req->rq_sent = now + req->rq_timeout;
                            else
                                    req->rq_sent = now + req->rq_nr_resend;
            
                            RETURN(0);
                    }
            

            I will update http://review.whamcloud.com/#/c/17199/

            di.wang Di Wang added a comment - - edited According to the debug log 00000020:00080000:29.0:1448567093.939286:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939290:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:11.0:1448567093.939292:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928) req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939297:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:11.0:1448567093.939298:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928) req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939303:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:11.0:1448567093.939305:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928) req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939310:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:11.0:1448567093.939316:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928) req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939318:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939325:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:11.0:1448567093.939325:0:8215:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518918717705928) req@ffff88082ea09050 x1518918717714200/t0(0) o1000->soaked-MDT0003-mdtlov_UUID@192.168.1.109@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939332:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939336:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939340:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00080000:29.0:1448567093.939344:0:8491:0:(out_handler.c:89:out_check_resent()) @@@ no reply for RESENT req (have 1518924261645488) req@ffff8808040a5450 x1518924261650748/t0(0) o1000->soaked-MDT0006-mdtlov_UUID@192.168.1.111@o2ib10:694/0 lens 320/4320 e 0 to 0 dl 1448567099 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00020000:29.0:1448567093.940373:0:8491:0:(out_lib.c:692:out_tx_write_exec()) read record [0x240089779:0x1:0x0] tail_pos 173122472 rc -53 index 50635 size 172659608 00000020:00040000:29.0:1448567093.940379:0:8491:0:(out_lib.c:693:out_tx_write_exec()) LBUG It seems the request x1518924261650748 is not being invalidated for failure, which cause the corruption. sigh, there are no debug log on MDT0006, so I do not know what happens there. Hmm, it seems not all of resend requests are added to the delayed_list, /* retry indefinitely on EINPROGRESS */ if (lustre_msg_get_status(req->rq_repmsg) == -EINPROGRESS && ptlrpc_no_resend(req) == 0 && !req->rq_no_retry_einprogress) { time_t now = cfs_time_current_sec(); DEBUG_REQ(D_RPCTRACE, req, "Resending request on EINPROGRESS"); spin_lock(&req->rq_lock); req->rq_resend = 1; spin_unlock(&req->rq_lock); req->rq_nr_resend++; /* Readjust the timeout for current conditions */ ptlrpc_at_set_req_timeout(req); /* delay resend to give a chance to the server to get ready. * The delay is increased by 1s on every resend and is capped to * the current request timeout (i.e. obd_timeout if AT is off, * or AT service time x 125% + 5s, see at_est2timeout) */ if (req->rq_nr_resend > req->rq_timeout) req->rq_sent = now + req->rq_timeout; else req->rq_sent = now + req->rq_nr_resend; RETURN(0); } I will update http://review.whamcloud.com/#/c/17199/

            People

              di.wang Di Wang
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: