Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7252

replay-single test_116b:process_req_last_xid()) LBUG

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f670098e-6a05-11e5-9d0a-5254006e85c2.

      The sub-test test_116b failed with the following error:

      10:13:07:Lustre: DEBUG MARKER: == replay-single test 116b: large update log slave MDT recovery == 10:12:46 (1443867166)
      10:13:07:Lustre: DEBUG MARKER: sync; sync; sync
      10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 notransno
      10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 readonly
      10:13:07:Turning device dm-1 (0xfd00001) read-only
      10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001
      10:13:07:Lustre: DEBUG MARKER: mds2 REPLAY BARRIER on lustre-MDT0001
      10:13:07:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80001702
      10:13:07:Lustre: DEBUG MARKER: grep -c /mnt/mds2' ' /proc/mounts
      10:13:07:Lustre: DEBUG MARKER: umount -d /mnt/mds2
      10:13:07:Removing read-only on unknown block (0xfd00001)
      10:13:07:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      10:13:07:Lustre: DEBUG MARKER: hostname
      10:13:07:Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P2
      10:13:07:Lustre: DEBUG MARKER: mkdir -p /mnt/mds2; mount -t lustre   		                   /dev/lvm-Role_MDS/P2 /mnt/mds2
      10:13:07:LDISKFS-fs (dm-1): recovery complete
      10:13:07:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: 
      10:13:07:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      10:13:07:Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
      10:13:07:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P2 2>/dev/null
      10:13:07:LustreError: 30838:0:(tgt_handler.c:621:process_req_last_xid()) @@@ Unexpected xid 560f8bf556714 vs. last_xid 560f8bf55672b
      10:13:07:  req@ffff880059530c50 x1513996361754388/t0(158913790082) o1000->lustre-MDT0000-mdtlov_UUID@10.1.5.33@tcp:666/0 lens 4168/0 e 0 to 0 dl 1443867196 ref 1 fl Interpret:/4/ffffffff rc 0/-1
      10:13:07:LustreError: 30838:0:(tgt_handler.c:636:process_req_last_xid()) LBUG
      10:13:07:Pid: 30838, comm: mdt_out00_000
      10:13:07:
      10:13:07:Call Trace:
      10:13:07: [<ffffffffa049b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      10:13:07: [<ffffffffa049be77>] lbug_with_loc+0x47/0xb0 [libcfs]
      10:13:07: [<ffffffffa08859bd>] tgt_request_handle+0x105d/0x13f0 [ptlrpc]
      10:13:07: [<ffffffffa082cd61>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      10:13:07: [<ffffffffa082bf20>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      10:13:07: [<ffffffff810a101e>] kthread+0x9e/0xc0
      10:13:07: [<ffffffff8100c28a>] child_rip+0xa/0x20
      10:13:07: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
      10:13:07: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      10:13:07:
      10:13:07:Kernel panic - not syncing: LBUG
      10:13:07:Pid: 30838, comm: mdt_out00_000 Not tainted 2.6.32-573.3.1.el6_lustre.g00880a0.x86_64 #1
      10:13:07:Call Trace:
      10:13:07: [<ffffffff815384e4>] ? panic+0xa7/0x16f
      10:13:07: [<ffffffffa049becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      10:13:07: [<ffffffffa08859bd>] ? tgt_request_handle+0x105d/0x13f0 [ptlrpc]
      10:13:07: [<ffffffffa082cd61>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
      10:13:07: [<ffffffffa082bf20>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      10:13:07: [<ffffffff810a101e>] ? kthread+0x9e/0xc0
      10:13:07: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      10:13:07: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
      10:13:07: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      10:13:07:Initializing cgroup subsys cpuset
      10:13:07:Initializing cgroup subsys cpu
      

      Please provide additional information about the failure here.

      Info required for matching: replay-single 116b

      Attachments

        Issue Links

          Activity

            [LU-7252] replay-single test_116b:process_req_last_xid()) LBUG

            Btw: even the comment out that LBUG section, it does not seem to resolve the problem. See https://testing.hpdd.intel.com/test_sets/15508dae-6bde-11e5-ae15-5254006e85c2

            Right, we have to make sure the replay request is handled properly in this case. Please see my new patch at: http://review.whamcloud.com/#/c/16759/

            niu Niu Yawei (Inactive) added a comment - Btw: even the comment out that LBUG section, it does not seem to resolve the problem. See https://testing.hpdd.intel.com/test_sets/15508dae-6bde-11e5-ae15-5254006e85c2 Right, we have to make sure the replay request is handled properly in this case. Please see my new patch at: http://review.whamcloud.com/#/c/16759/
            di.wang Di Wang (Inactive) added a comment - - edited

            rq_allow_replay is only for readonly request (no need replay), which means the request can be sent even before the import is FULL (during recovery process). Hmm, Did you see replay request set with rq_allow_replay?

            Btw: even the comment out that LBUG section, it does not seem to resolve the problem. See https://testing.hpdd.intel.com/test_sets/15508dae-6bde-11e5-ae15-5254006e85c2

            di.wang Di Wang (Inactive) added a comment - - edited rq_allow_replay is only for readonly request (no need replay), which means the request can be sent even before the import is FULL (during recovery process). Hmm, Did you see replay request set with rq_allow_replay? Btw: even the comment out that LBUG section, it does not seem to resolve the problem. See https://testing.hpdd.intel.com/test_sets/15508dae-6bde-11e5-ae15-5254006e85c2

            hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays.

            I'm not sure too. Di, any comments?

            niu Niu Yawei (Inactive) added a comment - hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays. I'm not sure too. Di, any comments?
            bzzz Alex Zhuravlev added a comment - - edited

            hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays.

            bzzz Alex Zhuravlev added a comment - - edited hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays.

            I think it because the OUT requests are "rq_allow_replay" (can be send during replay), so that normal OUT request could be sent to server to bump the exp_last_xid, and the LBUG will be triggered when processing some replay OUT request later.

            Removing the improper LBUG() looks a viable approach to me.

            niu Niu Yawei (Inactive) added a comment - I think it because the OUT requests are "rq_allow_replay" (can be send during replay), so that normal OUT request could be sent to server to bump the exp_last_xid, and the LBUG will be triggered when processing some replay OUT request later. Removing the improper LBUG() looks a viable approach to me.
            pjones Peter Jones added a comment -

            Reverting LU-5951 to avoid this regression

            pjones Peter Jones added a comment - Reverting LU-5951 to avoid this regression
            jamesanunez James Nunez (Inactive) added a comment - - edited

            I'm seeing the same LBUG in replay-single test 70b in review-dne-part-2 test sessions. Logs are at https://testing.hpdd.intel.com/test_sets/431bb092-6b20-11e5-94a7-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/5d6b06a8-6a19-11e5-9fbf-5254006e85c2
            2015-10-05 18:52:03 - https://testing.hpdd.intel.com/test_sets/b118562c-6bd9-11e5-ae45-5254006e85c2

            LBUG reported in failure of replay-single test 70d in review-dne-part-2. Logs at https://testing.hpdd.intel.com/test_sets/8aac68c4-6a18-11e5-9d0a-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - - edited I'm seeing the same LBUG in replay-single test 70b in review-dne-part-2 test sessions. Logs are at https://testing.hpdd.intel.com/test_sets/431bb092-6b20-11e5-94a7-5254006e85c2 https://testing.hpdd.intel.com/test_sets/5d6b06a8-6a19-11e5-9fbf-5254006e85c2 2015-10-05 18:52:03 - https://testing.hpdd.intel.com/test_sets/b118562c-6bd9-11e5-ae45-5254006e85c2 LBUG reported in failure of replay-single test 70d in review-dne-part-2. Logs at https://testing.hpdd.intel.com/test_sets/8aac68c4-6a18-11e5-9d0a-5254006e85c2

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/16725
            Subject: LU-7252 ptlrpc: remove debug LBUG in process_req_last_xid()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fa5d9aca33a83016e836893790ad414711c613c5

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/16725 Subject: LU-7252 ptlrpc: remove debug LBUG in process_req_last_xid() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fa5d9aca33a83016e836893790ad414711c613c5

            Hmm, this patch seems caused by http://review.whamcloud.com/15473 Niu, could you please have a look? Thanks

            di.wang Di Wang (Inactive) added a comment - Hmm, this patch seems caused by http://review.whamcloud.com/15473 Niu, could you please have a look? Thanks

            People

              jhammond John Hammond
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: