[LU-7252] replay-single test_116b:process_req_last_xid()) LBUG - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.8.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f670098e-6a05-11e5-9d0a-5254006e85c2.

The sub-test test_116b failed with the following error:

10:13:07:Lustre: DEBUG MARKER: == replay-single test 116b: large update log slave MDT recovery == 10:12:46 (1443867166)
10:13:07:Lustre: DEBUG MARKER: sync; sync; sync
10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 notransno
10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 readonly
10:13:07:Turning device dm-1 (0xfd00001) read-only
10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001
10:13:07:Lustre: DEBUG MARKER: mds2 REPLAY BARRIER on lustre-MDT0001
10:13:07:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80001702
10:13:07:Lustre: DEBUG MARKER: grep -c /mnt/mds2' ' /proc/mounts
10:13:07:Lustre: DEBUG MARKER: umount -d /mnt/mds2
10:13:07:Removing read-only on unknown block (0xfd00001)
10:13:07:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
10:13:07:Lustre: DEBUG MARKER: hostname
10:13:07:Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P2
10:13:07:Lustre: DEBUG MARKER: mkdir -p /mnt/mds2; mount -t lustre   		                   /dev/lvm-Role_MDS/P2 /mnt/mds2
10:13:07:LDISKFS-fs (dm-1): recovery complete
10:13:07:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: 
10:13:07:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
10:13:07:Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
10:13:07:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P2 2>/dev/null
10:13:07:LustreError: 30838:0:(tgt_handler.c:621:process_req_last_xid()) @@@ Unexpected xid 560f8bf556714 vs. last_xid 560f8bf55672b
10:13:07:  req@ffff880059530c50 x1513996361754388/t0(158913790082) o1000->lustre-MDT0000-mdtlov_UUID@10.1.5.33@tcp:666/0 lens 4168/0 e 0 to 0 dl 1443867196 ref 1 fl Interpret:/4/ffffffff rc 0/-1
10:13:07:LustreError: 30838:0:(tgt_handler.c:636:process_req_last_xid()) LBUG
10:13:07:Pid: 30838, comm: mdt_out00_000
10:13:07:
10:13:07:Call Trace:
10:13:07: [<ffffffffa049b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
10:13:07: [<ffffffffa049be77>] lbug_with_loc+0x47/0xb0 [libcfs]
10:13:07: [<ffffffffa08859bd>] tgt_request_handle+0x105d/0x13f0 [ptlrpc]
10:13:07: [<ffffffffa082cd61>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
10:13:07: [<ffffffffa082bf20>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
10:13:07: [<ffffffff810a101e>] kthread+0x9e/0xc0
10:13:07: [<ffffffff8100c28a>] child_rip+0xa/0x20
10:13:07: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
10:13:07: [<ffffffff8100c280>] ? child_rip+0x0/0x20
10:13:07:
10:13:07:Kernel panic - not syncing: LBUG
10:13:07:Pid: 30838, comm: mdt_out00_000 Not tainted 2.6.32-573.3.1.el6_lustre.g00880a0.x86_64 #1
10:13:07:Call Trace:
10:13:07: [<ffffffff815384e4>] ? panic+0xa7/0x16f
10:13:07: [<ffffffffa049becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
10:13:07: [<ffffffffa08859bd>] ? tgt_request_handle+0x105d/0x13f0 [ptlrpc]
10:13:07: [<ffffffffa082cd61>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
10:13:07: [<ffffffffa082bf20>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
10:13:07: [<ffffffff810a101e>] ? kthread+0x9e/0xc0
10:13:07: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
10:13:07: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
10:13:07: [<ffffffff8100c280>] ? child_rip+0x0/0x20
10:13:07:Initializing cgroup subsys cpuset
10:13:07:Initializing cgroup subsys cpu

Please provide additional information about the failure here.

Info required for matching: replay-single 116b

Attachments

Issue Links

is related to

LU-5951 sanity test_39k: mtime is lost on close

Resolved

Activity

[LU-7252] replay-single test_116b:process_req_last_xid()) LBUG

Di Wang (Inactive) added a comment - 08/Oct/15 7:10 AM - edited

rq_allow_replay is only for readonly request (no need replay), which means the request can be sent even before the import is FULL (during recovery process). Hmm, Did you see replay request set with rq_allow_replay?

Btw: even the comment out that LBUG section, it does not seem to resolve the problem. See https://testing.hpdd.intel.com/test_sets/15508dae-6bde-11e5-ae15-5254006e85c2

Di Wang (Inactive) added a comment - 08/Oct/15 7:10 AM - edited rq_allow_replay is only for readonly request (no need replay), which means the request can be sent even before the import is FULL (during recovery process). Hmm, Did you see replay request set with rq_allow_replay? Btw: even the comment out that LBUG section, it does not seem to resolve the problem. See https://testing.hpdd.intel.com/test_sets/15508dae-6bde-11e5-ae15-5254006e85c2

Niu Yawei (Inactive) added a comment - 08/Oct/15 6:57 AM

hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays.

I'm not sure too. Di, any comments?

Niu Yawei (Inactive) added a comment - 08/Oct/15 6:57 AM hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays. I'm not sure too. Di, any comments?

Alex Zhuravlev added a comment - 08/Oct/15 6:40 AM - edited

hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays.

Alex Zhuravlev added a comment - 08/Oct/15 6:40 AM - edited hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays.

Niu Yawei (Inactive) added a comment - 08/Oct/15 1:27 AM

I think it because the OUT requests are "rq_allow_replay" (can be send during replay), so that normal OUT request could be sent to server to bump the exp_last_xid, and the LBUG will be triggered when processing some replay OUT request later.

Removing the improper LBUG() looks a viable approach to me.

Niu Yawei (Inactive) added a comment - 08/Oct/15 1:27 AM I think it because the OUT requests are "rq_allow_replay" (can be send during replay), so that normal OUT request could be sent to server to bump the exp_last_xid, and the LBUG will be triggered when processing some replay OUT request later. Removing the improper LBUG() looks a viable approach to me.

Peter Jones added a comment - 06/Oct/15 5:28 PM

Reverting ~~LU-5951~~ to avoid this regression

Peter Jones added a comment - 06/Oct/15 5:28 PM Reverting LU-5951 to avoid this regression

James Nunez (Inactive) added a comment - 05/Oct/15 8:00 PM - edited

I'm seeing the same LBUG in replay-single test 70b in review-dne-part-2 test sessions. Logs are at https://testing.hpdd.intel.com/test_sets/431bb092-6b20-11e5-94a7-5254006e85c2
https://testing.hpdd.intel.com/test_sets/5d6b06a8-6a19-11e5-9fbf-5254006e85c2
2015-10-05 18:52:03 - https://testing.hpdd.intel.com/test_sets/b118562c-6bd9-11e5-ae45-5254006e85c2

LBUG reported in failure of replay-single test 70d in review-dne-part-2. Logs at https://testing.hpdd.intel.com/test_sets/8aac68c4-6a18-11e5-9d0a-5254006e85c2

James Nunez (Inactive) added a comment - 05/Oct/15 8:00 PM - edited I'm seeing the same LBUG in replay-single test 70b in review-dne-part-2 test sessions. Logs are at https://testing.hpdd.intel.com/test_sets/431bb092-6b20-11e5-94a7-5254006e85c2 https://testing.hpdd.intel.com/test_sets/5d6b06a8-6a19-11e5-9fbf-5254006e85c2 2015-10-05 18:52:03 - https://testing.hpdd.intel.com/test_sets/b118562c-6bd9-11e5-ae45-5254006e85c2 LBUG reported in failure of replay-single test 70d in review-dne-part-2. Logs at https://testing.hpdd.intel.com/test_sets/8aac68c4-6a18-11e5-9d0a-5254006e85c2

Gerrit Updater added a comment - 05/Oct/15 3:16 PM

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/16725
Subject: ~~LU-7252~~ ptlrpc: remove debug LBUG in process_req_last_xid()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fa5d9aca33a83016e836893790ad414711c613c5

Gerrit Updater added a comment - 05/Oct/15 3:16 PM John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/16725 Subject: LU-7252 ptlrpc: remove debug LBUG in process_req_last_xid() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fa5d9aca33a83016e836893790ad414711c613c5

Di Wang (Inactive) added a comment - 05/Oct/15 12:00 AM

Hmm, this patch seems caused by http://review.whamcloud.com/15473 Niu, could you please have a look? Thanks

Di Wang (Inactive) added a comment - 05/Oct/15 12:00 AM Hmm, this patch seems caused by http://review.whamcloud.com/15473 Niu, could you please have a look? Thanks

People

Assignee:: John Hammond

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 04/Oct/15 11:57 PM

Updated:: 08/Oct/15 7:24 AM

Resolved:: 06/Oct/15 5:28 PM