[LU-7252] replay-single test_116b:process_req_last_xid()) LBUG Created: 04/Oct/15  Updated: 08/Oct/15  Resolved: 06/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: John Hammond
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5951 sanity test_39k: mtime is lost on close Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f670098e-6a05-11e5-9d0a-5254006e85c2.

The sub-test test_116b failed with the following error:

10:13:07:Lustre: DEBUG MARKER: == replay-single test 116b: large update log slave MDT recovery == 10:12:46 (1443867166)
10:13:07:Lustre: DEBUG MARKER: sync; sync; sync
10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 notransno
10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 readonly
10:13:07:Turning device dm-1 (0xfd00001) read-only
10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001
10:13:07:Lustre: DEBUG MARKER: mds2 REPLAY BARRIER on lustre-MDT0001
10:13:07:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80001702
10:13:07:Lustre: DEBUG MARKER: grep -c /mnt/mds2' ' /proc/mounts
10:13:07:Lustre: DEBUG MARKER: umount -d /mnt/mds2
10:13:07:Removing read-only on unknown block (0xfd00001)
10:13:07:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
10:13:07:Lustre: DEBUG MARKER: hostname
10:13:07:Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P2
10:13:07:Lustre: DEBUG MARKER: mkdir -p /mnt/mds2; mount -t lustre   		                   /dev/lvm-Role_MDS/P2 /mnt/mds2
10:13:07:LDISKFS-fs (dm-1): recovery complete
10:13:07:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: 
10:13:07:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
10:13:07:Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
10:13:07:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P2 2>/dev/null
10:13:07:LustreError: 30838:0:(tgt_handler.c:621:process_req_last_xid()) @@@ Unexpected xid 560f8bf556714 vs. last_xid 560f8bf55672b
10:13:07:  req@ffff880059530c50 x1513996361754388/t0(158913790082) o1000->lustre-MDT0000-mdtlov_UUID@10.1.5.33@tcp:666/0 lens 4168/0 e 0 to 0 dl 1443867196 ref 1 fl Interpret:/4/ffffffff rc 0/-1
10:13:07:LustreError: 30838:0:(tgt_handler.c:636:process_req_last_xid()) LBUG
10:13:07:Pid: 30838, comm: mdt_out00_000
10:13:07:
10:13:07:Call Trace:
10:13:07: [<ffffffffa049b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
10:13:07: [<ffffffffa049be77>] lbug_with_loc+0x47/0xb0 [libcfs]
10:13:07: [<ffffffffa08859bd>] tgt_request_handle+0x105d/0x13f0 [ptlrpc]
10:13:07: [<ffffffffa082cd61>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
10:13:07: [<ffffffffa082bf20>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
10:13:07: [<ffffffff810a101e>] kthread+0x9e/0xc0
10:13:07: [<ffffffff8100c28a>] child_rip+0xa/0x20
10:13:07: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
10:13:07: [<ffffffff8100c280>] ? child_rip+0x0/0x20
10:13:07:
10:13:07:Kernel panic - not syncing: LBUG
10:13:07:Pid: 30838, comm: mdt_out00_000 Not tainted 2.6.32-573.3.1.el6_lustre.g00880a0.x86_64 #1
10:13:07:Call Trace:
10:13:07: [<ffffffff815384e4>] ? panic+0xa7/0x16f
10:13:07: [<ffffffffa049becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
10:13:07: [<ffffffffa08859bd>] ? tgt_request_handle+0x105d/0x13f0 [ptlrpc]
10:13:07: [<ffffffffa082cd61>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
10:13:07: [<ffffffffa082bf20>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
10:13:07: [<ffffffff810a101e>] ? kthread+0x9e/0xc0
10:13:07: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
10:13:07: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
10:13:07: [<ffffffff8100c280>] ? child_rip+0x0/0x20
10:13:07:Initializing cgroup subsys cpuset
10:13:07:Initializing cgroup subsys cpu

Please provide additional information about the failure here.

Info required for matching: replay-single 116b



 Comments   
Comment by Di Wang [ 05/Oct/15 ]

Hmm, this patch seems caused by http://review.whamcloud.com/15473 Niu, could you please have a look? Thanks

Comment by Gerrit Updater [ 05/Oct/15 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/16725
Subject: LU-7252 ptlrpc: remove debug LBUG in process_req_last_xid()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fa5d9aca33a83016e836893790ad414711c613c5

Comment by James Nunez (Inactive) [ 05/Oct/15 ]

I'm seeing the same LBUG in replay-single test 70b in review-dne-part-2 test sessions. Logs are at https://testing.hpdd.intel.com/test_sets/431bb092-6b20-11e5-94a7-5254006e85c2
https://testing.hpdd.intel.com/test_sets/5d6b06a8-6a19-11e5-9fbf-5254006e85c2
2015-10-05 18:52:03 - https://testing.hpdd.intel.com/test_sets/b118562c-6bd9-11e5-ae45-5254006e85c2

LBUG reported in failure of replay-single test 70d in review-dne-part-2. Logs at https://testing.hpdd.intel.com/test_sets/8aac68c4-6a18-11e5-9d0a-5254006e85c2

Comment by Peter Jones [ 06/Oct/15 ]

Reverting LU-5951 to avoid this regression

Comment by Niu Yawei (Inactive) [ 08/Oct/15 ]

I think it because the OUT requests are "rq_allow_replay" (can be send during replay), so that normal OUT request could be sent to server to bump the exp_last_xid, and the LBUG will be triggered when processing some replay OUT request later.

Removing the improper LBUG() looks a viable approach to me.

Comment by Alex Zhuravlev [ 08/Oct/15 ]

hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays.

Comment by Niu Yawei (Inactive) [ 08/Oct/15 ]

hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays.

I'm not sure too. Di, any comments?

Comment by Di Wang [ 08/Oct/15 ]

rq_allow_replay is only for readonly request (no need replay), which means the request can be sent even before the import is FULL (during recovery process). Hmm, Did you see replay request set with rq_allow_replay?

Btw: even the comment out that LBUG section, it does not seem to resolve the problem. See https://testing.hpdd.intel.com/test_sets/15508dae-6bde-11e5-ae15-5254006e85c2

Comment by Niu Yawei (Inactive) [ 08/Oct/15 ]

Btw: even the comment out that LBUG section, it does not seem to resolve the problem. See https://testing.hpdd.intel.com/test_sets/15508dae-6bde-11e5-ae15-5254006e85c2

Right, we have to make sure the replay request is handled properly in this case. Please see my new patch at: http://review.whamcloud.com/#/c/16759/

Generated at Sat Feb 10 02:07:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.