[LU-7252] replay-single test_116b:process_req_last_xid()) LBUG Created: 04/Oct/15 Updated: 08/Oct/15 Resolved: 06/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | John Hammond |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for wangdi <di.wang@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f670098e-6a05-11e5-9d0a-5254006e85c2. The sub-test test_116b failed with the following error: 10:13:07:Lustre: DEBUG MARKER: == replay-single test 116b: large update log slave MDT recovery == 10:12:46 (1443867166) 10:13:07:Lustre: DEBUG MARKER: sync; sync; sync 10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 notransno 10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 readonly 10:13:07:Turning device dm-1 (0xfd00001) read-only 10:13:07:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001 10:13:07:Lustre: DEBUG MARKER: mds2 REPLAY BARRIER on lustre-MDT0001 10:13:07:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80001702 10:13:07:Lustre: DEBUG MARKER: grep -c /mnt/mds2' ' /proc/mounts 10:13:07:Lustre: DEBUG MARKER: umount -d /mnt/mds2 10:13:07:Removing read-only on unknown block (0xfd00001) 10:13:07:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 10:13:07:Lustre: DEBUG MARKER: hostname 10:13:07:Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P2 10:13:07:Lustre: DEBUG MARKER: mkdir -p /mnt/mds2; mount -t lustre /dev/lvm-Role_MDS/P2 /mnt/mds2 10:13:07:LDISKFS-fs (dm-1): recovery complete 10:13:07:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: 10:13:07:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u 10:13:07:Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1 10:13:07:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P2 2>/dev/null 10:13:07:LustreError: 30838:0:(tgt_handler.c:621:process_req_last_xid()) @@@ Unexpected xid 560f8bf556714 vs. last_xid 560f8bf55672b 10:13:07: req@ffff880059530c50 x1513996361754388/t0(158913790082) o1000->lustre-MDT0000-mdtlov_UUID@10.1.5.33@tcp:666/0 lens 4168/0 e 0 to 0 dl 1443867196 ref 1 fl Interpret:/4/ffffffff rc 0/-1 10:13:07:LustreError: 30838:0:(tgt_handler.c:636:process_req_last_xid()) LBUG 10:13:07:Pid: 30838, comm: mdt_out00_000 10:13:07: 10:13:07:Call Trace: 10:13:07: [<ffffffffa049b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 10:13:07: [<ffffffffa049be77>] lbug_with_loc+0x47/0xb0 [libcfs] 10:13:07: [<ffffffffa08859bd>] tgt_request_handle+0x105d/0x13f0 [ptlrpc] 10:13:07: [<ffffffffa082cd61>] ptlrpc_main+0xe41/0x1910 [ptlrpc] 10:13:07: [<ffffffffa082bf20>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 10:13:07: [<ffffffff810a101e>] kthread+0x9e/0xc0 10:13:07: [<ffffffff8100c28a>] child_rip+0xa/0x20 10:13:07: [<ffffffff810a0f80>] ? kthread+0x0/0xc0 10:13:07: [<ffffffff8100c280>] ? child_rip+0x0/0x20 10:13:07: 10:13:07:Kernel panic - not syncing: LBUG 10:13:07:Pid: 30838, comm: mdt_out00_000 Not tainted 2.6.32-573.3.1.el6_lustre.g00880a0.x86_64 #1 10:13:07:Call Trace: 10:13:07: [<ffffffff815384e4>] ? panic+0xa7/0x16f 10:13:07: [<ffffffffa049becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 10:13:07: [<ffffffffa08859bd>] ? tgt_request_handle+0x105d/0x13f0 [ptlrpc] 10:13:07: [<ffffffffa082cd61>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc] 10:13:07: [<ffffffffa082bf20>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 10:13:07: [<ffffffff810a101e>] ? kthread+0x9e/0xc0 10:13:07: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 10:13:07: [<ffffffff810a0f80>] ? kthread+0x0/0xc0 10:13:07: [<ffffffff8100c280>] ? child_rip+0x0/0x20 10:13:07:Initializing cgroup subsys cpuset 10:13:07:Initializing cgroup subsys cpu Please provide additional information about the failure here. Info required for matching: replay-single 116b |
| Comments |
| Comment by Di Wang [ 05/Oct/15 ] |
|
Hmm, this patch seems caused by http://review.whamcloud.com/15473 Niu, could you please have a look? Thanks |
| Comment by Gerrit Updater [ 05/Oct/15 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/16725 |
| Comment by James Nunez (Inactive) [ 05/Oct/15 ] |
|
I'm seeing the same LBUG in replay-single test 70b in review-dne-part-2 test sessions. Logs are at https://testing.hpdd.intel.com/test_sets/431bb092-6b20-11e5-94a7-5254006e85c2 LBUG reported in failure of replay-single test 70d in review-dne-part-2. Logs at https://testing.hpdd.intel.com/test_sets/8aac68c4-6a18-11e5-9d0a-5254006e85c2 |
| Comment by Peter Jones [ 06/Oct/15 ] |
|
Reverting |
| Comment by Niu Yawei (Inactive) [ 08/Oct/15 ] |
|
I think it because the OUT requests are "rq_allow_replay" (can be send during replay), so that normal OUT request could be sent to server to bump the exp_last_xid, and the LBUG will be triggered when processing some replay OUT request later. Removing the improper LBUG() looks a viable approach to me. |
| Comment by Alex Zhuravlev [ 08/Oct/15 ] |
|
hmm, I don't quite understand the case with OUT.. if it's replays, then why the initiator are sending any requests out of order? that would be bad because then it can consume transno which should be reserved for replays. |
| Comment by Niu Yawei (Inactive) [ 08/Oct/15 ] |
I'm not sure too. Di, any comments? |
| Comment by Di Wang [ 08/Oct/15 ] |
|
rq_allow_replay is only for readonly request (no need replay), which means the request can be sent even before the import is FULL (during recovery process). Hmm, Did you see replay request set with rq_allow_replay? Btw: even the comment out that LBUG section, it does not seem to resolve the problem. See https://testing.hpdd.intel.com/test_sets/15508dae-6bde-11e5-ae15-5254006e85c2 |
| Comment by Niu Yawei (Inactive) [ 08/Oct/15 ] |
Right, we have to make sure the replay request is handled properly in this case. Please see my new patch at: http://review.whamcloud.com/#/c/16759/ |