[LU-1235] timeout in sanity subtest 103,unable to handle kernel paging request Created: 19/Mar/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server: 2.2-RC1-RHEL6 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 4536 | ||||||||||||
| Description |
|
hit this issue again when doing interop test between 2.2-RC1 server and 2.1.1 RHEL6 client: |
| Comments |
| Comment by Peter Jones [ 19/Mar/12 ] |
|
Bobi Could you please comment on this one? Thanks Peter |
| Comment by Zhenyu Xu [ 19/Mar/12 ] |
|
MDS panic on bad IP, in osd_trans_commit_cb(), I think its the bad journal callback function address caused the panic. 20:01:48:BUG: unable to handle kernel paging request at 0000000400000002 osd_trans_commit_cb() static void osd_trans_commit_cb(struct journal_callback *jcb, int error) { struct osd_thandle *oh = container_of0(jcb, struct osd_thandle, ot_jcb); struct thandle *th = &oh->ot_super; struct lu_device *lud = &th->th_dev->dd_lu_dev; struct dt_txn_commit_cb *dcb, *tmp; LASSERT(oh->ot_handle == NULL); if (error) CERROR("transaction @0x%p commit error: %d\n", th, error); dt_txn_hook_commit(th); /* call per-transaction callbacks if any */ cfs_list_for_each_entry_safe(dcb, tmp, &oh->ot_dcb_list, dcb_linkage) dcb->dcb_func(NULL, th, dcb, error); // ===========> BAD RIP <============== lu_ref_del_at(&lud->ld_reference, oh->ot_dev_link, "osd-tx", th); lu_device_put(lud); th->th_dev = NULL; lu_context_exit(&th->th_ctx); lu_context_fini(&th->th_ctx); OBD_FREE_PTR(oh); } |
| Comment by Sarah Liu [ 26/Mar/12 ] |
|
got this error again in RC2 testing,server/client: RHEL6-ofed , https://maloo.whamcloud.com/test_sets/175f9a26-770f-11e1-a169-5254004bbbd3 |
| Comment by Zhenyu Xu [ 28/Mar/12 ] |
|
Sarah, Would you mind loading this patch http://review.whamcloud.com/2394 and trying to hit the issue again? |
| Comment by Sarah Liu [ 28/Mar/12 ] |
|
Sure, will keep you updated. |
| Comment by Zhenyu Xu [ 10/Apr/12 ] |
|
crash> dis osd_trans_commit_cb+0x79 crash> dis osd_trans_commit_cb 0xffffffffa0b1dcd6 <osd_trans_commit_cb+118>: callq *0x10(%rax) looks like the list was corrupted. |
| Comment by Zhenyu Xu [ 11/Apr/12 ] |
0000000000000c60 <osd_trans_commit_cb>:
osd_trans_commit_cb():
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:546
c60: 55 push %rbp
c61: 48 89 e5 mov %rsp,%rbp
c64: 41 57 push %r15
c66: 41 56 push %r14
c68: 41 55 push %r13
c6a: 41 54 push %r12
c6c: 53 push %rbx
c6d: 48 83 ec 08 sub $0x8,%rsp
c71: e8 00 00 00 00 callq c76 <osd_trans_commit_cb+0x16>
__container_of():
BUILD/BUILD/lustre-2.2.50/libcfs/include/libcfs/libcfs.h:321
c76: 48 81 fe 00 f0 ff ff cmp $0xfffffffffffff000,%rsi
osd_trans_commit_cb():
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:546
c7d: 41 89 d4 mov %edx,%r12d
__container_of():
BUILD/BUILD/lustre-2.2.50/libcfs/include/libcfs/libcfs.h:321
c80: 0f 87 ac 01 00 00 ja e32 <osd_trans_commit_cb+0x1d2>
c86: 48 85 f6 test %rsi,%rsi
c89: 0f 84 a3 01 00 00 je e32 <osd_trans_commit_cb+0x1d2>
BUILD/BUILD/lustre-2.2.50/libcfs/include/libcfs/libcfs.h:324
c8f: 48 8d 5e a8 lea -0x58(%rsi),%rbx
osd_trans_commit_cb():
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:552
c93: 48 83 7b 50 00 cmpq $0x0,0x50(%rbx)
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:549
c98: 4c 8b 2b mov (%rbx),%r13
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:552
c9b: 0f 85 55 01 00 00 jne df6 <osd_trans_commit_cb+0x196>
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:554
ca1: 45 85 e4 test %r12d,%r12d
ca4: 0f 85 e6 00 00 00 jne d90 <osd_trans_commit_cb+0x130>
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:557
caa: 48 89 df mov %rbx,%rdi
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:560
cad: 4c 8d 7b 70 lea 0x70(%rbx),%r15
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:557
cb1: e8 00 00 00 00 callq cb6 <osd_trans_commit_cb+0x56>
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:560
cb6: 48 8b 43 70 mov 0x70(%rbx),%rax
cba: 4c 39 f8 cmp %r15,%rax
cbd: 4c 8b 30 mov (%rax),%r14
cc0: 75 09 jne ccb <osd_trans_commit_cb+0x6b>
cc2: eb 20 jmp ce4 <osd_trans_commit_cb+0x84>
cc4: 0f 1f 40 00 nopl 0x0(%rax)
cc8: 49 89 d6 mov %rdx,%r14
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:561
ccb: 48 89 c2 mov %rax,%rdx
cce: 31 ff xor %edi,%edi
cd0: 44 89 e1 mov %r12d,%ecx
cd3: 48 89 de mov %rbx,%rsi
cd6: ff 50 10 callq *0x10(%rax)
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:560
cd9: 4d 39 fe cmp %r15,%r14 =======> <osd_trans_commit_cb+0x79>
cdc: 49 8b 16 mov (%r14),%rdx
cdf: 4c 89 f0 mov %r14,%rax
ce2: 75 e4 jne cc8 <osd_trans_commit_cb+0x68>
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:567
ce4: 4c 8d 63 10 lea 0x10(%rbx),%r12
BUILD/BUILD/lustre-2.2.50/lustre/osd-ldiskfs/osd_handler.c:564
ce8: 4c 89 ef mov %r13,%rdi
ceb: e8 00 00 00 00 callq cf0 <osd_trans_commit_cb+0x90>
...
|
| Comment by Zhenyu Xu [ 11/Apr/12 ] |
|
Tappro, Does it relate to |
| Comment by Mikhail Pershin [ 11/Apr/12 ] |
|
yes, this code was added with |
| Comment by Zhenyu Xu [ 11/Apr/12 ] |
|
unfortunately MDS paniced due to the bad memory access, and no logs were collected since. |
| Comment by Niu Yawei (Inactive) [ 13/Apr/12 ] |
|
Hi, tappro In the patch for /* if can't add callback, do sync write */
txn->th_sync = !!lut_last_commit_cb_add(txn, &mdt->mdt_lut,
mti->mti_exp,
mti->mti_transno);
I think we need to open a new ticket for this defect. |
| Comment by Mikhail Pershin [ 13/Apr/12 ] |
|
yes, '|=' should be there to don't drop sync but accumulate all possible sync cases into flag |
| Comment by Zhenyu Xu [ 10/May/12 ] |
|
the 'txn->th_sync != !!lut_last_commit_cb_add' patch (http://review.whamcloud.com/2530) as been landed to master |
| Comment by Peter Jones [ 10/May/12 ] |
|
ok then let's mark this as resolved and reopen if it is seen with code since that April 29th landing under |
| Comment by Peter Jones [ 11/May/12 ] |
|
Hmm. I just realized that sanity is still failing for the 2.2.52 tag which contains the fix you mentioned. Are we now experiencing a different failure? |
| Comment by Mikhail Pershin [ 11/May/12 ] |
|
Peter, the fix you mentioned is not for root cause but side issue. The |
| Comment by Peter Jones [ 11/May/12 ] |
|
ok so what are the next steps for the central issue? |
| Comment by Andreas Dilger [ 31/May/12 ] |
|
This still failed 3 times in the last 2 weeks (about 7% of runs according to Maloo): https://maloo.whamcloud.com/sub_tests/bd350a56-a1fa-11e1-abdc-52540035b04c I've resubmitted the build of the original debugging patch submitted in March. |
| Comment by Sarah Liu [ 11/Jun/12 ] |
|
another failure: https://maloo.whamcloud.com/test_sets/353c939e-b1db-11e1-bb61-52540035b04c |
| Comment by Andreas Dilger [ 22/Jun/12 ] |
|
This is being hit in a reported 27% of test runs: https://maloo.whamcloud.com/test_sets/19a4c974-bbf1-11e1-95bf-52540035b04c |
| Comment by Andreas Dilger [ 28/Jun/12 ] |
|
Bobijam, any progress on this bug? |
| Comment by Zhenyu Xu [ 28/Jun/12 ] |
|
not yet, while another debugging patch is in review phase (http://review.whamcloud.com/#change,2394) |
| Comment by Peter Jones [ 30/Jul/12 ] |
|
Latest diagnostic patch is landed for next tag. |
| Comment by Sarah Liu [ 07/Aug/12 ] |
|
In the latest tag 2.2.92, subtest 103 passed on both RHEL5 and RHEL6 client https://maloo.whamcloud.com/test_sets/64843e64-e0d3-11e1-a388-52540035b04c |
| Comment by Peter Jones [ 07/Aug/12 ] |
|
ok then let's drop this from being a blocker unless it reoccurs and we are able to gather the diagnostic information from the logs. |
| Comment by Jian Yu [ 13/Aug/12 ] |
|
Lustre Clients: v2_1_3_RC1 Lustre Servers: 2.2.0 The same issue occurred: https://maloo.whamcloud.com/test_sets/bc40a18e-e384-11e1-b6d3-52540035b04c |
| Comment by Zhenyu Xu [ 13/Aug/12 ] |
|
I'll port the debugging patch to b2_2 |
| Comment by Zhenyu Xu [ 13/Aug/12 ] |
|
b2_2 patch port tracking at http://review.whamcloud.com/3615 |
| Comment by Sarah Liu [ 27/Sep/12 ] |
|
server: 2.2.0 RHEL6 https://maloo.whamcloud.com/test_sets/4bddaaee-0806-11e2-b8a8-52540035b04c |
| Comment by Jian Yu [ 08/Oct/12 ] |
|
Lustre Client Build: http://build.whamcloud.com/job/lustre-b2_3/28 The same issue occurred: https://maloo.whamcloud.com/test_sets/e151ca0a-0e2e-11e2-91a3-52540035b04c As per Peter, we don't have any plans to land anything to b2_2 at this time. We can add Lustre version check code in b2_3 and master test suites to skip the test as what we did in |
| Comment by Ann Koehler (Inactive) [ 24/Jul/14 ] |
|
Just in case this helps anyone else: we hit the MDS panic in jbd2/dm-0-8 reported above with b2_2. We tracked the root cause to |