[LU-2615] group of OSS crashed at umount Created: 14/Jan/13 Updated: 17/Dec/15 Resolved: 17/Dec/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexandre Louvet | Assignee: | Hongchao Zhang |
| Resolution: | Done | Votes: | 0 |
| Labels: | ptr | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6118 |
| Description |
|
We have got 4 OSSes that crash at the same time, at umount, with the following bt : PID: 18173 TASK: ffff8803376dc040 CPU: 4 COMMAND: "umount" This bt is identical as the one shown Site is classified so I can't upload the binary crash but I can export the content of some structures upon request. |
| Comments |
| Comment by Peter Jones [ 14/Jan/13 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 20/Jan/13 ] |
|
is the debug log before the crash available? it will print the content of the remain llcd and the address of llog_ctxt and CDEBUG(D_RPCTRACE, "Llcd (%p) at %s:%d:\n", llcd, func, line); could you please print the content of the two structures? Thanks! |
| Comment by Hongchao Zhang [ 21/Jan/13 ] |
|
the other possible reason of this issue is the ptlrpc_request created in llog_send is not completed normally, and its rq_interpret_reply (llcd_interpret) Hi, what patches are you using with the 2.1.3? |
| Comment by Alexandre Louvet [ 24/Jan/13 ] |
|
Here is the list of patches we have on our production machine.
For CDEBUG log it will take a little more time. By default our production machine run with a debug filter set to 0 (none), so I do not have those trace in the dmesg log. I have to spend some time to extract them from the crash. |
| Comment by Hongchao Zhang [ 24/Jan/13 ] |
|
what is the content of the two structures "llog_ctxt" and "llog_commit_master" of the llcd? |
| Comment by Alexandre Louvet [ 11/Feb/13 ] |
|
Here is the content of llog_ctxt & llog_commit_master extracted from the crash dump :
crash> llog_ctxt 0xffff8802327f8900
struct llog_ctxt {
loc_idx = 3,
loc_gen = {
mnt_cnt = 0,
conn_cnt = 0
},
loc_obd = 0xffff88032adbc038,
loc_olg = 0xffff88032adbc2d0,
loc_exp = 0xffff8802ed751400,
loc_imp = 0x0,
loc_logops = 0xffffffffa0cb99e0,
loc_handle = 0x0,
loc_lcm = 0xffff88033614b400,
loc_llcd = 0x0,
loc_sem = {
lock = {
raw_lock = {
slock = 458759
}
},
count = 0,
wait_list = {
next = 0xffff8802327f8960,
prev = 0xffff8802327f8960
}
},
loc_refcount = {
counter = 2
},
llog_proc_cb = 0xffffffffa0c95220,
loc_flags = 2
}
crash> llog_commit_master 0xffff88033614b400
struct llog_commit_master {
lcm_flags = 4,
lcm_count = {
counter = 1
},
lcm_refcount = {
counter = 3
},
lcm_pc = {
pc_flags = 0,
pc_lock = {
raw_lock = {
slock = 65537
}
},
pc_starting = {
done = 0,
wait = {
lock = {
raw_lock = {
slock = 196611
}
},
task_list = {
next = 0xffff88033614b430,
prev = 0xffff88033614b430
}
}
},
pc_finishing = {
done = 0,
wait = {
lock = {
raw_lock = {
slock = 196611
}
},
task_list = {
next = 0xffff88033614b450,
prev = 0xffff88033614b450
}
}
},
pc_set = 0x0,
pc_name = "lcm_xxxxx1-OST0",
pc_env = {
le_ctx = {
lc_tags = 2415919112,
lc_thread = 0x0,
lc_value = 0x0,
lc_state = LCS_FINALIZED,
lc_remember = {
next = 0xffff88033614b498,
prev = 0xffff88033614b498
},
lc_version = 15,
lc_cookie = 0
},
le_ses = 0x0
},
pc_index = -1,
pc_npartners = 0,
pc_partners = 0x0,
pc_cursor = 0
},
lcm_lock = {
raw_lock = {
slock = 155453764
}
},
lcm_llcds = {
next = 0xffff88021b2c2050,
prev = 0xffff88021b2c2050
},
lcm_name = "lcm_xxxxx1-OST0002\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
}
|
| Comment by Hongchao Zhang [ 08/Mar/13 ] |
|
the lcm->lcm_llcds list is empty , but the lcm->lcm_count is "1", which is very strange! |
| Comment by Sebastien Buisson (Inactive) [ 11/Mar/13 ] |
|
Hi, here are the source files requested by Hongchao. |
| Comment by Hongchao Zhang [ 12/Mar/13 ] |
|
the list "lcm_llcds" is corrupted for its value of "next" and "prev" is wrong (it's not in the address region of "struct llog_commit_master"). does the issue occur again recently? |
| Comment by Alexandre Louvet [ 11/Apr/13 ] |
|
> Could it be memory corrupt? > does the issue occur again recently? |
| Comment by Hongchao Zhang [ 26/Apr/13 ] |
|
Hi, |
| Comment by Hongchao Zhang [ 02/May/13 ] |
|
the remaining "llcd" should have been sent over ptlrpc_request for llog_ctxt->loc_llcd == NULL, and the request could not finish, then "llcd_interpret" wasn't |
| Comment by Sebastien Buisson (Inactive) [ 16/May/13 ] |
|
Hi, It might be difficult to have the opportunity to install packages with those 2 patches reverted at customer site. Thanks, |
| Comment by Hongchao Zhang [ 07/Jun/13 ] |
|
Hi, Yes, it will disable the ptlrpcd thread pools (although not shaking off the patch completely) and it should be still a relevant test. Thanks |
| Comment by Hongchao Zhang [ 18/Jun/13 ] |
|
Hi, what is the output of the test? Thanks |
| Comment by Sebastien Buisson (Inactive) [ 19/Jun/13 ] |
|
Hi, I have asked people on site for the results of the tests. Cheers, |
| Comment by Li Xi (Inactive) [ 02/Aug/13 ] |
|
We hit the same problem on lustre-2.1.6 too. After reading a few codes, I am wondering whether it is possible for following race problem to happen. Please correct me if I am wrong. filter_llog_finish Thanks! |
| Comment by Hongchao Zhang [ 02/Aug/13 ] |
|
what are the two threads involved in the race? could you please attach some more info about this issue, and can it be reproduced on your site? |
| Comment by Li Xi (Inactive) [ 02/Aug/13 ] |
|
Hi Hongchao, Sorry, may be 'race' is not the right word to express my thought. At the time llcd_send() returns, the completion handler llcd_interpret() might not be called yet, right? When the llcd is still under use by the RPC on flight, llog_recov_thread_stop() will hit a LBUG. I can't find any codes in filter_llog_finish() which waits for the RPC finishes, so I guess it is possible that when llog_recov_thread_stop() is called, the RPC is still on flight. Am I right? Thanks |
| Comment by Hongchao Zhang [ 05/Aug/13 ] |
|
in ptlrpcd_stop, cfs_wait_for_completion(&pc->pc_finishing) will be called to wait the pending RPCs to complete! can the issue be reproduced in your site? |
| Comment by Li Xi (Inactive) [ 05/Aug/13 ] |
|
Ah, I see. Thank you very much! |