[LU-1092] NULL pointer dereference in filter_export_stats_init() Created: 10/Feb/12 Updated: 12/Jun/12 Resolved: 29/Mar/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ned Bass | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | paj | ||
| Environment: | |||
| Severity: | 3 |
| Rank (Obsolete): | 4682 |
| Description |
|
We had three occurrences of this crash on our classified 2.1 Lustre cluster, all on OSS nodes. BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 machine_kexec The timeframe conincided with the ASSERT reported in LustreError: 14210:0:(genops.c:1270:class_disconnect_stale_exports()) ls5-OST0349: disconnect stale client [UUID]@<unknown> Oleg has suggested that the patch for |
| Comments |
| Comment by Peter Jones [ 10/Feb/12 ] |
|
Lai Does this look like it would be related to Peter |
| Comment by Bruno Faccini (Inactive) [ 02/Mar/12 ] |
|
I don't think so since we just suffered this same crash/Oops with our Lustre version already including BTW, first crash-dump analysis indicate a race over "((struct obd_export *)0xffff880b4b4df000)->exp_nid_stats" which has been NULL'ed during panic thread reference when calling/inlining lprocfs_init_rw_stats() in filter_export_stats_init() ... So there may be some more protective code to be added in lprocfs !! More debugging stuff/infos to come a soon as power will be back on our system ... |
| Comment by Bruno Faccini (Inactive) [ 15/Mar/12 ] |
|
I made some progress in the crash-dump (in fact we faced several occurences of this crash on different OSSes) and my assumption is that the race occurs during a huge number of Clients/exports disconnects on multiple OSTs/targets at a time, according to the huge number of "LustreError: <tgt_recov_PID>:0:(genops.c:1270:class_disconnect_stale_exports()) <OST_name>: disconnect stale client <Client_UUID>@<unknown>" msgs beeing printed at the Console at the time of the crash ..., when a reconnect request arrive !! I think the race can occur because the "stale" exports disconnect algorithm works by protecting itself with obd_device->obd_dev_lock and then removing each obd_export from their "exp_obd_chain" list, when the [re-]connect algorithm finds the concerned obd_export via cfs_hash_lookup() using the obd_device->obd_uuid_hash ... This allows an obd_export reference to be taken in target_handle_connect() prior to call filter_export_stats_init()/lprocfs_init_rw_stats() when during the same time on the other side/thread, the disconnect algorithm will finally, but too late, unlink the obd_export from hashing (obd_disconnect()/fiter_disconnect()/server_disconnect_export()/class_disconnect()/class_unlink_export() and also clear/NULL obd_export->exp_nid_stats via class_export_put()/lprocfs_exp_cleanup(). Does this sound you like a reasonable scenario for the race ?? |
| Comment by Lai Siyao [ 16/Mar/12 ] |
|
Bruno, your analysis looks reasonable: disconnected export may still get found, this is wrong; IMHO during connect, if the export is being disconnected, this export should not be reused, but create a new one instead. |
| Comment by Lai Siyao [ 19/Mar/12 ] |
|
In last comment I said connect/disconnect should be serialized, however the real issue here is that 'connect' doesn't hold an export refcount in the process, and at the meantime another disconnect is called and put the last refcount of the export. Thought the way I said should work, there is a simpler way to fix: take export refcount in connect. I'll commit the fix later. |
| Comment by Lai Siyao [ 19/Mar/12 ] |
|
Review is on http://review.whamcloud.com/#change,2345. |
| Comment by Build Master (Inactive) [ 27/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 27/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 27/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 27/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 27/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 27/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 27/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 28/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Peter Jones [ 29/Mar/12 ] |
|
Landed for 2.3 |
| Comment by Christopher Morrone [ 19/Apr/12 ] |
|
Needed on 2.1. The ball was dropped here, it appears. |
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Jay Lan (Inactive) [ 11/Jun/12 ] |
|
Hi Siyao, We here at NASA Ames had a LBUG on That ASSERTION was mentioned in LustreError: 6023:0:(genops.c:933:class_import_put()) ASSERTION(cfs_list_empty(&imp->imp_zombie_chain)) failed^M |
| Comment by Lai Siyao [ 12/Jun/12 ] |
|
No, IMO it's not the same issue. The backtrace shows a zombie import is being destroyed again, but this LU is about a issue of export race. At first sight, the export->exp_imp_reverse handling code in target_handle_connect() looks unsafe (no lock protected). In fact, the whole bunch of code about export/import refcounting looks a bit messy, if such panic still happens with all the patches mentioned by Oleg, we may need tidy up these code as a whole. |