[LU-328] OSS pseudo-hang due to (struct filter_obd *)->fo_llog_list_lock deadlock upon OSTs warm restart/recovery Created: 16/May/11 Updated: 19/Nov/12 Resolved: 07/Jun/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0, Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexandre Louvet | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Bull PetaFlopV1.0AE1, kernel 2.6.32-71.14.1.el6, lustre 2.0.0.1 |
||
| Severity: | 3 |
| Rank (Obsolete): | 5005 |
| Description |
|
During a "warm" (shine stop/start) of several OSTs, OSS became pseudo-hang (ping ok but no login possible). Forced crash-dump analysis show that all 32 CPUs/Cores are running/spinning with 32 ll_ost_<xx> threads all with the following stack-trace: when the thread owning the concerned (struct filter_obd *)->fo_llog_list_lock spin-lock is sleeping/waiting to be re-scheduled with the following stack-trace : So this definitelly points to a bug in filter_find_create_olg() where if filter_find_olg_internal() does not find a matching struct obd_llog_group and it needs to be kmem-allocated, this can and must be done with (struct filter_obd *)->fo_llog_list_lock freed to avoid this race+dead-lock scenario. This means that the following source extract of filter_find_create_olg() routine : llog_group_init(olg, group); should be replaced with : llog_group_init(olg, group); |
| Comments |
| Comment by Peter Jones [ 16/May/11 ] |
|
Hi HongChao Could you please look into this issue? Thanks Peter |
| Comment by Hongchao Zhang [ 16/May/11 ] |
|
Hi Bruno, |
| Comment by Peter Jones [ 19/May/11 ] |
|
Please see http://review.whamcloud.com/#change,556 for details on the patch |
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Build Master (Inactive) [ 03/Jun/11 ] |
|
Integrated in Oleg Drokin : 799ed89ac8633b17f631cc13a3fb6fd90d8e958c
|
| Comment by Peter Jones [ 07/Jun/11 ] |
|
Landed for 2.1 |