[LU-369] ASSERTION(oti && oti->oti_thread && oti->oti_thread->t_watchdog) failed in quota_chk_acq_common() Created: 27/May/11 Updated: 14/Jul/11 Resolved: 14/Jul/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Patrick Valentin (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4960 |
| Description |
|
ASSERTION(oti && oti->oti_thread && oti->oti_thread->t_watchdog) failed in quota_chk_acq_common() The following system crash was reported by TGCC Bull customer, after a controller problem occured on a RAID array. On April 2, at 11:00:35 the syslog notices a "link down" for about 6 minutes, during which a lot a I/O errors and Lustre errors on OST are recorded. When the link became up, the 15 luns where no longer available, and the system continued to record the same errors and finally crashed at 11.25:42 (more than 40000 I/O errors). For this first crash, we do not have a dump. The system then restarted and the syslog shows that one of the lun was not detected (only 14 luns available). Lustre recorded the following error (19=ENODEV) for the lun not detected: Lustre then started the recovery on all the OSTs (tgt_recov thread), and logs several times the following error for the lun not detected: After approximately 1 minute the system crashed on the following assertion in quota_interface.c:451:quota_chk_acq_common(): This crash occured 6 times, as long as one of the OST was not detected. We have 3 dumps available which all show the same following stack: PID: 19634 TASK: ffff88045728d5a0 CPU: 2 COMMAND: "tgt_recov" In these 3 dumps, "oti" and "oti->oti_thread" are valid pointers, but "oti->oti_thread->t_watchdog" is a null pointer. The only place were this pointer is reset is in "ptlrpc_main()", when the service is stopped. A possible fix could be to remove the LASSERT and modify the "if" statement, as it is done 50 lines above, by replacing these lines: LASSERT(oti && oti->oti_thread && But I'm wondering if this is enough, or if an additional test must be added to exit the while loop. |
| Comments |
| Comment by Peter Jones [ 27/May/11 ] |
|
Niu Could you please look into this customer issue? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 29/May/11 ] |
|
The LASSERT(oti->oti_thread->t_watchlog) here isn't correct, because the quota_chk_acq_common() might be called from recovery thread, which doesn't have any watchdog attached. What confused me is: Given that the mount has failed, how did the OSS start recovery? Maybe the RAID recovered after a while, then the OSS mount succesfully and started recovery, but we just missed this part from the log? Anyway, I'll post a patch to fix the LASSERT problem first. |
| Comment by Niu Yawei (Inactive) [ 30/May/11 ] |
|
The patch is at http://review.whamcloud.com/870 |
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Build Master (Inactive) [ 14/Jul/11 ] |
|
Integrated in Oleg Drokin : d57911ad26ee6ae39738d7ed36898a915290a51f
|
| Comment by Niu Yawei (Inactive) [ 14/Jul/11 ] |
|
landed for 2.1 |