[LU-3295] ASSERTION(list_empty(&service->srv_active_rqbds)) failed Created: 08/May/13 Updated: 09/May/14 Resolved: 09/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL5: 2.6.18-348.1.1.el5_lustre |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 8161 |
| Description |
|
Recently we have been hitting an LBUG during Lustre startup. It appears to have happened before the 1.8.9, but it seems to be much more frequent since the upgrade. During startup our HA software attempts to mount all the resources in parallel. It looks like occasionally the OST attempts to contact the MGS before it is mounted, and we get messages like: May 7 17:36:35 lfs-oss-2-16 kernel: LustreError: 8596:0:(obd_config.c:372:class_setup()) setup OSS failed (-22) The MGS then starts, and the other OSTs on the OSS begin starting up. Then we hit: My guess is that it is some kind of race between mount and umount. There is the obvious workaround of making sure that the MGS is started, and we will do that in the future, but the HA isn't perfect and sometimes it just happens that the MGS isn't there. |
| Comments |
| Comment by Peter Jones [ 09/May/13 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 10/May/13 ] |
|
HI, could you please attach the whole logs of this issue, especially the logs around the ASSERTION, Thanks! |
| Comment by Kit Westneat (Inactive) [ 10/May/13 ] |
|
I need to raise this to a Sev1, we are unable to start the filesystem due to these LBUGs |
| Comment by Kit Westneat (Inactive) [ 10/May/13 ] |
|
Some more info. We were able to manually mount the OSTs and get everything working. In manually mounting, we ran into the "not registered" error on some of the OSTs. It seemed to always be the first OST to be mounted, and it seemed as if I had to lustre_rmmod in order to get it to work. We were unable to get more debugging information, but I think this helps to verify the timeline: 1) Linux HA attempts to do a parallel mount of several OSTs Let me know if there is any more information we can gather. |
| Comment by Oleg Drokin [ 10/May/13 ] |
|
it looks like ost threads failed to start or there were enough started already. e probably should add a bit more debug to see exactly why did it fail those threads. --- a/lustre/ost/ost_handler.c
+++ b/lustre/ost/ost_handler.c
@@ -2109,8 +2109,10 @@ static int ost_setup(struct obd_device *obd, obd_count len, void *buf)
}
rc = ptlrpc_start_threads(obd, ost->ost_service);
- if (rc)
+ if (rc) {
+ CERROR("Failed to start ost threads %d\n", rc);
GOTO(out_service, rc = -EINVAL);
+ }
if (oss_num_create_threads) {
if (oss_num_create_threads > OSS_MAX_CREATE_THREADS)
@@ -2139,8 +2141,10 @@ static int ost_setup(struct obd_device *obd, obd_count len, void *buf)
}
rc = ptlrpc_start_threads(obd, ost->ost_create_service);
- if (rc)
+ if (rc) {
+ CERROR("Failed to start ost create threads %d\n", rc);
GOTO(out_create, rc = -EINVAL);
+ }
ost->ost_io_service =
ptlrpc_init_svc(OST_NBUFS, OST_BUFSIZE, OST_MAXREQSIZE,
@@ -2159,8 +2163,10 @@ static int ost_setup(struct obd_device *obd, obd_count len, void *buf)
ost->ost_io_service->srv_done = ost_thread_done;
ost->ost_io_service->srv_cpu_affinity = 1;
rc = ptlrpc_start_threads(obd, ost->ost_io_service);
- if (rc)
+ if (rc) {
+ CERROR("Failed to start ost io threads %d\n", rc);
GOTO(out_io, rc = -EINVAL);
+ }
ping_evictor_start();
something like that as a first step |
| Comment by Kit Westneat (Inactive) [ 10/May/13 ] |
|
Here are the decoded lustre logs. There are also some from Tuesday's downtime (May 07): |
| Comment by Hongchao Zhang [ 13/May/13 ] |
|
this issue is caused by the conflict between the recovery and ptlrpc_unregister_service, by launching a kernel thread to stop the OST service after one last replay reply is queued manually, this issue can be reproduced. there is one more question, what is the cause of failure of ost_setup? there is an CERROR in each possible failure place, ptlrpc_init_svc and Hi Kit, do you specify ost_num_threads, oss_number_threads or oss_num_create_threads when you load ost.ko (which could trigger -EMFILE)? |
| Comment by Kit Westneat (Inactive) [ 13/May/13 ] |
|
Hi Hongchao, It looks like we are specifying: Do you think removing this line will work-around the issue? Thanks. |
| Comment by Hongchao Zhang [ 15/May/13 ] |
|
Hi, Kit Sorry for delayed response! |
| Comment by Kit Westneat (Inactive) [ 15/May/13 ] |
|
Hi Hongchao, Were there any changes between 1.8.8 (or perhaps 1.8.7) and 1.8.9 that would cause this bug to happen more often? We never used to see it until we upgraded to 1.8.9. Thanks. |
| Comment by Hongchao Zhang [ 17/May/13 ] |
|
I have checked the related codes in git, and found no trace about this issue, do you use any additional patches in your 1.8.9(especially those ported from master)? |
| Comment by Kit Westneat (Inactive) [ 17/May/13 ] |
|
We are using WC built Lustre, no patches. The only other module option we use is options ko2iblnd peer_timeout=0. I will try to see if I can reproduce the issue on their test system. |
| Comment by Kit Westneat (Inactive) [ 21/May/13 ] |
|
Also, are there plans to write a patch to fix the conflict between the recovery and ptlrpc_unregister_service? Thanks. |
| Comment by Hongchao Zhang [ 24/May/13 ] |
|
okay, I'll try to create a patch to fix the conflict firstly, but the patch could be not enough if the reason that ptlrpc_unregister_service is triggered |
| Comment by Hongchao Zhang [ 30/May/13 ] |
|
status update: |
| Comment by Hongchao Zhang [ 31/May/13 ] |
|
the patch is tracked at http://review.whamcloud.com/#change,6512 |