Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
3
-
Orion
-
3014
Description
We are seeing a lot (>30,000) of the following errors following recovery on our OSTs. This is only observed when the OSTs are under load when they are restarted. I believe this is the same as issue LU-463 but I thought I'd open this as a separate issue until someone else can confirm it. If these errors actually do percolate all the way back to the client and are returned to the application then this is a serious issue for us.
Lustre: Lustre: Build Version: orion-2_2_49_57_1-44chaos-44chaos--PRISTINE-2.6.32-220.13.1.2chaos.ch5.x86_64 LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2 LustreError: 5497:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2 Lustre: lcz-OST0000: Mounted Lustre: lcz-OST0000: Will be in recovery for at least 5:00, or until 60 clients reconnect. Lustre: lcz-OST0000: Recovery over after 1:12, of 60 clients 60 recovered and 0 were evicted. LustreError: 5669:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25345259: rc -2 LustreError: 5663:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25352185: rc -2 LustreError: 5669:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 2 previous similar messages LustreError: 5670:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25353836: rc -2 LustreError: 5670:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 168 previous similar messages LustreError: 5665:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25418903: rc -2 LustreError: 5665:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 561 previous similar messages LustreError: 5659:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25627504: rc -2 LustreError: 5659:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 2043 previous similar messages LustreError: 5673:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25772020: rc -2 LustreError: 5673:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 4424 previous similar messages LustreError: 5662:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 26456381: rc -2 LustreError: 5662:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 10386 previous similar messages LustreError: 5672:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 28049681: rc -2 LustreError: 5672:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 23629 previous similar messages
> I'm not sure if it is practical, but I'm wondering if it would make sense to have test pass/fail also depend on "expect" output for both the stdout and console error message?
I think that's an excellent idea. I currently do something similar with the ZFS builds by automatically flagging any tests which generate console messages with a warning. It's helped me catch several issues immediately which might otherwise have slipped through. For Lustre you'll probably need something more sophisticated to filter out the expected messages, or better yet to go on a crusade silencing those warnings.