Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.3.63-6chaos (https://github.com/chaos/lustre/tree/2.3.63-6chaos)
-
3
-
8472
Description
We had a server running Lustre over ZFS, Lustre version 2.3.63-6chaos that hit the following assertion. Clients were mostly BG/Q clients, with some 2.1 x86_64 as well.
<ConMan> Console [vesta56] log at 2013-05-24 17:00:00 PDT. 2013-05-24 17:51:58 LustreError: 33230:0:(sec_null.c:320:null_alloc_rs()) vmalloc of 'rs' (-1073741304 bytes) failed 2013-05-24 17:51:58 LustreError: 33230:0:(sec_null.c:320:null_alloc_rs()) 369821362 total bytes allocated by Lustre, 593003541 by LNET 2013-05-24 17:51:58 LustreError: 33230:0:(pack_generic.c:428:lustre_msg_buf_v2()) msg ffff880da3718108 buffer[1] size -1073741792 too small (required 0, opc=0) 2013-05-24 17:51:58 LNetError: 33230:0:(o2iblnd_cb.c:1601:kiblnd_send()) ASSERTION( __builtin_offsetof(kib_msg_t,ibm_u.immediate.ibim_payload[payload_nob]) <= (4<<10) ) failed: 2013-05-24 17:51:58 LNetError: 33230:0:(o2iblnd_cb.c:1601:kiblnd_send()) LBUG
When the failover partner started up, it hit the same assertion after recovery completed:
2013-05-24 17:56:42 Lustre: fsv-OST0037: Client e985e75f-abf3-b986-3da7-3cd1c4f29af0 (at 172.20.17.95@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-24 17:56:42 Lustre: Skipped 128 previous similar messages 2013-05-24 17:57:25 Lustre: fsv-OST0037: Recovery over after 3:45, of 405 clients 405 recovered and 0 were evicted. 2013-05-24 17:57:25 LustreError: 11574:0:(sec_null.c:320:null_alloc_rs()) vmalloc of 'rs' (-1073741304 bytes) failed 2013-05-24 17:57:25 LustreError: 11574:0:(sec_null.c:320:null_alloc_rs()) 408246602 total bytes allocated by Lustre, 1140997813 by LNET 2013-05-24 17:57:25 LustreError: 11574:0:(pack_generic.c:428:lustre_msg_buf_v2()) msg ffff880e344b8108 buffer[1] size -1073741792 too small (required 0, opc=0) 2013-05-24 17:57:25 LNetError: 11574:0:(o2iblnd_cb.c:1601:kiblnd_send()) ASSERTION( __builtin_offsetof(kib_msg_t,ibm_u.immediate.ibim_payload[payload_nob]) <= (4<<10) ) failed:
This happened over and over again for some time until a sysadmin intervened and aborted recovery manually. That seemed to allow everything to start up normally.