[LU-3417] ASSERTION( __builtin_offsetof(kib_msg_t,ibm_u.immediate.ibim_payload[payload_nob]) <= (4<<10) ) failed: Created: 30/May/13  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

Lustre 2.3.63-6chaos (https://github.com/chaos/lustre/tree/2.3.63-6chaos)


Severity: 3
Rank (Obsolete): 8472

 Description   

We had a server running Lustre over ZFS, Lustre version 2.3.63-6chaos that hit the following assertion. Clients were mostly BG/Q clients, with some 2.1 x86_64 as well.

<ConMan> Console [vesta56] log at 2013-05-24 17:00:00 PDT.
2013-05-24 17:51:58 LustreError: 33230:0:(sec_null.c:320:null_alloc_rs()) vmalloc of 'rs' (-1073741304 bytes) failed
2013-05-24 17:51:58 LustreError: 33230:0:(sec_null.c:320:null_alloc_rs()) 369821362 total bytes allocated by Lustre, 593003541 by LNET
2013-05-24 17:51:58 LustreError: 33230:0:(pack_generic.c:428:lustre_msg_buf_v2()) msg ffff880da3718108 buffer[1] size -1073741792 too small (required 0, opc=0)
2013-05-24 17:51:58 LNetError: 33230:0:(o2iblnd_cb.c:1601:kiblnd_send()) ASSERTION( __builtin_offsetof(kib_msg_t,ibm_u.immediate.ibim_payload[payload_nob]) <= (4<<10) ) failed:
2013-05-24 17:51:58 LNetError: 33230:0:(o2iblnd_cb.c:1601:kiblnd_send()) LBUG

When the failover partner started up, it hit the same assertion after recovery completed:

2013-05-24 17:56:42 Lustre: fsv-OST0037: Client e985e75f-abf3-b986-3da7-3cd1c4f29af0 (at 172.20.17.95@o2ib500) refused reconnection, still busy with 1 active RPCs
2013-05-24 17:56:42 Lustre: Skipped 128 previous similar messages
2013-05-24 17:57:25 Lustre: fsv-OST0037: Recovery over after 3:45, of 405 clients 405 recovered and 0 were evicted.
2013-05-24 17:57:25 LustreError: 11574:0:(sec_null.c:320:null_alloc_rs()) vmalloc of 'rs' (-1073741304 bytes) failed
2013-05-24 17:57:25 LustreError: 11574:0:(sec_null.c:320:null_alloc_rs()) 408246602 total bytes allocated by Lustre, 1140997813 by LNET
2013-05-24 17:57:25 LustreError: 11574:0:(pack_generic.c:428:lustre_msg_buf_v2()) msg ffff880e344b8108 buffer[1] size -1073741792 too small (required 0, opc=0)
2013-05-24 17:57:25 LNetError: 11574:0:(o2iblnd_cb.c:1601:kiblnd_send()) ASSERTION( __builtin_offsetof(kib_msg_t,ibm_u.immediate.ibim_payload[payload_nob]) <= (4<<10) ) failed: 

This happened over and over again for some time until a sysadmin intervened and aborted recovery manually. That seemed to allow everything to start up normally.



 Comments   
Comment by Peter Jones [ 31/May/13 ]

Bruno is looking into this one

Comment by Bruno Faccini (Inactive) [ 31/May/13 ]

Humm the negative sizes in the error msgs look strange but seem not significant at all ..., was a crash-dump taken at the time of one of these LBUGs in a loop ??

Comment by Bruno Faccini (Inactive) [ 05/Jun/13 ]

Hello Christopher,
Did you read my previous update+question ??
Since I am not able to explain what happen with these error msgs, if a crash-dump of one occurence could be available it would greatly help.

Comment by Christopher Morrone [ 05/Jun/13 ]

I believe we have a crash dump somewhere, not from the initial assertion, but from one of the failover assertions. Could be a while before we look into it. So far this only hit once, and we have a lot of higher priority issues at the moment.

Generated at Sat Feb 10 01:33:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.