Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
3
-
9223372036854775807
Description
Lustre client lbugs in osc_page_init when job processes are killed due to cgroup being out of memory. This LBUG occurred on 14 nodes during a recent relrun.
> 2017-05-22T16:41:07.320315-05:00 c0-0c1s6n0 LustreError: 15485:0:(osc_page.c:433:osc_page_init()) ASSERTION( result == 0 ) failed: > 2017-05-22T16:41:07.320393-05:00 c0-0c1s6n0 Killed process 15246 (namu.exe.6GB_pe) apid 471027 total-vm:8968944kB, anon-rss:5203772kB, file-rss:12kB, shmem-rss:1828kB > 2017-05-22T16:41:07.320398-05:00 c0-0c1s6n0 Memory cgroup out of memory: Killed 15 processes sharing cpu group with pid 15246. > 2017-05-22T16:41:07.320404-05:00 c0-0c1s6n0 LustreError: 15485:0:(osc_page.c:433:osc_page_init()) LBUG > 2017-05-22T16:41:07.320409-05:00 c0-0c1s6n0 Pid: 15485, comm: namu.exe.6GB_pe > PID: 15485 TASK: ffff8816cf566980 CPU: 55 COMMAND: "namu.exe.6GB_pe" > #0 [ffff8816cf56b908] panic at ffffffff8114670e > #1 [ffff8816cf56b980] lbug_with_loc at ffffffffa026aead [libcfs] > #2 [ffff8816cf56b9a0] osc_page_init at ffffffffa09f9e12 [osc] > #3 [ffff8816cf56b9e0] lov_page_init_raid0 at ffffffffa084199b [lov] > #4 [ffff8816cf56ba38] lov_page_init at ffffffffa083a34c [lov] > #5 [ffff8816cf56ba48] cl_page_alloc at ffffffffa0559bf2 [obdclass] > #6 [ffff8816cf56ba88] cl_page_find at ffffffffa0559e1f [obdclass] > #7 [ffff8816cf56bad8] ll_readpage at ffffffffa09031c9 [lustre] > #8 [ffff8816cf56bbe8] filemap_fault at ffffffff8114b5db > #9 [ffff8816cf56bc58] vvp_io_fault_start at ffffffffa093258e [lustre] > #10 [ffff8816cf56bcc8] cl_io_start at ffffffffa055cfae [obdclass] > #11 [ffff8816cf56bcf0] cl_io_loop at ffffffffa056036e [obdclass] > #12 [ffff8816cf56bd20] ll_fault at ffffffffa09137e4 [lustre] > #13 [ffff8816cf56bd98] __do_fault at ffffffff81175abe > #14 [ffff8816cf56be00] handle_mm_fault at ffffffff81179528 > #15 [ffff8816cf56bee0] __do_page_fault at ffffffff81048de9 > #16 [ffff8816cf56bf40] do_page_fault at ffffffff8104904c > #17 [ffff8816cf56bf50] page_fault at ffffffff81506a62 > RIP: 0000000000415702 RSP: 00002aab20a00480 RFLAGS: 00010202 > RAX: 0000000000000280 RBX: 000000000000104b RCX: 0000000000005008 > RDX: 0000000000000f02 RSI: 000000000517f258 RDI: 0000000000000280 > RBP: 00002aab20a00670 R8: 0000000106ec9118 R9: 0000000000000781 > R10: 0000000000000280 R11: 0000000101d49ec0 R12: 000000000cdabab8 > R13: 0000000000000280 R14: 00000000000013c2 R15: 0000000007c2c860 > ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b