Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14169

parallel-scale-nfsv3 test compilebench crashes with OOM

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.6
    • None
    • 3
    • 9223372036854775807

    Description

      We’ve seen parallel-scale-nfsv3 test_compilebench crashes with OOM only once. In this case https://testing.whamcloud.com/test_sets/a96851ea-9fc2-4442-b901-b19d0d5c0fc1, we are testing DNE with ZFS for RHEL 7.9 client/server. The last thing seen in the suite_log is

      == parallel-scale-nfsv3 test compilebench: compilebench ============================================== 00:10:01 (1606003801)
      OPTIONS:
      cbench_DIR=/usr/bin
      cbench_IDIRS=2
      cbench_RUNS=2
      trevis-46vm1.trevis.whamcloud.com
      trevis-46vm2
      free space = 15279104 KB
      ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.29720 -i 2         -r 2 --makej
      using working directory /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.29720, 2 intial dirs 2 runs
      native unpatched native-0 222MB in 196.06 seconds (1.13 MB/s)
      native patched native-0 109MB in 56.43 seconds (1.94 MB/s)
      native patched compiled native-0 691MB in 145.28 seconds (4.76 MB/s)
      create dir kernel-0 222MB in 184.62 seconds (1.20 MB/s)
      

      On the MDS2/4 (vm5) console log, we see communication errors

      [126471.115681] Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.compilebench.29720 -i 2         -r 2 --makej
      [126471.388277] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.29720 -i 2 -r 2 --makej
      [132040.380810] Lustre: 22680:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606009364/real 1606009364]  req@ffff8fdc15e2d680 x1683944906779264/t0(0) o13->lustre-OST0002-osc-MDT0001@10.9.5.186@tcp:7/4 lens 224/368 e 0 to 1 dl 1606009371 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      [132040.385562] Lustre: lustre-OST0002-osc-MDT0001: Connection to lustre-OST0002 (at 10.9.5.186@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [132040.388239] Lustre: Skipped 4 previous similar messages
      [132078.146048] Lustre: lustre-MDT0003: haven't heard from client lustre-MDT0003-lwp-OST0000_UUID (at 10.9.5.186@tcp) in 51 seconds. I think it's dead, and I am evicting it. exp ffff8fdc3f68f000, cur 1606009409 expire 1606009379 last 1606009358
      [132078.149715] Lustre: Skipped 1 previous similar message
      
      <ConMan> Console [trevis-46vm5] disconnected from <trevis-46:6004> at 11-22 01:44.
      

      On the MDS1/3 (vm4) console log, we see similar issues, but many more errors like

      [126470.851038] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test compilebench: compilebench ============================================== 00:10:01 (1606003801)
      [126471.153265] Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.compilebench.29720 -i 2         -r 2 --makej
      [126471.438784] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.29720 -i 2 -r 2 --makej
      [126514.413660] LustreError: 2619:0:(qmt_handler.c:514:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0000_UUID release:262147 granted:0, total:0 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:16204339 soft:15432704 granted:0 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:no
      [126514.418116] LustreError: 2619:0:(qmt_handler.c:514:qmt_dqacq0()) Skipped 8 previous similar messages
      [127114.385591] LustreError: 2619:0:(qmt_handler.c:514:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0000_UUID release:262147 granted:0, total:0 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:16204339 soft:15432704 granted:0 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:no
      [127114.390114] LustreError: 2619:0:(qmt_handler.c:514:qmt_dqacq0()) Skipped 9 previous similar messages
      [127774.353234] LustreError: 6643:0:(qmt_handler.c:514:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0000_UUID release:262147 granted:0, total:0 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:16204339 soft:15432704 granted:0 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:no
      [127774.357978] LustreError: 6643:0:(qmt_handler.c:514:qmt_dqacq0()) Skipped 10 previous similar messages
      [127931.015494] Lustre: 19775:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606004660/real 1606004660]  req@ffff9369daabba80 x1683953590123520/t0(0) o101->lustre-OST0004-osc-ffff9369fa533800@10.9.5.186@tcp:28/4 lens 328/400 e 24 to 1 dl 1606005261 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      [127931.020478] Lustre: lustre-OST0004-osc-ffff9369fa533800: Connection to lustre-OST0004 (at 10.9.5.186@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [127931.023258] Lustre: Skipped 2 previous similar messages
      [127931.029602] Lustre: lustre-OST0002-osc-ffff9369fa533800: Connection restored to 10.9.5.186@tcp (at 10.9.5.186@tcp)
      [127931.031642] Lustre: Skipped 36 previous similar messages
      [128263.190427] Lustre: 19776:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606005261/real 1606005261]  req@ffff936a18bf4900 x1683953590431360/t0(0) o103->lustre-OST0002-osc-ffff9369fa533800@10.9.5.186@tcp:17/18 lens 888/224 e 1 to 1 dl 1606005593 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      [128263.195381] Lustre: 19776:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 300 previous similar messages
      [128263.197072] Lustre: lustre-OST0002-osc-ffff9369fa533800: Connection to lustre-OST0002 (at 10.9.5.186@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [128263.199874] Lustre: Skipped 7 previous similar messages
      …
      [128733.982640] LustreError: 18293:0:(osc_cache.c:959:osc_extent_wait()) extent ffff9369fe27d000@{[0 -> 255/255], [3|0|+|rpc|wihuY|ffff936a01033dc0], [1703936|13|+|-|ffff9369c4e1cb40|256|ffff936a1a76d280]} lustre-OST0003-osc-ffff9369fa533800: wait ext to 0 timedout, recovery in progress?
      [128733.987044] LustreError: 18293:0:(osc_cache.c:959:osc_extent_wait()) ### extent: ffff9369fe27d000 ns: lustre-OST0003-osc-ffff9369fa533800 lock: ffff9369c4e1cb40/0x457300162b236ab9 lrc: 3/0,0 mode: PW/PW res: [0x4c9f7:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->53247) flags: 0x29400000000 nid: local remote: 0x345f4682719796fe expref: -99 pid: 14029 timeout: 0 lvb_type: 1
      [128774.136661] LustreError: 2606:0:(osc_cache.c:959:osc_extent_wait()) extent ffff9369cee80b00@{[0 -> 255/255], [3|0|+|rpc|wihuY|ffff936a1535f8c0], [1703936|1|+|-|ffff936a34d27680|256|ffff936a38fb1080]} lustre-OST0002-osc-ffff9369fa533800: wait ext to 0 timedout, recovery in progress?
      [128774.140963] LustreError: 2606:0:(osc_cache.c:959:osc_extent_wait()) ### extent: ffff9369cee80b00 ns: lustre-OST0002-osc-ffff9369fa533800 lock: ffff936a34d27680/0x457300162b2371dc lrc: 3/0,0 mode: PW/PW res: [0x4cc1e:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x29400000000 nid: local remote: 0x345f468271979886 expref: -99 pid: 14029 timeout: 0 lvb_type: 1
      …
      [131959.075762] Lustre: 19776:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606008982/real 1606008982]  req@ffff936a19bc0000 x1683953590680128/t0(0) o103->lustre-OST0000-osc-ffff9369fa533800@10.9.5.186@tcp:17/18 lens 496/224 e 12 to 1 dl 1606009289 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      [131959.080744] Lustre: 19776:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
      [131959.091626] Lustre: lustre-OST0000-osc-ffff9369fa533800: Connection restored to 10.9.5.186@tcp (at 10.9.5.186@tcp)
      [131959.093410] Lustre: Skipped 15 previous similar messages
      [132076.505428] Lustre: MGS: haven't heard from client 2e6a78ed-898c-e05c-81c2-d8a60a1bee04 (at 10.9.5.186@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff936a0d00ac00, cur 1606009407 expire 1606009377 last 1606009358
      [132076.511277] Lustre: Skipped 1 previous similar message
      
      <ConMan> Console [trevis-46vm4] disconnected from <trevis-46:6003> at 11-22 01:44.
      

      ON the OSS (vm3) console log, we see the oom, hung tasks and stack traces

      [126470.900307] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test compilebench: compilebench ============================================== 00:10:01 (1606003801)
      [126471.259208] Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.compilebench.29720 -i 2         -r 2 --makej
      [126471.542072] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.29720 -i 2 -r 2 --makej
      [126514.484905] LustreError: 1434:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0000 qtype:grp id:500 enforced:1 granted: 262147 pending:0 waiting:0 req:1 usage: 0 qunit:0 qtune:0 edquot:0 default:no
      [126514.488658] LustreError: 1434:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 8 previous similar messages
      [126936.722580] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [126936.724348] LNetError: Skipped 5 previous similar messages
      [126936.725334] LNetError: 1426:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [126936.727033] LNetError: 1426:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 5 previous similar messages
      [126936.728812] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [126936.731431] LNetError: Skipped 5 previous similar messages
      [127114.455871] LustreError: 1434:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0000 qtype:grp id:500 enforced:1 granted: 262147 pending:0 waiting:0 req:1 usage: 0 qunit:0 qtune:0 edquot:0 default:no
      [127114.459890] LustreError: 1434:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 9 previous similar messages
      [127369.681908] LNet: Service thread pid 32312 was inactive for 40.14s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [127369.684957] Pid: 32312, comm: ll_ost_io00_015 3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Fri Nov 13 23:48:56 UTC 2020
      [127369.686695] Call Trace:
      [127369.687305]  [<ffffffff943976d8>] call_rwsem_down_read_failed+0x18/0x30
      [127369.688534]  [<ffffffffc0d5819a>] lu_cache_shrink+0x6a/0x2d0 [obdclass]
      [127369.690062]  [<ffffffff941d17be>] shrink_slab+0xae/0x340
      [127369.691044]  [<ffffffff941d4b4a>] do_try_to_free_pages+0x3ca/0x520
      [127369.692183]  [<ffffffff941d4d9c>] try_to_free_pages+0xfc/0x180
      [127369.693254]  [<ffffffff9477c679>] __alloc_pages_slowpath+0x457/0x729
      [127369.694433]  [<ffffffff941c8d76>] __alloc_pages_nodemask+0x436/0x450
      [127369.695584]  [<ffffffff942189d8>] alloc_pages_current+0x98/0x110
      [127369.696758]  [<ffffffff942270c3>] new_slab+0x393/0x4e0
      [127369.697730]  [<ffffffff942275dc>] ___slab_alloc+0x3cc/0x520
      [127369.698762]  [<ffffffff9477dcc1>] __slab_alloc+0x40/0x5c
      [127369.699752]  [<ffffffff9422879b>] kmem_cache_alloc+0x19b/0x1f0
      [127369.700844]  [<ffffffffc06443c9>] spl_kmem_cache_alloc+0x99/0x150 [spl]
      [127369.702077]  [<ffffffffc0811697>] zio_data_buf_alloc+0x57/0x60 [zfs]
      [127369.704042]  [<ffffffffc07454cc>] abd_alloc_linear+0xac/0xf0 [zfs]
      [127369.705299]  [<ffffffffc074580c>] abd_alloc+0x2fc/0x460 [zfs]
      [127369.706404]  [<ffffffffc0749dfe>] arc_hdr_alloc_pabd+0x6e/0x110 [zfs]
      [127369.707597]  [<ffffffffc0749f5c>] arc_hdr_alloc+0xbc/0x120 [zfs]
      [127369.708759]  [<ffffffffc074ebbd>] arc_alloc_buf+0x3d/0xb0 [zfs]
      [127369.709914]  [<ffffffffc074ec53>] arc_loan_buf+0x23/0x30 [zfs]
      [127369.710979]  [<ffffffffc075f02a>] dmu_request_arcbuf+0x1a/0x20 [zfs]
      [127369.712178]  [<ffffffffc1295a61>] osd_bufs_get+0x3d1/0xf80 [osd_zfs]
      [127369.713412]  [<ffffffffc1406ba6>] ofd_preprw_write.isra.30+0x476/0xea0 [ofd]
      [127369.714748]  [<ffffffffc14079ef>] ofd_preprw+0x41f/0x1240 [ofd]
      [127369.715903]  [<ffffffffc109ce99>] tgt_brw_write+0xcc9/0x1ad0 [ptlrpc]
      [127369.717515]  [<ffffffffc109ef1a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [127369.718879]  [<ffffffffc104388b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [127369.720383]  [<ffffffffc10471f4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [127369.721582]  [<ffffffff940c5c21>] kthread+0xd1/0xe0
      [127369.722565]  [<ffffffff94793df7>] ret_from_fork_nospec_end+0x0/0x39
      [127369.723730]  [<ffffffffffffffff>] 0xffffffffffffffff
      [127369.724861] LustreError: dumping log to /tmp/lustre-log.1606004700.32312
      [127536.712930] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [127536.714600] LNetError: Skipped 5 previous similar messages
      [127536.715594] LNetError: 1425:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [127536.717408] LNetError: 1425:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 5 previous similar messages
      [127536.719080] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [127536.721754] LNetError: Skipped 5 previous similar messages
      [127554.606912] INFO: task kthreadd:2 blocked for more than 120 seconds.
      [127554.608205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [127554.609530] kthreadd        D ffff8cb1bfd1acc0     0     2      0 0x00000000
      [127554.610898] Call Trace:
      [127554.611379]  [<ffffffff947871e9>] schedule+0x29/0x70
      [127554.612254]  [<ffffffff94788bd5>] rwsem_down_read_failed+0x105/0x1c0
      [127554.613484]  [<ffffffff943976d8>] call_rwsem_down_read_failed+0x18/0x30
      [127554.614656]  [<ffffffff947864d0>] down_read+0x20/0x40
      [127554.615631]  [<ffffffffc0d5819a>] lu_cache_shrink+0x6a/0x2d0 [obdclass]
      [127554.616818]  [<ffffffff941d17be>] shrink_slab+0xae/0x340
      [127554.617801]  [<ffffffff942428c1>] ? vmpressure+0x21/0x90
      [127554.618769]  [<ffffffff941d4b4a>] do_try_to_free_pages+0x3ca/0x520
      [127554.619883]  [<ffffffff941d4d9c>] try_to_free_pages+0xfc/0x180
      [127554.620949]  [<ffffffff9477c679>] __alloc_pages_slowpath+0x457/0x729
      [127554.622090]  [<ffffffff941c8d76>] __alloc_pages_nodemask+0x436/0x450
      [127554.623236]  [<ffffffff940c5b50>] ? insert_kthread_work+0x40/0x40
      [127554.624353]  [<ffffffff940987dd>] copy_process+0x1dd/0x1a80
      [127554.625359]  [<ffffffff9409a231>] do_fork+0x91/0x330
      [127554.626273]  [<ffffffff9409a4f6>] kernel_thread+0x26/0x30
      [127554.627261]  [<ffffffff940c6601>] kthreadd+0x2e1/0x320
      [127554.628212]  [<ffffffff940c6320>] ? kthread_create_on_cpu+0x60/0x60
      [127554.629355]  [<ffffffff94793df7>] ret_from_fork_nospec_begin+0x21/0x21
      [127554.630518]  [<ffffffff940c6320>] ? kthread_create_on_cpu+0x60/0x60
      [127554.631649] INFO: task kswapd0:35 blocked for more than 120 seconds.
      …
      [127554.651560] INFO: task spl_dynamic_tas:639 blocked for more than 120 seconds.
      [127554.652778] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [127554.654181] spl_dynamic_tas D ffff8cb1b80ac200     0   639      2 0x00000000
      [127554.655493] Call Trace:
      [127554.655959]  [<ffffffff947871e9>] schedule+0x29/0x70
      [127554.656899]  [<ffffffff94784cd1>] schedule_timeout+0x221/0x2d0
      [127554.657966]  [<ffffffff940d7202>] ? check_preempt_curr+0x92/0xa0
      [127554.659047]  [<ffffffff940d7229>] ? ttwu_do_wakeup+0x19/0xe0
      [127554.660074]  [<ffffffff9478759d>] wait_for_completion+0xfd/0x140
      [127554.661162]  [<ffffffff940daaf0>] ? wake_up_state+0x20/0x20
      [127554.662175]  [<ffffffffc0645ad0>] ? taskq_thread_spawn+0x60/0x60 [spl]
      [127554.663344]  [<ffffffff940c5a7a>] kthread_create_on_node+0xaa/0x140
      [127554.664466]  [<ffffffff94391d3b>] ? string.isra.7+0x3b/0xf0
      [127554.665480]  [<ffffffffc0645ad0>] ? taskq_thread_spawn+0x60/0x60 [spl]
      [127554.666732]  [<ffffffffc0645ad0>] ? taskq_thread_spawn+0x60/0x60 [spl]
      [127554.667890]  [<ffffffffc064510c>] spl_kthread_create+0x9c/0xf0 [spl]
      [127554.669043]  [<ffffffffc064692b>] taskq_thread_create+0x6b/0x110 [spl]
      [127554.670216]  [<ffffffffc06469e2>] taskq_thread_spawn_task+0x12/0x40 [spl]
      [127554.671431]  [<ffffffffc0645d7c>] taskq_thread+0x2ac/0x4f0 [spl]
      [127554.672501]  [<ffffffff940daaf0>] ? wake_up_state+0x20/0x20
      [127554.673508]  [<ffffffffc0645ad0>] ? taskq_thread_spawn+0x60/0x60 [spl]
      [127554.674667]  [<ffffffff940c5c21>] kthread+0xd1/0xe0
      [127554.675565]  [<ffffffff940c5b50>] ? insert_kthread_work+0x40/0x40
      [127554.676668]  [<ffffffff94793df7>] ret_from_fork_nospec_begin+0x21/0x21
      [127554.677834]  [<ffffffff940c5b50>] ? insert_kthread_work+0x40/0x40
      [127554.678922] INFO: task dbuf_evict:645 blocked for more than 120 seconds.
      [127554.680124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [127554.681521] dbuf_evict      D ffff8cb1bfd1acc0     0   645      2 0x00000000
      [127554.682806] Call Trace:
      [127554.683277]  [<ffffffff94788109>] schedule_preempt_disabled+0x29/0x70
      [127554.684492]  [<ffffffff94786077>] __mutex_lock_slowpath+0xc7/0x1d0
      [127554.685602]  [<ffffffff9478544f>] mutex_lock+0x1f/0x2f
      [127554.686565]  [<ffffffffc074ee36>] arc_buf_destroy+0x96/0x130 [zfs]
      [127554.687699]  [<ffffffffc07563e1>] dbuf_destroy+0x31/0x3b0 [zfs]
      [127554.688790]  [<ffffffffc07b8176>] ? spa_get_random+0x36/0x60 [zfs]
      [127554.689957]  [<ffffffffc0756800>] dbuf_evict_one+0xa0/0xc0 [zfs]
      [127554.691058]  [<ffffffffc07597ff>] dbuf_evict_thread+0x10f/0x1d0 [zfs]
      [127554.692222]  [<ffffffffc07596f0>] ? dbuf_write_override_done+0x1e0/0x1e0 [zfs]
      [127554.693510]  [<ffffffffc0645063>] thread_generic_wrapper+0x73/0x80 [spl]
      [127554.694696]  [<ffffffffc0644ff0>] ? __thread_exit+0x20/0x20 [spl]
      [127554.695786]  [<ffffffff940c5c21>] kthread+0xd1/0xe0
      [127554.696691]  [<ffffffff940c5b50>] ? insert_kthread_work+0x40/0x40
      [127554.697790]  [<ffffffff94793df7>] ret_from_fork_nospec_begin+0x21/0x21
      [127554.698951]  [<ffffffff940c5b50>] ? insert_kthread_work+0x40/0x40
      …
      [132011.533944] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [132011.535685] LNetError: Skipped 7 previous similar messages
      [132031.291351] pickup invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
      [132031.292841] pickup cpuset=/ mems_allowed=0
      [132031.293561] CPU: 1 PID: 23709 Comm: pickup Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.2.1.el7_lustre.x86_64 #1
      [132031.295542] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [132031.296523] Call Trace:
      [132031.297056]  [<ffffffff947813c0>] dump_stack+0x19/0x1b
      [132031.297945]  [<ffffffff9477bce0>] dump_header+0x90/0x229
      [132031.298881]  [<ffffffff9478eb6f>] ? notifier_call_chain+0x4f/0x70
      [132031.300010]  [<ffffffff940cbfe8>] ? __blocking_notifier_call_chain+0x58/0x70
      [132031.301234]  [<ffffffff941c227e>] check_panic_on_oom+0x2e/0x60
      [132031.302243]  [<ffffffff941c25f4>] out_of_memory+0x194/0x500
      [132031.303196]  [<ffffffff9477c7fd>] __alloc_pages_slowpath+0x5db/0x729
      [132031.304277]  [<ffffffff941c8d76>] __alloc_pages_nodemask+0x436/0x450
      [132031.305371]  [<ffffffff9421c269>] alloc_pages_vma+0xa9/0x200
      [132031.306343]  [<ffffffff94209805>] __read_swap_cache_async+0x115/0x190
      [132031.307437]  [<ffffffff941bd45e>] ? __find_get_page+0x1e/0xa0
      [132031.308426]  [<ffffffff942098a6>] read_swap_cache_async+0x26/0x60
      [132031.309470]  [<ffffffff94209a8b>] swapin_readahead+0x1ab/0x210
      [132031.310493]  [<ffffffff9438d222>] ? radix_tree_lookup_slot+0x22/0x50
      [132031.311577]  [<ffffffff941bd45e>] ? __find_get_page+0x1e/0xa0
      [132031.312560]  [<ffffffff941f1f1f>] do_swap_page+0x23f/0x7c0
      [132031.313503]  [<ffffffff941bdf6b>] ? unlock_page+0x2b/0x30
      [132031.314431]  [<ffffffff941ee4d9>] ? do_read_fault.isra.63+0x139/0x1b0
      [132031.315527]  [<ffffffff941f5cb7>] handle_mm_fault+0xaa7/0xfb0
      [132031.316532]  [<ffffffff9478e653>] __do_page_fault+0x213/0x500
      [132031.317512]  [<ffffffff9478ea26>] trace_do_page_fault+0x56/0x150
      [132031.318541]  [<ffffffff9478dfa2>] do_async_page_fault+0x22/0xf0
      [132031.319681]  [<ffffffff9478a7a8>] async_page_fault+0x28/0x30
      [132031.320657] Mem-Info:
      [132031.321091] active_anon:3089 inactive_anon:3139 isolated_anon:0
      [132031.321091]  active_file:343 inactive_file:2107 isolated_file:0
      [132031.321091]  unevictable:0 dirty:10 writeback:0 unstable:0
      [132031.321091]  slab_reclaimable:4834 slab_unreclaimable:137783
      [132031.321091]  mapped:1028 shmem:4549 pagetables:1120 bounce:0
      [132031.321091]  free:12862 free_pcp:30 free_cma:0
      [132031.326525] Node 0 DMA free:6916kB min:416kB low:520kB high:624kB active_anon:68kB inactive_anon:100kB active_file:0kB inactive_file:24kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:132kB slab_reclaimable:152kB slab_unreclaimable:1516kB kernel_stack:192kB pagetables:104kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:297 all_unreclaimable? yes
      [132031.333546] lowmem_reserve[]: 0 1660 1660 1660
      [132031.334501] Node 0 DMA32 free:44532kB min:44636kB low:55792kB high:66952kB active_anon:12288kB inactive_anon:12456kB active_file:1372kB inactive_file:8404kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1703244kB mlocked:0kB dirty:40kB writeback:0kB mapped:4112kB shmem:18064kB slab_reclaimable:19184kB slab_unreclaimable:549616kB kernel_stack:12144kB pagetables:4376kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:51805 all_unreclaimable? yes
      [132031.342090] lowmem_reserve[]: 0 0 0 0
      [132031.342912] Node 0 DMA: 41*4kB (UEM) 26*8kB (UEM) 37*16kB (UEM) 20*32kB (UEM) 13*64kB (UEM) 7*128kB (UEM) 2*256kB (E) 2*512kB (UE) 2*1024kB (U) 0*2048kB 0*4096kB = 6916kB
      [132031.346179] Node 0 DMA32: 675*4kB (UEM) 2708*8kB (UEM) 1138*16kB (UM) 58*32kB (UM) 2*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44556kB
      [132031.349119] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      [132031.350593] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [132031.352019] 7419 total pagecache pages
      [132031.352692] 404 pages in swap cache
      [132031.353308] Swap cache stats: add 29767, delete 29363, find 2416/3651
      [132031.354391] Free swap  = 3558652kB
      [132031.355004] Total swap = 3671036kB
      [132031.355606] 524184 pages RAM
      [132031.356125] 0 pages HighMem/MovableOnly
      [132031.356798] 94396 pages reserved
      [132031.357361] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
      [132031.358680] [  352]     0   352     9863      827      24       58             0 systemd-journal
      [132031.360168] [  393]     0   393    47594        0      28      123             0 lvmetad
      [132031.361531] [  400]     0   400    11421        2      25      173         -1000 systemd-udevd
      [132031.362986] [  666]     0   666    13883        1      27      111         -1000 auditd
      [132031.364325] [  689]   999   689   153058        0      61     1385             0 polkitd
      [132031.365712] [  691]     0   691     5385       18      16       42             0 irqbalance
      [132031.367113] [  693]    81   693    14594       88      34      120          -900 dbus-daemon
      [132031.368531] [  695]    32   695    17314       14      37      126             0 rpcbind
      [132031.369896] [  698]     0   698     6637        1      17      117             0 systemd-logind
      [132031.371348] [  701]     0   701   155458       91      89      453             0 NetworkManager
      [132031.372827] [  702]     0   702    51907      117      35       69             0 zed
      [132031.374140] [  714]     0   714    48801        0      34      129             0 gssproxy
      [132031.375517] [  721]    38   721    11826       38      28      145             0 ntpd
      [132031.376860] [  747]     0   747    25735        7      54      508             0 dhclient
      [132031.378240] [ 1004]     0  1004    28234        3      57      256         -1000 sshd
      [132031.379551] [ 1005]     0  1005   143570      145      99     2728             0 tuned
      [132031.380886] [ 1013]     0  1013    74581      703      73      830             0 rsyslogd
      [132031.382262] [ 1015]     0  1015     6792        0      18       63             0 xinetd
      [132031.383609] [ 1018]   997  1018    56473       46      22      131             0 munged
      [132031.384971] [ 1019]    29  1019    10610        0      25      208             0 rpc.statd
      [132031.386353] [ 1083]     0  1083   172961      551      80      518             0 automount
      [132031.387744] [ 1086]     0  1086    31595       25      19      129             0 crond
      [132031.389075] [ 1089]     0  1089     6477        0      18       52             0 atd
      [132031.390383] [ 1099]     0  1099    27551        1      10       32             0 agetty
      [132031.391736] [ 1100]     0  1100    27551        1      13       32             0 agetty
      [132031.393078] [ 1438]     0  1438    22436       20      42      241             0 master
      [132031.394416] [ 1449]    89  1449    22479       15      46      238             0 qmgr
      [132031.395739] [23709]    89 23709    22462      103      44      194             0 pickup
      [132031.397438] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      [132031.397438] 
      [132031.399072] CPU: 1 PID: 23709 Comm: pickup Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.2.1.el7_lustre.x86_64 #1
      [132031.401036] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [132031.401994] Call Trace:
      [132031.402435]  [<ffffffff947813c0>] dump_stack+0x19/0x1b
      [132031.403303]  [<ffffffff9477a918>] panic+0xe8/0x21f
      [132031.404121]  [<ffffffff941c22a5>] check_panic_on_oom+0x55/0x60
      [132031.405099]  [<ffffffff941c25f4>] out_of_memory+0x194/0x500
      [132031.406041]  [<ffffffff9477c7fd>] __alloc_pages_slowpath+0x5db/0x729
      [132031.407118]  [<ffffffff941c8d76>] __alloc_pages_nodemask+0x436/0x450
      [132031.408184]  [<ffffffff9421c269>] alloc_pages_vma+0xa9/0x200
      [132031.409141]  [<ffffffff94209805>] __read_swap_cache_async+0x115/0x190
      [132031.410223]  [<ffffffff941bd45e>] ? __find_get_page+0x1e/0xa0
      [132031.411192]  [<ffffffff942098a6>] read_swap_cache_async+0x26/0x60
      [132031.412219]  [<ffffffff94209a8b>] swapin_readahead+0x1ab/0x210
      [132031.413202]  [<ffffffff9438d222>] ? radix_tree_lookup_slot+0x22/0x50
      [132031.414267]  [<ffffffff941bd45e>] ? __find_get_page+0x1e/0xa0
      [132031.415235]  [<ffffffff941f1f1f>] do_swap_page+0x23f/0x7c0
      [132031.416172]  [<ffffffff941bdf6b>] ? unlock_page+0x2b/0x30
      [132031.417082]  [<ffffffff941ee4d9>] ? do_read_fault.isra.63+0x139/0x1b0
      [132031.418165]  [<ffffffff941f5cb7>] handle_mm_fault+0xaa7/0xfb0
      [132031.419136]  [<ffffffff9478e653>] __do_page_fault+0x213/0x500
      [132031.420101]  [<ffffffff9478ea26>] trace_do_page_fault+0x56/0x150
      [132031.421117]  [<ffffffff9478dfa2>] do_async_page_fault+0x22/0xf0
      [132031.422111]  [<ffffffff9478a7a8>] async_page_fault+0x28/0x30
      

      This crash may be related to LU-11929 and/or LU-13068.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: