Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0
    • lustre-master-ib #286
    • 3
    • 9223372036854775807

    Description

      soak-

      [  540.771758] ^A4Lustre: soaked-OST000d: Will be in recovery for at least 2:30, or until 28 clients reconnect
      [  557.249150] ^A6Lustre: soaked-OST000d: Recovery over after 0:16, of 28 clients 28 recovered and 0 were evicted.
      [  557.377573] ^A6Lustre: soaked-OST000d: deleting orphan objects from 0x680000400:163752940 to 0x680000400:163755413
      [  557.449951] ^A6Lustre: soaked-OST000d: deleting orphan objects from 0x680000402:140748232 to 0x680000402:140758690
      [  557.508161] ^A6Lustre: soaked-OST000d: deleting orphan objects from 0x680000401:211295588 to 0x680000401:211298918
      [  557.519664] ^A6Lustre: soaked-OST000d: deleting orphan objects from 0x0:215937270 to 0x0:215938124
      [  568.981469] ^A4Lustre: Failing over soaked-OST0009
      [  570.884990] ^A4Lustre: server umount soaked-OST0009 complete
      [  585.668199] in:imjournal invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
      [  585.677142] in:imjournal cpuset=/ mems_allowed=0-1
      [  585.682498] CPU: 24 PID: 24262 Comm: in:imjournal Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.21.3.el7_lustre.x86_64 #1
      [  585.696573] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      [  585.709100] Call Trace:
      [  585.711837]  [<ffffffff83363107>] dump_stack+0x19/0x1b
      [  585.717576]  [<ffffffff8335db2a>] dump_header+0x90/0x229
      [  585.723510]  [<ffffffff82d01292>] ? ktime_get_ts64+0x52/0xf0
      [  585.729836]  [<ffffffff82d584df>] ? delayacct_end+0x8f/0xb0
      [  585.736060]  [<ffffffff82dba834>] oom_kill_process+0x254/0x3d0
      [  585.742576]  [<ffffffff82dba2dd>] ? oom_unkillable_task+0xcd/0x120
      [  585.749478]  [<ffffffff82dba386>] ? find_lock_task_mm+0x56/0xc0
      [  585.756107]  [<ffffffff82dbb076>] out_of_memory+0x4b6/0x4f0
      [  585.762335]  [<ffffffff8335e62e>] __alloc_pages_slowpath+0x5d6/0x724
      [  585.769458]  [<ffffffff82dc1454>] __alloc_pages_nodemask+0x404/0x420
      [  585.776594]  [<ffffffff82e11795>] alloc_pages_vma+0xb5/0x200
      [  585.782921]  [<ffffffff82dff9e5>] __read_swap_cache_async+0x115/0x190
      [  585.790133]  [<ffffffff82dffa86>] read_swap_cache_async+0x26/0x60
      [  585.796946]  [<ffffffff82dffb6c>] swapin_readahead+0xac/0x110
      [  585.803365]  [<ffffffff82de9c62>] handle_pte_fault+0x812/0xd10
      [  585.809881]  [<ffffffff82ce035c>] ? update_curr+0x14c/0x1e0
      [  585.816106]  [<ffffffff82cdccbe>] ? account_entity_dequeue+0xae/0xd0
      [  585.823203]  [<ffffffff82ce084c>] ? dequeue_entity+0x11c/0x5e0
      [  585.829715]  [<ffffffff82dec27d>] handle_mm_fault+0x39d/0x9b0
      [  585.836131]  [<ffffffff82ce112e>] ? dequeue_task_fair+0x41e/0x660
      [  585.842928]  [<ffffffff83370603>] __do_page_fault+0x203/0x4f0
      [  585.849344]  [<ffffffff83370925>] do_page_fault+0x35/0x90
      [  585.855374]  [<ffffffff833680ce>] ? schedule_hrtimeout_range_clock+0xbe/0x150
      [  585.863348]  [<ffffffff8336c768>] page_fault+0x28/0x30
      [  585.869093]  [<ffffffff82e58e0e>] ? do_sys_poll+0x4fe/0x590
      [  585.875320]  [<ffffffff82e58de6>] ? do_sys_poll+0x4d6/0x590
      [  585.881546]  [<ffffffff82dd1a5f>] ? shmem_fault+0xdf/0x1f0
      [  585.887673]  [<ffffffff82e57530>] ? __pollwait+0xf0/0xf0
      [  585.893610]  [<ffffffff82df755c>] ? page_add_file_rmap+0x8c/0xc0
      [  585.900311]  [<ffffffff82db6abb>] ? unlock_page+0x2b/0x30
      [  585.906341]  [<ffffffff82de4e89>] ? do_read_fault.isra.61+0x139/0x1b0
      [  585.913539]  [<ffffffff82de9744>] ? handle_pte_fault+0x2f4/0xd10
      [  585.920248]  [<ffffffff82e54492>] ? user_path_at_empty+0x72/0xc0
      [  585.926957]  [<ffffffff82e3e82a>] ? __check_object_size+0x1ca/0x250
      [  585.933958]  [<ffffffff82f9572d>] ? list_del+0xd/0x30
      [  585.939600]  [<ffffffff82cc2a61>] ? remove_wait_queue+0x31/0x40
      [  585.946211]  [<ffffffff82e8c22f>] ? inotify_read+0x2ef/0x420
      [  585.952532]  [<ffffffff82d01292>] ? ktime_get_ts64+0x52/0xf0
      [  585.958854]  [<ffffffff82e59213>] SyS_ppoll+0x1d3/0x1f0
      [  585.964688]  [<ffffffff83375d15>] ? system_call_after_swapgs+0xa2/0x146
      [  585.972074]  [<ffffffff83375d21>] ? system_call_after_swapgs+0xae/0x146
      [  585.979462]  [<ffffffff83375ddb>] system_call_fastpath+0x22/0x27
      [  585.986171]  [<ffffffff83375d21>] ? system_call_after_swapgs+0xae/0x146
      [  585.993556] Mem-Info:
      [  585.996086] active_anon:271 inactive_anon:404 isolated_anon:0
      [  585.996086]  active_file:149 inactive_file:0 isolated_file:0
      [  585.996086]  unevictable:6763 dirty:0 writeback:0 unstable:0
      [  585.996086]  slab_reclaimable:10724 slab_unreclaimable:174704
      [  585.996086]  mapped:1588 shmem:22 pagetables:1820 bounce:0
      [  585.996086]  free:34101 free_pcp:0 free_cma:0
      [  586.032467] Node 0 DMA free:15324kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15920kB managed:15836kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      [  586.078658] lowmem_reserve[]: 0 2754 15791 15791
      [  586.083859] Node 0 DMA32 free:59904kB min:7780kB low:9724kB high:11668kB active_anon:824kB inactive_anon:1244kB active_file:0kB inactive_file:0kB unevictable:4076kB isolated(anon):0kB isolated(file):0kB present:3051628kB managed:2820172kB mlocked:4076kB dirty:0kB writeback:0kB mapped:408kB shmem:84kB slab_reclaimable:2692kB slab_unreclaimable:71512kB kernel_stack:1792kB pagetables:764kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:49514 all_unreclaimable? yes
      [  586.134218] lowmem_reserve[]: 0 0 13037 13037
      [  586.139149] Node 0 Normal free:27284kB min:36828kB low:46032kB high:55240kB active_anon:0kB inactive_anon:4kB active_file:172kB inactive_file:0kB unevictable:21484kB isolated(anon):0kB isolated(file):128kB present:13631488kB managed:13350636kB mlocked:21484kB dirty:0kB writeback:0kB mapped:4456kB shmem:0kB slab_reclaimable:19596kB slab_unreclaimable:342544kB kernel_stack:22144kB pagetables:3948kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:20401 all_unreclaimable? yes
      [  586.190481] lowmem_reserve[]: 0 0 0 0
      [  586.194625] Node 1 Normal free:34292kB min:45456kB low:56820kB high:68184kB active_anon:0kB inactive_anon:0kB active_file:384kB inactive_file:0kB unevictable:1492kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16480320kB mlocked:1492kB dirty:0kB writeback:0kB mapped:1488kB shmem:0kB slab_reclaimable:20608kB slab_unreclaimable:284740kB kernel_stack:15616kB pagetables:2568kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:13718 all_unreclaimable? no
      [  586.245470] lowmem_reserve[]: 0 0 0 0
      [  586.249615] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 1*2048kB (M) 3*4096kB (M) = 15324kB
      [  586.265227] Node 0 DMA32: 696*4kB (UEM) 639*8kB (UEM) 370*16kB (UEM) 269*32kB (UEM) 90*64kB (UEM) 31*128kB (UM) 24*256kB (UM) 15*512kB (UM) 2*1024kB (UM) 2*2048kB (UM) 2*4096kB (U) = 60312kB
      [  586.284458] Node 0 Normal: 2214*4kB (UEM) 1463*8kB (UEM) 313*16kB (UM) 12*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 25952kB
      [  586.299720] Node 1 Normal: 931*4kB (UEM) 1127*8kB (UEM) 611*16kB (UEM) 100*32kB (UEM) 36*64kB (UM) 8*128kB (UM) 12*256kB (UM) 3*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 33652kB
      [  586.317475] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      [  586.327186] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [  586.336607] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      [  586.346321] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [  586.355742] 1829 total pagecache pages
      [  586.359930] 0 pages in swap cache
      [  586.363631] Swap cache stats: add 16693, delete 16693, find 46/62
      [  586.370434] Free swap  = 16183796kB
      [  586.374330] Total swap = 16253948kB
      [  586.378225] 8369063 pages RAM
      [  586.381536] 0 pages HighMem/MovableOnly
      [  586.385818] 202322 pages reserved
      [  586.389518] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
      [  586.398294] [ 7134]     0  7134     9769      222      25       83             0 systemd-journal
      [  586.408106] [ 7164]     0  7164    29157      231      27       80             0 lvmetad
      [  586.417142] [ 7195]     0  7195    11230      232      24      212         -1000 systemd-udevd
      [  586.426759] [ 7206]     0  7206  1572958     3672     133        0         -1000 multipathd
      [  586.436129] [23295]     0 23295    15511       88      31      155         -1000 auditd
      [  586.445065] [23324]     0 23324     5475      190      16      144             0 irqbalance
      [  586.454381] [23325]   999 23325   156119      270      64     1900             0 polkitd
      [  586.463418] [23327]    32 23327    18412      164      40      190             0 rpcbind
      [  586.472453] [23330]     0 23330    64337      330      79      333             0 sssd
      [  586.481198] [23335]    81 23335    17628      274      36      171          -900 dbus-daemon
      [  586.490621] [23345]     0 23345    69399      176      48      214             0 gssproxy
      [  586.499753] [23373]     0 23373   136908      304      87     1138             0 NetworkManager
      [  586.509466] [23374]     0 23374    41019      263      44      207             0 zed
      [  586.518113] [23382]     0 23382     1781       30       8       38             0 mcelog
      [  586.527053] [23383]   997 23383    29446      248      30      113             0 chronyd
      [  586.536089] [23385]     0 23385    32230      209      33      271             0 rpc.gssd
      [  586.545220] [23402]     0 23402    98257      333     135      642             0 sssd_be
      [  586.554262] [23438]     0 23438    66241      296      85      235             0 sssd_nss
      [  586.563394] [23439]     0 23439    61158      288      74      229             0 sssd_pam
      [  586.572525] [23440]     0 23440    58985      273      71      213             0 sssd_ssh
      [  586.581660] [23441]     0 23441    69110      279      87      318             0 sssd_pac
      [  586.590794] [23455]     0 23455     6594      230      19       83             0 systemd-logind
      [  586.600507] [23590]     0 23590    26839      264      55      501             0 dhclient
      [  586.609639] [24250]     0 24250   143518      314      98     2832             0 tuned
      [  586.618480] [24253]     0 24253    54103      275      40      617             0 rsyslogd
      [  586.627616] [24254]     0 24254    28189      287      56      258         -1000 sshd
      [  586.636358] [24256]   998 24256    24222      182      22      129             0 munged
      [  586.645289] [24257]    29 24257    12760      174      28      256             0 rpc.statd
      [  586.654517] [24271]     0 24271     6791      150      18       64             0 xinetd
      [  586.663459] [24554]     0 24554    22907      175      44      262             0 master
      [  586.672399] [24560]    89 24560    25474      215      45      255             0 pickup
      [  586.681337] [24561]    89 24561    25491      211      45      256             0 qmgr
      [  586.690082] [24590]     0 24590   157973      273      81      424             0 automount
      [  586.699311] [24593]     0 24593     6476      168      18       52             0 atd
      [  586.707960] [24596]     0 24596    31571      205      20      154             0 crond
      [  586.716804] [24653]     0 24653    27523      167      11       32             0 agetty
      [  586.725743] [24654]     0 24654    27523      161      12       32             0 agetty
      [  586.734851] Out of memory: Kill process 24250 (tuned) score 0 or sacrifice child
      

      Attachments

        Activity

          [LU-12727] OSS OOM during failover
          sarah Sarah Liu added a comment -

          Since MOFED hasn't released on el7.7, we don't have ib el7.7 build. will move soak to tip master el7.6

          sarah Sarah Liu added a comment - Since MOFED hasn't released on el7.7, we don't have ib el7.7 build. will move soak to tip master el7.6

          Discussed this with jamesanunez & green on the triage call today.  This is a bit of a weird failure - It's trying to allocate a single page (order 0) and there's a lot of free memory, seemingly in every node & zone, per the info dumped.  The GFP flags (gfp_mask=0x200da)  look relatively permissive - If I unpacked them correctly, they are (___GFP):
          memalloc
          highmem
          movable
          wait
          high
          FS

          Which I think should be easy to satisfy?

          Oleg and I don't see any clear link from this to Lustre, so we're just going to recommend upgrading SOAK to 7.7 and going ahead with testing, to see if this happens again.

          pfarrell Patrick Farrell (Inactive) added a comment - Discussed this with jamesanunez  &  green on the triage call today.  This is a bit of a weird failure - It's trying to allocate a single page (order 0) and there's a lot of free memory, seemingly in every node & zone, per the info dumped.  The GFP flags (gfp_mask=0x200da)  look relatively permissive - If I unpacked them correctly, they are (___ GFP ): memalloc highmem movable wait high FS Which I think should be easy to satisfy? Oleg and I don't see any clear link from this to Lustre, so we're just going to recommend upgrading SOAK to 7.7 and going ahead with testing, to see if this happens again.

          People

            wc-triage WC Triage
            sarah Sarah Liu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: