Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12752

osc_page.c:osc_page_delete() ASSERTION( 0 ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.15.0
    • Lustre 2.12.0
    • None
    • CentOS 7.6 2.12.0+patch including async_discard for DoM. Servers: 2.12.2_116+3 patches
    • 3
    • 9223372036854775807

    Description

      A specific type of jobs (which seems to be Spark+Python) keeps triggering the following LBUG on different clients:

      /var/log/nodes/sh-110-01.log:Sep  9 12:09:32 sh-110-01 kernel: LustreError: 123514:0:(osc_page.c:193:osc_page_delete()) LBUG
      /var/log/nodes/sh-110-04.log:Sep  9 08:53:09 sh-110-04 kernel: LustreError: 89537:0:(osc_page.c:193:osc_page_delete()) LBUG
      /var/log/nodes/sh-110-04.log:Sep  9 12:47:29 sh-110-04 kernel: LustreError: 53455:0:(osc_page.c:193:osc_page_delete()) LBUG
      /var/log/nodes/sh-110-04.log:Sep  9 12:47:29 sh-110-04 kernel: Kernel panic - not syncing: LBUG
      /var/log/nodes/sh-110-04.log:Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) LBUG
      /var/log/nodes/sh-110-04.log:Sep 10 11:48:43 sh-110-04 kernel: LustreError: 104908:0:(osc_page.c:193:osc_page_delete()) LBUG
      /var/log/nodes/sh-110-04.log:Sep 10 12:29:58 sh-110-04 kernel: LustreError: 19176:0:(osc_page.c:193:osc_page_delete()) LBUG
      /var/log/nodes/sh-110-04.log:Sep 10 12:45:09 sh-110-04 kernel: LustreError: 19196:0:(osc_page.c:193:osc_page_delete()) LBUG
      /var/log/nodes/sh-110-04.log:Sep 10 13:58:56 sh-110-04 kernel: LustreError: 14313:0:(osc_page.c:193:osc_page_delete()) LBUG
      /var/log/nodes/sh-110-04.log:Sep 10 15:01:22 sh-110-04 kernel: LustreError: 11772:0:(osc_page.c:193:osc_page_delete()) LBUG
      /var/log/nodes/sh-110-04.log:Sep 10 16:59:34 sh-110-04 kernel: LustreError: 10094:0:(osc_page.c:193:osc_page_delete()) LBUG
      /var/log/nodes/sh-110-04.log:Sep 10 18:29:22 sh-110-04 kernel: LustreError: 25649:0:(osc_page.c:193:osc_page_delete()) LBUG
      
      Sep  9 14:08:38 sh-110-04 kernel: perf: interrupt took too long (4932 > 4930), lowering kernel.perf_event_max_sample_rate to 40000
      Sep  9 14:10:01 sh-110-04 CROND[24303]: (root) CMD (/usr/lib64/sa/sa1 1 1)      
      Sep  9 14:20:01 sh-110-04 CROND[25059]: (root) CMD (/usr/lib64/sa/sa1 1 1)      
      Sep  9 14:30:01 sh-110-04 CROND[25808]: (root) CMD (/usr/lib64/sa/sa1 1 1)      
      Sep  9 14:32:01 sh-110-04 kernel: LNetError: 8091:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
      Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_cache.c:2575:osc_teardown_async_page()) extent ffff94ab962a3c30@{[32 -> 102/4095], [2|0|-|cache|wi|ffff9498052c8f00], [315392|71|+|-|ffff94c319791b00|4096|          (null)]} trunc at 32.
      Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_cache.c:2575:osc_teardown_async_page()) ### extent: ffff94ab962a3c30 ns: fir-OST0028-osc-ffff9496d9c3e000 lock: ffff94c319791b00/0x6233ab8351d87f8e lrc: 2/0,0 mode: PW/PW res: [0xa00000401:0x2db265b:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 131072->16777215) flags: 0x20000000000 nid: local remote: 0xcd88a3c067e066be expref: -99 pid: 26040 timeout: 0 lvb_type: 1
      Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) page@ffff94b6232a7600[2 ffff949c67e4db30 4 1           (null)]
      Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) vvp-page@ffff94b6232a7650(0:0) vm@ffffe46433a31e00 6fffff0000087d 3:0 ffff94b6232a7600 32 lru
      Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) lov-page@ffff94b6232a7690, comp index: 10000, gen: 7
      Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) osc-page@ffff94b6232a76c8 32: 1< 0x845fed 2 0 + - > 2< 131072 0 4096 0x0 0x420 |           (null) ffff94ad27592540 ffff9498052c8f00 > 3< 0 0 0 > 4< 0 0 8 210399232 - | - - + - > 5< - - + - | 0 - | 71 - ->
      Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) end page@ffff94b6232a7600
      Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) Trying to teardown failed: -16
      Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) ASSERTION( 0 ) failed: 
      Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) LBUG
      Sep  9 14:36:54 sh-110-04 kernel: Pid: 26083, comm: migratefs 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
      Sep  9 14:36:54 sh-110-04 kernel: Call Trace:                                   
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc0b9c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc0b9c87c>] lbug_with_loc+0x4c/0xa0 [libcfs] 
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc0dd08bf>] osc_page_delete+0x48f/0x500 [osc]
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc0eea770>] cl_page_delete0+0x80/0x220 [obdclass]
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc0eea943>] cl_page_delete+0x33/0x110 [obdclass]
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc120ff3f>] ll_invalidatepage+0x7f/0x170 [lustre]
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c7ddd>] do_invalidatepage_range+0x7d/0x90
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c7e87>] truncate_inode_page+0x77/0x80 
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c80ba>] truncate_inode_pages_range+0x1ea/0x700
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c863f>] truncate_inode_pages_final+0x4f/0x60
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc11f656f>] ll_delete_inode+0x4f/0x230 [lustre]
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd45ff24>] evict+0xb4/0x180         
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd46082c>] iput+0xfc/0x190          
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd4548de>] do_unlinkat+0x1ae/0x2d0  
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd45595b>] SyS_unlinkat+0x1b/0x40   
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd976ddb>] system_call_fastpath+0x22/0x27
      Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffffffffff>] 0xffffffffffffffff       
      Sep  9 14:40:11 sh-110-04 kernel: Initializing cgroup subsys cpuset             
      

      We didn't find any other LBUG so we think this is related to a specific access pattern. This comes from several jobs but similar ones.

      The logs do mention fir so it's against our 2.12.2_116+3 scratch filesystem. We use DoM and PFL.

      We've found other similar tickets, some of them are old and marked as resolved:

      LU-2720, LU-2557, LU-8388
      or more recent LU-12525 but I checked and we don't have the patch that was reverted as it was only on 2.13+ I think.

      Any idea? Do you think it's worth trying to upgrade our clients from 2.12.0+patches (including async_discard) to 2.12.2_116+3 like our servers?

      Thanks!
      Stephane

      Attachments

        Issue Links

          Activity

            People

              vsaveliev Vladimir Saveliev
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: