Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1723

osc_page_delete()) ASSERTION(0) failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.2
    • None
    • Client RHEL-6.2 2.6.32-220.23.1.el6.x86_64
    • 3
    • 4210

    Description

      LBUG on production cluster while node was running a user job.

      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b5c0[1 ffff8806726e8f48:1263 ^(null)_ffff88106917b500 4 0 1 (null) (null) 0x1]
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b500[1 ffff880770b2c4c8:1263 ^ffff88106917b5c0_(null) 4 0 1 (null) (null) 0x0]
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) vvp-page@ffff8809d88c6be0(1:0:0) vm@ffffea0034f6ce20 c0000000000001 3:0 0 1263 lru
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) lov-page@ffff8809f22e93a8
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) osc-page@ffff881067d54bc8: 1< 0x845fed 1 0 - - + > 2< 5173248 0 4096 0x0 0x8 | (null) ffff8806d7f24688 ffff88016958f700 ffffffffa07ff>
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) end page@ffff88106917b5c0
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) Trying to teardown failed: -16
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) ASSERTION(0) failed
      Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) LBUG
      Aug 8 16:00:02 sand-4-52 kernel: Pid: 64082, comm: calculate_propa
      Aug 8 16:00:02 sand-4-52 kernel:
      Aug 8 16:00:02 sand-4-52 kernel: Call Trace:
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446e95>] lbug_with_loc+0x75/0xe0 [libcfs]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0451d86>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa07f8df6>] osc_page_delete+0x236/0x240 [osc]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0553d1e>] cl_page_delete0+0xce/0x400 [obdclass]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055128e>] ? cl_env_get+0x19e/0x350 [obdclass]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0550bf6>] ? cl_env_peek+0x36/0x110 [obdclass]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055408d>] cl_page_delete+0x3d/0xf0 [obdclass]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0560bde>] ? cl_io_is_going+0xe/0x20 [obdclass]
      Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa08e791b>] ll_releasepage+0x10b/0x150 [lustre]
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81168ff0>] ? mem_cgroup_uncharge_cache_page+0x10/0x20
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811100b0>] try_to_release_page+0x30/0x60
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a4f1>] shrink_page_list.clone.0+0x4f1/0x5c0
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a8bb>] shrink_inactive_list+0x2fb/0x740
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112b5cf>] shrink_zone+0x38f/0x520
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112c374>] zone_reclaim+0x354/0x410
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112cfc0>] ? isolate_pages_global+0x0/0x350
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81122874>] get_page_from_freelist+0x694/0x820
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81123af1>] __alloc_pages_nodemask+0x111/0x940
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8116a728>] ? __mem_cgroup_try_charge+0x78/0x420
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811586ca>] alloc_pages_vma+0x9a/0x150
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81172015>] do_huge_pmd_anonymous_page+0x145/0x370
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8113c79a>] handle_mm_fault+0x25a/0x2b0
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81042c29>] __do_page_fault+0x139/0x480
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814ed250>] ? thread_return+0x4e/0x76e
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f2c8e>] do_page_fault+0x3e/0xa0
      Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f0045>] page_fault+0x25/0x30
      Aug 8 16:00:03 sand-4-52 kernel:
      Aug 8 16:00:03 sand-4-52 kernel: LustreError: dumping log to /tmp/lustre-log.1344438003.64082

      Attachments

        Activity

          [LU-1723] osc_page_delete()) ASSERTION(0) failed
          bkorb Bruce Korb (Inactive) added a comment - - edited

          In today's Xyratex/Cray conference call, it was decided to go forward with backing out the LU-1030 patches from our release branch, pending further work on them.

          bkorb Bruce Korb (Inactive) added a comment - - edited In today's Xyratex/Cray conference call, it was decided to go forward with backing out the LU-1030 patches from our release branch, pending further work on them.

          More comments on Xyratex bug. Given the difficulty/complexity of the change, I would happily accept guidance here. (i.e. wait for you-all or suggestions on how to proceed)

          Iurii Golovach added a comment - 06/Dec/12 4:34 AM

          During implementation of extents on OSC level, two cases fell between cracks:

          • evict;
          • invalidate page.
            Like a result extent flags for these cases weren't set and we have a race.

          Since adding these cases is quite complex task we recommend to do next:
          1) revert changes from next patches:
          http://review.whamcloud.com/#change,2270
          http://review.whamcloud.com/#change,2009
          http://review.whamcloud.com/#change,2460
          2) wait until Jinshan finish his work on LU-1723 and then use changes with correct extents support.

          Please let us know if you need patch against 2.3 to reverse these patches.

          Cory Spitz added a comment - 06/Dec/12 2:42 PM

          Hi, that's a pretty big change. How many bug fixes would we loose? I'm not sure that the tradeoff for this bug is worth it (yet). Can you guys help w/LU-1723 or are you just waiting for him?

          bkorb Bruce Korb (Inactive) added a comment - More comments on Xyratex bug. Given the difficulty/complexity of the change, I would happily accept guidance here. (i.e. wait for you-all or suggestions on how to proceed) Iurii Golovach added a comment - 06/Dec/12 4:34 AM During implementation of extents on OSC level, two cases fell between cracks: evict; invalidate page. Like a result extent flags for these cases weren't set and we have a race. Since adding these cases is quite complex task we recommend to do next: 1) revert changes from next patches: http://review.whamcloud.com/#change,2270 http://review.whamcloud.com/#change,2009 http://review.whamcloud.com/#change,2460 2) wait until Jinshan finish his work on LU-1723 and then use changes with correct extents support. Please let us know if you need patch against 2.3 to reverse these patches. Cory Spitz added a comment - 06/Dec/12 2:42 PM Hi, that's a pretty big change. How many bug fixes would we loose? I'm not sure that the tradeoff for this bug is worth it (yet). Can you guys help w/ LU-1723 or are you just waiting for him?

          With a crash dump with debug tracing turned on, we get a lot more useful information.

          Alexey Lyashkov added a comment - 30/Nov/12 12:27 AM

          In fact that is bug in new osc extent code landed into 2.3 :/
          both code path
          1) ldlm lock cancel -> page discard ....
          and
          2) mkwrite /race/ -> invalidate page
          
          forget to find a extent and set correct state for it.  looks osc
          extent author forget about forced page flush from a page cache with
          mkwrite race and client eviction.
          
          for first case that is more or less clear - cl_lock_discard_pages
          foget to find an extent and set truncate state, because use a old API
          and directly kill a pages. that is may be fixed via replace
          cl_lock_discard_pages [with a] osc_extent_truncate_start or something
          similar.
          
          but for second case it's not clear for me. [It] looks [like] we need
          [to] completely change code near of mkwrite and page invalidation.
          
          [Otherwise] (may be dirty hack) - we change osc_page_delete to ignore
          error in osc_teardown_async_page for some 'force' cases or (better)
          correctly abort an IO request for that page if page discard need (page
          may be in readahead as was in
          790798-osc_page_delete-assert0-sp1-1120-fulldbg.tar.bz2).
          bkorb Bruce Korb (Inactive) added a comment - With a crash dump with debug tracing turned on, we get a lot more useful information. Alexey Lyashkov added a comment - 30/Nov/12 12:27 AM In fact that is bug in new osc extent code landed into 2.3 :/ both code path 1) ldlm lock cancel -> page discard .... and 2) mkwrite /race/ -> invalidate page forget to find a extent and set correct state for it. looks osc extent author forget about forced page flush from a page cache with mkwrite race and client eviction. for first case that is more or less clear - cl_lock_discard_pages foget to find an extent and set truncate state, because use a old API and directly kill a pages. that is may be fixed via replace cl_lock_discard_pages [with a] osc_extent_truncate_start or something similar. but for second case it's not clear for me. [It] looks [like] we need [to] completely change code near of mkwrite and page invalidation. [Otherwise] (may be dirty hack) - we change osc_page_delete to ignore error in osc_teardown_async_page for some 'force' cases or (better) correctly abort an IO request for that page if page discard need (page may be in readahead as was in 790798-osc_page_delete-assert0-sp1-1120-fulldbg.tar.bz2).

          Here is an xz compressed "dk" log from a crash dump that is an example of the issue. The two processes 617 and 619 are at issue. There is a timing hole of some sort. I will append Shadow's and my analysis shortly.

          bkorb Bruce Korb (Inactive) added a comment - Here is an xz compressed "dk" log from a crash dump that is an example of the issue. The two processes 617 and 619 are at issue. There is a timing hole of some sort. I will append Shadow's and my analysis shortly.
          bkorb Bruce Korb (Inactive) added a comment - - edited

          I don't see an obvious way to add attachments, so I'll x-clip and paste some of the log info.
          NOTE: I used a script to edit the time into a format compatible with the console log.
          This is the "lctl dk" output.

          2012-11-14T16-33-50.279302:3591:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1352932423/real 0] req@ffff88041df49c00 x1418650257981646/t0(0) o400->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 224/224 e 0 to 1 dl 1352932430 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
          2012-11-14T16-33-50.279320:3591:(import.c:167:ptlrpc_set_import_discon()) lustre1-OST0003-osc-ffff88061d741400: Connection to lustre1-OST0003 (at 53@gni) was lost; in progress operations using this service will wait for recovery to complete
          2012-11-14T16-33-50.279328:3591:(import.c:177:ptlrpc_set_import_discon()) ffff880815628000 lustre1-OST0003_UUID: changing import state from FULL to DISCONN
          2012-11-14T16-33-50.279331:3591:(import.c:415:ptlrpc_fail_import()) lustre1-OST0003_UUID: waking up pinger

          2012-11-14T16-34-06.389737:3567:(import.c:1137:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to DISCONN
          2012-11-14T16-34-06.389739:3567:(import.c:1183:ptlrpc_connect_interpret()) recovery of lustre1-OST0003_UUID on 53@gni failed (-110 (ETIMEDOUT))
          2012-11-14T16-34-47.095297:3564:(gnilnd_conn.c:2152:kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 42@gni timed out @ 63s dgram 0xffff880814862288 state GNILND_DGRAM_POSTED conn 0xffff880815013400

          2012-11-14T16-37-35.377578:3567:(import.c:815:ptlrpc_connect_interpret()) lustre1-MDT0000-mdc-ffff88061d741400: connect to target with instance 4
          2012-11-14T16-37-35.377583:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-MDT0000-mdc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88041dd58000 x1418650257981665/t0(0) o38->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 400/264 e 0 to 0 dl 1352932681 ref 1 fl Interpret:RN/0/0 rc 0/0
          2012-11-14T16-37-35.377589:3567:(import.c:968:ptlrpc_connect_interpret()) ffff8808153cb000 lustre1-MDT0000_UUID: changing import state from CONNECTING to EVICTED

          2012-11-14T16-37-42.798033:3567:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 120/128
          2012-11-14T16-37-42.798035:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-OST0003-osc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88081df80c00 x1418650257981674/t0(0) o8->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 400/264 e 0 to 0 dl 1352932683 ref 1 fl Interpret:RN/0/0 rc 0/0
          2012-11-14T16-37-42.798040:3567:(import.c:968:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to EVICTED
          2012-11-14T16-37-42.798043:3567:(import.c:1324:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lustre1-OST0003; in progress operations using this service will fail.
          2012-11-14T16-37-42.798044:3567:(import.c:1328:ptlrpc_import_recovery_state_machine()) evicted from lustre1-OST0003_UUID@53@gni; invalidating

          The connection recovers, according to logs:

          2012-11-14T16-37-42.798209:3867:(import.c:1394:ptlrpc_import_recovery_state_machine()) ffff8808155b6000 lustre1-OST0001_UUID: changing import state from RECOVER to FULL
          2012-11-14T16-37-42.798211:3867:(lov_obd.c:431:lov_set_osc_active()) Marking OSC lustre1-OST0001_UUID active
          2012-11-14T16-37-42.798213:3867:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 144/128
          2012-11-14T16-37-42.798216:3867:(import.c:1402:ptlrpc_import_recovery_state_machine()) lustre1-OST0001-osc-ffff88061d741400: Connection restored to lustre1-OST0001 (at 53@gni)

          Then about 400,000 copies of this:

          2012-11-14T16-41-29.357283:3941:(mdc_request.c:836:mdc_close()) @@@ matched open req@ffff88081542e800 x1418650258369942/t17185315789(17185315789) o101->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 560/600 e 0 to 0 dl 1352932906 ref 2 fl Complete:RP/4/0 rc 301/301

          followed 0.2 seconds with the LBUG:

          2012-11-14T17-22-00.230715:2496:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060ba76480@

          Unknown macro: {[13 -> 13/255], [2|0|-|cache|wi|ffff8802054a9648], [4096|1|+|-|ffff8802172f9318|256|(null)]}

          trunc at 13.
          2012-11-14T17-22-00.247868:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcbcc0[2 ffff8802164aa948:13 ^(null)_ffff880403fcb840 4 0 1 (null) (null) 0x1]
          2012-11-14T17-22-00.260922:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcb840[1 ffff880207a317d8:13 ^ffff880403fcbcc0_(null) 4 0 1 (null) (null) 0x0]
          2012-11-14T17-22-00.273976:2496:(osc_page.c:411:osc_page_delete()) vvp-page@ffff880401be3320(1:1:0) vm@ffffea000e052f28 6000000000000eb 9:0 0 13 lru
          2012-11-14T17-22-00.285984:2496:(osc_page.c:411:osc_page_delete()) lov-page@ffff8803ea750798
          2012-11-14T17-22-00.293128:2496:(osc_page.c:411:osc_page_delete()) osc-page@ffff880415ce6660: 1< 0x845fed 258 0 + - > 2< 53248 0 4096 0x0 0x520 | (null) ffff8808155c65f8 ffff8802054a9648 > 3< + ffff8801f01cb040 0 0 0 > 4< 0 0 8 34373632 - | - - + - > 5< - - + - | 0 - | 16 - ->
          2012-11-14T17-22-00.316352:2496:(osc_page.c:411:osc_page_delete()) end page@ffff880403fcbcc0
          2012-11-14T17-22-00.323489:2496:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
          2012-11-14T17-22-00.331062:2496:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
          2012-11-14T17-22-00.338026:2496:(osc_page.c:412:osc_page_delete()) LBUG
          2012-11-14T17-22-00.412653:2496:(linux-debug.c:143:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/usr/src/packages/BUILD/cray-lustre-2.3/lustre/osc/osc_page.c,osc_page_delete,412

          bkorb Bruce Korb (Inactive) added a comment - - edited I don't see an obvious way to add attachments, so I'll x-clip and paste some of the log info. NOTE: I used a script to edit the time into a format compatible with the console log. This is the "lctl dk" output. 2012-11-14T16-33-50.279302:3591:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1352932423/real 0] req@ffff88041df49c00 x1418650257981646/t0(0) o400->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 224/224 e 0 to 1 dl 1352932430 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 2012-11-14T16-33-50.279320:3591:(import.c:167:ptlrpc_set_import_discon()) lustre1-OST0003-osc-ffff88061d741400: Connection to lustre1-OST0003 (at 53@gni) was lost; in progress operations using this service will wait for recovery to complete 2012-11-14T16-33-50.279328:3591:(import.c:177:ptlrpc_set_import_discon()) ffff880815628000 lustre1-OST0003_UUID: changing import state from FULL to DISCONN 2012-11-14T16-33-50.279331:3591:(import.c:415:ptlrpc_fail_import()) lustre1-OST0003_UUID: waking up pinger 2012-11-14T16-34-06.389737:3567:(import.c:1137:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to DISCONN 2012-11-14T16-34-06.389739:3567:(import.c:1183:ptlrpc_connect_interpret()) recovery of lustre1-OST0003_UUID on 53@gni failed (-110 (ETIMEDOUT)) 2012-11-14T16-34-47.095297:3564:(gnilnd_conn.c:2152:kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 42@gni timed out @ 63s dgram 0xffff880814862288 state GNILND_DGRAM_POSTED conn 0xffff880815013400 2012-11-14T16-37-35.377578:3567:(import.c:815:ptlrpc_connect_interpret()) lustre1-MDT0000-mdc-ffff88061d741400: connect to target with instance 4 2012-11-14T16-37-35.377583:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-MDT0000-mdc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88041dd58000 x1418650257981665/t0(0) o38->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 400/264 e 0 to 0 dl 1352932681 ref 1 fl Interpret:RN/0/0 rc 0/0 2012-11-14T16-37-35.377589:3567:(import.c:968:ptlrpc_connect_interpret()) ffff8808153cb000 lustre1-MDT0000_UUID: changing import state from CONNECTING to EVICTED 2012-11-14T16-37-42.798033:3567:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 120/128 2012-11-14T16-37-42.798035:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-OST0003-osc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88081df80c00 x1418650257981674/t0(0) o8->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 400/264 e 0 to 0 dl 1352932683 ref 1 fl Interpret:RN/0/0 rc 0/0 2012-11-14T16-37-42.798040:3567:(import.c:968:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to EVICTED 2012-11-14T16-37-42.798043:3567:(import.c:1324:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lustre1-OST0003; in progress operations using this service will fail. 2012-11-14T16-37-42.798044:3567:(import.c:1328:ptlrpc_import_recovery_state_machine()) evicted from lustre1-OST0003_UUID@53@gni; invalidating The connection recovers, according to logs: 2012-11-14T16-37-42.798209:3867:(import.c:1394:ptlrpc_import_recovery_state_machine()) ffff8808155b6000 lustre1-OST0001_UUID: changing import state from RECOVER to FULL 2012-11-14T16-37-42.798211:3867:(lov_obd.c:431:lov_set_osc_active()) Marking OSC lustre1-OST0001_UUID active 2012-11-14T16-37-42.798213:3867:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 144/128 2012-11-14T16-37-42.798216:3867:(import.c:1402:ptlrpc_import_recovery_state_machine()) lustre1-OST0001-osc-ffff88061d741400: Connection restored to lustre1-OST0001 (at 53@gni) Then about 400,000 copies of this: 2012-11-14T16-41-29.357283:3941:(mdc_request.c:836:mdc_close()) @@@ matched open req@ffff88081542e800 x1418650258369942/t17185315789(17185315789) o101->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 560/600 e 0 to 0 dl 1352932906 ref 2 fl Complete:RP/4/0 rc 301/301 followed 0.2 seconds with the LBUG: 2012-11-14T17-22-00.230715:2496:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060ba76480@ Unknown macro: {[13 -> 13/255], [2|0|-|cache|wi|ffff8802054a9648], [4096|1|+|-|ffff8802172f9318|256|(null)]} trunc at 13. 2012-11-14T17-22-00.247868:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcbcc0 [2 ffff8802164aa948:13 ^(null)_ffff880403fcb840 4 0 1 (null) (null) 0x1] 2012-11-14T17-22-00.260922:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcb840 [1 ffff880207a317d8:13 ^ffff880403fcbcc0_(null) 4 0 1 (null) (null) 0x0] 2012-11-14T17-22-00.273976:2496:(osc_page.c:411:osc_page_delete()) vvp-page@ffff880401be3320(1:1:0) vm@ffffea000e052f28 6000000000000eb 9:0 0 13 lru 2012-11-14T17-22-00.285984:2496:(osc_page.c:411:osc_page_delete()) lov-page@ffff8803ea750798 2012-11-14T17-22-00.293128:2496:(osc_page.c:411:osc_page_delete()) osc-page@ffff880415ce6660: 1< 0x845fed 258 0 + - > 2< 53248 0 4096 0x0 0x520 | (null) ffff8808155c65f8 ffff8802054a9648 > 3< + ffff8801f01cb040 0 0 0 > 4< 0 0 8 34373632 - | - - + - > 5< - - + - | 0 - | 16 - -> 2012-11-14T17-22-00.316352:2496:(osc_page.c:411:osc_page_delete()) end page@ffff880403fcbcc0 2012-11-14T17-22-00.323489:2496:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16 2012-11-14T17-22-00.331062:2496:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed: 2012-11-14T17-22-00.338026:2496:(osc_page.c:412:osc_page_delete()) LBUG 2012-11-14T17-22-00.412653:2496:(linux-debug.c:143:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/usr/src/packages/BUILD/cray-lustre-2.3/lustre/osc/osc_page.c,osc_page_delete,412

          OK. Thank you. It just looks a lot alike.
          I will keep reading code.

          2012-11-06T14:57:10.872333-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060eedfe58@

          {[23 -> 23/255], [2|0|-|cache|wi|ffff88020e18f8c8], [4096|1|+|-|ffff8801f9ed9c18|256| (null)]}

          trunc at 23.
          2012-11-06T14:57:10.872389-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8bc0[2 ffff88020cefad08:23 ^ (null)_ffff88080e8c8b00 4 0 1 (null) (null) 0x0]
          2012-11-06T14:57:10.902445-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8b00[1 ffff88020d8c1f58:23 ^ffff88080e8c8bc0_ (null) 4 0 1 (null) (null) 0x0]
          2012-11-06T14:57:10.902494-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) vvp-page@ffff88080e8cf5a0(0:0:0) vm@ffffea001c6a99b8 e00000000000063 7:0 0 23 lru
          2012-11-06T14:57:10.902508-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) lov-page@ffff8808050a7888
          2012-11-06T14:57:10.958005-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) osc-page@ffff88080d78ed20: 1< 0x845fed 258 0 + - > 2< 94208 0 4096 0x0 0x520 | (null) ffff880804350700 ffff88020e18f8c8 > 3< + ffff8801f8f577b0 0 0 0 > 4< 0 0 8 33824768 - | - - + - > 5< - - + - | 0 - | 1 - ->
          2012-11-06T14:57:10.958049-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) end page@ffff88080e8c8bc0
          2012-11-06T14:57:10.983504-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
          2012-11-06T14:57:10.983536-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
          2012-11-06T14:57:10.983549-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) LBUG
          2012-11-06T14:57:10.983570-06:00 c0-0c1s6n0 Pid: 5270, comm: fsx-linux
          2012-11-06T14:57:10.983582-06:00 c0-0c1s6n0 Call Trace:
          2012-11-06T14:57:10.983605-06:00 c0-0c1s6n0 [<ffffffff810063b1>] try_stack_unwind+0x161/0x1a0
          2012-11-06T14:57:11.009114-06:00 c0-0c1s6n0 [<ffffffff81004bf9>] dump_trace+0x89/0x440
          2012-11-06T14:57:11.009138-06:00 c0-0c1s6n0 [<ffffffffa014e887>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
          2012-11-06T14:57:11.009161-06:00 c0-0c1s6n0 [<ffffffffa014ede7>] lbug_with_loc+0x47/0xc0 [libcfs]
          2012-11-06T14:57:11.009173-06:00 c0-0c1s6n0 [<ffffffffa0671d41>] osc_page_delete+0x2d1/0x2e0 [osc]
          2012-11-06T14:57:11.034700-06:00 c0-0c1s6n0 [<ffffffffa02b0095>] cl_page_delete0+0xd5/0x4e0 [obdclass]
          2012-11-06T14:57:11.034737-06:00 c0-0c1s6n0 [<ffffffffa02b04e2>] cl_page_delete+0x42/0x120 [obdclass]
          2012-11-06T14:57:11.034762-06:00 c0-0c1s6n0 [<ffffffffa07f2e2d>] ll_invalidatepage+0x8d/0x170 [lustre]
          2012-11-06T14:57:11.034774-06:00 c0-0c1s6n0 [<ffffffffa07ea290>] ll_page_mkwrite+0x7c0/0x840 [lustre]
          2012-11-06T14:57:11.034820-06:00 c0-0c1s6n0 [<ffffffff81107cb7>] __do_fault+0xe7/0x570
          2012-11-06T14:57:11.034833-06:00 c0-0c1s6n0 [<ffffffff811081e4>] handle_pte_fault+0xa4/0xcd0
          2012-11-06T14:57:11.060561-06:00 c0-0c1s6n0 [<ffffffff81108fbe>] handle_mm_fault+0x1ae/0x240
          2012-11-06T14:57:11.060588-06:00 c0-0c1s6n0 [<ffffffff81025471>] do_page_fault+0x191/0x410
          2012-11-06T14:57:11.060600-06:00 c0-0c1s6n0 [<ffffffff81301b5f>] page_fault+0x1f/0x30
          2012-11-06T14:57:11.060647-06:00 c0-0c1s6n0 [<00000000200422b3>] 0x200422b3
          2012-11-06T14:57:11.060660-06:00 c0-0c1s6n0 Kernel panic - not syncing: LBUG

          bkorb Bruce Korb (Inactive) added a comment - OK. Thank you. It just looks a lot alike. I will keep reading code. 2012-11-06T14:57:10.872333-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060eedfe58@ {[23 -> 23/255], [2|0|-|cache|wi|ffff88020e18f8c8], [4096|1|+|-|ffff8801f9ed9c18|256| (null)]} trunc at 23. 2012-11-06T14:57:10.872389-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8bc0 [2 ffff88020cefad08:23 ^ (null)_ffff88080e8c8b00 4 0 1 (null) (null) 0x0] 2012-11-06T14:57:10.902445-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8b00 [1 ffff88020d8c1f58:23 ^ffff88080e8c8bc0_ (null) 4 0 1 (null) (null) 0x0] 2012-11-06T14:57:10.902494-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) vvp-page@ffff88080e8cf5a0(0:0:0) vm@ffffea001c6a99b8 e00000000000063 7:0 0 23 lru 2012-11-06T14:57:10.902508-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) lov-page@ffff8808050a7888 2012-11-06T14:57:10.958005-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) osc-page@ffff88080d78ed20: 1< 0x845fed 258 0 + - > 2< 94208 0 4096 0x0 0x520 | (null) ffff880804350700 ffff88020e18f8c8 > 3< + ffff8801f8f577b0 0 0 0 > 4< 0 0 8 33824768 - | - - + - > 5< - - + - | 0 - | 1 - -> 2012-11-06T14:57:10.958049-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) end page@ffff88080e8c8bc0 2012-11-06T14:57:10.983504-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16 2012-11-06T14:57:10.983536-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed: 2012-11-06T14:57:10.983549-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) LBUG 2012-11-06T14:57:10.983570-06:00 c0-0c1s6n0 Pid: 5270, comm: fsx-linux 2012-11-06T14:57:10.983582-06:00 c0-0c1s6n0 Call Trace: 2012-11-06T14:57:10.983605-06:00 c0-0c1s6n0 [<ffffffff810063b1>] try_stack_unwind+0x161/0x1a0 2012-11-06T14:57:11.009114-06:00 c0-0c1s6n0 [<ffffffff81004bf9>] dump_trace+0x89/0x440 2012-11-06T14:57:11.009138-06:00 c0-0c1s6n0 [<ffffffffa014e887>] libcfs_debug_dumpstack+0x57/0x80 [libcfs] 2012-11-06T14:57:11.009161-06:00 c0-0c1s6n0 [<ffffffffa014ede7>] lbug_with_loc+0x47/0xc0 [libcfs] 2012-11-06T14:57:11.009173-06:00 c0-0c1s6n0 [<ffffffffa0671d41>] osc_page_delete+0x2d1/0x2e0 [osc] 2012-11-06T14:57:11.034700-06:00 c0-0c1s6n0 [<ffffffffa02b0095>] cl_page_delete0+0xd5/0x4e0 [obdclass] 2012-11-06T14:57:11.034737-06:00 c0-0c1s6n0 [<ffffffffa02b04e2>] cl_page_delete+0x42/0x120 [obdclass] 2012-11-06T14:57:11.034762-06:00 c0-0c1s6n0 [<ffffffffa07f2e2d>] ll_invalidatepage+0x8d/0x170 [lustre] 2012-11-06T14:57:11.034774-06:00 c0-0c1s6n0 [<ffffffffa07ea290>] ll_page_mkwrite+0x7c0/0x840 [lustre] 2012-11-06T14:57:11.034820-06:00 c0-0c1s6n0 [<ffffffff81107cb7>] __do_fault+0xe7/0x570 2012-11-06T14:57:11.034833-06:00 c0-0c1s6n0 [<ffffffff811081e4>] handle_pte_fault+0xa4/0xcd0 2012-11-06T14:57:11.060561-06:00 c0-0c1s6n0 [<ffffffff81108fbe>] handle_mm_fault+0x1ae/0x240 2012-11-06T14:57:11.060588-06:00 c0-0c1s6n0 [<ffffffff81025471>] do_page_fault+0x191/0x410 2012-11-06T14:57:11.060600-06:00 c0-0c1s6n0 [<ffffffff81301b5f>] page_fault+0x1f/0x30 2012-11-06T14:57:11.060647-06:00 c0-0c1s6n0 [<00000000200422b3>] 0x200422b3 2012-11-06T14:57:11.060660-06:00 c0-0c1s6n0 Kernel panic - not syncing: LBUG

          this is a different issue from what have described by Oleg

          jay Jinshan Xiong (Inactive) added a comment - this is a different issue from what have described by Oleg

          OK, well we (Cory at Cray) see the same symptom, even after the LU-2170 patch.
          So there must be another cause for this.

          > Denis Kondratenko added a comment - 06/Nov/12 4:31 AM
          > Wally,
          > was this fixed by http://jira.whamcloud.com/browse/LU-2170 ?
          > [...]
          > Wally Wang added a comment - 06/Nov/12 3:19 PM
          > No, attached is the SP2 version of the cdump and it is with the LU-2170 applied.

          The SP2 version of the cdump was, unfortunately, corrupt, so I've been futzing
          with the pre-SP2 version. That earlier dump may well have been caused by the
          issue fixed in LU-2170, but the later dump not. (Presuming Wally is correct about
          the patch having been applied.)

          I can attach the SP2 console log, if you like. But not the "lctl dk" dump,
          pending another crash dump.

          bkorb Bruce Korb (Inactive) added a comment - OK, well we (Cory at Cray) see the same symptom, even after the LU-2170 patch. So there must be another cause for this. > Denis Kondratenko added a comment - 06/Nov/12 4:31 AM > Wally, > was this fixed by http://jira.whamcloud.com/browse/LU-2170 ? > [...] > Wally Wang added a comment - 06/Nov/12 3:19 PM > No, attached is the SP2 version of the cdump and it is with the LU-2170 applied. The SP2 version of the cdump was, unfortunately, corrupt, so I've been futzing with the pre-SP2 version. That earlier dump may well have been caused by the issue fixed in LU-2170 , but the later dump not. (Presuming Wally is correct about the patch having been applied.) I can attach the SP2 console log, if you like. But not the "lctl dk" dump, pending another crash dump.

          In LU-2170, I discovered that a write lock can be matched and enqueued by a glimpse process so that the write lock will be marked with INTENT(this write lock was wrongly converted to glimpse lock) in the enqueue RPC, then the lock was denied by OST, then the client calls cl_lock_delete() to delete this lock and grants cl_lock to make it go through with glimpse. After glimpse process finishes to use this lock, it will be deleted w/o canceling.

          Depending on the time when this lock is found by other process, difference symptoms will show up. For example, in osc_extent_merge(), I discovered that one of the extent's oe_lock pointed to that "glimpse write" lock, and another extent had the normal one. For the osc_page_delete() case, that was because one dirty page was added into the "glimpse write" lock, since there is no cancel for this lock, the page still remained in the cache after the lock was deleted. Afterwards, when a read lock was enqueued and canceled, it discovered this page and then hit the assert.

          jay Jinshan Xiong (Inactive) added a comment - In LU-2170 , I discovered that a write lock can be matched and enqueued by a glimpse process so that the write lock will be marked with INTENT(this write lock was wrongly converted to glimpse lock) in the enqueue RPC, then the lock was denied by OST, then the client calls cl_lock_delete() to delete this lock and grants cl_lock to make it go through with glimpse. After glimpse process finishes to use this lock, it will be deleted w/o canceling. Depending on the time when this lock is found by other process, difference symptoms will show up. For example, in osc_extent_merge(), I discovered that one of the extent's oe_lock pointed to that "glimpse write" lock, and another extent had the normal one. For the osc_page_delete() case, that was because one dirty page was added into the "glimpse write" lock, since there is no cancel for this lock, the page still remained in the cache after the lock was deleted. Afterwards, when a read lock was enqueued and canceled, it discovered this page and then hit the assert.

          I am a little unclear as to how this was diagnosed as a duplicate of LU-2170.
          The assertions are different. This "assert(0)" was initiated because
          osc_page_delete() was called while the opg->ops_oap field referenced a
          osc_async_page structure with a non-empty oap_rpc_item list. (That caused
          osc_teardown_async_page to return EBUSY (16) and osc_page_delete handles
          that by asserting that zero is not zero.)

          So, I don't see the obvious reason the osc_extent_merge assertion would be
          seen as a duplicate of this issue.

          bkorb Bruce Korb (Inactive) added a comment - I am a little unclear as to how this was diagnosed as a duplicate of LU-2170 . The assertions are different. This "assert(0)" was initiated because osc_page_delete() was called while the opg->ops_oap field referenced a osc_async_page structure with a non-empty oap_rpc_item list. (That caused osc_teardown_async_page to return EBUSY (16) and osc_page_delete handles that by asserting that zero is not zero.) So, I don't see the obvious reason the osc_extent_merge assertion would be seen as a duplicate of this issue.

          duplicate of LU-2170

          jay Jinshan Xiong (Inactive) added a comment - duplicate of LU-2170

          People

            jay Jinshan Xiong (Inactive)
            wjt27 Wojciech Turek
            Votes:
            2 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: