[LU-1723] osc_page_delete()) ASSERTION(0) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.2
Labels:
None
Environment:
Client RHEL-6.2 2.6.32-220.23.1.el6.x86_64

Severity:
3
Rank (Obsolete):
4210

Description

LBUG on production cluster while node was running a user job.

Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b5c0[1 ffff8806726e8f48:1263 ^(null)_ffff88106917b500 4 0 1 (null) (null) 0x1]
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b500[1 ffff880770b2c4c8:1263 ^ffff88106917b5c0_(null) 4 0 1 (null) (null) 0x0]
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) vvp-page@ffff8809d88c6be0(1:0:0) vm@ffffea0034f6ce20 c0000000000001 3:0 0 1263 lru
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) lov-page@ffff8809f22e93a8
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) osc-page@ffff881067d54bc8: 1< 0x845fed 1 0 - - + > 2< 5173248 0 4096 0x0 0x8 | (null) ffff8806d7f24688 ffff88016958f700 ffffffffa07ff>
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) end page@ffff88106917b5c0
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) Trying to teardown failed: -16
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) ASSERTION(0) failed
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) LBUG
Aug 8 16:00:02 sand-4-52 kernel: Pid: 64082, comm: calculate_propa
Aug 8 16:00:02 sand-4-52 kernel:
Aug 8 16:00:02 sand-4-52 kernel: Call Trace:
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446e95>] lbug_with_loc+0x75/0xe0 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0451d86>] libcfs_assertion_failed+0x66/0x70 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa07f8df6>] osc_page_delete+0x236/0x240 [osc]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0553d1e>] cl_page_delete0+0xce/0x400 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055128e>] ? cl_env_get+0x19e/0x350 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0550bf6>] ? cl_env_peek+0x36/0x110 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055408d>] cl_page_delete+0x3d/0xf0 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0560bde>] ? cl_io_is_going+0xe/0x20 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa08e791b>] ll_releasepage+0x10b/0x150 [lustre]
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81168ff0>] ? mem_cgroup_uncharge_cache_page+0x10/0x20
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811100b0>] try_to_release_page+0x30/0x60
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a4f1>] shrink_page_list.clone.0+0x4f1/0x5c0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a8bb>] shrink_inactive_list+0x2fb/0x740
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112b5cf>] shrink_zone+0x38f/0x520
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112c374>] zone_reclaim+0x354/0x410
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112cfc0>] ? isolate_pages_global+0x0/0x350
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81122874>] get_page_from_freelist+0x694/0x820
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81123af1>] __alloc_pages_nodemask+0x111/0x940
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8116a728>] ? __mem_cgroup_try_charge+0x78/0x420
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811586ca>] alloc_pages_vma+0x9a/0x150
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81172015>] do_huge_pmd_anonymous_page+0x145/0x370
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8113c79a>] handle_mm_fault+0x25a/0x2b0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81042c29>] __do_page_fault+0x139/0x480
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814ed250>] ? thread_return+0x4e/0x76e
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f2c8e>] do_page_fault+0x3e/0xa0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f0045>] page_fault+0x25/0x30
Aug 8 16:00:03 sand-4-52 kernel:
Aug 8 16:00:03 sand-4-52 kernel: LustreError: dumping log to /tmp/lustre-log.1344438003.64082

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dk.log.xz
4.12 MB
30/Nov/12 2:48 PM
lustre-log.1344438003.64082.tar.bz2
763 kB
08/Aug/12 11:32 AM
sand-4-52_lustre_debug.log
5 kB
08/Aug/12 11:32 AM

Activity

[LU-1723] osc_page_delete()) ASSERTION(0) failed

Bruce Korb (Inactive) added a comment - 19/Dec/12 12:44 PM - edited

In today's Xyratex/Cray conference call, it was decided to go forward with backing out the ~~LU-1030~~ patches from our release branch, pending further work on them.

Bruce Korb (Inactive) added a comment - 19/Dec/12 12:44 PM - edited In today's Xyratex/Cray conference call, it was decided to go forward with backing out the LU-1030 patches from our release branch, pending further work on them.

Bruce Korb (Inactive) added a comment - 07/Dec/12 12:10 AM

More comments on Xyratex bug. Given the difficulty/complexity of the change, I would happily accept guidance here. (i.e. wait for you-all or suggestions on how to proceed)

Iurii Golovach added a comment - 06/Dec/12 4:34 AM

During implementation of extents on OSC level, two cases fell between cracks:

evict;
invalidate page.
Like a result extent flags for these cases weren't set and we have a race.

Since adding these cases is quite complex task we recommend to do next:
1) revert changes from next patches:
http://review.whamcloud.com/#change,2270
http://review.whamcloud.com/#change,2009
http://review.whamcloud.com/#change,2460
2) wait until Jinshan finish his work on ~~LU-1723~~ and then use changes with correct extents support.

Please let us know if you need patch against 2.3 to reverse these patches.

Cory Spitz added a comment - 06/Dec/12 2:42 PM

Hi, that's a pretty big change. How many bug fixes would we loose? I'm not sure that the tradeoff for this bug is worth it (yet). Can you guys help w/~~LU-1723~~ or are you just waiting for him?

Bruce Korb (Inactive) added a comment - 07/Dec/12 12:10 AM More comments on Xyratex bug. Given the difficulty/complexity of the change, I would happily accept guidance here. (i.e. wait for you-all or suggestions on how to proceed) Iurii Golovach added a comment - 06/Dec/12 4:34 AM During implementation of extents on OSC level, two cases fell between cracks: evict; invalidate page. Like a result extent flags for these cases weren't set and we have a race. Since adding these cases is quite complex task we recommend to do next: 1) revert changes from next patches: http://review.whamcloud.com/#change,2270 http://review.whamcloud.com/#change,2009 http://review.whamcloud.com/#change,2460 2) wait until Jinshan finish his work on LU-1723 and then use changes with correct extents support. Please let us know if you need patch against 2.3 to reverse these patches. Cory Spitz added a comment - 06/Dec/12 2:42 PM Hi, that's a pretty big change. How many bug fixes would we loose? I'm not sure that the tradeoff for this bug is worth it (yet). Can you guys help w/ LU-1723 or are you just waiting for him?

Bruce Korb (Inactive) added a comment - 30/Nov/12 2:55 PM

With a crash dump with debug tracing turned on, we get a lot more useful information.

Alexey Lyashkov added a comment - 30/Nov/12 12:27 AM

In fact that is bug in new osc extent code landed into 2.3 :/
both code path
1) ldlm lock cancel -> page discard ....
and
2) mkwrite /race/ -> invalidate page

forget to find a extent and set correct state for it.  looks osc
extent author forget about forced page flush from a page cache with
mkwrite race and client eviction.

for first case that is more or less clear - cl_lock_discard_pages
foget to find an extent and set truncate state, because use a old API
and directly kill a pages. that is may be fixed via replace
cl_lock_discard_pages [with a] osc_extent_truncate_start or something
similar.

but for second case it's not clear for me. [It] looks [like] we need
[to] completely change code near of mkwrite and page invalidation.

[Otherwise] (may be dirty hack) - we change osc_page_delete to ignore
error in osc_teardown_async_page for some 'force' cases or (better)
correctly abort an IO request for that page if page discard need (page
may be in readahead as was in
790798-osc_page_delete-assert0-sp1-1120-fulldbg.tar.bz2).

Bruce Korb (Inactive) added a comment - 30/Nov/12 2:55 PM With a crash dump with debug tracing turned on, we get a lot more useful information. Alexey Lyashkov added a comment - 30/Nov/12 12:27 AM In fact that is bug in new osc extent code landed into 2.3 :/ both code path 1) ldlm lock cancel -> page discard .... and 2) mkwrite /race/ -> invalidate page forget to find a extent and set correct state for it. looks osc extent author forget about forced page flush from a page cache with mkwrite race and client eviction. for first case that is more or less clear - cl_lock_discard_pages foget to find an extent and set truncate state, because use a old API and directly kill a pages. that is may be fixed via replace cl_lock_discard_pages [with a] osc_extent_truncate_start or something similar. but for second case it's not clear for me. [It] looks [like] we need [to] completely change code near of mkwrite and page invalidation. [Otherwise] (may be dirty hack) - we change osc_page_delete to ignore error in osc_teardown_async_page for some 'force' cases or (better) correctly abort an IO request for that page if page discard need (page may be in readahead as was in 790798-osc_page_delete-assert0-sp1-1120-fulldbg.tar.bz2).

Bruce Korb (Inactive) added a comment - 30/Nov/12 2:48 PM

Here is an xz compressed "dk" log from a crash dump that is an example of the issue. The two processes 617 and 619 are at issue. There is a timing hole of some sort. I will append Shadow's and my analysis shortly.

Bruce Korb (Inactive) added a comment - 30/Nov/12 2:48 PM Here is an xz compressed "dk" log from a crash dump that is an example of the issue. The two processes 617 and 619 are at issue. There is a timing hole of some sort. I will append Shadow's and my analysis shortly.

Bruce Korb (Inactive) added a comment - 15/Nov/12 6:19 PM - edited

I don't see an obvious way to add attachments, so I'll x-clip and paste some of the log info.
NOTE: I used a script to edit the time into a format compatible with the console log.
This is the "lctl dk" output.

2012-11-14T16-33-50.279302:3591:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1352932423/real 0] req@ffff88041df49c00 x1418650257981646/t0(0) o400->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 224/224 e 0 to 1 dl 1352932430 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
2012-11-14T16-33-50.279320:3591:(import.c:167:ptlrpc_set_import_discon()) lustre1-OST0003-osc-ffff88061d741400: Connection to lustre1-OST0003 (at 53@gni) was lost; in progress operations using this service will wait for recovery to complete
2012-11-14T16-33-50.279328:3591:(import.c:177:ptlrpc_set_import_discon()) ffff880815628000 lustre1-OST0003_UUID: changing import state from FULL to DISCONN
2012-11-14T16-33-50.279331:3591:(import.c:415:ptlrpc_fail_import()) lustre1-OST0003_UUID: waking up pinger

2012-11-14T16-34-06.389737:3567:(import.c:1137:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to DISCONN
2012-11-14T16-34-06.389739:3567:(import.c:1183:ptlrpc_connect_interpret()) recovery of lustre1-OST0003_UUID on 53@gni failed (-110 (ETIMEDOUT))
2012-11-14T16-34-47.095297:3564:(gnilnd_conn.c:2152:kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 42@gni timed out @ 63s dgram 0xffff880814862288 state GNILND_DGRAM_POSTED conn 0xffff880815013400

2012-11-14T16-37-35.377578:3567:(import.c:815:ptlrpc_connect_interpret()) lustre1-MDT0000-mdc-ffff88061d741400: connect to target with instance 4
2012-11-14T16-37-35.377583:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-MDT0000-mdc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88041dd58000 x1418650257981665/t0(0) o38->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 400/264 e 0 to 0 dl 1352932681 ref 1 fl Interpret:RN/0/0 rc 0/0
2012-11-14T16-37-35.377589:3567:(import.c:968:ptlrpc_connect_interpret()) ffff8808153cb000 lustre1-MDT0000_UUID: changing import state from CONNECTING to EVICTED

2012-11-14T16-37-42.798033:3567:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 120/128
2012-11-14T16-37-42.798035:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-OST0003-osc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88081df80c00 x1418650257981674/t0(0) o8->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 400/264 e 0 to 0 dl 1352932683 ref 1 fl Interpret:RN/0/0 rc 0/0
2012-11-14T16-37-42.798040:3567:(import.c:968:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to EVICTED
2012-11-14T16-37-42.798043:3567:(import.c:1324:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lustre1-OST0003; in progress operations using this service will fail.
2012-11-14T16-37-42.798044:3567:(import.c:1328:ptlrpc_import_recovery_state_machine()) evicted from lustre1-OST0003_UUID@53@gni; invalidating

The connection recovers, according to logs:

2012-11-14T16-37-42.798209:3867:(import.c:1394:ptlrpc_import_recovery_state_machine()) ffff8808155b6000 lustre1-OST0001_UUID: changing import state from RECOVER to FULL
2012-11-14T16-37-42.798211:3867:(lov_obd.c:431:lov_set_osc_active()) Marking OSC lustre1-OST0001_UUID active
2012-11-14T16-37-42.798213:3867:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 144/128
2012-11-14T16-37-42.798216:3867:(import.c:1402:ptlrpc_import_recovery_state_machine()) lustre1-OST0001-osc-ffff88061d741400: Connection restored to lustre1-OST0001 (at 53@gni)

Then about 400,000 copies of this:

2012-11-14T16-41-29.357283:3941:(mdc_request.c:836:mdc_close()) @@@ matched open req@ffff88081542e800 x1418650258369942/t17185315789(17185315789) o101->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 560/600 e 0 to 0 dl 1352932906 ref 2 fl Complete:RP/4/0 rc 301/301

followed 0.2 seconds with the LBUG:

2012-11-14T17-22-00.230715:2496:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060ba76480@

Unknown macro: {[13 -> 13/255], [2|0|-|cache|wi|ffff8802054a9648], [4096|1|+|-|ffff8802172f9318|256|(null)]}

trunc at 13.
2012-11-14T17-22-00.247868:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcbcc0[2 ffff8802164aa948:13 ^(null)_ffff880403fcb840 4 0 1 (null) (null) 0x1]
2012-11-14T17-22-00.260922:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcb840[1 ffff880207a317d8:13 ^ffff880403fcbcc0_(null) 4 0 1 (null) (null) 0x0]
2012-11-14T17-22-00.273976:2496:(osc_page.c:411:osc_page_delete()) vvp-page@ffff880401be3320(1:1:0) vm@ffffea000e052f28 6000000000000eb 9:0 0 13 lru
2012-11-14T17-22-00.285984:2496:(osc_page.c:411:osc_page_delete()) lov-page@ffff8803ea750798
2012-11-14T17-22-00.293128:2496:(osc_page.c:411:osc_page_delete()) osc-page@ffff880415ce6660: 1< 0x845fed 258 0 + - > 2< 53248 0 4096 0x0 0x520 | (null) ffff8808155c65f8 ffff8802054a9648 > 3< + ffff8801f01cb040 0 0 0 > 4< 0 0 8 34373632 - | - - + - > 5< - - + - | 0 - | 16 - ->
2012-11-14T17-22-00.316352:2496:(osc_page.c:411:osc_page_delete()) end page@ffff880403fcbcc0
2012-11-14T17-22-00.323489:2496:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
2012-11-14T17-22-00.331062:2496:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
2012-11-14T17-22-00.338026:2496:(osc_page.c:412:osc_page_delete()) LBUG
2012-11-14T17-22-00.412653:2496:(linux-debug.c:143:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/usr/src/packages/BUILD/cray-lustre-2.3/lustre/osc/osc_page.c,osc_page_delete,412

Bruce Korb (Inactive) added a comment - 15/Nov/12 6:19 PM - edited I don't see an obvious way to add attachments, so I'll x-clip and paste some of the log info. NOTE: I used a script to edit the time into a format compatible with the console log. This is the "lctl dk" output. 2012-11-14T16-33-50.279302:3591:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1352932423/real 0] req@ffff88041df49c00 x1418650257981646/t0(0) o400->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 224/224 e 0 to 1 dl 1352932430 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 2012-11-14T16-33-50.279320:3591:(import.c:167:ptlrpc_set_import_discon()) lustre1-OST0003-osc-ffff88061d741400: Connection to lustre1-OST0003 (at 53@gni) was lost; in progress operations using this service will wait for recovery to complete 2012-11-14T16-33-50.279328:3591:(import.c:177:ptlrpc_set_import_discon()) ffff880815628000 lustre1-OST0003_UUID: changing import state from FULL to DISCONN 2012-11-14T16-33-50.279331:3591:(import.c:415:ptlrpc_fail_import()) lustre1-OST0003_UUID: waking up pinger 2012-11-14T16-34-06.389737:3567:(import.c:1137:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to DISCONN 2012-11-14T16-34-06.389739:3567:(import.c:1183:ptlrpc_connect_interpret()) recovery of lustre1-OST0003_UUID on 53@gni failed (-110 (ETIMEDOUT)) 2012-11-14T16-34-47.095297:3564:(gnilnd_conn.c:2152:kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 42@gni timed out @ 63s dgram 0xffff880814862288 state GNILND_DGRAM_POSTED conn 0xffff880815013400 2012-11-14T16-37-35.377578:3567:(import.c:815:ptlrpc_connect_interpret()) lustre1-MDT0000-mdc-ffff88061d741400: connect to target with instance 4 2012-11-14T16-37-35.377583:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-MDT0000-mdc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88041dd58000 x1418650257981665/t0(0) o38->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 400/264 e 0 to 0 dl 1352932681 ref 1 fl Interpret:RN/0/0 rc 0/0 2012-11-14T16-37-35.377589:3567:(import.c:968:ptlrpc_connect_interpret()) ffff8808153cb000 lustre1-MDT0000_UUID: changing import state from CONNECTING to EVICTED 2012-11-14T16-37-42.798033:3567:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 120/128 2012-11-14T16-37-42.798035:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-OST0003-osc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88081df80c00 x1418650257981674/t0(0) o8->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 400/264 e 0 to 0 dl 1352932683 ref 1 fl Interpret:RN/0/0 rc 0/0 2012-11-14T16-37-42.798040:3567:(import.c:968:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to EVICTED 2012-11-14T16-37-42.798043:3567:(import.c:1324:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lustre1-OST0003; in progress operations using this service will fail. 2012-11-14T16-37-42.798044:3567:(import.c:1328:ptlrpc_import_recovery_state_machine()) evicted from lustre1-OST0003_UUID@53@gni; invalidating The connection recovers, according to logs: 2012-11-14T16-37-42.798209:3867:(import.c:1394:ptlrpc_import_recovery_state_machine()) ffff8808155b6000 lustre1-OST0001_UUID: changing import state from RECOVER to FULL 2012-11-14T16-37-42.798211:3867:(lov_obd.c:431:lov_set_osc_active()) Marking OSC lustre1-OST0001_UUID active 2012-11-14T16-37-42.798213:3867:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 144/128 2012-11-14T16-37-42.798216:3867:(import.c:1402:ptlrpc_import_recovery_state_machine()) lustre1-OST0001-osc-ffff88061d741400: Connection restored to lustre1-OST0001 (at 53@gni) Then about 400,000 copies of this: 2012-11-14T16-41-29.357283:3941:(mdc_request.c:836:mdc_close()) @@@ matched open req@ffff88081542e800 x1418650258369942/t17185315789(17185315789) o101->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 560/600 e 0 to 0 dl 1352932906 ref 2 fl Complete:RP/4/0 rc 301/301 followed 0.2 seconds with the LBUG: 2012-11-14T17-22-00.230715:2496:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060ba76480@ Unknown macro: {[13 -> 13/255], [2|0|-|cache|wi|ffff8802054a9648], [4096|1|+|-|ffff8802172f9318|256|(null)]} trunc at 13. 2012-11-14T17-22-00.247868:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcbcc0 [2 ffff8802164aa948:13 ^(null)_ffff880403fcb840 4 0 1 (null) (null) 0x1] 2012-11-14T17-22-00.260922:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcb840 [1 ffff880207a317d8:13 ^ffff880403fcbcc0_(null) 4 0 1 (null) (null) 0x0] 2012-11-14T17-22-00.273976:2496:(osc_page.c:411:osc_page_delete()) vvp-page@ffff880401be3320(1:1:0) vm@ffffea000e052f28 6000000000000eb 9:0 0 13 lru 2012-11-14T17-22-00.285984:2496:(osc_page.c:411:osc_page_delete()) lov-page@ffff8803ea750798 2012-11-14T17-22-00.293128:2496:(osc_page.c:411:osc_page_delete()) osc-page@ffff880415ce6660: 1< 0x845fed 258 0 + - > 2< 53248 0 4096 0x0 0x520 | (null) ffff8808155c65f8 ffff8802054a9648 > 3< + ffff8801f01cb040 0 0 0 > 4< 0 0 8 34373632 - | - - + - > 5< - - + - | 0 - | 16 - -> 2012-11-14T17-22-00.316352:2496:(osc_page.c:411:osc_page_delete()) end page@ffff880403fcbcc0 2012-11-14T17-22-00.323489:2496:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16 2012-11-14T17-22-00.331062:2496:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed: 2012-11-14T17-22-00.338026:2496:(osc_page.c:412:osc_page_delete()) LBUG 2012-11-14T17-22-00.412653:2496:(linux-debug.c:143:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/usr/src/packages/BUILD/cray-lustre-2.3/lustre/osc/osc_page.c,osc_page_delete,412

Bruce Korb (Inactive) added a comment - 14/Nov/12 3:25 PM

OK. Thank you. It just looks a lot alike.
I will keep reading code.

2012-11-06T14:57:10.872333-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060eedfe58@

{[23 -> 23/255], [2|0|-|cache|wi|ffff88020e18f8c8], [4096|1|+|-|ffff8801f9ed9c18|256| (null)]}

trunc at 23.
2012-11-06T14:57:10.872389-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8bc0[2 ffff88020cefad08:23 ^ (null)_ffff88080e8c8b00 4 0 1 (null) (null) 0x0]
2012-11-06T14:57:10.902445-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8b00[1 ffff88020d8c1f58:23 ^ffff88080e8c8bc0_ (null) 4 0 1 (null) (null) 0x0]
2012-11-06T14:57:10.902494-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) vvp-page@ffff88080e8cf5a0(0:0:0) vm@ffffea001c6a99b8 e00000000000063 7:0 0 23 lru
2012-11-06T14:57:10.902508-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) lov-page@ffff8808050a7888
2012-11-06T14:57:10.958005-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) osc-page@ffff88080d78ed20: 1< 0x845fed 258 0 + - > 2< 94208 0 4096 0x0 0x520 | (null) ffff880804350700 ffff88020e18f8c8 > 3< + ffff8801f8f577b0 0 0 0 > 4< 0 0 8 33824768 - | - - + - > 5< - - + - | 0 - | 1 - ->
2012-11-06T14:57:10.958049-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) end page@ffff88080e8c8bc0
2012-11-06T14:57:10.983504-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
2012-11-06T14:57:10.983536-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
2012-11-06T14:57:10.983549-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) LBUG
2012-11-06T14:57:10.983570-06:00 c0-0c1s6n0 Pid: 5270, comm: fsx-linux
2012-11-06T14:57:10.983582-06:00 c0-0c1s6n0 Call Trace:
2012-11-06T14:57:10.983605-06:00 c0-0c1s6n0 [<ffffffff810063b1>] try_stack_unwind+0x161/0x1a0
2012-11-06T14:57:11.009114-06:00 c0-0c1s6n0 [<ffffffff81004bf9>] dump_trace+0x89/0x440
2012-11-06T14:57:11.009138-06:00 c0-0c1s6n0 [<ffffffffa014e887>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2012-11-06T14:57:11.009161-06:00 c0-0c1s6n0 [<ffffffffa014ede7>] lbug_with_loc+0x47/0xc0 [libcfs]
2012-11-06T14:57:11.009173-06:00 c0-0c1s6n0 [<ffffffffa0671d41>] osc_page_delete+0x2d1/0x2e0 [osc]
2012-11-06T14:57:11.034700-06:00 c0-0c1s6n0 [<ffffffffa02b0095>] cl_page_delete0+0xd5/0x4e0 [obdclass]
2012-11-06T14:57:11.034737-06:00 c0-0c1s6n0 [<ffffffffa02b04e2>] cl_page_delete+0x42/0x120 [obdclass]
2012-11-06T14:57:11.034762-06:00 c0-0c1s6n0 [<ffffffffa07f2e2d>] ll_invalidatepage+0x8d/0x170 [lustre]
2012-11-06T14:57:11.034774-06:00 c0-0c1s6n0 [<ffffffffa07ea290>] ll_page_mkwrite+0x7c0/0x840 [lustre]
2012-11-06T14:57:11.034820-06:00 c0-0c1s6n0 [<ffffffff81107cb7>] __do_fault+0xe7/0x570
2012-11-06T14:57:11.034833-06:00 c0-0c1s6n0 [<ffffffff811081e4>] handle_pte_fault+0xa4/0xcd0
2012-11-06T14:57:11.060561-06:00 c0-0c1s6n0 [<ffffffff81108fbe>] handle_mm_fault+0x1ae/0x240
2012-11-06T14:57:11.060588-06:00 c0-0c1s6n0 [<ffffffff81025471>] do_page_fault+0x191/0x410
2012-11-06T14:57:11.060600-06:00 c0-0c1s6n0 [<ffffffff81301b5f>] page_fault+0x1f/0x30
2012-11-06T14:57:11.060647-06:00 c0-0c1s6n0 [<00000000200422b3>] 0x200422b3
2012-11-06T14:57:11.060660-06:00 c0-0c1s6n0 Kernel panic - not syncing: LBUG

Bruce Korb (Inactive) added a comment - 14/Nov/12 3:25 PM OK. Thank you. It just looks a lot alike. I will keep reading code. 2012-11-06T14:57:10.872333-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060eedfe58@ {[23 -> 23/255], [2|0|-|cache|wi|ffff88020e18f8c8], [4096|1|+|-|ffff8801f9ed9c18|256| (null)]} trunc at 23. 2012-11-06T14:57:10.872389-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8bc0 [2 ffff88020cefad08:23 ^ (null)_ffff88080e8c8b00 4 0 1 (null) (null) 0x0] 2012-11-06T14:57:10.902445-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8b00 [1 ffff88020d8c1f58:23 ^ffff88080e8c8bc0_ (null) 4 0 1 (null) (null) 0x0] 2012-11-06T14:57:10.902494-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) vvp-page@ffff88080e8cf5a0(0:0:0) vm@ffffea001c6a99b8 e00000000000063 7:0 0 23 lru 2012-11-06T14:57:10.902508-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) lov-page@ffff8808050a7888 2012-11-06T14:57:10.958005-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) osc-page@ffff88080d78ed20: 1< 0x845fed 258 0 + - > 2< 94208 0 4096 0x0 0x520 | (null) ffff880804350700 ffff88020e18f8c8 > 3< + ffff8801f8f577b0 0 0 0 > 4< 0 0 8 33824768 - | - - + - > 5< - - + - | 0 - | 1 - -> 2012-11-06T14:57:10.958049-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) end page@ffff88080e8c8bc0 2012-11-06T14:57:10.983504-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16 2012-11-06T14:57:10.983536-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed: 2012-11-06T14:57:10.983549-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) LBUG 2012-11-06T14:57:10.983570-06:00 c0-0c1s6n0 Pid: 5270, comm: fsx-linux 2012-11-06T14:57:10.983582-06:00 c0-0c1s6n0 Call Trace: 2012-11-06T14:57:10.983605-06:00 c0-0c1s6n0 [<ffffffff810063b1>] try_stack_unwind+0x161/0x1a0 2012-11-06T14:57:11.009114-06:00 c0-0c1s6n0 [<ffffffff81004bf9>] dump_trace+0x89/0x440 2012-11-06T14:57:11.009138-06:00 c0-0c1s6n0 [<ffffffffa014e887>] libcfs_debug_dumpstack+0x57/0x80 [libcfs] 2012-11-06T14:57:11.009161-06:00 c0-0c1s6n0 [<ffffffffa014ede7>] lbug_with_loc+0x47/0xc0 [libcfs] 2012-11-06T14:57:11.009173-06:00 c0-0c1s6n0 [<ffffffffa0671d41>] osc_page_delete+0x2d1/0x2e0 [osc] 2012-11-06T14:57:11.034700-06:00 c0-0c1s6n0 [<ffffffffa02b0095>] cl_page_delete0+0xd5/0x4e0 [obdclass] 2012-11-06T14:57:11.034737-06:00 c0-0c1s6n0 [<ffffffffa02b04e2>] cl_page_delete+0x42/0x120 [obdclass] 2012-11-06T14:57:11.034762-06:00 c0-0c1s6n0 [<ffffffffa07f2e2d>] ll_invalidatepage+0x8d/0x170 [lustre] 2012-11-06T14:57:11.034774-06:00 c0-0c1s6n0 [<ffffffffa07ea290>] ll_page_mkwrite+0x7c0/0x840 [lustre] 2012-11-06T14:57:11.034820-06:00 c0-0c1s6n0 [<ffffffff81107cb7>] __do_fault+0xe7/0x570 2012-11-06T14:57:11.034833-06:00 c0-0c1s6n0 [<ffffffff811081e4>] handle_pte_fault+0xa4/0xcd0 2012-11-06T14:57:11.060561-06:00 c0-0c1s6n0 [<ffffffff81108fbe>] handle_mm_fault+0x1ae/0x240 2012-11-06T14:57:11.060588-06:00 c0-0c1s6n0 [<ffffffff81025471>] do_page_fault+0x191/0x410 2012-11-06T14:57:11.060600-06:00 c0-0c1s6n0 [<ffffffff81301b5f>] page_fault+0x1f/0x30 2012-11-06T14:57:11.060647-06:00 c0-0c1s6n0 [<00000000200422b3>] 0x200422b3 2012-11-06T14:57:11.060660-06:00 c0-0c1s6n0 Kernel panic - not syncing: LBUG

Jinshan Xiong (Inactive) added a comment - 14/Nov/12 3:02 PM

this is a different issue from what have described by Oleg

Jinshan Xiong (Inactive) added a comment - 14/Nov/12 3:02 PM this is a different issue from what have described by Oleg

Bruce Korb (Inactive) added a comment - 14/Nov/12 1:49 PM

OK, well we (Cory at Cray) see the same symptom, even after the ~~LU-2170~~ patch.
So there must be another cause for this.

> Denis Kondratenko added a comment - 06/Nov/12 4:31 AM
> Wally,
> was this fixed by http://jira.whamcloud.com/browse/LU-2170 ?
> [...]
> Wally Wang added a comment - 06/Nov/12 3:19 PM
> No, attached is the SP2 version of the cdump and it is with the ~~LU-2170~~ applied.

The SP2 version of the cdump was, unfortunately, corrupt, so I've been futzing
with the pre-SP2 version. That earlier dump may well have been caused by the
issue fixed in ~~LU-2170~~, but the later dump not. (Presuming Wally is correct about
the patch having been applied.)

I can attach the SP2 console log, if you like. But not the "lctl dk" dump,
pending another crash dump.

Bruce Korb (Inactive) added a comment - 14/Nov/12 1:49 PM OK, well we (Cory at Cray) see the same symptom, even after the LU-2170 patch. So there must be another cause for this. > Denis Kondratenko added a comment - 06/Nov/12 4:31 AM > Wally, > was this fixed by http://jira.whamcloud.com/browse/LU-2170 ? > [...] > Wally Wang added a comment - 06/Nov/12 3:19 PM > No, attached is the SP2 version of the cdump and it is with the LU-2170 applied. The SP2 version of the cdump was, unfortunately, corrupt, so I've been futzing with the pre-SP2 version. That earlier dump may well have been caused by the issue fixed in LU-2170 , but the later dump not. (Presuming Wally is correct about the patch having been applied.) I can attach the SP2 console log, if you like. But not the "lctl dk" dump, pending another crash dump.

Jinshan Xiong (Inactive) added a comment - 14/Nov/12 1:21 PM

In ~~LU-2170~~, I discovered that a write lock can be matched and enqueued by a glimpse process so that the write lock will be marked with INTENT(this write lock was wrongly converted to glimpse lock) in the enqueue RPC, then the lock was denied by OST, then the client calls cl_lock_delete() to delete this lock and grants cl_lock to make it go through with glimpse. After glimpse process finishes to use this lock, it will be deleted w/o canceling.

Depending on the time when this lock is found by other process, difference symptoms will show up. For example, in osc_extent_merge(), I discovered that one of the extent's oe_lock pointed to that "glimpse write" lock, and another extent had the normal one. For the osc_page_delete() case, that was because one dirty page was added into the "glimpse write" lock, since there is no cancel for this lock, the page still remained in the cache after the lock was deleted. Afterwards, when a read lock was enqueued and canceled, it discovered this page and then hit the assert.

Jinshan Xiong (Inactive) added a comment - 14/Nov/12 1:21 PM In LU-2170 , I discovered that a write lock can be matched and enqueued by a glimpse process so that the write lock will be marked with INTENT(this write lock was wrongly converted to glimpse lock) in the enqueue RPC, then the lock was denied by OST, then the client calls cl_lock_delete() to delete this lock and grants cl_lock to make it go through with glimpse. After glimpse process finishes to use this lock, it will be deleted w/o canceling. Depending on the time when this lock is found by other process, difference symptoms will show up. For example, in osc_extent_merge(), I discovered that one of the extent's oe_lock pointed to that "glimpse write" lock, and another extent had the normal one. For the osc_page_delete() case, that was because one dirty page was added into the "glimpse write" lock, since there is no cancel for this lock, the page still remained in the cache after the lock was deleted. Afterwards, when a read lock was enqueued and canceled, it discovered this page and then hit the assert.

Bruce Korb (Inactive) added a comment - 14/Nov/12 12:01 PM

I am a little unclear as to how this was diagnosed as a duplicate of ~~LU-2170~~.
The assertions are different. This "assert(0)" was initiated because
osc_page_delete() was called while the opg->ops_oap field referenced a
osc_async_page structure with a non-empty oap_rpc_item list. (That caused
osc_teardown_async_page to return EBUSY (16) and osc_page_delete handles
that by asserting that zero is not zero.)

So, I don't see the obvious reason the osc_extent_merge assertion would be
seen as a duplicate of this issue.

Bruce Korb (Inactive) added a comment - 14/Nov/12 12:01 PM I am a little unclear as to how this was diagnosed as a duplicate of LU-2170 . The assertions are different. This "assert(0)" was initiated because osc_page_delete() was called while the opg->ops_oap field referenced a osc_async_page structure with a non-empty oap_rpc_item list. (That caused osc_teardown_async_page to return EBUSY (16) and osc_page_delete handles that by asserting that zero is not zero.) So, I don't see the obvious reason the osc_extent_merge assertion would be seen as a duplicate of this issue.

Jinshan Xiong (Inactive) added a comment - 20/Oct/12 8:13 PM

duplicate of ~~LU-2170~~

Jinshan Xiong (Inactive) added a comment - 20/Oct/12 8:13 PM duplicate of LU-2170

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Wojciech Turek

Votes:: 2 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 08/Aug/12 11:32 AM

Updated:: 17/Jul/17 6:48 PM

Resolved:: 29/May/17 4:57 AM