Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8497

GPF after page allocation failure in osp_send_update_thread()

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0, Lustre 2.10.2
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      This is seen running racer with MDSCOUNT=4 this morning (v2_8_56_0-24-g272c136):

      [ 1181.303743] Lustre: DEBUG MARKER: == racer test 1: racer on clients: t DURATION=300 == 
      09:28:27 (1470925707)
      [ 1182.650010] LustreError: 27979:0:(llog_cat.c:712:llog_cat_cancel_records()) lustre-MDT0
      000-osp-MDT0002: fail to cancel 1 of 1 llog-records: rc = -116
      [ 1182.662231] LustreError: 27979:0:(llog_cat.c:712:llog_cat_cancel_records()) Skipped 2 p
      revious similar messages
      [ 1183.479062] mdt_out01_000: page allocation failure. order:4, mode:0x50
      [ 1183.484771] Pid: 27608, comm: mdt_out01_000 Not tainted 2.6.32-431.29.2.el6.lustre.x86_
      64 #1
      [ 1183.492551] Call Trace:
      [ 1183.494672]  [<ffffffff81142941>] ? __alloc_pages_nodemask+0x791/0x950
      [ 1183.500338]  [<ffffffff81183e8e>] ? kmem_getpages+0x6e/0x170
      [ 1183.505207]  [<ffffffff81186819>] ? fallback_alloc+0x1c9/0x2b0
      [ 1183.510370]  [<ffffffff81186077>] ? cache_grow+0x4c7/0x530
      [ 1183.515511]  [<ffffffff811864fb>] ? ____cache_alloc_node+0xab/0x200
      [ 1183.521154]  [<ffffffff81186c53>] ? __kmalloc+0x273/0x310
      [ 1183.526091]  [<ffffffffa1666f84>] ? out_handle+0xa84/0x1870 [ptlrpc]
      [ 1183.531889]  [<ffffffffa1666f84>] ? out_handle+0xa84/0x1870 [ptlrpc]
      [ 1183.537747]  [<ffffffff81554a7a>] ? __mutex_lock_common+0x2ca/0x400
      [ 1183.543492]  [<ffffffffa1657650>] ? req_can_reconstruct+0x50/0x130 [ptlrpc]
      [ 1183.550049]  [<ffffffffa1657650>] ? req_can_reconstruct+0x50/0x130 [ptlrpc]
      [ 1183.556371]  [<ffffffffa15fa5bc>] ? lustre_msg_get_opc+0x8c/0x100 [ptlrpc]
      [ 1183.562476]  [<ffffffff815545ee>] ? mutex_unlock+0xe/0x10
      [ 1183.567858]  [<ffffffffa165766e>] ? req_can_reconstruct+0x6e/0x130 [ptlrpc]
      [ 1183.574021]  [<ffffffffa165e8ff>] ? tgt_request_handle+0x90f/0x1470 [ptlrpc]
      [ 1183.580210]  [<ffffffffa160c752>] ? ptlrpc_main+0xd02/0x1800 [ptlrpc]
      [ 1183.586378]  [<ffffffffa160ba50>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      [ 1183.592153]  [<ffffffff8109e856>] ? kthread+0x96/0xa0
      [ 1183.597098]  [<ffffffff8100c30a>] ? child_rip+0xa/0x20
      [ 1183.602225]  [<ffffffff815562e0>] ? _spin_unlock_irq+0x30/0x40
      [ 1183.607772]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
      [ 1183.612791]  [<ffffffff8109e7c0>] ? kthread+0x0/0xa0
      [ 1183.617476]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
      [ 1183.622146] Mem-Info:
      [ 1183.624184] Node 0 DMA per-cpu:
      [ 1183.626999] CPU    0: hi:    0, btch:   1 usd:   0
      [ 1183.631320] CPU    1: hi:    0, btch:   1 usd:   0
      [ 1183.635664] CPU    2: hi:    0, btch:   1 usd:   0
      [ 1183.639960] CPU    3: hi:    0, btch:   1 usd:   0
      [ 1183.644777] Node 0 DMA32 per-cpu:
      [ 1183.647943] CPU    0: hi:  186, btch:  31 usd:  29
      [ 1183.652328] CPU    1: hi:  186, btch:  31 usd:  47
      [ 1183.656776] CPU    2: hi:  186, btch:  31 usd:  46
      [ 1183.661341] CPU    3: hi:  186, btch:  31 usd:   0
      [ 1183.665760] active_anon:52142 inactive_anon:6 isolated_anon:0
      [ 1183.665761]  active_file:56776 inactive_file:58980 isolated_file:0
      [ 1183.665762]  unevictable:0 dirty:248 writeback:0 unstable:0
      [ 1183.665763]  free:63417 slab_reclaimable:18898 slab_unreclaimable:98982
      [ 1183.665764]  mapped:2060 shmem:97 pagetables:20216 bounce:0
      [ 1183.693292] Node 0 DMA free:8340kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15364kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7204kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      [ 1183.727158] lowmem_reserve[]: 0 2004 2004 2004
      [ 1183.731294] Node 0 DMA32 free:288564kB min:44720kB low:55900kB high:67080kB active_anon:208856kB inactive_anon:24kB active_file:208264kB inactive_file:208424kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052312kB mlocked:0kB dirty:1088kB writeback:0kB mapped:8164kB shmem:388kB slab_reclaimable:77228kB slab_unreclaimable:389060kB kernel_stack:13232kB pagetables:81316kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      [ 1183.768744] lowmem_reserve[]: 0 0 0 0
      [ 1183.772294] Node 0 DMA: 1*4kB 0*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8340kB
      [ 1183.783726] Node 0 DMA32: 30239*4kB 17514*8kB 3395*16kB 863*32kB 215*64kB 101*128kB 28*256kB 1*512kB 4*1024kB 2*2048kB 3*4096kB = 397852kB
      [ 1183.798187] 70451 total pagecache pages
      [ 1183.802374] 0 pages in swap cache
      [ 1183.806010] Swap cache stats: add 0, delete 0, find 0/0
      [ 1183.811691] Free swap  = 0kB
      [ 1183.814789] Total swap = 0kB
      [ 1183.824529] 524285 pages RAM
      [ 1183.827700] 49009 pages reserved
      [ 1183.831149] 291298 pages shared
      [ 1183.834550] 215587 pages non-shared
      [ 1183.838902] LustreError: 11-0: lustre-MDT0001-osp-MDT0002: operation out_update to node[ 1183.814789] Total swap = 0kB
      [ 1183.824529] 524285 pages RAM
      [ 1183.827700] 49009 pages reserved
      [ 1183.831149] 291298 pages shared
      [ 1183.834550] 215587 pages non-shared
      [ 1183.838902] LustreError: 11-0: lustre-MDT0001-osp-MDT0002: operation out_update to node
       0@lo failed: rc = -12
      [ 1183.849680] LustreError: 27973:0:(layout.c:2074:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)
      [ 1183.849682]   req@ffff88006df62088 x1542376881533200/t0(0) o1000->lustre-MDT0001-osp-MDT0002@0@lo:24/4 lens 424/192 e 0 to 0 dl 1470925716 ref 2 fl Interpret:ReM/0/0 rc -12/-12
      [ 1183.883591] general protection fault: 0000 [#1] SMP 
      [ 1183.884386] last sysfs file: /sys/devices/system/cpu/possible
      [ 1183.884386] CPU 3 
      [ 1183.884386] Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) lfsck(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc_gss(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) libcfs(U) exportfs sha512_generic sha256_generic autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipv6 zfs(U) zcommon(U) znvpair(U) spl(U) zlib_deflate zavl(U) zunicode(U) microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      [ 1183.884386] 
      [ 1183.884386] Pid: 27973, comm: osp_up1-2 Not tainted 2.6.32-431.29.2.el6.lustre.x86_64 #1 Bochs Bochs
      [ 1183.884386] RIP: 0010:[<ffffffffa0e4793f>]  [<ffffffffa0e4793f>] osp_send_update_thread+0x1bf/0x584 [osp]
      [ 1183.884386] RSP: 0000:ffff880020b39dd0  EFLAGS: 00010282
      [ 1183.884386] RAX: ffff880065e6c3e0 RBX: ffff880020a18208 RCX: 0000000000000000
      [ 1183.884386] RDX: 000000000000000d RSI: 6b6b6b6b6b6b6b6b RDI: 0000000000000282
      [ 1183.884386] RBP: ffff880020b39eb0 R08: 0000000000000000 R09: ffff8800729b4508
      [ 1183.884386] R10: 09f911029d74e35b R11: 0000000000000000 R12: ffff8800197e7ed8
      [ 1183.884386] R13: ffff8800197e7ee8 R14: ffff880020b39e10 R15: ffff880020b39e78
      [ 1183.884386] FS:  0000000000000000(0000) GS:ffff88000d800000(0000) knlGS:0000000000000000
      [ 1183.884386] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      [ 1183.884386] CR2: 0000003d046acc6d CR3: 0000000001a85000 CR4: 00000000000006e0
      [ 1183.884386] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1183.884386] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 1183.884386] Process osp_up1-2 (pid: 27973, threadinfo ffff880020b38000, task ffff880016aa6480)
      [ 1183.884386] Stack:
      [ 1183.884386]  0000000000000000 ffff880016aa6480 ffff880016aa6480 ffff880020a18460
      [ 1183.884386] <d> ffff8800197e7ee8 ffff8800197e7f28 ffff880016aa6480 00000000fffffff4
      [ 1183.884386] <d> 0000000210000003 0000000000000000 ffff880018e162f0 ffff880020b39e28
      [ 1183.884386] Call Trace:
      [ 1183.884386]  [<ffffffff81061d90>] ? default_wake_function+0x0/0x20
      [ 1183.884386]  [<ffffffffa0e47780>] ? osp_send_update_thread+0x0/0x584 [osp]
      [ 1183.884386]  [<ffffffff8109e856>] kthread+0x96/0xa0
      [ 1183.884386]  [<ffffffff8100c30a>] child_rip+0xa/0x20
      [ 1183.884386]  [<ffffffff815562e0>] ? _spin_unlock_irq+0x30/0x40
      [ 1183.884386]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
      [ 1183.884386]  [<ffffffff8109e7c0>] ? kthread+0x0/0xa0
      

      Attachments

        Activity

          [LU-8497] GPF after page allocation failure in osp_send_update_thread()

          John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30354/
          Subject: LU-8497 osp: handle remote -ENOMEM from osp_send_update_req()
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set:
          Commit: 2f7cf70d70db2e3a801644f6d7eaeb09311e322b

          gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30354/ Subject: LU-8497 osp: handle remote -ENOMEM from osp_send_update_req() Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 2f7cf70d70db2e3a801644f6d7eaeb09311e322b

          Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30354
          Subject: LU-8497 osp: handle remote -ENOMEM from osp_send_update_req()
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set: 1
          Commit: c72defc7b06499380e89c88b904ccf10d2041245

          gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30354 Subject: LU-8497 osp: handle remote -ENOMEM from osp_send_update_req() Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: c72defc7b06499380e89c88b904ccf10d2041245
          pjones Peter Jones added a comment -

          Landed for 2.11

          pjones Peter Jones added a comment - Landed for 2.11

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30083/
          Subject: LU-8497 osp: handle remote -ENOMEM from osp_send_update_req()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: cef8983c8b5bf51b58df23a779769cc4b8ca8db5

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30083/ Subject: LU-8497 osp: handle remote -ENOMEM from osp_send_update_req() Project: fs/lustre-release Branch: master Current Patch Set: Commit: cef8983c8b5bf51b58df23a779769cc4b8ca8db5

          I think this was hit again during mds-survey testing: https://testing.hpdd.intel.com/test_sets/245a0ee6-d0a5-11e7-8027-52540065bddc

          {nofromat}

          [33205.282827] mdt_out00_000: page allocation failure: order:4, mode:0x10c050
          [33205.286444] CPU: 1 PID: 15315 Comm: mdt_out00_000 Tainted: G OE ------------ 3.10.0-693.5.2.el7_lustre.x86_64 #1
          [33205.289410] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
          [33205.299931] Call Trace:
          [33205.302118] [<ffffffff816a3e2d>] dump_stack+0x19/0x1b
          [33205.304492] [<ffffffff81188820>] warn_alloc_failed+0x110/0x180
          [33205.306852] [<ffffffff8169fe2a>] __alloc_pages_slowpath+0x6b6/0x724
          [33205.309234] [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
          [33205.311580] [<ffffffff811d1078>] alloc_pages_current+0x98/0x110
          [33205.313840] [<ffffffff8118761e>] __get_free_pages+0xe/0x40
          [33205.316047] [<ffffffff811dca2e>] kmalloc_order_trace+0x2e/0xa0
          [33205.318248] [<ffffffff811e05c1>] __kmalloc+0x211/0x230
          [33205.322857] [<ffffffffc0b4fc5d>] out_handle+0xa5d/0x1920 [ptlrpc]
          [33205.329424] [<ffffffffc0b4a9b5>] tgt_request_handle+0x925/0x13b0 [ptlrpc]
          [33205.331567] [<ffffffffc0aeec3e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
          [33205.335695] [<ffffffffc0af23e2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
          [33205.343358] [<ffffffff810b099f>] kthread+0xcf/0xe0
          [33205.413346] LustreError: 11-0: lustre-MDT0002-osp-MDT0000: operation out_update to node 0@lo failed: rc = -12
          [33205.417858] LustreError: 15331:0:(layout.c:2087:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)
          [33205.417858] req@ffff88006fdc1500 x1584861919186832/t0(0) o1000->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 376/192 e 0 to 0 dl 1511475148 ref 2 fl Interpret:ReM/0/0 rc -12/-12
          [33205.427203] LustreError: 20691:0:(echo_client.c:2048:echo_md_destroy_internal()) Can not unlink child 1073743783: rc = -12
          [33205.430025] LustreError: 20691:0:(echo_client.c:2109:echo_destroy_object()) Can not unlink child (null): rc = -12
          [33205.433216] general protection fault: 0000 1 SMP
          [33205.444034] CPU: 1 PID: 15331 Comm: osp_up2-0 Tainted: G OE ------------ 3.10.0-693.5.2.el7_lustre.x86_64 #1
          [33205.444034] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
          [33205.444034] RIP: 0010:[<ffffffffc10938bd>] [<ffffffffc10938bd>] osp_send_update_thread+0x1dd/0x600 [osp]

          
          
          adilger Andreas Dilger added a comment - I think this was hit again during mds-survey testing: https://testing.hpdd.intel.com/test_sets/245a0ee6-d0a5-11e7-8027-52540065bddc {nofromat} [33205.282827] mdt_out00_000: page allocation failure: order:4, mode:0x10c050 [33205.286444] CPU: 1 PID: 15315 Comm: mdt_out00_000 Tainted: G OE ------------ 3.10.0-693.5.2.el7_lustre.x86_64 #1 [33205.289410] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [33205.299931] Call Trace: [33205.302118] [<ffffffff816a3e2d>] dump_stack+0x19/0x1b [33205.304492] [<ffffffff81188820>] warn_alloc_failed+0x110/0x180 [33205.306852] [<ffffffff8169fe2a>] __alloc_pages_slowpath+0x6b6/0x724 [33205.309234] [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420 [33205.311580] [<ffffffff811d1078>] alloc_pages_current+0x98/0x110 [33205.313840] [<ffffffff8118761e>] __get_free_pages+0xe/0x40 [33205.316047] [<ffffffff811dca2e>] kmalloc_order_trace+0x2e/0xa0 [33205.318248] [<ffffffff811e05c1>] __kmalloc+0x211/0x230 [33205.322857] [<ffffffffc0b4fc5d>] out_handle+0xa5d/0x1920 [ptlrpc] [33205.329424] [<ffffffffc0b4a9b5>] tgt_request_handle+0x925/0x13b0 [ptlrpc] [33205.331567] [<ffffffffc0aeec3e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc] [33205.335695] [<ffffffffc0af23e2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [33205.343358] [<ffffffff810b099f>] kthread+0xcf/0xe0 [33205.413346] LustreError: 11-0: lustre-MDT0002-osp-MDT0000: operation out_update to node 0@lo failed: rc = -12 [33205.417858] LustreError: 15331:0:(layout.c:2087:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server) [33205.417858] req@ffff88006fdc1500 x1584861919186832/t0(0) o1000->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 376/192 e 0 to 0 dl 1511475148 ref 2 fl Interpret:ReM/0/0 rc -12/-12 [33205.427203] LustreError: 20691:0:(echo_client.c:2048:echo_md_destroy_internal()) Can not unlink child 1073743783: rc = -12 [33205.430025] LustreError: 20691:0:(echo_client.c:2109:echo_destroy_object()) Can not unlink child (null): rc = -12 [33205.433216] general protection fault: 0000 1 SMP [33205.444034] CPU: 1 PID: 15331 Comm: osp_up2-0 Tainted: G OE ------------ 3.10.0-693.5.2.el7_lustre.x86_64 #1 [33205.444034] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [33205.444034] RIP: 0010: [<ffffffffc10938bd>] [<ffffffffc10938bd>] osp_send_update_thread+0x1dd/0x600 [osp]

          John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30083
          Subject: LU-8497 osp: handle remote -ENOMEM from osp_send_update_req()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: df1bbd79d276df84717a905b1a34be7e2fe43081

          gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30083 Subject: LU-8497 osp: handle remote -ENOMEM from osp_send_update_req() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: df1bbd79d276df84717a905b1a34be7e2fe43081
          jhammond John Hammond added a comment - - edited

          The rc == -ENOMEM test is confusing local -ENOMEM (from ptlrpc_queue_wait() meaning that the request was never sent) with remote -ENOMEM (meaning that the request was sent and the interpreter was called). Note that the req->rq_set == NULL condition is always true after we leave ptlrpc_queue_wait().

          jhammond John Hammond added a comment - - edited The rc == -ENOMEM test is confusing local -ENOMEM (from ptlrpc_queue_wait() meaning that the request was never sent) with remote -ENOMEM (meaning that the request was sent and the interpreter was called). Note that the req->rq_set == NULL condition is always true after we leave ptlrpc_queue_wait() .
          jhammond John Hammond added a comment -

          Seeing this again while testing 2.10.1-RC1 (I doubt it ever went away). This is due a use after free of our in osp_send_update_thread() following an extra thandle put in osp_send_update_req() when the request status is -ENOMEM:

          osp_send_update_req():
                          if ((rc == -ENOMEM && req->rq_set == NULL) ||
                              (req->rq_transno == 0 && !req->rq_committed)) {
                                  if (args->oaua_update != NULL) {
                                          /* If osp_update_interpret is not being called,           
                                           * release the osp_thandle */
                                          args->oaua_update = NULL;
                                          osp_thandle_put(env, oth);
                                  }
          
                                  req->rq_cb_data = NULL;
                                  rc = rc == 0 ? req->rq_status : rc;
                                  osp_trans_callback(env, oth, rc);
                                  osp_thandle_put(env, oth); /* HERE */
                                  GOTO(out, rc);
                          }
          
          jhammond John Hammond added a comment - Seeing this again while testing 2.10.1-RC1 (I doubt it ever went away). This is due a use after free of our in osp_send_update_thread() following an extra thandle put in osp_send_update_req() when the request status is -ENOMEM : osp_send_update_req(): if ((rc == -ENOMEM && req->rq_set == NULL) || (req->rq_transno == 0 && !req->rq_committed)) { if (args->oaua_update != NULL) { /* If osp_update_interpret is not being called, * release the osp_thandle */ args->oaua_update = NULL; osp_thandle_put(env, oth); } req->rq_cb_data = NULL; rc = rc == 0 ? req->rq_status : rc; osp_trans_callback(env, oth, rc); osp_thandle_put(env, oth); /* HERE */ GOTO(out, rc); }
          pjones Peter Jones added a comment -

          As per John this has not reproduced

          pjones Peter Jones added a comment - As per John this has not reproduced
          yong.fan nasf (Inactive) added a comment - - edited

          Any further logs? Thanks!

          yong.fan nasf (Inactive) added a comment - - edited Any further logs? Thanks!

          Possibly this was introduced by a recent landing of LU-8399

          The LU-8399 patch http://review.whamcloud.com/21330/ makes the cross-MDT RPC sponsor to hold reference on the object to avoid searching objects table for fail callback. Because the in parallel RPCs count is very limited (will NOT exceed the max RPC service threads count), the cached objects for such purpose should NOT be too much to exhaust the RAM. Another suspected case is that the logic may miss to release the object's reference under such ignored corner case(s) as to the accumulative objects exhausted the RAM. But if such case really exist, the most possible case is that the system cannot be cleanup because of (as long as one) object's reference leak. So we should have seen many failures on Maloo about unload module. But we did not hit that up to now. So LU-8399 patch should have NOT cause object's reference leak.

          I have tested racer with 4 MDTs in my VM environment for many times, but cannot reproduce the trouble. So it will be helpful if you have more logs about that.

          yong.fan nasf (Inactive) added a comment - Possibly this was introduced by a recent landing of LU-8399 The LU-8399 patch http://review.whamcloud.com/21330/ makes the cross-MDT RPC sponsor to hold reference on the object to avoid searching objects table for fail callback. Because the in parallel RPCs count is very limited (will NOT exceed the max RPC service threads count), the cached objects for such purpose should NOT be too much to exhaust the RAM. Another suspected case is that the logic may miss to release the object's reference under such ignored corner case(s) as to the accumulative objects exhausted the RAM. But if such case really exist, the most possible case is that the system cannot be cleanup because of (as long as one) object's reference leak. So we should have seen many failures on Maloo about unload module. But we did not hit that up to now. So LU-8399 patch should have NOT cause object's reference leak. I have tested racer with 4 MDTs in my VM environment for many times, but cannot reproduce the trouble. So it will be helpful if you have more logs about that.

          People

            jhammond John Hammond
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: