Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      Added support for consistent memory leak detection to my scripts and sure enough, conf-sanity test 59 leaks memory in current master:

      [ 5916.569153] Lustre: DEBUG MARKER: == conf-sanity test 59: writeconf mount option ======================================================= 15:41:23 (1551472883)
      [ 5918.077007] Lustre: Setting parameter lustre-MDT0000.mdt.identity_upcall in log lustre-MDT0000
      [ 5918.118690] Lustre: ctl-lustre-MDT0000: No data found on store. Initialize space
      [ 5918.119237] Lustre: Skipped 2 previous similar messages
      [ 5918.132181] Lustre: lustre-MDT0000: new disk, initializing
      [ 5918.132596] Lustre: Skipped 1 previous similar message
      [ 5918.372703] Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt
      [ 5919.051560] Lustre: DEBUG MARKER: oleg15-server.localnet: executing set_default_debug -1 all 16
      [ 5922.520329] Lustre: lustre-MDT0000: Connection restored to 192.168.10.168@tcp (at 0@lo)
      [ 5922.521053] Lustre: Skipped 16 previous similar messages
      [ 5922.540923] Lustre: lustre-OST0000: Imperative Recovery not enabled, recovery window 60-180
      [ 5922.541528] Lustre: Skipped 3 previous similar messages
      [ 5923.201728] Lustre: DEBUG MARKER: oleg15-server.localnet: executing set_default_debug -1 all 16
      [ 5938.511155] Lustre: 8603:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1551472898/real 1551472898]  req@ffff8800bab07b40 x1626836967632352/t0(0) o251->MGC192.168.10.168@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1551472904 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 5938.513054] Lustre: 8603:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      [ 5938.544098] Lustre: server umount lustre-OST0000 complete
      [ 5938.544517] Lustre: Skipped 3 previous similar messages
      [ 5940.849292] Lustre: MGS: Logs for fs lustre were removed by user request.  All servers must be restarted in order to regenerate the logs: rc = 0
      [ 5940.858390] Lustre: Setting parameter lustre-MDT0000.mdt.identity_upcall in log lustre-MDT0000
      [ 5941.576567] Lustre: DEBUG MARKER: oleg15-server.localnet: executing set_default_debug -1 all 16
      [ 5943.500363] LustreError: 13b-9: lustre-OST0000 claims to have registered, but this MGS does not know about it, preventing registration.
      [ 5943.508686] LustreError: 13a-8: Failed to get MGS log lustre-OST0000 and no local copy.
      [ 5943.509328] LustreError: 15c-8: MGC192.168.10.168@tcp: The configuration from log 'lustre-OST0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      [ 5943.510952] LustreError: 9571:0:(obd_mount_server.c:1386:server_start_targets()) failed to start server lustre-OST0000: -2
      [ 5943.529731] LustreError: 9571:0:(obd_mount_server.c:1939:server_fill_super()) Unable to start targets: -2
      [ 5943.530468] LustreError: 9571:0:(obd_mount_server.c:1589:server_put_super()) no obd lustre-OST0000
      [ 5943.576890] LustreError: 9571:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-2)
      [ 5944.443255] Lustre: MGS: Regenerating lustre-OST0000 log by user request: rc = 0
      [ 5945.111938] Lustre: DEBUG MARKER: oleg15-server.localnet: executing set_default_debug -1 all 16
      [ 5947.346104] Lustre: MGS: Regenerating lustre-OST0001 log by user request: rc = 0
      [ 5948.013805] Lustre: DEBUG MARKER: oleg15-server.localnet: executing set_default_debug -1 all 16
      [ 5952.788610] Lustre: lustre-OST0001-osc-MDT0000: Connection to lustre-OST0001 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      [ 5952.790296] Lustre: lustre-OST0001: Not available for connect from 0@lo (stopping)
      [ 5957.780675] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_statfs to node 0@lo failed: rc = -107
      [ 5957.797623] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 5962.804735] Lustre: lustre-OST0000: Not available for connect from 0@lo (stopping)
      [ 5962.805177] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 5962.807882] Lustre: Skipped 1 previous similar message
      [ 5973.778356] Lustre: DEBUG MARKER: oleg15-server.localnet: executing /home/green/git/lustre-release/lustre/scripts/lustre_rmmod ldiskfs
      [ 5974.249337] Key type lgssc unregistered
      [ 5975.903115] LNet: Removed LNI 192.168.10.168@tcp
      [ 5975.975610] LustreError: 11726:0:(class_obd.c:824:obdclass_exit()) obd_memory max: 366353522, leaked: 80
      

      Full report: http://testing.linuxhacker.ru:3333/lustre-reports/122/testresults/conf-sanity-zfs-centos7_x86_64-centos7_x86_64/

      Attachments

        Issue Links

          Activity

            [LU-12038] conf-sanity test 59 memory leaks

            Have not seen this since initial reports.

            adilger Andreas Dilger added a comment - Have not seen this since initial reports.
            fan.yong Fan Yong added a comment -

            If the lustre_index_restore() does not handle/free the @libu, then it will be kept in the list osd_device::od_index_backup_list until the device umount via osd_index_backup(). So there should be not memleak unless some @libu is registered during or after the umount by race. But as I remember, the lustre_index_register() logic will avoid inserting new @libu if the device in-umount or has been umounted. So if such bad case happened, it is bug and should be fixed.

            fan.yong Fan Yong added a comment - If the lustre_index_restore() does not handle/free the @libu, then it will be kept in the list osd_device::od_index_backup_list until the device umount via osd_index_backup(). So there should be not memleak unless some @libu is registered during or after the umount by race. But as I remember, the lustre_index_register() logic will avoid inserting new @libu if the device in-umount or has been umounted. So if such bad case happened, it is bug and should be fixed.
            green Oleg Drokin added a comment -

            ok, the most easy way to see it is in zfs with conf-sanity test 57a it looks like.

            kmemleak checker seems to point at this path:

            unreferenced object 0xffff8800299f7500 (size 64):
              comm "mount.lustre", pid 9866, jiffies 4295066912 (age 532.277s)
              hex dump (first 32 bytes):
                a0 18 03 4e 00 88 ff ff a0 18 03 4e 00 88 ff ff  ...N.......N....
                03 00 00 00 02 00 00 00 03 00 00 00 00 00 00 00  ................
              backtrace:
                [<ffffffff81211684>] kmem_cache_alloc_trace+0x134/0x620
                [<ffffffffa09fbf9b>] lustre_index_register+0x12b/0x5b0 [obdclass]
                [<ffffffffa0d0b157>] osd_ios_scan_one+0x1027/0x10d0 [osd_zfs]
                [<ffffffffa0d0bfdc>] osd_initial_OI_scrub+0x65c/0xce0 [osd_zfs]
                [<ffffffffa0d0ef78>] osd_scrub_setup+0x9b8/0xb50 [osd_zfs]
                [<ffffffffa0ce750a>] osd_mount+0xe7a/0x1310 [osd_zfs]
                [<ffffffffa0ce7efc>] osd_device_alloc+0x25c/0x3a0 [osd_zfs]
                [<ffffffffa09beb93>] obd_setup+0x113/0x280 [obdclass]
                [<ffffffffa09bf826>] class_setup+0x2a6/0x840 [obdclass]
                [<ffffffffa09c307a>] class_process_config+0x190a/0x2810 [obdclass]
                [<ffffffffa09c6eb8>] do_lcfg+0x258/0x4b0 [obdclass]
                [<ffffffffa09caf48>] lustre_start_simple+0x88/0x210 [obdclass]
                [<ffffffffa09f7e85>] server_fill_super+0xf55/0x1890 [obdclass]
                [<ffffffffa09ce898>] lustre_fill_super+0x3d8/0x8c0 [obdclass]
                [<ffffffff8123a47d>] mount_nodev+0x4d/0xb0
                [<ffffffffa09c6be8>] lustre_mount+0x38/0x60 [obdclass]
            

            the libu thing allocated in that path is never freed. size reported as leaked is 40 bytes and that's exactly that struct size so I think it matches everything.

            the freeing is supposed to happen by a call to lustre_index_restore that does not appear to be coming.

            It looks like the problem is in osd_index_restore. Where as in ldiskfs we always call lustre_index_restore unconditionally, in zfs it's conditional on call to osd_idc_find_and_init_with_oid, but no matter what it does, the item is removed from the list and is therefore leaked if the lustre_index_restore was not called?

            this code was introduced with https://review.whamcloud.com/30910 for LU-10193

            green Oleg Drokin added a comment - ok, the most easy way to see it is in zfs with conf-sanity test 57a it looks like. kmemleak checker seems to point at this path: unreferenced object 0xffff8800299f7500 (size 64): comm "mount.lustre", pid 9866, jiffies 4295066912 (age 532.277s) hex dump (first 32 bytes): a0 18 03 4e 00 88 ff ff a0 18 03 4e 00 88 ff ff ...N.......N.... 03 00 00 00 02 00 00 00 03 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81211684>] kmem_cache_alloc_trace+0x134/0x620 [<ffffffffa09fbf9b>] lustre_index_register+0x12b/0x5b0 [obdclass] [<ffffffffa0d0b157>] osd_ios_scan_one+0x1027/0x10d0 [osd_zfs] [<ffffffffa0d0bfdc>] osd_initial_OI_scrub+0x65c/0xce0 [osd_zfs] [<ffffffffa0d0ef78>] osd_scrub_setup+0x9b8/0xb50 [osd_zfs] [<ffffffffa0ce750a>] osd_mount+0xe7a/0x1310 [osd_zfs] [<ffffffffa0ce7efc>] osd_device_alloc+0x25c/0x3a0 [osd_zfs] [<ffffffffa09beb93>] obd_setup+0x113/0x280 [obdclass] [<ffffffffa09bf826>] class_setup+0x2a6/0x840 [obdclass] [<ffffffffa09c307a>] class_process_config+0x190a/0x2810 [obdclass] [<ffffffffa09c6eb8>] do_lcfg+0x258/0x4b0 [obdclass] [<ffffffffa09caf48>] lustre_start_simple+0x88/0x210 [obdclass] [<ffffffffa09f7e85>] server_fill_super+0xf55/0x1890 [obdclass] [<ffffffffa09ce898>] lustre_fill_super+0x3d8/0x8c0 [obdclass] [<ffffffff8123a47d>] mount_nodev+0x4d/0xb0 [<ffffffffa09c6be8>] lustre_mount+0x38/0x60 [obdclass] the libu thing allocated in that path is never freed. size reported as leaked is 40 bytes and that's exactly that struct size so I think it matches everything. the freeing is supposed to happen by a call to lustre_index_restore that does not appear to be coming. It looks like the problem is in osd_index_restore. Where as in ldiskfs we always call lustre_index_restore unconditionally, in zfs it's conditional on call to osd_idc_find_and_init_with_oid, but no matter what it does, the item is removed from the list and is therefore leaked if the lustre_index_restore was not called? this code was introduced with https://review.whamcloud.com/30910 for LU-10193

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: