Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14590

add output aggregation for "lctl get_param"

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • None
    • None
    • 9223372036854775807

    Description

      When running "lctl get_param" on a system with hundreds of OSTs and MDTs, and thousands of clients, often there are are many lines of output that are identical for all devices. For example:

      # lctl get_param ldlm.namespaces.*.*
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.contended_locks=32
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.contention_seconds=2
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.ctime_age_limit=10
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.dirty_age_limit=10
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.early_lock_cancel=0
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.lock_count=0
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.lock_timeouts=0
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.lock_unused_count=0
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.lru_max_age=3900000
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.lru_size=0
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.max_nolock_bytes=0
      ldlm.namespaces.testfs-OST0000-osc-MDT0000.max_parallel_ast=1024
      ldlm.namespaces.testfs-OST0000-osc-MDT0001.contended_locks=32
      ldlm.namespaces.testfs-OST0000-osc-MDT0001.contention_seconds=2
      
      :
      :
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.contended_locks=32
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.contention_seconds=2
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.ctime_age_limit=10
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.dirty_age_limit=10
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.early_lock_cancel=0
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.lock_count=0
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.lock_timeouts=0
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.lock_unused_count=0
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.lru_max_age=3900000
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.lru_size=0
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.max_nolock_bytes=0
      ldlm.namespaces.testfs-OST0102-osc-MDT0013.max_parallel_ast=1024
      

      but it would be much more convenient if the output was aggregated with wildcards in a manner similar to how multiple parameters are specified with "lctl set_param":

      ldlm.namespaces.*.contended_locks=32
      ldlm.namespaces.*.contention_seconds=2
      ldlm.namespaces.*.ctime_age_limit=10
      ldlm.namespaces.*.dirty_age_limit=10
      ldlm.namespaces.*.early_lock_cancel=0
      ldlm.namespaces.*.lock_count=0
      ldlm.namespaces.*.lock_timeouts=0
      ldlm.namespaces.*.lock_unused_count=0
      ldlm.namespaces.testfs-OST00[00-14]-osc-MDT00[00-13].lru_max_age=3900000
      ldlm.namespaces.testfs-OST00[15-63]-osc-MDT00[00-11,13].lru_max_age=3900000
      ldlm.namespaces.testfs-OST00[15-63]-osc-MDT0012.lru_max_age=100000
      ldlm.namespaces.*.lru_size=0
      ldlm.namespaces.*.max_nolock_bytes=0
      ldlm.namespaces.*.max_parallel_ast=1024
      

      or something similar (it would be implementation dependent how disjoint regions of identifiers would be shown). This would not only allow reducing the amount of output, but also make it much more obvious in cases where there are differences in the settings (e.g. lru_max_age above).

      Something like "lctl merge_param" would take the output of "lctl get_param" as input and merge the lines, either by optionally (with new '-m' option) forking a process to pipe the output of "lctl get_param -m" into, or allowing the output of previously-captured "get_param" output to be aggregated (possibly from dsh or clush running on multiple nodes at once).

      There would have to be some implementation-specific smarts in the aggregation, for example to understand that client instance identifiers can be aggregated.

      Attachments

        Issue Links

          Activity

            [LU-14590] add output aggregation for "lctl get_param"

            Alex, it isn't clear how this patch is causing the kernel crashes? It is not touching the kernel or recovery-small.sh?

            I checked the original patch right after bisection and noticed it didn't change code in the kernel, but it does change how lfs get_param forms requests (slightly), so I guess this change have increased likelyhood?
            will test with the patch James mentioned above.

            bzzz Alex Zhuravlev added a comment - Alex, it isn't clear how this patch is causing the kernel crashes? It is not touching the kernel or recovery-small.sh? I checked the original patch right after bisection and noticed it didn't change code in the kernel, but it does change how lfs get_param forms requests (slightly), so I guess this change have increased likelyhood? will test with the patch James mentioned above.

            Alex, it isn't clear how this patch is causing the kernel crashes? It is not touching the kernel or recovery-small.sh?

            adilger Andreas Dilger added a comment - Alex, it isn't clear how this patch is causing the kernel crashes? It is not touching the kernel or recovery-small.sh?

            Alex patch https://review.whamcloud.com/c/fs/lustre-release/+/57948 attempts to address your issue. Please try it out.

            simmonsja James A Simmons added a comment - Alex patch https://review.whamcloud.com/c/fs/lustre-release/+/57948 attempts to address your issue. Please try it out.
            # MDSCOUNT=2 ONLY=57 ONLY_REPEAT=20  bash recovery-small.sh
            ...
            [  162.131670] Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash (repeat 8/20 iter, 1/ min) ========================================================== 10:12:28 (1746526348)
            [  163.541999] systemd[1]: mnt-lustre.mount: Succeeded.
            [  163.565883] LustreError: 12317:0:(obd_class.h:479:obd_check_dev()) Device 34 not setup
            [  163.574210] LustreError: 12317:0:(obd_class.h:479:obd_check_dev()) Skipped 31 previous similar messages
            [  163.574380] LustreError: 12318:0:(obd_class.h:479:obd_check_dev()) Device 34 not setup
            [  163.575759] BUG: unable to handle kernel NULL pointer dereference at 0000000000000118
            [  163.575821] PGD 14585a067 P4D 14585a067 PUD 153768067 PMD 0 
            [  163.575851] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
            [  163.575872] CPU: 0 PID: 12318 Comm: lctl Tainted: G        W  O     --------- -  - 4.18.0 #12
            [  163.575907] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
            [  163.575947] RIP: 0010:do_raw_read_lock+0x1/0x50
            [  163.576013] Code: 85 92 5b 00 85 c0 74 8f 48 c7 c6 ef 61 e2 a9 48 89 df e8 c2 fd ff ff e9 7b ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 53 <81> 7f 08 ed 1e af de 48 89 fb 75 12 b8 00 02 00 00 f0 0f c1 03 a9
            [  163.576076] RSP: 0018:ffff9b5bc457fde0 EFLAGS: 00010296
            [  163.576138] RAX: ffffffffc08b0754 RBX: 0000000000000000 RCX: 0000000000000000
            [  163.576317] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000110
            [  163.576343] RBP: 0000000000000110 R08: 0000000000000001 R09: 0000000000000000
            [  163.576369] R10: ffff9b5bc457fe40 R11: 0000000000000000 R12: 00000000ffffffff
            [  163.576395] R13: 0000000000000000 R14: ffff9b5b84991400 R15: ffff9b5b9778dc30
            [  163.576422] FS:  00007f801f4f3dc0(0000) GS:ffff9b5c0b000000(0000) knlGS:0000000000000000
            [  163.576452] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [  163.576491] CR2: 0000000000000118 CR3: 0000000145de6000 CR4: 0000000000350eb0
            [  163.576517] Call Trace:
            [  163.576527]  sptlrpc_import_sec_ref+0x14/0x40 [ptlrpc]
            [  163.576666]  srpc_sptlrpc_sepol_seq_show+0x31/0x230 [ptlrpc]
            [  163.576787]  seq_read+0x14e/0x3e0
            [  163.576813]  full_proxy_read+0x4b/0x70
            [  163.576921]  vfs_read+0xa1/0x150
            [  163.576941]  ksys_read+0x3d/0xa0
            [  163.577013]  do_syscall_64+0x4b/0x1b0
            

            bisection results:

            COMMIT          TESTED  PASSED  FAILED          COMMIT DESCRIPTION
            e6172ff4af      3       2       1       BAD     LU-18904 tests: skip sanity/230k for older servers
            fd2f0d3960      2       2       0               LU-18876 osp: rename conflicting ping_show
            69aa1bf646      3       3       0               LU-18687 doc: move man pages to Documentation [1]
            8837fce6d3      3       2       1       BAD     LU-18863 nodemap: warning for inconsistencies with offset
            6f03f976c1      3       3       0               LU-18814 tests: run extra prog, execute NLOOPS mpi workloads
            f251864767      3       3       0               LU-18862 ldiskfs: update for RHEL 9.6
            8dbf049479      3       3       0               LU-17427 mdt: reduce hold time for BFL rename lock
            9fe3951c44      3       2       1       BAD     LU-16350 ldiskfs: update for kernel 6.12
            df2b5d99ad      11      10      1       BAD     LU-17933 target: do not break grants on RPC failure
            a2e3a2f5a3      12      10      2       BAD     LU-14590 utils: merge similar list_param params
            33e6c43e4c      20      20      0       GOOD    LU-12885 llite: add enum ll_file_flags for clarity
            d33e96cdd4      20      20      0       GOOD    LU-18515 build: fix configure checks for ZFS 2.2.3
            2db2c9dceb      20      20      0       GOOD    LU-13814 clio: add coo_dio_pages_init
            8c0d073c17      20      20      0       GOOD    LU-13814 clio: add cl_dio_pages_init
            a9099de5d5      20      20      0       GOOD    LU-15935 tests: ignore replay-dual/33 cleanup error
            
            bzzz Alex Zhuravlev added a comment - # MDSCOUNT=2 ONLY=57 ONLY_REPEAT=20 bash recovery-small.sh ... [ 162.131670] Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash (repeat 8/20 iter, 1/ min) ========================================================== 10:12:28 (1746526348) [ 163.541999] systemd[1]: mnt-lustre.mount: Succeeded. [ 163.565883] LustreError: 12317:0:(obd_class.h:479:obd_check_dev()) Device 34 not setup [ 163.574210] LustreError: 12317:0:(obd_class.h:479:obd_check_dev()) Skipped 31 previous similar messages [ 163.574380] LustreError: 12318:0:(obd_class.h:479:obd_check_dev()) Device 34 not setup [ 163.575759] BUG: unable to handle kernel NULL pointer dereference at 0000000000000118 [ 163.575821] PGD 14585a067 P4D 14585a067 PUD 153768067 PMD 0 [ 163.575851] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 163.575872] CPU: 0 PID: 12318 Comm: lctl Tainted: G W O --------- - - 4.18.0 #12 [ 163.575907] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 [ 163.575947] RIP: 0010:do_raw_read_lock+0x1/0x50 [ 163.576013] Code: 85 92 5b 00 85 c0 74 8f 48 c7 c6 ef 61 e2 a9 48 89 df e8 c2 fd ff ff e9 7b ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 53 <81> 7f 08 ed 1e af de 48 89 fb 75 12 b8 00 02 00 00 f0 0f c1 03 a9 [ 163.576076] RSP: 0018:ffff9b5bc457fde0 EFLAGS: 00010296 [ 163.576138] RAX: ffffffffc08b0754 RBX: 0000000000000000 RCX: 0000000000000000 [ 163.576317] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000110 [ 163.576343] RBP: 0000000000000110 R08: 0000000000000001 R09: 0000000000000000 [ 163.576369] R10: ffff9b5bc457fe40 R11: 0000000000000000 R12: 00000000ffffffff [ 163.576395] R13: 0000000000000000 R14: ffff9b5b84991400 R15: ffff9b5b9778dc30 [ 163.576422] FS: 00007f801f4f3dc0(0000) GS:ffff9b5c0b000000(0000) knlGS:0000000000000000 [ 163.576452] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 163.576491] CR2: 0000000000000118 CR3: 0000000145de6000 CR4: 0000000000350eb0 [ 163.576517] Call Trace: [ 163.576527] sptlrpc_import_sec_ref+0x14/0x40 [ptlrpc] [ 163.576666] srpc_sptlrpc_sepol_seq_show+0x31/0x230 [ptlrpc] [ 163.576787] seq_read+0x14e/0x3e0 [ 163.576813] full_proxy_read+0x4b/0x70 [ 163.576921] vfs_read+0xa1/0x150 [ 163.576941] ksys_read+0x3d/0xa0 [ 163.577013] do_syscall_64+0x4b/0x1b0 bisection results: COMMIT TESTED PASSED FAILED COMMIT DESCRIPTION e6172ff4af 3 2 1 BAD LU-18904 tests: skip sanity/230k for older servers fd2f0d3960 2 2 0 LU-18876 osp: rename conflicting ping_show 69aa1bf646 3 3 0 LU-18687 doc: move man pages to Documentation [1] 8837fce6d3 3 2 1 BAD LU-18863 nodemap: warning for inconsistencies with offset 6f03f976c1 3 3 0 LU-18814 tests: run extra prog, execute NLOOPS mpi workloads f251864767 3 3 0 LU-18862 ldiskfs: update for RHEL 9.6 8dbf049479 3 3 0 LU-17427 mdt: reduce hold time for BFL rename lock 9fe3951c44 3 2 1 BAD LU-16350 ldiskfs: update for kernel 6.12 df2b5d99ad 11 10 1 BAD LU-17933 target: do not break grants on RPC failure a2e3a2f5a3 12 10 2 BAD LU-14590 utils: merge similar list_param params 33e6c43e4c 20 20 0 GOOD LU-12885 llite: add enum ll_file_flags for clarity d33e96cdd4 20 20 0 GOOD LU-18515 build: fix configure checks for ZFS 2.2.3 2db2c9dceb 20 20 0 GOOD LU-13814 clio: add coo_dio_pages_init 8c0d073c17 20 20 0 GOOD LU-13814 clio: add cl_dio_pages_init a9099de5d5 20 20 0 GOOD LU-15935 tests: ignore replay-dual/33 cleanup error
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55724/
            Subject: LU-14590 utils: merge similar list_param params
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a2e3a2f5a3a891fc3fed391023b3cdb65af2d427

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55724/ Subject: LU-14590 utils: merge similar list_param params Project: fs/lustre-release Branch: master Current Patch Set: Commit: a2e3a2f5a3a891fc3fed391023b3cdb65af2d427
            gerrit Gerrit Updater added a comment - - edited

            "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55735
            Subject: LU-14590 utils: add dshbak to collapse list_param
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 27a9a9cd5c611b0b43ac96c8dc0357ff0a1daa06

            gerrit Gerrit Updater added a comment - - edited "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55735 Subject: LU-14590 utils: add dshbak to collapse list_param Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 27a9a9cd5c611b0b43ac96c8dc0357ff0a1daa06
            gerrit Gerrit Updater added a comment - - edited

            "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55726
            Subject: LU-14590 utils: add dshbak to collapse list_param
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 63f9660dd6cd8ffa96c7ed1c7c5c5c04c4756f40

            gerrit Gerrit Updater added a comment - - edited "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55726 Subject: LU-14590 utils: add dshbak to collapse list_param Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 63f9660dd6cd8ffa96c7ed1c7c5c5c04c4756f40
            gerrit Gerrit Updater added a comment - - edited

            "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55725
            Subject: LU-14590 utils: duplicate param_dirs are collapsed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cf8677ab6f8d9f2ced556ea67c9adbaed289a1a8

            gerrit Gerrit Updater added a comment - - edited "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55725 Subject: LU-14590 utils: duplicate param_dirs are collapsed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cf8677ab6f8d9f2ced556ea67c9adbaed289a1a8
            gerrit Gerrit Updater added a comment - - edited

            "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55724
            Subject: LU-14590 utils: merge similar list_param params
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 10558fca4af38720a61c3ea28eb57333ef5faa28

            gerrit Gerrit Updater added a comment - - edited "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55724 Subject: LU-14590 utils: merge similar list_param params Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 10558fca4af38720a61c3ea28eb57333ef5faa28

            People

              fdilger Fred Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: