Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18170

conf-sanity test_123ad: 'llog print only showed 92/93 records'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for emoly <emoly@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/46914488-02f0-47cd-89b0-ab0907e5fe34

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/106986 - 5.14.0-362.24.1.el9_3.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/106986 - 5.14.0-362.24.1_lustre.el9.x86_64

      This issue is very easy to reproduce by setting ONLY_REPEAT to n(n>=10).

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      conf-sanity test_123ad - llog print only showed 92/93 records

      Attachments

        Issue Links

          Activity

            [LU-18170] conf-sanity test_123ad: 'llog print only showed 92/93 records'
            adilger Andreas Dilger added a comment -

            Patch https://review.whamcloud.com/58366 ("LU-18764 tests: fix conf-sanity 123H for interop testing") is fixing the test that was added in this ticket.

            adilger Andreas Dilger added a comment - Patch https://review.whamcloud.com/58366 (" LU-18764 tests: fix conf-sanity 123H for interop testing ") is fixing the test that was added in this ticket.
            green Oleg Drokin added a comment -

            I think we need to add an interop check to test 123H

            (first hit Feb 18) we get this hit regularly now with 2.16.1 servers

            [27373.539546] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == conf-sanity test 123H: check concurent accesses with \'lctl llog_print ========================================================== 22:07:19 \(1740780439\)
            [27373.780325] Lustre: DEBUG MARKER: == conf-sanity test 123H: check concurent accesses with 'lctl llog_print ========================================================== 22:07:19 (1740780439)
            [27438.099364] Lustre: DEBUG MARKER: /usr/sbin/lctl llog_print -r params
            [27438.542994] Lustre: DEBUG MARKER: seq 1 20 |  xargs -P20 -I{} bash -c '/usr/sbin/lctl llog_print -r params | wc -l' |  sort | uniq -c
            [27438.824080] ------------[ cut here ]------------
            [27438.824088] WARNING: CPU: 0 PID: 969973 at lib/vsprintf.c:2726 vsnprintf+0x4bf/0x570
            [27438.825864] Modules linked in: xfs libcrc32c ofd(OE) ost(OE) loop osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey tls dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc intel_rapl_msr intel_rapl_common i2c_piix4 virtio_balloon pcspkr joydev drm fuse ext4 mbcache jbd2 ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel ata_piix libata virtio_net ghash_clmulni_intel virtio_blk net_failover failover serio_raw [last unloaded: libcfs]
            [27438.832027] CPU: 0 PID: 969973 Comm: llog_process_th Kdump: loaded Tainted: G           OE     -------  ---  5.14.0-427.31.1_lustre.el9.x86_64 #1
            [27438.833428] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [27438.834084] RIP: 0010:vsnprintf+0x4bf/0x570
            [27438.834603] Code: 44 24 10 8b 03 4d 8d 7c 0f 01 31 c9 83 f8 2f 0f 87 50 ff ff ff e9 08 fc ff ff 89 d0 83 c2 08 48 03 43 10 89 13 e9 98 fe ff ff <0f> 0b e9 cc fc ff ff 41 0f b6 04 24 4d 89 e6 e9 c2 fd ff ff 83 fa
            [27438.836542] RSP: 0000:ffffa1ccc2d33c70 EFLAGS: 00010292
            [27438.837140] RAX: 0000000000000000 RBX: ffff92c10181b824 RCX: ffffa1ccc2d33cc8
            [27438.837954] RDX: ffffffffc0b75150 RSI: fffffffffffffffd RDI: ffff92c10181b869
            [27438.838762] RBP: ffffa1ccc2d33d18 R08: 000000000000000a R09: ffff92c20181b861
            [27438.839566] R10: ffffffffffffffff R11: 000000000000000f R12: 0000000000000007
            [27438.840386] R13: ffff92c10181b869 R14: ffff92c1014736bf R15: 0000000000000510
            [27438.841176] FS:  0000000000000000(0000) GS:ffff92c17fc00000(0000) knlGS:0000000000000000
            [27438.842055] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [27438.842718] CR2: 0000560664894008 CR3: 0000000033b90001 CR4: 00000000000606f0
            [27438.843521] Call Trace:
            [27438.843847]  <TASK>
            [27438.844136]  ? show_trace_log_lvl+0x1c4/0x2df
            [27438.844708]  ? show_trace_log_lvl+0x1c4/0x2df
            [27438.845241]  ? snprintf+0x49/0x70
            [27438.845662]  ? vsnprintf+0x4bf/0x570
            [27438.846095]  ? __warn+0x81/0x110
            [27438.846558]  ? vsnprintf+0x4bf/0x570
            [27438.846992]  ? report_bug+0x10a/0x140
            [27438.847474]  ? handle_bug+0x3c/0x70
            [27438.847939]  ? exc_invalid_op+0x14/0x70
            [27438.848422]  ? asm_exc_invalid_op+0x16/0x20
            [27438.848945]  ? vsnprintf+0x4bf/0x570
            [27438.849413]  snprintf+0x49/0x70
            [27438.849814]  class_config_yaml_output+0x3b6/0x600 [obdclass]
            [27438.850874]  llog_print_cb+0x5e2/0x670 [obdclass]
            [27438.851516]  llog_process_thread+0xd97/0x1920 [obdclass]
            [27438.852185]  ? __pfx_llog_process_thread_daemonize+0x10/0x10 [obdclass]
            [27438.852977]  llog_process_thread_daemonize+0x99/0xe0 [obdclass]
            [27438.853715]  kthread+0xe0/0x100
            [27438.854127]  ? __pfx_kthread+0x10/0x10
            [27438.854601]  ret_from_fork+0x2c/0x50
            [27438.855082]  </TASK>
            [27438.855421] ---[ end trace 708d17dd91c9a88b ]---
            [27438.988112] LustreError: 970099:0:(llog_ioctl.c:296:llog_print_cb()) not enough space for print log records
            [27439.098846] LustreError: 970208:0:(llog.c:525:llog_process_thread()) ASSERTION( is_power_of_2(chunk_size) ) failed:
            [27439.098852] LustreError: 970208:0:(llog.c:525:llog_process_thread()) LBUG
            [27439.098855] CPU: 0 PID: 970208 Comm: llog_process_th Kdump: loaded Tainted: G        W  OE     -------  ---  5.14.0-427.31.1_lustre.el9.x86_64 #1
            [27439.098858] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [27439.098859] Call Trace:
            [27439.098883]  <TASK>
            [27439.098886]  dump_stack_lvl+0x34/0x48
            [27439.098914]  lbug_with_loc.cold+0x5/0x58 [libcfs]
            [27439.098990]  llog_process_thread+0xa9e/0x1920 [obdclass]
            [27439.099065]  ? __pfx_llog_process_thread_daemonize+0x10/0x10 [obdclass]
            [27439.099120]  ? mgs_key_init+0xa8/0x130 [mgs]
            [27439.099200]  ? __pfx_llog_process_thread_daemonize+0x10/0x10 [obdclass]
            [27439.099248]  llog_process_thread_daemonize+0x99/0xe0 [obdclass]
            [27439.099297]  kthread+0xe0/0x100
            [27439.099301]  ? __pfx_kthread+0x10/0x10
            [27439.099304]  ret_from_fork+0x2c/0x50
            [27439.099309]  </TASK>
            [27445.568983] Kernel panic - not syncing: LBUG
            [27445.568988] CPU: 0 PID: 970208 Comm: llog_process_th Kdump: loaded Tainted: G        W  OE     -------  ---  5.14.0-427.31.1_lustre.el9.x86_64 #1
            [27445.571030] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [27445.571676] Call Trace:
            [27445.571994]  <TASK>
            [27445.572295]  dump_stack_lvl+0x34/0x48
            [27445.572739]  panic+0x107/0x2f7
            [27445.573157]  lbug_with_loc.cold+0x2f/0x58 [libcfs]
            [27445.573725]  llog_process_thread+0xa9e/0x1920 [obdclass]
            [27445.574409]  ? __pfx_llog_process_thread_daemonize+0x10/0x10 [obdclass]
            [27445.575186]  ? mgs_key_init+0xa8/0x130 [mgs]
            [27445.575707]  ? __pfx_llog_process_thread_daemonize+0x10/0x10 [obdclass]
            [27445.576486]  llog_process_thread_daemonize+0x99/0xe0 [obdclass]
            [27445.577216]  kthread+0xe0/0x100
            [27445.577615]  ? __pfx_kthread+0x10/0x10
            [27445.578080]  ret_from_fork+0x2c/0x50
            [27445.578532]  </TASK>

            First hit: https://testing.whamcloud.com/test_sets/3f66b52d-b4d7-4d30-884e-ba82f8340698

            more hits:

            https://testing.whamcloud.com/test_sets/3a04df19-c183-4e94-85d8-c3f00d71aaf0

            https://testing.whamcloud.com/test_sets/db2be02b-8282-4de1-8d1f-a5e5c367695a

            https://testing.whamcloud.com/test_sets/2de30589-9fa9-4047-9763-a19616f19601

            https://testing.whamcloud.com/test_sets/c32f7d41-35cb-4df1-b64f-dd98d009a41a

            green Oleg Drokin added a comment - I think we need to add an interop check to test 123H (first hit Feb 18) we get this hit regularly now with 2.16.1 servers [27373.539546] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == conf-sanity test 123H: check concurent accesses with \'lctl llog_print ========================================================== 22:07:19 \(1740780439\) [27373.780325] Lustre: DEBUG MARKER: == conf-sanity test 123H: check concurent accesses with 'lctl llog_print ========================================================== 22:07:19 (1740780439) [27438.099364] Lustre: DEBUG MARKER: /usr/sbin/lctl llog_print -r params [27438.542994] Lustre: DEBUG MARKER: seq 1 20 |  xargs -P20 -I{} bash -c '/usr/sbin/lctl llog_print -r params | wc -l' |  sort | uniq -c [27438.824080] ------------[ cut here ]------------ [27438.824088] WARNING: CPU: 0 PID: 969973 at lib/vsprintf.c:2726 vsnprintf+0x4bf/0x570 [27438.825864] Modules linked in: xfs libcrc32c ofd(OE) ost(OE) loop osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey tls dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc intel_rapl_msr intel_rapl_common i2c_piix4 virtio_balloon pcspkr joydev drm fuse ext4 mbcache jbd2 ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel ata_piix libata virtio_net ghash_clmulni_intel virtio_blk net_failover failover serio_raw [last unloaded: libcfs] [27438.832027] CPU: 0 PID: 969973 Comm: llog_process_th Kdump: loaded Tainted: G           OE     -------  ---  5.14.0-427.31.1_lustre.el9.x86_64 #1 [27438.833428] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [27438.834084] RIP: 0010:vsnprintf+0x4bf/0x570 [27438.834603] Code: 44 24 10 8b 03 4d 8d 7c 0f 01 31 c9 83 f8 2f 0f 87 50 ff ff ff e9 08 fc ff ff 89 d0 83 c2 08 48 03 43 10 89 13 e9 98 fe ff ff <0f> 0b e9 cc fc ff ff 41 0f b6 04 24 4d 89 e6 e9 c2 fd ff ff 83 fa [27438.836542] RSP: 0000:ffffa1ccc2d33c70 EFLAGS: 00010292 [27438.837140] RAX: 0000000000000000 RBX: ffff92c10181b824 RCX: ffffa1ccc2d33cc8 [27438.837954] RDX: ffffffffc0b75150 RSI: fffffffffffffffd RDI: ffff92c10181b869 [27438.838762] RBP: ffffa1ccc2d33d18 R08: 000000000000000a R09: ffff92c20181b861 [27438.839566] R10: ffffffffffffffff R11: 000000000000000f R12: 0000000000000007 [27438.840386] R13: ffff92c10181b869 R14: ffff92c1014736bf R15: 0000000000000510 [27438.841176] FS:  0000000000000000(0000) GS:ffff92c17fc00000(0000) knlGS:0000000000000000 [27438.842055] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27438.842718] CR2: 0000560664894008 CR3: 0000000033b90001 CR4: 00000000000606f0 [27438.843521] Call Trace: [27438.843847]  <TASK> [27438.844136]  ? show_trace_log_lvl+0x1c4/0x2df [27438.844708]  ? show_trace_log_lvl+0x1c4/0x2df [27438.845241]  ? snprintf+0x49/0x70 [27438.845662]  ? vsnprintf+0x4bf/0x570 [27438.846095]  ? __warn+0x81/0x110 [27438.846558]  ? vsnprintf+0x4bf/0x570 [27438.846992]  ? report_bug+0x10a/0x140 [27438.847474]  ? handle_bug+0x3c/0x70 [27438.847939]  ? exc_invalid_op+0x14/0x70 [27438.848422]  ? asm_exc_invalid_op+0x16/0x20 [27438.848945]  ? vsnprintf+0x4bf/0x570 [27438.849413]  snprintf+0x49/0x70 [27438.849814]  class_config_yaml_output+0x3b6/0x600 [obdclass] [27438.850874]  llog_print_cb+0x5e2/0x670 [obdclass] [27438.851516]  llog_process_thread+0xd97/0x1920 [obdclass] [27438.852185]  ? __pfx_llog_process_thread_daemonize+0x10/0x10 [obdclass] [27438.852977]  llog_process_thread_daemonize+0x99/0xe0 [obdclass] [27438.853715]  kthread+0xe0/0x100 [27438.854127]  ? __pfx_kthread+0x10/0x10 [27438.854601]  ret_from_fork+0x2c/0x50 [27438.855082]  </TASK> [27438.855421] ---[ end trace 708d17dd91c9a88b ]--- [27438.988112] LustreError: 970099:0:(llog_ioctl.c:296:llog_print_cb()) not enough space for print log records [27439.098846] LustreError: 970208:0:(llog.c:525:llog_process_thread()) ASSERTION( is_power_of_2(chunk_size) ) failed: [27439.098852] LustreError: 970208:0:(llog.c:525:llog_process_thread()) LBUG [27439.098855] CPU: 0 PID: 970208 Comm: llog_process_th Kdump: loaded Tainted: G        W  OE     -------  ---  5.14.0-427.31.1_lustre.el9.x86_64 #1 [27439.098858] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [27439.098859] Call Trace: [27439.098883]  <TASK> [27439.098886]  dump_stack_lvl+0x34/0x48 [27439.098914]  lbug_with_loc.cold+0x5/0x58 [libcfs] [27439.098990]  llog_process_thread+0xa9e/0x1920 [obdclass] [27439.099065]  ? __pfx_llog_process_thread_daemonize+0x10/0x10 [obdclass] [27439.099120]  ? mgs_key_init+0xa8/0x130 [mgs] [27439.099200]  ? __pfx_llog_process_thread_daemonize+0x10/0x10 [obdclass] [27439.099248]  llog_process_thread_daemonize+0x99/0xe0 [obdclass] [27439.099297]  kthread+0xe0/0x100 [27439.099301]  ? __pfx_kthread+0x10/0x10 [27439.099304]  ret_from_fork+0x2c/0x50 [27439.099309]  </TASK> [27445.568983] Kernel panic - not syncing: LBUG [27445.568988] CPU: 0 PID: 970208 Comm: llog_process_th Kdump: loaded Tainted: G        W  OE     -------  ---  5.14.0-427.31.1_lustre.el9.x86_64 #1 [27445.571030] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [27445.571676] Call Trace: [27445.571994]  <TASK> [27445.572295]  dump_stack_lvl+0x34/0x48 [27445.572739]  panic+0x107/0x2f7 [27445.573157]  lbug_with_loc.cold+0x2f/0x58 [libcfs] [27445.573725]  llog_process_thread+0xa9e/0x1920 [obdclass] [27445.574409]  ? __pfx_llog_process_thread_daemonize+0x10/0x10 [obdclass] [27445.575186]  ? mgs_key_init+0xa8/0x130 [mgs] [27445.575707]  ? __pfx_llog_process_thread_daemonize+0x10/0x10 [obdclass] [27445.576486]  llog_process_thread_daemonize+0x99/0xe0 [obdclass] [27445.577216]  kthread+0xe0/0x100 [27445.577615]  ? __pfx_kthread+0x10/0x10 [27445.578080]  ret_from_fork+0x2c/0x50 [27445.578532]  </TASK> First hit: https://testing.whamcloud.com/test_sets/3f66b52d-b4d7-4d30-884e-ba82f8340698 more hits: https://testing.whamcloud.com/test_sets/3a04df19-c183-4e94-85d8-c3f00d71aaf0 https://testing.whamcloud.com/test_sets/db2be02b-8282-4de1-8d1f-a5e5c367695a https://testing.whamcloud.com/test_sets/2de30589-9fa9-4047-9763-a19616f19601 https://testing.whamcloud.com/test_sets/c32f7d41-35cb-4df1-b64f-dd98d009a41a

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57354/
            Subject: LU-18170 obdclass: replace round_up(_inllenX) with ALIGN
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4a7db8f29b7f3e9962b5260b3bd51a4f3685dcbb

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57354/ Subject: LU-18170 obdclass: replace round_up( _inllenX ) with ALIGN Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4a7db8f29b7f3e9962b5260b3bd51a4f3685dcbb

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57354
            Subject: LU-18170 obdclass: replace round_up(_inllenX) with ALIGN
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3331fd8408b11d04e1bb828bcdb613c6bc451a46

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57354 Subject: LU-18170 obdclass: replace round_up( _inllenX ) with ALIGN Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3331fd8408b11d04e1bb828bcdb613c6bc451a46
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56256/
            Subject: LU-18170 obdclass: fix llog_print_cb()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5ad117fc6f28389cb800953094b5acb0532f702b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56256/ Subject: LU-18170 obdclass: fix llog_print_cb() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5ad117fc6f28389cb800953094b5acb0532f702b

            Etienne, thanks for looking into this. I've also seen this failure a few times.

            Config log records are not size fixed, so one record can take several llog indexes (see llog_osd_pad()).

            It is possible to have llogs with either fixed or variable-sized records. The llog_osd_pad() function is adding an empty record at the end of the block if it is too small to fit another record. This is done so that the next record will start at the beginning of a new disk block (no partial records crossing block boundaries). The llog file must have a whole record at the start of each block to allow backward and forward seeking.

            In this failure case, it looks like there may be a padding record written to fill the block, but this does not get printed by llog_print?

            One possible solution on this case is to change the test to add another config record and confirm that the new record is printed properly. This test was to catch a bug when the config llog was truncated incorrectly, but if the new record is printed then that is not the case.

            adilger Andreas Dilger added a comment - Etienne, thanks for looking into this. I've also seen this failure a few times. Config log records are not size fixed, so one record can take several llog indexes (see llog_osd_pad()). It is possible to have llogs with either fixed or variable-sized records. The llog_osd_pad() function is adding an empty record at the end of the block if it is too small to fit another record. This is done so that the next record will start at the beginning of a new disk block (no partial records crossing block boundaries). The llog file must have a whole record at the start of each block to allow backward and forward seeking. In this failure case, it looks like there may be a padding record written to fill the block, but this does not get printed by llog_print? One possible solution on this case is to change the test to add another config record and confirm that the new record is printed properly. This test was to catch a bug when the config llog was truncated incorrectly, but if the new record is printed then that is not the case.

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56256
            Subject: LU-18170 tests: fix test conf-sanity 123ad
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6468c523ad699904e030388914885b1a4a433e0a

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56256 Subject: LU-18170 tests: fix test conf-sanity 123ad Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6468c523ad699904e030388914885b1a4a433e0a

            I hit this one in: https://review.whamcloud.com/56218
            The test is not valid. Config log records are not size fixed, so one record can take several llog indexes (see llog_osd_pad()).
            I will fix the test.

            eaujames Etienne Aujames added a comment - I hit this one in: https://review.whamcloud.com/56218 The test is not valid. Config log records are not size fixed, so one record can take several llog indexes (see llog_osd_pad()). I will fix the test.

            Emily, have you tried changing your patch to update some other subtest?

            adilger Andreas Dilger added a comment - Emily, have you tried changing your patch to update some other subtest?

            People

              eaujames Etienne Aujames
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: