Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15059

Setting several tbf rules at the same time causes crashes

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      I was looking to reproduce an another TBF bug (LU-15056) when I trigger this one.

      reproducer

      start rule_name1 uid={500} rate=100 & \
      start rule_name2 uid={1000} rate=2000 &
      

      crash

      [ 5940.060061] BUG: unable to handle kernel paging request at 00000000deadbeef
      [ 5940.060112] IP: [<ffffffffa978f100>] strlen+0x0/0x30
      [ 5940.060157] PGD 80000000d084d067 PUD 0
      [ 5940.060188] Oops: 0000 [#1] SMP 
      [ 5940.060198] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) mbcache jbd2 libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel snd_intel8x0 snd_ac97_codec ac97_bus snd_seq aesni_intel snd_seq_device snd_pcm lrw gf128mul glue_helper ablk_helper cryptd snd_timer snd pcspkr sg soundcore parport_pc vboxguest(OE) i2c_piix4 parport video ip_tables xfs libcrc32c sr_mod sd_mod cdrom crc_t10dif crct10dif_generic ata_generic pata_acpi vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm libahci crct10dif_pclmul crct10dif_common
      [ 5940.060571]  drm ata_piix serio_raw crc32c_intel libata e1000 drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
      [ 5940.060641] CPU: 3 PID: 8460 Comm: lctl Kdump: loaded Tainted: G           OE  ------------   3.10.0-1127.8.2.el7_lustre.x86_64 #1
      [ 5940.060682] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [ 5940.060722] task: ffff8bac04d9d230 ti: ffff8bac9a8f8000 task.ti: ffff8bac9a8f8000
      [ 5940.060768] RIP: 0010:[<ffffffffa978f100>]  [<ffffffffa978f100>] strlen+0x0/0x30
      [ 5940.060786] RSP: 0018:ffff8bac9a8fbe20  EFLAGS: 00010246
      [ 5940.060798] RAX: 0000000000000001 RBX: ffff8baccc8dc000 RCX: 0000000000000000
      [ 5940.060813] RDX: 0000000000000713 RSI: 0000000000000000 RDI: 00000000deadbeef
      [ 5940.060828] RBP: ffff8bac9a8fbe38 R08: 000000000000076b R09: 0000000000000000
      [ 5940.060843] R10: 0000000000000000 R11: 000000000000000f R12: 00000000deadbeef
      [ 5940.060858] R13: 0000000000000000 R14: 0000000000000003 R15: ffff8bac741ea240
      [ 5940.060874] FS:  00007f56cf31c740(0000) GS:ffff8bacdfd80000(0000) knlGS:0000000000000000
      [ 5940.060891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5940.060904] CR2: 00000000deadbeef CR3: 0000000009fb6000 CR4: 00000000000606e0
      [ 5940.060921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 5940.060960] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 5940.060997] Call Trace:
      [ 5940.061054]  [<ffffffffc0e867f3>] ? nrs_tbf_generic_cmd_fini+0x53/0x100 [ptlrpc]
      [ 5940.061118]  [<ffffffffc0e86905>] nrs_tbf_cmd_fini.part.30+0x65/0xd0 [ptlrpc]
      [ 5940.061607]  [<ffffffffc0e8a91d>] ptlrpc_lprocfs_nrs_tbf_rule_seq_write+0xbdd/0x1030 [ptlrpc]
      [ 5940.062062]  [<ffffffffa964d1b0>] vfs_write+0xc0/0x1f0
      [ 5940.062579]  [<ffffffffa9b92e15>] ? system_call_after_swapgs+0xa2/0x13a
      [ 5940.063031]  [<ffffffffa964df7f>] SyS_write+0x7f/0xf0
      [ 5940.063540]  [<ffffffffa9b92e15>] ? system_call_after_swapgs+0xa2/0x13a
      [ 5940.063979]  [<ffffffffa9b92ed2>] system_call_fastpath+0x25/0x2a
      [ 5940.064482]  [<ffffffffa9b92e15>] ? system_call_after_swapgs+0xa2/0x13a
      [ 5940.064923] Code: 89 f8 48 89 e5 f6 82 20 ad c5 a9 20 74 15 0f 1f 44 00 00 48 83 c0 01 0f b6 10 f6 82 20 ad c5 a9 20 75 f0 5d c3 66 0f 1f 44 00 00 <80> 3f 00 55 48 89 e5 74 15 48 89 f8 0f 1f 40 00 48 83 c0 01 80
      [ 5940.066420] RIP  [<ffffffffa978f100>] strlen+0x0/0x30
      [ 5940.066856]  RSP <ffff8bac9a8fbe20>
      [ 5940.067317] CR2: 00000000deadbeef
      

      Analysis
      After some search, it seem that the crash is cause by a double kfree on "nrs_tbf_generic_cmd_fini() cmd->u.tc_start.ts_conds_str" pointer:

      00000100:00000010:3.0:1633132460.865076:0:8460:0:(nrs_tbf.c:1770:nrs_tbf_expression_free()) kfreed 'expr': 48 at ffff8bac99516340.
      00000100:00000010:1.0:1633132460.865077:0:8459:0:(nrs_tbf.c:1808:nrs_tbf_generic_cmd_fini()) kfreed 'cmd->u.tc_start.ts_conds_str': 35 at ffff8bac99516f40.  <--------------------------
      00000100:00000010:3.0:1633132460.865077:0:8460:0:(nrs_tbf.c:1786:nrs_tbf_conjunction_free()) kfreed 'conjunction': 32 at ffff8bac90937840.
      00000100:00000010:1.0:1633132460.865078:0:8459:0:(nrs_tbf.c:3641:ptlrpc_lprocfs_nrs_tbf_rule_seq_write()) kfreed 'cmd': 144 at ffff8baccc8dc000.
      00000100:00000010:1.0:1633132460.865078:0:8459:0:(nrs_tbf.c:3643:ptlrpc_lprocfs_nrs_tbf_rule_seq_write()) kfreed 'kernbuf': 4096 at ffff8bacc96eb000.
      00000100:00000010:3.0:1633132460.865078:0:8460:0:(nrs_tbf.c:1808:nrs_tbf_generic_cmd_fini()) kfreed 'cmd->u.tc_start.ts_conds_str': 35 at ffff8bac99516f40.  <--------------------------
      

      This is possible because of the following code:

      static ssize_t
      ptlrpc_lprocfs_nrs_sbf_rule_seq_write(struct file *file,
                                            const char __user *buffer,
                                            size_t count, loff_t *off)
      {
              struct seq_file           *m = file->private_data;
              struct ptlrpc_service     *svc = m->private;
              char                      *kernbuf;
              char                      *val;
              int                        rc;
              static struct nrs_tbf_cmd *cmd;                                         <----------------
              enum ptlrpc_nrs_queue_type queue = PTLRPC_NRS_QUEUE_BOTH;
      ...
      
              cmd = nrs_tbf_parse_cmd(val, length, nrs_tbf_type_flag(svc, queue));    <----------------
              if (IS_ERR(cmd))
                      GOTO(out_free_kernbuff, rc = PTR_ERR(cmd));
      

      The "cmd" static pointer is overwriten by the 2sd thread.
      I don't why this pointer should be static: it is allocated and free in the ptlrpc_lprocfs_nrs_sbf_rule_seq_write() function.

      Attachments

        Issue Links

          Activity

            [LU-15059] Setting several tbf rules at the same time causes crashes

            I pushed a fix in LU-17452.

            eaujames Etienne Aujames added a comment - I pushed a fix in LU-17452 .

            Yes, I will push a fix.

            eaujames Etienne Aujames added a comment - Yes, I will push a fix.

            It looks like test_77r is also failing sanityn interop, and needs to be excluded in one way or another.

            adilger Andreas Dilger added a comment - It looks like test_77r is also failing sanityn interop, and needs to be excluded in one way or another.

            Etienne, I was running an interop test between master (2.15.60) and 2.15.4 running sanityn, and test_77q failed:
            https://testing.whamcloud.com/test_sessions/41cf8a97-7a9b-4423-ab3a-9654579f02e7

            == sanityn test 77q: Parallel TBF rule definitions should not panic ==== 19:09:35 (1705691375)
            CMD: trevis-101vm6 /usr/sbin/lctl set_param mds.MDS.mdt.nrs_policies=tbf
            mds.MDS.mdt.nrs_policies=tbf
            CMD: trevis-101vm6 /usr/sbin/lctl set_param mds.MDS.mdt.nrs_tbf_rule='start rule77q_1 uid={ 500  11 3}&gid={500 10 33   100 } rate=100'
            CMD: trevis-101vm6 /usr/sbin/lctl set_param mds.MDS.mdt.nrs_tbf_rule='start rule77q_2 uid={1000}&gid={1000} rate=100'
            trevis-101vm6: error: set_param: setting /sys/kernel/debug/lustre/mds/MDS/mdt/nrs_tbf_rule=start rule77q_1 uid={ 500  11 3}&gid={500 10 33   100 } rate=100: Invalid argument
            mds.MDS.mdt.nrs_tbf_rule=start rule77q_2 uid={1000}&gid={1000} rate=100
            pdsh@trevis-101vm1: trevis-101vm6: ssh exited with exit code 22
             sanityn test_77q: @@@@@@ FAIL: 1: Fail to start TBF rule 'rule77q_1' 
            

            Could you please look into this. Does the version check need to be updated? The patch was landed in commit v2_14_55-153-gebef4989e3 and had a version check for this.

            adilger Andreas Dilger added a comment - Etienne, I was running an interop test between master (2.15.60) and 2.15.4 running sanityn, and test_77q failed: https://testing.whamcloud.com/test_sessions/41cf8a97-7a9b-4423-ab3a-9654579f02e7 == sanityn test 77q: Parallel TBF rule definitions should not panic ==== 19:09:35 (1705691375) CMD: trevis-101vm6 /usr/sbin/lctl set_param mds.MDS.mdt.nrs_policies=tbf mds.MDS.mdt.nrs_policies=tbf CMD: trevis-101vm6 /usr/sbin/lctl set_param mds.MDS.mdt.nrs_tbf_rule='start rule77q_1 uid={ 500 11 3}&gid={500 10 33 100 } rate=100' CMD: trevis-101vm6 /usr/sbin/lctl set_param mds.MDS.mdt.nrs_tbf_rule='start rule77q_2 uid={1000}&gid={1000} rate=100' trevis-101vm6: error: set_param: setting /sys/kernel/debug/lustre/mds/MDS/mdt/nrs_tbf_rule=start rule77q_1 uid={ 500 11 3}&gid={500 10 33 100 } rate=100: Invalid argument mds.MDS.mdt.nrs_tbf_rule=start rule77q_2 uid={1000}&gid={1000} rate=100 pdsh@trevis-101vm1: trevis-101vm6: ssh exited with exit code 22 sanityn test_77q: @@@@@@ FAIL: 1: Fail to start TBF rule 'rule77q_1' Could you please look into this. Does the version check need to be updated? The patch was landed in commit v2_14_55-153-gebef4989e3 and had a version check for this.

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46002
            Subject: LU-15059 nrs: do not overwrite "cmd" in nrs_tbf_rule
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: c1b8794782bd33bf7d6cfc7fabf04c0702f8f49c

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46002 Subject: LU-15059 nrs: do not overwrite "cmd" in nrs_tbf_rule Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: c1b8794782bd33bf7d6cfc7fabf04c0702f8f49c
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45142/
            Subject: LU-15059 nrs: do not overwrite "cmd" in nrs_tbf_rule
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ebef4989e39ef8cae29edcf26fa2ee16b6106ad6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45142/ Subject: LU-15059 nrs: do not overwrite "cmd" in nrs_tbf_rule Project: fs/lustre-release Branch: master Current Patch Set: Commit: ebef4989e39ef8cae29edcf26fa2ee16b6106ad6

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45142
            Subject: LU-15059 nrs: do not overwrite "cmd" in nrs_tbf_rule
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d883ed5ab0ef385ce1fbd6391c4e4bff48ed5616

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45142 Subject: LU-15059 nrs: do not overwrite "cmd" in nrs_tbf_rule Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d883ed5ab0ef385ce1fbd6391c4e4bff48ed5616

            People

              eaujames Etienne Aujames
              eaujames Etienne Aujames
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: