[LU-15059] Setting several tbf rules at the same time causes crashes Created: 05/Oct/21 Updated: 22/Jan/24 Resolved: 30/Nov/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Etienne Aujames | Assignee: | Etienne Aujames |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
I was looking to reproduce an another TBF bug ( reproducer
start rule_name1 uid={500} rate=100 & \
start rule_name2 uid={1000} rate=2000 &
crash [ 5940.060061] BUG: unable to handle kernel paging request at 00000000deadbeef [ 5940.060112] IP: [<ffffffffa978f100>] strlen+0x0/0x30 [ 5940.060157] PGD 80000000d084d067 PUD 0 [ 5940.060188] Oops: 0000 [#1] SMP [ 5940.060198] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) mbcache jbd2 libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel snd_intel8x0 snd_ac97_codec ac97_bus snd_seq aesni_intel snd_seq_device snd_pcm lrw gf128mul glue_helper ablk_helper cryptd snd_timer snd pcspkr sg soundcore parport_pc vboxguest(OE) i2c_piix4 parport video ip_tables xfs libcrc32c sr_mod sd_mod cdrom crc_t10dif crct10dif_generic ata_generic pata_acpi vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm libahci crct10dif_pclmul crct10dif_common [ 5940.060571] drm ata_piix serio_raw crc32c_intel libata e1000 drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod [ 5940.060641] CPU: 3 PID: 8460 Comm: lctl Kdump: loaded Tainted: G OE ------------ 3.10.0-1127.8.2.el7_lustre.x86_64 #1 [ 5940.060682] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 5940.060722] task: ffff8bac04d9d230 ti: ffff8bac9a8f8000 task.ti: ffff8bac9a8f8000 [ 5940.060768] RIP: 0010:[<ffffffffa978f100>] [<ffffffffa978f100>] strlen+0x0/0x30 [ 5940.060786] RSP: 0018:ffff8bac9a8fbe20 EFLAGS: 00010246 [ 5940.060798] RAX: 0000000000000001 RBX: ffff8baccc8dc000 RCX: 0000000000000000 [ 5940.060813] RDX: 0000000000000713 RSI: 0000000000000000 RDI: 00000000deadbeef [ 5940.060828] RBP: ffff8bac9a8fbe38 R08: 000000000000076b R09: 0000000000000000 [ 5940.060843] R10: 0000000000000000 R11: 000000000000000f R12: 00000000deadbeef [ 5940.060858] R13: 0000000000000000 R14: 0000000000000003 R15: ffff8bac741ea240 [ 5940.060874] FS: 00007f56cf31c740(0000) GS:ffff8bacdfd80000(0000) knlGS:0000000000000000 [ 5940.060891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5940.060904] CR2: 00000000deadbeef CR3: 0000000009fb6000 CR4: 00000000000606e0 [ 5940.060921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5940.060960] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5940.060997] Call Trace: [ 5940.061054] [<ffffffffc0e867f3>] ? nrs_tbf_generic_cmd_fini+0x53/0x100 [ptlrpc] [ 5940.061118] [<ffffffffc0e86905>] nrs_tbf_cmd_fini.part.30+0x65/0xd0 [ptlrpc] [ 5940.061607] [<ffffffffc0e8a91d>] ptlrpc_lprocfs_nrs_tbf_rule_seq_write+0xbdd/0x1030 [ptlrpc] [ 5940.062062] [<ffffffffa964d1b0>] vfs_write+0xc0/0x1f0 [ 5940.062579] [<ffffffffa9b92e15>] ? system_call_after_swapgs+0xa2/0x13a [ 5940.063031] [<ffffffffa964df7f>] SyS_write+0x7f/0xf0 [ 5940.063540] [<ffffffffa9b92e15>] ? system_call_after_swapgs+0xa2/0x13a [ 5940.063979] [<ffffffffa9b92ed2>] system_call_fastpath+0x25/0x2a [ 5940.064482] [<ffffffffa9b92e15>] ? system_call_after_swapgs+0xa2/0x13a [ 5940.064923] Code: 89 f8 48 89 e5 f6 82 20 ad c5 a9 20 74 15 0f 1f 44 00 00 48 83 c0 01 0f b6 10 f6 82 20 ad c5 a9 20 75 f0 5d c3 66 0f 1f 44 00 00 <80> 3f 00 55 48 89 e5 74 15 48 89 f8 0f 1f 40 00 48 83 c0 01 80 [ 5940.066420] RIP [<ffffffffa978f100>] strlen+0x0/0x30 [ 5940.066856] RSP <ffff8bac9a8fbe20> [ 5940.067317] CR2: 00000000deadbeef Analysis 00000100:00000010:3.0:1633132460.865076:0:8460:0:(nrs_tbf.c:1770:nrs_tbf_expression_free()) kfreed 'expr': 48 at ffff8bac99516340. 00000100:00000010:1.0:1633132460.865077:0:8459:0:(nrs_tbf.c:1808:nrs_tbf_generic_cmd_fini()) kfreed 'cmd->u.tc_start.ts_conds_str': 35 at ffff8bac99516f40. <-------------------------- 00000100:00000010:3.0:1633132460.865077:0:8460:0:(nrs_tbf.c:1786:nrs_tbf_conjunction_free()) kfreed 'conjunction': 32 at ffff8bac90937840. 00000100:00000010:1.0:1633132460.865078:0:8459:0:(nrs_tbf.c:3641:ptlrpc_lprocfs_nrs_tbf_rule_seq_write()) kfreed 'cmd': 144 at ffff8baccc8dc000. 00000100:00000010:1.0:1633132460.865078:0:8459:0:(nrs_tbf.c:3643:ptlrpc_lprocfs_nrs_tbf_rule_seq_write()) kfreed 'kernbuf': 4096 at ffff8bacc96eb000. 00000100:00000010:3.0:1633132460.865078:0:8460:0:(nrs_tbf.c:1808:nrs_tbf_generic_cmd_fini()) kfreed 'cmd->u.tc_start.ts_conds_str': 35 at ffff8bac99516f40. <-------------------------- This is possible because of the following code: static ssize_t ptlrpc_lprocfs_nrs_sbf_rule_seq_write(struct file *file, const char __user *buffer, size_t count, loff_t *off) { struct seq_file *m = file->private_data; struct ptlrpc_service *svc = m->private; char *kernbuf; char *val; int rc; static struct nrs_tbf_cmd *cmd; <---------------- enum ptlrpc_nrs_queue_type queue = PTLRPC_NRS_QUEUE_BOTH; ... cmd = nrs_tbf_parse_cmd(val, length, nrs_tbf_type_flag(svc, queue)); <---------------- if (IS_ERR(cmd)) GOTO(out_free_kernbuff, rc = PTR_ERR(cmd)); The "cmd" static pointer is overwriten by the 2sd thread. |
| Comments |
| Comment by Gerrit Updater [ 06/Oct/21 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45142 |
| Comment by Gerrit Updater [ 30/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45142/ |
| Comment by Peter Jones [ 30/Nov/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 07/Jan/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46002 |
| Comment by Andreas Dilger [ 20/Jan/24 ] |
|
Etienne, I was running an interop test between master (2.15.60) and 2.15.4 running sanityn, and test_77q failed: == sanityn test 77q: Parallel TBF rule definitions should not panic ==== 19:09:35 (1705691375)
CMD: trevis-101vm6 /usr/sbin/lctl set_param mds.MDS.mdt.nrs_policies=tbf
mds.MDS.mdt.nrs_policies=tbf
CMD: trevis-101vm6 /usr/sbin/lctl set_param mds.MDS.mdt.nrs_tbf_rule='start rule77q_1 uid={ 500 11 3}&gid={500 10 33 100 } rate=100'
CMD: trevis-101vm6 /usr/sbin/lctl set_param mds.MDS.mdt.nrs_tbf_rule='start rule77q_2 uid={1000}&gid={1000} rate=100'
trevis-101vm6: error: set_param: setting /sys/kernel/debug/lustre/mds/MDS/mdt/nrs_tbf_rule=start rule77q_1 uid={ 500 11 3}&gid={500 10 33 100 } rate=100: Invalid argument
mds.MDS.mdt.nrs_tbf_rule=start rule77q_2 uid={1000}&gid={1000} rate=100
pdsh@trevis-101vm1: trevis-101vm6: ssh exited with exit code 22
sanityn test_77q: @@@@@@ FAIL: 1: Fail to start TBF rule 'rule77q_1'
Could you please look into this. Does the version check need to be updated? The patch was landed in commit v2_14_55-153-gebef4989e3 and had a version check for this. |
| Comment by Andreas Dilger [ 20/Jan/24 ] |
|
It looks like test_77r is also failing sanityn interop, and needs to be excluded in one way or another. |
| Comment by Etienne Aujames [ 22/Jan/24 ] |
|
Yes, I will push a fix. |
| Comment by Etienne Aujames [ 22/Jan/24 ] |
|
I pushed a fix in LU-17452. |