[LU-8951] lctl conf_param not retaining *_cache_enable settings Created: 16/Dec/16 Updated: 28/Apr/20 Resolved: 28/Apr/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Nathan Dauchy (Inactive) | Assignee: | Emoly Liu |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | None | ||
| Environment: |
CentOS-6.8 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
The read_cache_enable=1 and writethrough_cache_enable=1 settings don't appear to be retained through unmount/remount of the OSTs. However, the readcache_max_filesize=4194304 DOES get retained. Is there a step I'm missing? Is there additional debugging procedures I should follow to trace down the source of the problem? Example commands showing that the setting does get propagated after a "lctl conf_param", but then goes away... nbp7-mds2 ~ # for i in $(lctl list_param osc.fscache*OST* | sed 's/osc\.//; s/-osc-.*//'); do lctl conf_param $i.ost.read_cache_enable=1; lctl conf_param $i.ost.writethrough_cache_enable=1; lctl conf_param $i.ost.readcache_max_filesize=4M; done (wait ~10 seconds to propagate) Nothing unusual in the OSS logs. In the logs on the MGS: Dec 16 14:28:26 nbp7-mds2 kernel: Lustre: Modifying parameter fscache-OST0005.ost.read_cache_enable in log fscache-OST0005 Dec 16 14:28:26 nbp7-mds2 kernel: Lustre: Skipped 17 previous similar messages nbp1-oss6 ~ # lctl get_param obdfilter.fscache*.{*_cache_enable,readcache_max_filesize}
obdfilter.fscache-OST0005.read_cache_enable=1
obdfilter.fscache-OST0005.writethrough_cache_enable=1
obdfilter.fscache-OST000b.read_cache_enable=1
obdfilter.fscache-OST000b.writethrough_cache_enable=1
obdfilter.fscache-OST0011.read_cache_enable=1
obdfilter.fscache-OST0011.writethrough_cache_enable=1
obdfilter.fscache-OST0017.read_cache_enable=1
obdfilter.fscache-OST0017.writethrough_cache_enable=1
obdfilter.fscache-OST001d.read_cache_enable=1
obdfilter.fscache-OST001d.writethrough_cache_enable=1
obdfilter.fscache-OST0023.read_cache_enable=1
obdfilter.fscache-OST0023.writethrough_cache_enable=1
obdfilter.fscache-OST0005.readcache_max_filesize=4194304
obdfilter.fscache-OST000b.readcache_max_filesize=4194304
obdfilter.fscache-OST0011.readcache_max_filesize=4194304
obdfilter.fscache-OST0017.readcache_max_filesize=4194304
obdfilter.fscache-OST001d.readcache_max_filesize=4194304
obdfilter.fscache-OST0023.readcache_max_filesize=4194304
# lmount -u -v -f fscache.csv --host service636 umounting OSTs... ssh -n service636 umount /mnt/lustre/OST29 ssh -n service636 umount /mnt/lustre/OST5 ssh -n service636 umount /mnt/lustre/OST35 ssh -n service636 umount /mnt/lustre/OST17 ssh -n service636 umount /mnt/lustre/OST11 ssh -n service636 umount /mnt/lustre/OST23 nbp1-oss6 ~ # lctl get_param obdfilter.fscache*.{*_cache_enable,readcache_max_filesize}
error: get_param: obdfilter/fscache*/*_cache_enable: Found no match
error: get_param: obdfilter/fscache*/readcache_max_filesize: Found no match
# lmount -m -v -f fscache.csv --host service636 mounting OSTs... ssh -n service636 'mkdir -p /mnt/lustre/OST29 ; mount -t lustre $(journal-dev-of.sh dev/intelcas1-29) -o errors=panic,extents,mballoc /dev/intelcas1-29 /mnt/lustre/OST29' ssh -n service636 'mkdir -p /mnt/lustre/OST5 ; mount -t lustre $(journal-dev-of.sh /dev/intelcas1-5) -o errors=panic,extents,mballoc /dev/intelcas1-5 /mnt/lustre/OST5' ssh -n service636 'mkdir -p /mnt/lustre/OST35 ; mount -t lustre $(journal-dev-of.sh /dev/intelcas2-35) -o errors=panic,extents,mballoc /dev/intelcas2-35 /mnt/lustre/OST35' ssh -n service636 'mkdir -p /mnt/lustre/OST17 ; mount -t lustre $(journal-dev-of.sh /dev/intelcas1-17) -o errors=panic,extents,mballoc /dev/intelcas1-17 /mnt/lustre/OST17' ssh -n service636 'mkdir -p /mnt/lustre/OST11 ; mount -t lustre $(journal-dev-of.sh /dev/intelcas2-11) -o errors=panic,extents,mballoc /dev/intelcas2-11 /mnt/lustre/OST11' ssh -n service636 'mkdir -p /mnt/lustre/OST23 ; mount -t lustre $(journal-dev-of.sh /dev/intelcas2-23) -o errors=panic,extents,mballoc /dev/intelcas2-23 /mnt/lustre/OST23' # lctl get_param obdfilter.fscache*.{*_cache_enable,readcache_max_filesize}
obdfilter.fscache-OST0005.read_cache_enable=0
obdfilter.fscache-OST0005.writethrough_cache_enable=0
obdfilter.fscache-OST000b.read_cache_enable=0
obdfilter.fscache-OST000b.writethrough_cache_enable=0
obdfilter.fscache-OST0011.read_cache_enable=0
obdfilter.fscache-OST0011.writethrough_cache_enable=0
obdfilter.fscache-OST0017.read_cache_enable=0
obdfilter.fscache-OST0017.writethrough_cache_enable=0
obdfilter.fscache-OST001d.read_cache_enable=0
obdfilter.fscache-OST001d.writethrough_cache_enable=0
obdfilter.fscache-OST0023.read_cache_enable=0
obdfilter.fscache-OST0023.writethrough_cache_enable=0
obdfilter.fscache-OST0005.readcache_max_filesize=4194304
obdfilter.fscache-OST000b.readcache_max_filesize=4194304
obdfilter.fscache-OST0011.readcache_max_filesize=4194304
obdfilter.fscache-OST0017.readcache_max_filesize=4194304
obdfilter.fscache-OST001d.readcache_max_filesize=4194304
obdfilter.fscache-OST0023.readcache_max_filesize=4194304
I should note that the MDS does have a single MGS for two file systems, in case that is relevant to reproducing the problem... nbp7-mds2 ~ # lctl dl 0 UP osd-ldiskfs MGS-osd MGS-osd_UUID 5 1 UP mgs MGS MGS 9 2 UP mgc MGC10.151.27.39@o2ib 2eb7f880-6a2e-ea5e-3631-922183627327 5 3 UP osd-ldiskfs nocache-MDT0000-osd nocache-MDT0000-osd_UUID 13 4 UP mds MDS MDS_uuid 3 5 UP lod nocache-MDT0000-mdtlov nocache-MDT0000-mdtlov_UUID 4 6 UP mdt nocache-MDT0000 nocache-MDT0000_UUID 19 7 UP mdd nocache-MDD0000 nocache-MDD0000_UUID 4 8 UP qmt nocache-QMT0000 nocache-QMT0000_UUID 4 9 UP osp nocache-OST0029-osc-MDT0000 nocache-MDT0000-mdtlov_UUID 5 10 UP osp nocache-OST002f-osc-MDT0000 nocache-MDT0000-mdtlov_UUID 5 11 UP osp nocache-OST003b-osc-MDT0000 nocache-MDT0000-mdtlov_UUID 5 12 UP osp nocache-OST0035-osc-MDT0000 nocache-MDT0000-mdtlov_UUID 5 13 UP osp nocache-OST0041-osc-MDT0000 nocache-MDT0000-mdtlov_UUID 5 14 UP osp nocache-OST0047-osc-MDT0000 nocache-MDT0000-mdtlov_UUID 5 15 UP lwp nocache-MDT0000-lwp-MDT0000 nocache-MDT0000-lwp-MDT0000_UUID 5 16 UP osd-ldiskfs fscache-MDT0000-osd fscache-MDT0000-osd_UUID 13 17 UP lod fscache-MDT0000-mdtlov fscache-MDT0000-mdtlov_UUID 4 18 UP mdt fscache-MDT0000 fscache-MDT0000_UUID 17 19 UP mdd fscache-MDD0000 fscache-MDD0000_UUID 4 20 UP qmt fscache-QMT0000 fscache-QMT0000_UUID 4 21 UP osp fscache-OST0023-osc-MDT0000 fscache-MDT0000-mdtlov_UUID 5 22 UP osp fscache-OST0011-osc-MDT0000 fscache-MDT0000-mdtlov_UUID 5 23 UP osp fscache-OST000b-osc-MDT0000 fscache-MDT0000-mdtlov_UUID 5 24 UP osp fscache-OST001d-osc-MDT0000 fscache-MDT0000-mdtlov_UUID 5 25 UP osp fscache-OST0017-osc-MDT0000 fscache-MDT0000-mdtlov_UUID 5 26 UP osp fscache-OST0005-osc-MDT0000 fscache-MDT0000-mdtlov_UUID 5 27 UP lwp fscache-MDT0000-lwp-MDT0000 fscache-MDT0000-lwp-MDT0000_UUID 5 |
| Comments |
| Comment by Nathan Dauchy (Inactive) [ 16/Dec/16 ] |
|
To try to rule out the multi-FS MGS as the source of the problem, I completely stopped the "nocache" file system, then stopped and restarted all targets (MGS, MDT, OSTs) of the "fscache" file system, and symptoms did not change. |
| Comment by Nathan Dauchy (Inactive) [ 17/Dec/16 ] |
|
Update... Mahmoud clued me in to the "set_param -P" option, and this seems to work! nbp7-mds2 ~ # lctl set_param -P obdfilter.fscache*.read_cache_enable=1 nbp7-mds2 ~ # lctl set_param -P obdfilter.fscache*.writethrough_cache_enable=1 nbp7-mds2 ~ # lctl set_param -P obdfilter.fscache*.readcache_max_filesize=4M So, the main problem is solved. Only questions remaining on this issue then are...
|
| Comment by Peter Jones [ 19/Dec/16 ] |
|
Emoly Could you please advise on this one? Thanks Peter |
| Comment by Emoly Liu [ 21/Dec/16 ] |
|
Nathan,
I did this test for several times on both single node and multiple nodes, one MGS for two filesystems, but still failed to reproduce it. The conf_param works well for me. I'm using b2_7_fe(New tag 2.7.2-RC1) + el6.7(2.6.32-573.26.1.el6_lustre.x86_64). Could you please provide more details about how to reproduce it and any logs?
The "lctl set_param -P" functionality was landed via |
| Comment by Nathan Dauchy (Inactive) [ 21/Dec/16 ] |
|
Emoly, Thanks for the clarification on the history and goals for "set_param -P". I highly recommend that to help get through the "transition period", 1) the manual be updated to include clarification that set_param -P is preferred and that conf_param will be removed, and 2) conf_param report a warning or error for any tunable for which set_param -P should be working. I can easily duplicate the problem using the procedure posted in the original description. I did this on the other file system for this MDS/OSS pair (to rule out any effect from the CAS cache testing) and it had the same symptoms. Not much useful in the logs, but here you go... Dec 21 10:11:09 nbp7-mds2 kernel: Lustre: Modifying parameter nocache-OST0029.ost.read_cache_enable in log nocache-OST0029 Dec 21 10:11:20 nbp7-mds2 kernel: Lustre: 7277:0:(client.c:1941:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1482343725/real 1482343725] req@ffff8806f06ba680 x1553901315768044/t0(0) o8->nocache-OST0047-osc-MDT0000@10.151.26.123@o2ib:28/4 lens 400/544 e 0 to 1 dl 1482343880 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 21 10:12:22 nbp7-mds2 kernel: Lustre: Modifying parameter nocache-OST0029.ost.read_cache_enable in log nocache-OST0029 Dec 21 10:12:22 nbp7-mds2 kernel: Lustre: Skipped 17 previous similar messages Dec 21 10:13:37 nbp7-mds2 kernel: Lustre: nocache-OST0029-osc-MDT0000: Connection to nocache-OST0029 (at 10.151.26.123@o2ib) was lost; in progress operations using this service will wait for recovery to complete Dec 21 10:13:37 nbp7-mds2 kernel: Lustre: Skipped 5 previous similar messages Dec 21 10:13:40 nbp7-mds2 kernel: Lustre: nocache-OST0047-osc-MDT0000: Connection restored to nocache-OST0047 (at 10.151.26.123@o2ib) Dec 21 10:13:40 nbp7-mds2 kernel: Lustre: Skipped 5 previous similar messages Dec 21 10:13:07 nbp1-oss6 kernel: Lustre: Failing over nocache-OST0029 Dec 21 10:13:07 nbp1-oss6 kernel: Lustre: Skipped 2 previous similar messages Dec 21 10:13:08 nbp1-oss6 kernel: Lustre: server umount nocache-OST0035 complete Dec 21 10:13:10 nbp1-oss6 kernel: LNet: 12436:0:(lib-move.c:1485:lnet_parse_put()) Dropping PUT from 12345-10.151.27.39@o2ib portal 7 match 1553901315769072 offset 0 length 224: 4 Dec 21 10:13:10 nbp1-oss6 kernel: LNet: 12436:0:(lib-move.c:1485:lnet_parse_put()) Skipped 5 previous similar messages Dec 21 10:13:11 nbp1-oss6 kernel: perl[45541]: segfault at 0 ip 00007fffebca713e sp 00007fffffffddd0 error 4 in libpcp_pmda.so.3[7fffebca3000+11000] Dec 21 10:13:24 nbp1-oss6 kernel: LDISKFS-fs (dm-27): mounted filesystem with ordered data mode. quota=on. Opts: Dec 21 10:13:24 nbp1-oss6 kernel: LDISKFS-fs (dm-29): mounted filesystem with ordered data mode. quota=on. Opts: Dec 21 10:13:24 nbp1-oss6 kernel: LDISKFS-fs (dm-28): mounted filesystem with ordered data mode. quota=on. Opts: Dec 21 10:13:24 nbp1-oss6 kernel: LDISKFS-fs (dm-33): Dec 21 10:13:24 nbp1-oss6 kernel: LDISKFS-fs (dm-26): mounted filesystem with ordered data mode. quota=on. Opts: Dec 21 10:13:24 nbp1-oss6 kernel: mounted filesystem with ordered data mode. quota=on. Opts: Dec 21 10:13:24 nbp1-oss6 kernel: LDISKFS-fs (dm-47): mounted filesystem with ordered data mode. quota=on. Opts: Dec 21 10:13:30 nbp1-oss6 kernel: LustreError: 137-5: nocache-OST002f_UUID: not available for connect from 10.151.27.19@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 21 10:13:30 nbp1-oss6 kernel: LustreError: Skipped 3 previous similar messages Dec 21 10:13:37 nbp1-oss6 kernel: Lustre: nocache-OST0029: Will be in recovery for at least 5:00, or until 2 clients reconnect Dec 21 10:13:37 nbp1-oss6 kernel: Lustre: Skipped 5 previous similar messages Dec 21 10:13:40 nbp1-oss6 kernel: Lustre: nocache-OST0047: Recovery over after 0:03, of 2 clients 2 recovered and 0 were evicted. Dec 21 10:13:40 nbp1-oss6 kernel: Lustre: nocache-OST0047: deleting orphan objects from 0x0:2678325 to 0x0:2678417 Dec 21 10:13:40 nbp1-oss6 kernel: Lustre: nocache-OST0029: deleting orphan objects from 0x0:2797909 to 0x0:2798001 Dec 21 10:13:40 nbp1-oss6 kernel: Lustre: nocache-OST002f: deleting orphan objects from 0x0:2711637 to 0x0:2711729 Dec 21 10:13:40 nbp1-oss6 kernel: Lustre: Skipped 6 previous similar messages Dec 21 10:13:40 nbp1-oss6 kernel: Lustre: nocache-OST003b: deleting orphan objects from 0x0:2714774 to 0x0:2714865 Dec 21 10:13:40 nbp1-oss6 kernel: Lustre: nocache-OST0041: deleting orphan objects from 0x0:2714453 to 0x0:2714545 Dec 21 10:13:40 nbp1-oss6 kernel: Lustre: nocache-OST0035: deleting orphan objects from 0x0:2749318 to 0x0:2749409
|
| Comment by Nathan Dauchy (Inactive) [ 21/Dec/16 ] |
|
As a baseline, I tried to duplicate the problem on a completely different system (running lustre-2.5.42.8.ddn4) and was not able to get the same (bad) symptom. So, I will continue to try to figure out what is different about the test system where this is happening. One possibility is that we are using external journal devices. Please let me know if you have suggestions of where else to look for differences. |
| Comment by Andreas Dilger [ 28/Apr/20 ] |
|
Close old issue that cannot be reproduced. |