Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • 3
    • 13953

    Description

      In tracking down the root cause of LU-5043, I went through the cpu partitioning code in Lustre and found some of it rather difficult to follow. A number of not-necessarily-safe assumptions seem to made, with no code comments explaining what was assumed and why. I think we have some technical debt to deal with here.

      Attachments

        Issue Links

          Activity

            [LU-5050] cpu partitioning oddities
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22377/
            Subject: LU-5050 libcfs: default CPT matches NUMA topology
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4fb513d0a635ce749ddb2173e9841814622ba4a2

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22377/ Subject: LU-5050 libcfs: default CPT matches NUMA topology Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4fb513d0a635ce749ddb2173e9841814622ba4a2

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/22507
            Subject: LU-5050 libcfs: default CPT matches NUMA topology
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cf67b83eb74bc84e90d0a02ea306405a0fc76b43

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/22507 Subject: LU-5050 libcfs: default CPT matches NUMA topology Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cf67b83eb74bc84e90d0a02ea306405a0fc76b43

            Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/22377
            Subject: LU-5050 libcfs: default CPT matches NUMA topology
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1916e280b1f0ac579d17a6cb9134923f1481176c

            gerrit Gerrit Updater added a comment - Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/22377 Subject: LU-5050 libcfs: default CPT matches NUMA topology Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1916e280b1f0ac579d17a6cb9134923f1481176c

            Dmitry, could you please work on an updated version of the http://review.whamcloud.com/17824 patch. We'd like to get it included into 2.9.0.

            adilger Andreas Dilger added a comment - Dmitry, could you please work on an updated version of the http://review.whamcloud.com/17824 patch. We'd like to get it included into 2.9.0.
            simmonsja James A Simmons added a comment - - edited

            That is one old bug. I have a question. How did it ever pass testing on RHEL7 in the first place?

            simmonsja James A Simmons added a comment - - edited That is one old bug. I have a question. How did it ever pass testing on RHEL7 in the first place?

            Quick guess as to the cause is that cfs_trimwhite() modifies the input string, while the string literal "N" was stored in read-only memory. The easiest solution is likely for cfs_cpt_table_create_pattern() to make a copy of the pattern string.

            olaf Olaf Weber (Inactive) added a comment - Quick guess as to the cause is that cfs_trimwhite() modifies the input string, while the string literal "N" was stored in read-only memory. The easiest solution is likely for cfs_cpt_table_create_pattern() to make a copy of the pattern string.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20376/
            Subject: Revert "LU-5050 libcfs: default CPT matches NUMA topology"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ae6fc0156d11ae730fbb284085a2050006b570c7

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20376/ Subject: Revert " LU-5050 libcfs: default CPT matches NUMA topology" Project: fs/lustre-release Branch: master Current Patch Set: Commit: ae6fc0156d11ae730fbb284085a2050006b570c7
            green Oleg Drokin added a comment -

            I just reverted this change. it causes 100% failure on start in rhel7 (now default):

            [  303.759884] libcfs: module verification failed: signature and/or required key missing - tainting kernel
            [  303.765035] BUG: unable to handle kernel paging request at ffffffffa01dafcc
            [  303.765374] IP: [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs]
            [  303.765624] PGD 1c11067 PUD 1c12063 PMD a95c9067 PTE 800000009abee161
            [  303.766105] Oops: 0003 [#1] SMP DEBUG_PAGEALLOC
            [  303.766445] Modules linked in: libcfs(OE+) rpcsec_gss_krb5 syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm ata_generic pata_acpi i2c_piix4 ata_piix serio_raw pcspkr virtio_balloon virtio_console libata i2c_core virtio_blk floppy nfsd ip_tables
            [  303.768469] CPU: 4 PID: 10077 Comm: insmod Tainted: G           OE  ------------   3.10.0-debug #1
            [  303.768685] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
            [  303.768851] task: ffff8800931f0700 ti: ffff8800af7cc000 task.ti: ffff8800af7cc000
            [  303.769041] RIP: 0010:[<ffffffffa01cc21e>]  [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs]
            [  303.769330] RSP: 0018:ffff8800af7cfcd0  EFLAGS: 00010246
            [  303.769484] RAX: ffffffffa01dafcc RBX: ffffffffa01dafcb RCX: 0000000000000001
            [  303.769669] RDX: 000000000000004e RSI: 0000000000000000 RDI: ffffffffa01dafcb
            [  303.769861] RBP: ffff8800af7cfcd8 R08: 0000000000000001 R09: 0000000000000000
            [  303.770489] R10: 0000000000000000 R11: ffff8800931f0fd8 R12: ffff8800ab5ca480
            [  303.771130] R13: ffffffffa0222000 R14: ffffffffa01dafcb R15: 0000000000000000
            [  303.771753] FS:  00007f9b47a79740(0000) GS:ffff8800bc700000(0000) knlGS:0000000000000000
            [  303.772834] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [  303.773434] CR2: ffffffffa01dafcc CR3: 00000000af2c6000 CR4: 00000000000006e0
            [  303.774045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [  303.774682] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            [  303.775314] Stack:
            [  303.775840]  0000000000000000 ffff8800af7cfd40 ffffffffa01c25d8 ffffffffa01e3170
            [  303.777178]  ffff8800af7cfd10 00000000810aa9ee 0000000000000000 ffff8800ab5ca480
            [  303.778514]  00000000b6b465f8 0000000000000000 ffff8800ab5ca480 ffffffffa0222000
            [  303.779836] Call Trace:
            [  303.780395]  [<ffffffffa01c25d8>] cfs_cpu_init+0x1f8/0xcd0 [libcfs]
            [  303.780993]  [<ffffffffa0222000>] ? 0xffffffffa0221fff
            [  303.781588]  [<ffffffffa022202f>] libcfs_init+0x2f/0x1000 [libcfs]
            [  303.782210]  [<ffffffff810020e8>] do_one_initcall+0xb8/0x230
            [  303.782810]  [<ffffffff810f3e6e>] load_module+0x138e/0x1bc0
            [  303.783414]  [<ffffffff8139de20>] ? ddebug_proc_write+0xf0/0xf0
            [  303.784007]  [<ffffffff810eff23>] ? copy_module_from_fd.isra.40+0x53/0x150
            [  303.784639]  [<ffffffff810f4876>] SyS_finit_module+0xa6/0xd0
            [  303.785239]  [<ffffffff81711809>] system_call_fastpath+0x16/0x1b
            [  303.785853] Code: e8 b8 45 1b e1 48 01 d8 48 39 d8 77 11 eb 1c 66 0f 1f 44 00 00 48 83 e8 01 48 39 d8 74 0d 0f b6 50 ff f6 82 60 de 87 81 20 75 ea <c6> 00 00 48 89 d8 5b 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 
            [  303.791791] RIP  [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs]
            

            This corresponds to:

            0xf21e is in cfs_trimwhite (libcfs/libcfs/libcfs_string.c:185).
            180			if (!isspace(end[-1]))
            181				break;
            182			end--;
            183		}
            184	
            185		*end = 0;
            186		return str;
            187	}
            188	EXPORT_SYMBOL(cfs_trimwhite);
            189	
            
            green Oleg Drokin added a comment - I just reverted this change. it causes 100% failure on start in rhel7 (now default): [ 303.759884] libcfs: module verification failed: signature and/or required key missing - tainting kernel [ 303.765035] BUG: unable to handle kernel paging request at ffffffffa01dafcc [ 303.765374] IP: [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs] [ 303.765624] PGD 1c11067 PUD 1c12063 PMD a95c9067 PTE 800000009abee161 [ 303.766105] Oops: 0003 [#1] SMP DEBUG_PAGEALLOC [ 303.766445] Modules linked in: libcfs(OE+) rpcsec_gss_krb5 syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm ata_generic pata_acpi i2c_piix4 ata_piix serio_raw pcspkr virtio_balloon virtio_console libata i2c_core virtio_blk floppy nfsd ip_tables [ 303.768469] CPU: 4 PID: 10077 Comm: insmod Tainted: G OE ------------ 3.10.0-debug #1 [ 303.768685] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 303.768851] task: ffff8800931f0700 ti: ffff8800af7cc000 task.ti: ffff8800af7cc000 [ 303.769041] RIP: 0010:[<ffffffffa01cc21e>] [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs] [ 303.769330] RSP: 0018:ffff8800af7cfcd0 EFLAGS: 00010246 [ 303.769484] RAX: ffffffffa01dafcc RBX: ffffffffa01dafcb RCX: 0000000000000001 [ 303.769669] RDX: 000000000000004e RSI: 0000000000000000 RDI: ffffffffa01dafcb [ 303.769861] RBP: ffff8800af7cfcd8 R08: 0000000000000001 R09: 0000000000000000 [ 303.770489] R10: 0000000000000000 R11: ffff8800931f0fd8 R12: ffff8800ab5ca480 [ 303.771130] R13: ffffffffa0222000 R14: ffffffffa01dafcb R15: 0000000000000000 [ 303.771753] FS: 00007f9b47a79740(0000) GS:ffff8800bc700000(0000) knlGS:0000000000000000 [ 303.772834] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 303.773434] CR2: ffffffffa01dafcc CR3: 00000000af2c6000 CR4: 00000000000006e0 [ 303.774045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 303.774682] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 303.775314] Stack: [ 303.775840] 0000000000000000 ffff8800af7cfd40 ffffffffa01c25d8 ffffffffa01e3170 [ 303.777178] ffff8800af7cfd10 00000000810aa9ee 0000000000000000 ffff8800ab5ca480 [ 303.778514] 00000000b6b465f8 0000000000000000 ffff8800ab5ca480 ffffffffa0222000 [ 303.779836] Call Trace: [ 303.780395] [<ffffffffa01c25d8>] cfs_cpu_init+0x1f8/0xcd0 [libcfs] [ 303.780993] [<ffffffffa0222000>] ? 0xffffffffa0221fff [ 303.781588] [<ffffffffa022202f>] libcfs_init+0x2f/0x1000 [libcfs] [ 303.782210] [<ffffffff810020e8>] do_one_initcall+0xb8/0x230 [ 303.782810] [<ffffffff810f3e6e>] load_module+0x138e/0x1bc0 [ 303.783414] [<ffffffff8139de20>] ? ddebug_proc_write+0xf0/0xf0 [ 303.784007] [<ffffffff810eff23>] ? copy_module_from_fd.isra.40+0x53/0x150 [ 303.784639] [<ffffffff810f4876>] SyS_finit_module+0xa6/0xd0 [ 303.785239] [<ffffffff81711809>] system_call_fastpath+0x16/0x1b [ 303.785853] Code: e8 b8 45 1b e1 48 01 d8 48 39 d8 77 11 eb 1c 66 0f 1f 44 00 00 48 83 e8 01 48 39 d8 74 0d 0f b6 50 ff f6 82 60 de 87 81 20 75 ea <c6> 00 00 48 89 d8 5b 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 [ 303.791791] RIP [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs] This corresponds to: 0xf21e is in cfs_trimwhite (libcfs/libcfs/libcfs_string.c:185). 180 if (!isspace(end[-1])) 181 break ; 182 end--; 183 } 184 185 *end = 0; 186 return str; 187 } 188 EXPORT_SYMBOL(cfs_trimwhite); 189

            Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/20376
            Subject: Revert "LU-5050 libcfs: default CPT matches NUMA topology"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7b3e52e9e6c8b47a7d2005860d2893a02b77a05c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/20376 Subject: Revert " LU-5050 libcfs: default CPT matches NUMA topology" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7b3e52e9e6c8b47a7d2005860d2893a02b77a05c

            People

              dmiter Dmitry Eremin (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: