[LU-9073] SSK: lgss_sk generates keys with invalid HMAC and Crypto algorithms Created: 01/Feb/17 Updated: 27/Jun/17 Resolved: 27/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Nathan Lavender (Inactive) | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
With the landing of commit c6f5e8121366be05765dabe0008165166d3f431c for == Master HEAD at c6f5e8121366be05765dabe0008165166d3f431c ==
==
The problem manifests itself by logging the following when secure contexts are being instantiated: kernel: LustreError: 2559:0:(gss_sk_mech.c:172:sk_fill_context()) Invalid hmac type: 65541 |
| Comments |
| Comment by James A Simmons [ 01/Feb/17 ] |
|
I see the print out error that I missed but the mapping seemed pretty straight forward. Is the bug in the tool or in the kernel code? |
| Comment by Gerrit Updater [ 01/Feb/17 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/25199 |
| Comment by James A Simmons [ 01/Feb/17 ] |
|
Can you test the patch I just pushed to see if it fixed things. Now I know who to include in reviews that can test my patches. |
| Comment by Gerrit Updater [ 02/Feb/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/25201 |
| Comment by Gerrit Updater [ 07/Feb/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25201/ |
| Comment by James A Simmons [ 08/Feb/17 ] |
|
Can you give my patch another try? |
| Comment by Nathan Lavender (Inactive) [ 08/Feb/17 ] |
|
I'm no longer seeing 'sk_fill_context()) Invalid hmac type: 65541' errors with the latest patch. However, I'm now seeing the following errors logged by clients: {{[ 560.612216] LustreError: 11271:0:(lmv_obd.c:1395:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff880138e9a000), error -4 As a result of these errors all cli2mdt and cli2ost connections are in a disconnected state; they never recover. I'll dig deeper into this tomorrow. |
| Comment by Jeremy Filizetti [ 09/Feb/17 ] |
|
I've posted comments in the patch on what I think is the problem. |
| Comment by Nathan Lavender (Inactive) [ 17/Feb/17 ] |
|
Patchset 4 crashes when built as follows: {{git checkout -b shared-key c6f5e8121366be05765dabe0008165166d3f431c Reproduce crash with: export SHARED_KEY=true; /usr/lib64/lustre/tests/llmount.sh Call trace: {{[ 416.036450] Lustre: DEBUG MARKER: Using TIMEOUT=20 |
| Comment by James A Simmons [ 27/Feb/17 ] |
|
Have you tested Kerboses as well? Does that work or do I need to test that as well. |
| Comment by Nathan Lavender (Inactive) [ 27/Feb/17 ] |
|
Kerberos(krb5p) crashes with the following on my multiple node Kerberos test environment:
[ 543.160207] Lustre: DEBUG MARKER: Setting sptlrpc rule: lustre.srpc.flavor.default=krb5p |
| Comment by James A Simmons [ 27/Feb/17 ] |
|
Let me guess. struct sk_ctx is used by both kerberos and HMAC. I have a test system working so I'm going to track this down. |
| Comment by Jeremy Filizetti [ 30/Mar/17 ] |
|
I believe the panic is related to the changes from |
| Comment by Peter Jones [ 12/May/17 ] |
|
Should we revert the culprit while we are working out the proper solution? |
| Comment by James A Simmons [ 13/May/17 ] |
|
Well the reason I haven't fixed this yet is due to the breakage from |
| Comment by Chris Hanna [ 15/May/17 ] |
|
James, I think your ordering is fine. Right now, the SSK testing patch is blocking on both of these getting fixed. |
| Comment by Sebastien Buisson (Inactive) [ 14/Jun/17 ] |
|
Hi, With set 13 of the patch at https://review.whamcloud.com/25199, I experience a NULL pointer dereference in cfs_crypto_hash_final and a crash of the client node when I unmount and remove kernel modules: [ 2164.649264] Lustre: 8077:0:(linux-crypto.c:83:cfs_crypto_hash_alloc()) Unsupported hash algorithm id = 255, max id is 10
[ 2164.649858] LustreError: 8077:0:(gss_krb5_mech.c:455:krb5_make_checksum()) failed to alloc hash hmac(sha1) : rc = -22
[ 2164.650417] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
[ 2164.650865] IP: [<ffffffffa021c253>] cfs_crypto_hash_final+0x13/0x90 [libcfs] [ 2164.651295] PGD 0
[ 2164.651403] Oops: 0000 [#1] SMP
[ 2164.651616] LustreError: 8144:0:(obd_config.c:1257:class_process_config()) no device for: seb-OST0000-osc-ffff880406457000
[ 2164.651617] LustreError: 8144:0:(obd_config.c:1985:class_manual_cleanup()) cleanup failed -22: seb-OST0000-osc-ffff880406457000
[ 2164.652617] Lustre: Unmounted seb-client
[ 2164.651403] Modules linked in: osc(OF) mgc(OF) lustre(OF) lmv(OF) mdc(OF) lov(OF) fid(OF) fld(OF) ptlrpc_gss(OF) sunrpc ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) libcfs(OF) i2c_piix4 i2c_core sg ppdev virtio_balloon parport_pc parport pcspkr ip_tables xfs libcrc32c sd_mod crct10dif_generic crc_t10dif crct10dif_common ata_generic pata_acpi virtio_scsi ata_piix 8139too virtio_pci libata virtio_ring 8139cp serio_raw virtio mii floppy
[ 2164.653282] CPU: 2 PID: 8077 Comm: sptlrpc_gc Tainted: GF O-------------- 3.10.0-229.20.1.el7.x86_64 #1
[ 2164.653282] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 2164.653282] task: ffff880404eb4fa0 ti: ffff880406e40000 task.ti: ffff880406e40000
[ 2164.653282] RIP: 0010:[<ffffffffa021c253>] [<ffffffffa021c253>] cfs_crypto_hash_final+0x13/0x90 [libcfs]
[ 2164.653282] RSP: 0018:ffff880406e43a58 EFLAGS: 00010246
[ 2164.653282] RAX: 0000000000000000 RBX: ffff880406e43b28 RCX: 0000000000000006
[ 2164.653282] RDX: ffff880406e43b28 RSI: 0000000000000000 RDI: ffffffffffffffea
[ 2164.653282] RBP: ffff880406e43a78 R08: 0000000000000096 R09: 000000000000028c
[ 2164.653282] R10: 0720072007200720 R11: 0720072007200720 R12: ffffffffffffffea
[ 2164.653282] R13: 00000000ffffffea R14: ffff880408dd3a20 R15: 000000000000000a
[ 2164.653282] FS: 0000000000000000(0000) GS:ffff88041fc80000(0000) knlGS:0000000000000000
[ 2164.653282] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2164.653282] CR2: 000000000000000a CR3: 0000000405908000 CR4: 00000000000006e0
[ 2164.653282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2164.653282] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2164.653282] Stack:
[ 2164.653282] ffff880406e43b28 ffffffffffffffea 00000000ffffffea ffff880408dd3a20
[ 2164.653282] ffff880406e43af8 ffffffffa0765332 ffff88041fc8f0a0 00000000000002d0
[ 2164.653282] 0000000000000000 ffff880406e43bb0 0000000200000000 ffff8800368fef40
[ 2164.653282] Call Trace:
[ 2164.653282] [<ffffffffa0765332>] krb5_make_checksum+0x352/0x630 [ptlrpc_gss]
[ 2164.653282] [<ffffffffa07663ea>] gss_get_mic_kerberos+0x9a/0x1c0 [ptlrpc_gss]
[ 2164.653282] [<ffffffffa0762c0e>] lgss_get_mic+0x2e/0x100 [ptlrpc_gss]
[ 2164.653282] [<ffffffffa074e696>] gss_sign_msg+0x166/0x300 [ptlrpc_gss]
[ 2164.653282] [<ffffffffa075387a>] gss_cli_ctx_sign+0x1ba/0x330 [ptlrpc_gss]
[ 2164.653282] [<ffffffffa05b3f3c>] sptlrpc_cli_wrap_request+0xcc/0x360 [ptlrpc]
[ 2164.653282] [<ffffffffa05841d2>] ptl_send_rpc+0x1f2/0xe40 [ptlrpc]
[ 2164.653282] [<ffffffffa0578a77>] ? ptlrpc_add_unreplied+0xc7/0x100 [ptlrpc]
[ 2164.653282] [<ffffffffa057efbc>] ? ptlrpc_request_bufs_pack+0x1ec/0x4d0 [ptlrpc]
[ 2164.653282] [<ffffffffa075cf60>] gss_do_ctx_fini_rpc+0x320/0x500 [ptlrpc_gss]
[ 2164.653282] [<ffffffffa0755cec>] gss_cli_ctx_fini_common+0x5c/0x2d0 [ptlrpc_gss]
[ 2164.653282] [<ffffffffa076d878>] ctx_destroy_kr+0x88/0x3d0 [ptlrpc_gss]
[ 2164.653282] [<ffffffffa076e01d>] gss_sec_release_ctx_kr+0x2d/0xa0 [ptlrpc_gss]
[ 2164.653282] [<ffffffffa05b2062>] sptlrpc_cli_ctx_put+0x42/0xb0 [ptlrpc]
[ 2164.653282] [<ffffffffa05bd34d>] sec_process_ctx_list+0xdd/0x180 [ptlrpc]
[ 2164.653282] [<ffffffffa05bd59c>] sec_gc_main+0x7c/0x420 [ptlrpc]
[ 2164.653282] [<ffffffff810a9510>] ? wake_up_state+0x20/0x20
[ 2164.653282] [<ffffffffa05bd520>] ? sptlrpc_gc_add_ctx+0x130/0x130 [ptlrpc]
[ 2164.653282] [<ffffffff8109727f>] kthread+0xcf/0xe0
[ 2164.653282] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[ 2164.653282] [<ffffffff81614358>] ret_from_fork+0x58/0x90
[ 2164.653282] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[ 2164.653282] Code: 09 48 83 c4 28 5b 41 5c 5d c3 e8 59 1e e5 e0 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 85 f6 48 89 e5 41 56 41 55 41 54 53 <48> 8b 4f 20 48 89 fb 48 8b 41 50 44 8b 70 f8 74 5c 48 85 d2 49
[ 2164.653282] RIP [<ffffffffa021c253>] cfs_crypto_hash_final+0x13/0x90 [libcfs]
[ 2164.653282] RSP <ffff880406e43a58>
[ 2164.653282] CR2: 000000000000000a
This is on a CentOS 7 node. |
| Comment by James A Simmons [ 19/Jun/17 ] |
|
Try it now Sebastien |
| Comment by Sebastien Buisson (Inactive) [ 20/Jun/17 ] |
|
To document completely the case, with patchset 14 I was not able to reproduce the crash anymore, but I was seeing the following traces when unmounting the client (on CentOS 7): [ 4915.909072] Lustre: 1386:0:(linux-crypto.c:83:cfs_crypto_hash_alloc()) Unsupported hash algorithm id = 255, max id is 10 [ 4915.909701] LustreError: 1386:0:(gss_krb5_mech.c:455:krb5_make_checksum()) failed to alloc hash hmac(sha1) : rc = -22 [ 4915.910254] LustreError: 1386:0:(sec_gss.c:204:gss_sign_msg()) fail to generate MIC: 000d0000 Now with patchset 15 on a CentOS 7 client, I have the following crash when unmounting the client: [438940.599108] BUG: unable to handle kernel paging request at 00000000deadbeef
[438940.599483] IP: [<ffffffff8129b01b>] sha1_final+0x7b/0x170
[438940.599783] PGD 0
[438940.599903] Oops: 0002 [#1] SMP
[438940.600075] Modules linked in: osc(OF) mgc(OF) lustre(OF) lmv(OF) mdc(OF) lov(OF) fid(OF) fld(OF) ptlrpc_gss(OF) ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) libcfs(OF) sha512_generic sunrpc sg virtio_balloon ppdev parport_pc parport i2c_piix4 i2c_core pcspkr ip_tables xfs libcrc32c sd_mod crct10dif_generic crc_t10dif crct10dif_common ata_generic pata_acpi virtio_scsi ata_piix 8139too libata virtio_pci serio_raw virtio_ring 8139cp virtio mii floppy [last unloaded: libcfs]
[438940.600075] CPU: 6 PID: 21534 Comm: sptlrpc_gc Tainted: GF O-------------- 3.10.0-229.20.1.el7.x86_64 #1
[438940.603186] LustreError: 21611:0:(obd_config.c:1294:class_process_config()) no device for: seb-OST0000-osc-ffff880405907800
[438940.603187] LustreError: 21611:0:(obd_config.c:2024:class_manual_cleanup()) cleanup failed -22: seb-OST0000-osc-ffff880405907800 [438940.600075] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[438940.600075] task: ffff880405c3e660 ti: ffff880406410000 task.ti: ffff880406410000
[438940.600075] RIP: 0010:[<ffffffff8129b01b>] [<ffffffff8129b01b>] sha1_final+0x7b/0x170
[438940.605512] Lustre: Unmounted seb-client
[438940.605562] RSP: 0018:ffff880406413928 EFLAGS: 00010246
[438940.605562] RAX: 0000000000000000 RBX: ffff8803d1c2dc70 RCX: 0000000000000000
[438940.605562] RDX: 000000002d49b7f0 RSI: ffff880406413930 RDI: ffff8803d1c2dc8c
[438940.605562] RBP: ffff880406413950 R08: 00000000b7bd30b2 R09: 00000000f128d062
[438940.605562] R10: 00000000ec79780e R11: 000000001dd8f2f0 R12: 00000000deadbeef
[438940.605562] R13: ffff8803d1c2dc60 R14: 0000000000000000 R15: ffff880408e1be00
[438940.605562] FS: 0000000000000000(0000) GS:ffff88041fd80000(0000) knlGS:0000000000000000
[438940.605562] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[438940.605562] CR2: 00000000deadbeef CR3: 00000004072f1000 CR4: 00000000000006e0
[438940.605562] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[438940.605562] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[438940.605562] Stack:
[438940.605562] e00a000000000000 000000001a7a96b2 00000000deadbeef 00000000deadbeef
[438940.605562] ffff8803d1c2dc50 ffff880406413988 ffffffff8129522c ffff8803d1c2dc60
[438940.605562] 00000000deadbeef ffff8803d1c2dc50 0000000000000000 ffff880408e1be00
[438940.605562] Call Trace:
[438940.605562] [<ffffffff8129522c>] crypto_shash_final+0x2c/0xb0
[438940.605562] [<ffffffff8129a70a>] hmac_final+0x4a/0xb0
[438940.605562] [<ffffffff812953b0>] ? crypto_shash_digest+0x40/0x40
[438940.605562] [<ffffffff8129522c>] crypto_shash_final+0x2c/0xb0
[438940.605562] [<ffffffff812953b0>] ? crypto_shash_digest+0x40/0x40
[438940.605562] [<ffffffff812953c6>] shash_async_final+0x16/0x20
[438940.605562] [<ffffffff81294c9a>] crypto_ahash_op+0x2a/0xc0
[438940.605562] [<ffffffff81294d46>] crypto_ahash_final+0x16/0x20
[438940.605562] [<ffffffffa026a09f>] cfs_crypto_hash_final+0x4f/0x90 [libcfs]
[438940.605562] [<ffffffffa0797337>] krb5_make_checksum+0x347/0x630 [ptlrpc_gss]
[438940.605562] [<ffffffffa07983fa>] gss_get_mic_kerberos+0x9a/0x1c0 [ptlrpc_gss]
[438940.605562] [<ffffffffa0794c1e>] lgss_get_mic+0x2e/0x100 [ptlrpc_gss]
[438940.605562] [<ffffffffa0780696>] gss_sign_msg+0x166/0x300 [ptlrpc_gss]
[438940.605562] [<ffffffffa078587a>] gss_cli_ctx_sign+0x1ba/0x330 [ptlrpc_gss]
[438940.605562] [<ffffffffa05e112c>] sptlrpc_cli_wrap_request+0xcc/0x360 [ptlrpc]
[438940.605562] [<ffffffffa05b0ec2>] ptl_send_rpc+0x1f2/0xe60 [ptlrpc]
[438940.605562] [<ffffffffa05a57e7>] ? ptlrpc_add_unreplied+0xc7/0x100 [ptlrpc]
[438940.605562] [<ffffffffa05abcac>] ? ptlrpc_request_bufs_pack+0x1ec/0x4d0 [ptlrpc]
[438940.605562] [<ffffffffa078ef60>] gss_do_ctx_fini_rpc+0x320/0x500 [ptlrpc_gss]
[438940.605562] [<ffffffffa0787cec>] gss_cli_ctx_fini_common+0x5c/0x2d0 [ptlrpc_gss]
[438940.605562] [<ffffffffa079f7d8>] ctx_destroy_kr+0x88/0x3d0 [ptlrpc_gss]
[438940.605562] [<ffffffffa079ff7d>] gss_sec_release_ctx_kr+0x2d/0xa0 [ptlrpc_gss]
[438940.605562] [<ffffffffa05df252>] sptlrpc_cli_ctx_put+0x42/0xb0 [ptlrpc]
[438940.605562] [<ffffffffa05ea52d>] sec_process_ctx_list+0xdd/0x180 [ptlrpc]
[438940.605562] [<ffffffffa05ea77c>] sec_gc_main+0x7c/0x420 [ptlrpc]
[438940.605562] [<ffffffff810a9510>] ? wake_up_state+0x20/0x20
[438940.605562] [<ffffffffa05ea700>] ? sptlrpc_gc_add_ctx+0x130/0x130 [ptlrpc]
[438940.605562] [<ffffffff8109727f>] kthread+0xcf/0xe0
[438940.605562] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[438940.605562] [<ffffffff81614358>] ret_from_fork+0x58/0x90
[438940.605562] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[438940.605562] Code: 00 29 c2 48 c7 c6 80 fc 66 81 4c 89 ef e8 8e fe ff ff 48 8d 75 d8 ba 08 00 00 00 4c 89 ef e8 7d fe ff ff 31 c0 8b 54 03 08 0f ca <41> 89 14 04 48 83 c0 04 48 83 f8 14 75 ec f6 c3 01 48 89 df be
[438940.605562] RIP [<ffffffff8129b01b>] sha1_final+0x7b/0x170
[438940.605562] RSP <ffff880406413928>
[438940.605562] CR2: 00000000deadbeef
|
| Comment by Sebastien Buisson (Inactive) [ 20/Jun/17 ] |
|
With patchset 16: [20763.922197] BUG: unable to handle kernel paging request at 00000000deadbef7
[20763.922606] IP: [<ffffffff812e2a85>] memcpy+0xb5/0x110
[20763.922927] PGD 0
[20763.923044] Oops: 0000 [#1] SMP
[20763.925271] LustreError: 1226:0:(obd_config.c:1294:class_process_config()) no device for: seb-OST0000-osc-ffff8800d2c1c800
[20763.925273] LustreError: 1226:0:(obd_config.c:2024:class_manual_cleanup()) cleanup failed -22: seb-OST0000-osc-ffff8800d2c1c800
[20763.926581] Lustre: Unmounted seb-client
[20763.923161] Modules linked in: osc(OF) mgc(OF) lustre(OF) lmv(OF) mdc(OF) lov(OF) fid(OF) fld(OF) ptlrpc_gss(OF) ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) libcfs(OF) sunrpc i2c_piix4 i2c_core sg virtio_balloon ppdev parport_pc parport pcspkr ip_tables xfs libcrc32c sd_mod crct10dif_generic crc_t10dif crct10dif_common ata_generic pata_acpi virtio_scsi 8139too virtio_pci virtio_ring serio_raw 8139cp virtio ata_piix mii libata floppy [last unloaded: libcfs] [20763.927207] CPU: 5 PID: 1172 Comm: sptlrpc_gc Tainted: GF O-------------- 3.10.0-229.20.1.el7.x86_64 #1
[20763.927207] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[20763.927207] task: ffff8804088871c0 ti: ffff880406c1c000 task.ti: ffff880406c1c000
[20763.927207] RIP: 0010:[<ffffffff812e2a85>] [<ffffffff812e2a85>] memcpy+0xb5/0x110
[20763.927207] RSP: 0018:ffff880406c1fb00 EFLAGS: 00010202
[20763.927207] RAX: ffff880404ff2b50 RBX: ffff880404ff2b40 RCX: 0000000000000028
[20763.927207] RDX: 000000000000000c RSI: 00000000deadbef7 RDI: ffff880404ff2b50
[20763.927207] RBP: ffff880406c1fb68 R08: 00000000000163c0 R09: ffff88041fd563c0
[20763.927207] R10: ffffea000f427640 R11: ffffffffa04583a4 R12: ffff880406c1fb28
[20763.927207] R13: 00000000000002d0 R14: ffff880406c1fba0 R15: 0000000000000012
[20763.927207] FS: 0000000000000000(0000) GS:ffff88041fd40000(0000) knlGS:0000000000000000
[20763.927207] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[20763.927207] CR2: 00000000deadbef7 CR3: 000000000190a000 CR4: 00000000000006e0
[20763.927207] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[20763.927207] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[20763.927207] Stack:
[20763.927207] ffffffffa0459452 0000000000000000 ffff880406c1fb28 0000000000000000
[20763.927207] 0000000000000012 ffff880300000014 00000000deadbeef 000000004c1a9725
[20763.927207] 0000000000000002 ffff880404ff2a08 0000000000000002 ffff880406c1fbd0
[20763.927207] Call Trace:
[20763.927207] [<ffffffffa0459452>] ? gss_get_mic_kerberos+0x102/0x1c0 [ptlrpc_gss]
[20763.927207] [<ffffffffa0455c1e>] lgss_get_mic+0x2e/0x100 [ptlrpc_gss]
[20763.927207] [<ffffffffa0441696>] gss_sign_msg+0x166/0x300 [ptlrpc_gss]
[20763.927207] [<ffffffffa044687a>] gss_cli_ctx_sign+0x1ba/0x330 [ptlrpc_gss]
[20763.927207] [<ffffffffa07ba12c>] sptlrpc_cli_wrap_request+0xcc/0x360 [ptlrpc]
[20763.927207] [<ffffffffa0789ec2>] ptl_send_rpc+0x1f2/0xe60 [ptlrpc]
[20763.927207] [<ffffffffa077e7e7>] ? ptlrpc_add_unreplied+0xc7/0x100 [ptlrpc]
[20763.927207] [<ffffffffa0784cac>] ? ptlrpc_request_bufs_pack+0x1ec/0x4d0 [ptlrpc]
[20763.927207] [<ffffffffa044ff60>] gss_do_ctx_fini_rpc+0x320/0x500 [ptlrpc_gss]
[20763.927207] [<ffffffffa0448cec>] gss_cli_ctx_fini_common+0x5c/0x2d0 [ptlrpc_gss]
[20763.927207] [<ffffffffa04607c8>] ctx_destroy_kr+0x88/0x3d0 [ptlrpc_gss]
[20763.927207] [<ffffffffa0460f6d>] gss_sec_release_ctx_kr+0x2d/0xa0 [ptlrpc_gss]
[20763.927207] [<ffffffffa07b8252>] sptlrpc_cli_ctx_put+0x42/0xb0 [ptlrpc]
[20763.927207] [<ffffffffa07c352d>] sec_process_ctx_list+0xdd/0x180 [ptlrpc]
[20763.927207] [<ffffffffa07c377c>] sec_gc_main+0x7c/0x420 [ptlrpc]
[20763.927207] [<ffffffff810a9510>] ? wake_up_state+0x20/0x20
[20763.927207] [<ffffffffa07c3700>] ? sptlrpc_gc_add_ctx+0x130/0x130 [ptlrpc]
[20763.927207] [<ffffffff8109727f>] kthread+0xcf/0xe0
[20763.927207] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[20763.927207] [<ffffffff81614358>] ret_from_fork+0x58/0x90
[20763.927207] [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[20763.927207] Code: 72 24 4c 8b 06 4c 8b 4e 08 4c 8b 54 16 f0 4c 8b 5c 16 f8 4c 89 07 4c 89 4f 08 4c 89 54 17 f0 4c 89 5c 17 f8 c3 90 83 fa 08 72 1b <4c> 8b 06 4c 8b 4c 16 f8 4c 89 07 4c 89 4c 17 f8 c3 66 2e 0f 1f
[20763.927207] RIP [<ffffffff812e2a85>] memcpy+0xb5/0x110
[20763.927207] RSP <ffff880406c1fb00>
[20763.927207] CR2: 00000000deadbef7
|
| Comment by James A Simmons [ 20/Jun/17 ] |
|
So the original code if you look at it leaks rawobj_t like crazy so I plugged the leaks. I was a bit over zealous and in the krb5 case we still need the rawobj_t around after running krb5_make_checksum(). Thanks for testing this. I don't have a KRB5 setup. |
| Comment by Sebastien Buisson (Inactive) [ 21/Jun/17 ] |
|
Hi James, I understand. juin 22 00:05:20 ltest-vm4 kernel: LustreError: 26884:0:(gss_krb5_mech.c:633:gss_verify_mic_kerberos()) checksum mismatch juin 22 00:05:20 ltest-vm4 kernel: LustreError: 26884:0:(sec_gss.c:243:gss_verify_msg()) mic verify error: 00060000 juin 22 00:05:20 ltest-vm4 kernel: LustreError: 26884:0:(sec_gss.c:2122:gss_svc_verify_request()) failed to verify request: 60000 juin 22 00:05:26 ltest-vm4 kernel: Lustre: 26630:0:(sec_gss.c:1222:gss_cli_ctx_fini_common()) reverse sec ffff880409776900: destroy ctx ffff88040748b240 |
| Comment by James A Simmons [ 21/Jun/17 ] |
|
Any good at doing dynamic tracing with perf or do you want me to push a patch with CERROR added. The current info is not enough for me to figure out what is going on. |
| Comment by Peter Jones [ 21/Jun/17 ] |
|
James Given that Sebastien is in France and it is very late there already I would suggest pushing the patch with CERROR proactively would make sense - we already know that the existing version is not going to be sutiable for landing... Peter |
| Comment by Sebastien Buisson (Inactive) [ 22/Jun/17 ] |
|
Hi James, Here the corresponding stack. Does it help, or do you need anything else? ll_ost00_002 28897 [003] 686311.879332: probe:gss_verify_mic_kerberos: (ffffffffa0d12614)
18685 gss_verify_mic_kerberos ([unknown])
14d8e lgss_verify_mic ([unknown])
97a gss_verify_msg ([unknown])
5014 gss_svc_verify_request ([unknown])
a172 gss_svc_accept ([unknown])
1efd8 gss_svc_accept_kr ([unknown])
84ebe sptlrpc_svc_unwrap_request ([unknown])
65f94 ptlrpc_main ([unknown])
7fff8109927f kthread ([kernel.kallsyms])
7fff81616358 ret_from_fork ([kernel.kallsyms])
|
| Comment by James A Simmons [ 22/Jun/17 ] |
|
While looking to add various debug message I really thought about what the code was doing and I think the reason for your error was found. The function cfs_crypto_hash_final() has to be called before gss_crypt_generic() was called. It was being called after so the checksum final value wasn't being cached which lead to your failure. I fixed it up in the latest patch. Give it a try. |
| Comment by Sebastien Buisson (Inactive) [ 22/Jun/17 ] |
|
Same behavior with patchset 18 juin 23 01:01:20 ltest-vm4 kernel: LustreError: 16857:0:(gss_krb5_mech.c:636:gss_verify_mic_kerberos()) checksum mismatch juin 23 01:01:20 ltest-vm4 kernel: LustreError: 16857:0:(sec_gss.c:243:gss_verify_msg()) mic verify error: 00060000 juin 23 01:01:20 ltest-vm4 kernel: LustreError: 16857:0:(sec_gss.c:2122:gss_svc_verify_request()) failed to verify request: 60000 juin 23 01:01:25 ltest-vm4 kernel: Lustre: 16810:0:(sec_gss.c:1222:gss_cli_ctx_fini_common()) reverse sec ffff8800366ea500: destroy ctx ffff8803f5b7f0c0 ll_ost00_002 16857 [006] 715524.594882: probe:gss_verify_mic_kerberos: (ffffffffa0d12624)
18695 gss_verify_mic_kerberos ([unknown])
14d8e lgss_verify_mic ([unknown])
97a gss_verify_msg ([unknown])
5014 gss_svc_verify_request ([unknown])
a172 gss_svc_accept ([unknown])
1f098 gss_svc_accept_kr ([unknown])
84ebe sptlrpc_svc_unwrap_request ([unknown])
65f94 ptlrpc_main ([unknown])
7fff8109927f kthread ([kernel.kallsyms])
7fff81616358 ret_from_fork ([kernel.kallsyms])
|
| Comment by James A Simmons [ 22/Jun/17 ] |
|
I have two theories why this might be happening. I pushed a debug patch. Please post the results of the run. |
| Comment by Sebastien Buisson (Inactive) [ 22/Jun/17 ] |
|
With patchset 20, on server side when client is unmounted: juin 23 03:27:19 ltest-vm4 kernel: LustreError: 3437:0:(gss_krb5_mech.c:470:krb5_make_checksum()) cksum->len 20, ke_hash_size 12 for ke_hash_name sha1
juin 23 03:27:19 ltest-vm4 kernel: LustreError: 3437:0:(gss_krb5_mech.c:643:gss_verify_mic_kerberos()) checksum mismatch
juin 23 03:27:19 ltest-vm4 kernel: LustreError: 3437:0:(sec_gss.c:243:gss_verify_msg()) mic verify error: 00060000
juin 23 03:27:19 ltest-vm4 kernel: LustreError: 3437:0:(sec_gss.c:2122:gss_svc_verify_request()) failed to verify request: 60000
juin 23 03:27:25 ltest-vm4 kernel: Lustre: 3390:0:(sec_gss.c:1222:gss_cli_ctx_fini_common()) reverse sec ffff8804058cf300: destroy ctx ffff8804062433c0
|
| Comment by James A Simmons [ 23/Jun/17 ] |
|
Found the reason for your error. The bug was in the user land utility. The bug was located in sk_session_kdf() and sk_compute_keys(). Before we had an entry for "ctr(aes)" in the libcfs crypto abstraction but I removed it. That was causing the mentioned functions to generate the incorrect size keys for the keyring. Currently I just hard coded the key size since we only use "ctr(aes)". I just discovered /proc/crypto has all the info we need. Will create a latter patch to handle that. Please give it a try. Thanks for you patience. It was a crash course in krb5 the last few days. |
| Comment by Sebastien Buisson (Inactive) [ 26/Jun/17 ] |
|
Still 'checksum mismatch' errors on server side at client unmount when testing patchset 21 with krb5n flavor. juin 26 16:39:58 ltest-vm4 kernel: LustreError: 6544:0:(gss_krb5_mech.c:640:gss_verify_mic_kerberos()) checksum mismatch juin 26 16:39:58 ltest-vm4 kernel: LustreError: 6544:0:(sec_gss.c:243:gss_verify_msg()) mic verify error: 00060000 juin 26 16:39:58 ltest-vm4 kernel: LustreError: 6544:0:(sec_gss.c:2122:gss_svc_verify_request()) failed to verify request: 60000 juin 26 16:40:04 ltest-vm4 kernel: Lustre: 6494:0:(sec_gss.c:1222:gss_cli_ctx_fini_common()) reverse sec ffff8803f9680d00: destroy ctx It seems patchset 21 only fixes sk_utils.c, so not related to krb5. |
| Comment by James A Simmons [ 26/Jun/17 ] |
|
I have to ask does it work in 2.9 for you? I found with hmac the 2.9 clients doesn't work for me so I'm wondering it the GSS code works at all for anyone? |
| Comment by Andreas Dilger [ 26/Jun/17 ] |
|
What is the current state of affairs with and without James' patch for Kerberos and SSK? Sebastien, you report problems at unmount, but does Kerberos work for normal usage before that time with James' patch applied? Is it "more" broken without the patch, or is it working without the patch and the patch breaks Kerberos? Chris Hanna or Nathan Lavender, does James' patch fix SSK again? I'm trying to figure out if we should land James' patch to fix the problem(s) found so far and then get a later patch to fix the remaining problem(s)? It seems like there are multiple issues here, and I'd like to move forward with getting fixes landed, since this is the last problem holding up the 2.10 release. |
| Comment by Gerrit Updater [ 26/Jun/17 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/27823 |
| Comment by James A Simmons [ 26/Jun/17 ] |
|
I just pushed a revert in the hopes we are back to the state of lustre 2.9. For the revert I made it so GSS is disabled with newer kernels instead so this will not hinder newer kernel support. This revert is just a very poor band aid. IMHO based on my testing I don't think GSS is ready for production systems. Even with the 2.9 client I found I couldn't get hmac support going with multiple nodes. Sebastien reported for 2.9 that gss null is unstable and it tends to kernel panic. Also I just could never get GSS working without nodemap. Is that suppose to be the case? I found issues like a missing lgssc.conf file will cause phantom keys in the kernel which caused problems. |
| Comment by Chris Hanna [ 26/Jun/17 ] |
|
James, Nodemap is needed for SSK (see Lustre manual 24.5) in a multinode setup. It does not necessarily need to be an activated feature, however. The lgssc.conf file should be created by the test-framework.sh script. If the nodemap is not set correctly set up, you definitely will see strange issues. SSK did function in a client-to-server mode prior to the initial GSS patches. In the most recent version of James' patch it appeared to work in sanity-sec up to test_15, until it ran into some issues related to fileop in test_16 and did not recover. I can't speak to the most recent reversion James just uploaded. I don't expect this feature needs to get in the way of releasing the many other improvements in 2.10. |
| Comment by Peter Jones [ 26/Jun/17 ] |
|
hannac could you please need out the behaviour on RHEL7.x with James's latest reversion patch? We don't want to release 2.10.0 with a drop in functionality compared to 2.9. |
| Comment by Sebastien Buisson (Inactive) [ 26/Jun/17 ] |
|
With the patch ' juin 27 05:30:51 ltest-vm4 kernel: Lustre: 11812:0:(sec_gss.c:2323:gss_svc_handle_destroy()) destroy svc ctx ffff8803fbd6bc40 idx 0x31cb7c0f1b298198 (0->10.128.11.159@tcp) juin 27 05:30:56 ltest-vm4 kernel: Lustre: 11764:0:(sec_gss.c:1222:gss_cli_ctx_fini_common()) reverse sec ffff8800364a4500: destroy ctx ffff880036603780 That being said, the problem mentioned earlier is not blocking, and could be tackled in another Jira. I would support the idea of having patch at https://review.whamcloud.com/25199 landed, in order not to block support for GSS with newer kernels, knowing that it gives 'checksum mismatch' errors on server side when clients are unmounted. What do you think? |
| Comment by Chris Hanna [ 26/Jun/17 ] |
|
After spending some time with the latest reversion from James, it appears to have fixed the issue. We were able to run sanity-sec and sanity for some time. There were some errors, but SSK was engaged any many tests did pass. I would proceed with the reversion if it is holding you back, and the errors can be diagnosed when time permits. |
| Comment by Andreas Dilger [ 27/Jun/17 ] |
|
Sebastien, Chris, could you please mark +1 on the patch if it is working for you, so that we can land it. |
| Comment by Sebastien Buisson (Inactive) [ 27/Jun/17 ] |
|
Chris, did you +1 on the right patch? |
| Comment by Gerrit Updater [ 27/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27823/ |
| Comment by Peter Jones [ 27/Jun/17 ] |
|
After some discussion we elected to land the reversion option for 2.10.0 and then aim to sort out GSS support for new kernel versions post-2.10.0. The rationale is that this keeps consistent behaviour with 2.9 for the officially supported distros. |