[LU-9073] SSK: lgss_sk generates keys with invalid HMAC and Crypto algorithms Created: 01/Feb/17  Updated: 27/Jun/17  Resolved: 27/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Blocker
Reporter: Nathan Lavender (Inactive) Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9567 sptlrpc rules are not being updated Resolved
is related to LU-8602 Support GSS crypto code with linux 4.... Resolved
is related to LU-9582 gssnull instability Open
is related to LU-9086 obd_config.c:1258:class_process_confi... Resolved
is related to LU-9430 logic errors in lgss_sk code Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With the landing of commit c6f5e8121366be05765dabe0008165166d3f431c for LU-8602, lgss_sk now generates keys with invalid HMAC and Crypto algorithms. The HMAC and Crypto algorithms are being swapped.

== Master HEAD at c6f5e8121366be05765dabe0008165166d3f431c ==

  1. lgss_sk -t server -f testfs -w testfs_test_with_LU-8602.key -d /dev/urandom
    Reading random data for shared key from '/dev/urandom'
  1. lgss_sk -r testfs_test_with_LU-8602.key
    warning: secret key 'testfs_test_with_LU-8602.key' has insecure file mode 0100400
    Version: 1
    Type: server
    HMAC alg: AES-256-CTR
    Crypto alg: sha256
    Ctx Expiration: 604800 seconds
    Shared keylen: 256 bits
    Prime length: 2048 bits
    File system: testfs
    MGS NIDs:
    Nodemap name: default

== LU-8602 reverted ==

  1. lgss_sk -t server -f testfs -w testfs_test_without_LU-8602.key -d /dev/urandom
    Reading random data for shared key from '/dev/urandom'
  1. lgss_sk -r testfs_test_without_LU-8602.key
    warning: secret key 'testfs_test_without_LU-8602.key' has insecure file mode 0100400
    Version: 1
    Type: server
    HMAC alg: SHA256
    Crypto alg: AES-256-CTR
    Ctx Expiration: 604800 seconds
    Shared keylen: 256 bits
    Prime length: 2048 bits
    File system: testfs
    MGS NIDs:
    Nodemap name: default
  1. lgss_sk -r testfs_test_with_LU-8602.key
    warning: secret key 'testfs_test_with_LU-8602.key' has insecure file mode 0100400
    Invalid HMAC algorithm
    error: key configuration failed validation

The problem manifests itself by logging the following when secure contexts are being instantiated:

kernel: LustreError: 2559:0:(gss_sk_mech.c:172:sk_fill_context()) Invalid hmac type: 65541
kernel: LustreError: 2559:0:(gss_sk_mech.c:172:sk_fill_context()) Skipped 1 previous similar message
kernel: LustreError: 2559:0:(gss_svc_upcall.c:668:rsc_parse()) parse rsc error -22
kernel: LustreError: 2559:0:(gss_svc_upcall.c:668:rsc_parse()) Skipped 1 previous similar message
kernel: LustreError: 2450:0:(gss_svc_upcall.c:1018:gss_svc_upcall_handle_init()) authentication failed



 Comments   
Comment by James A Simmons [ 01/Feb/17 ]

I see the print out error that I missed but the mapping seemed pretty straight forward. Is the bug in the tool or in the kernel code?

Comment by Gerrit Updater [ 01/Feb/17 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/25199
Subject: LU-9073 gss: Fix mixup between standard crypto and krb5 handling
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 741c567849282d12ed3d45d51d32b286393b092f

Comment by James A Simmons [ 01/Feb/17 ]

Can you test the patch I just pushed to see if it fixed things. Now I know who to include in reviews that can test my patches.

Comment by Gerrit Updater [ 02/Feb/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/25201
Subject: LU-9073 gss: quiet insecure key file warning
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 169a55899cd1346fb6ecae5cc396e928c7c3ea27

Comment by Gerrit Updater [ 07/Feb/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25201/
Subject: LU-9073 gss: quiet insecure key file warning
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b5c4f0305323a44c38357420e1ee84dd343f4875

Comment by James A Simmons [ 08/Feb/17 ]

Can you give my patch another try?

Comment by Nathan Lavender (Inactive) [ 08/Feb/17 ]

I'm no longer seeing 'sk_fill_context()) Invalid hmac type: 65541' errors with the latest patch. However, I'm now seeing the following errors logged by clients:

{{[ 560.612216] LustreError: 11271:0:(lmv_obd.c:1395:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff880138e9a000), error -4
[ 560.612380] LustreError: 11271:0:(llite_lib.c:1737:ll_statfs_internal()) md_statfs fails: rc = -4
[ 565.763600] LustreError: 3033:0:(sec_gss.c:204:gss_sign_msg()) fail to generate MIC: 000d0000}}

As a result of these errors all cli2mdt and cli2ost connections are in a disconnected state; they never recover. I'll dig deeper into this tomorrow.

Comment by Jeremy Filizetti [ 09/Feb/17 ]

I've posted comments in the patch on what I think is the problem.

Comment by Nathan Lavender (Inactive) [ 17/Feb/17 ]

Patchset 4 crashes when built as follows:

{{git checkout -b shared-key c6f5e8121366be05765dabe0008165166d3f431c
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/80/20780/66 && git cherry-pick FETCH_HEAD
git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/99/25199/4 && git cherry-pick FETCH_HEAD
git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/36/24236/2 && git cherry-pick FETCH_HEAD}}

Reproduce crash with:

export SHARED_KEY=true; /usr/lib64/lustre/tests/llmount.sh

Call trace:

{{[ 416.036450] Lustre: DEBUG MARKER: Using TIMEOUT=20
[ 426.124318] Lustre: DEBUG MARKER: Setting sptlrpc rule: lustre.srpc.flavor.default.cli2mdt=ski
[ 426.140373] Lustre: DEBUG MARKER: Setting sptlrpc rule: lustre.srpc.flavor.default.cli2ost=ski
[ 445.048297] Lustre: 4666:0:(sec_gss.c:2063:gss_svc_handle_init()) create svc ctx ffff8800b196e440: user from 0@lo authenticated as root
[ 445.048311] Lustre: 4666:0:(sec_gss.c:371:gss_cli_ctx_uptodate()) server installed reverse ctx ffff880130545a80 idx 0xe3d2a3daf9dd43cb, expiry 1487965290(+604800s)
[ 445.067518] Lustre: 4669:0:(sec_gss.c:378:gss_cli_ctx_uptodate()) client refreshed ctx ffff8800a8fef3c0 idx 0xe3d2a3daf9dd43ce (0->lustre-MDT0000_UUID), expiry 1487965280(+604790s)
[ 445.067534] Lustre: 4669:0:(gss_svc_upcall.c:863:gss_svc_upcall_install_rvs_ctx()) create reverse svc ctx ffff8800b196e040 to lustre-MDT0000_UUID: idx 0xe3d2a3daf9dd43cb
[ 445.067590] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 445.067674] IP: [<ffffffff81076e6d>] sha256_ssse3_finup+0xcd/0x2f0
[ 445.067748] PGD 130c5c067 PUD 130536067 PMD 0
[ 445.067807] Oops: 0002 1 SMP
[ 445.067867] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) mbcache jbd2 loop zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate lustre(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device iosf_mbi snd_pcm snd_timer snd crc32_pclmul ghash_clmulni_intel soundcore aesni_intel lrw gf128mul glue_helper ablk_helper ppdev cryptd parport_pc parport sg i2c_piix4 pcspkr virtio_balloon nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[ 445.068479] xfs libcrc32c sr_mod cdrom ata_generic pata_acpi virtio_blk virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul ttm crct10dif_common crc32c_intel serio_raw ata_piix virtio_pci virtio_ring virtio drm i2c_core libata floppy dm_mirror dm_region_hash dm_log dm_mod
[ 445.068724] CPU: 1 PID: 2536 Comm: ptlrpcd_00_00 Tainted: P OE ------------ 3.10.0-514.6.1.el7_lustre.x86_64 #1
[ 445.068800] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 445.068856] task: ffff8801363d0fb0 ti: ffff8800b3efc000 task.ti: ffff8800b3efc000
[ 445.068921] RIP: 0010:[<ffffffff81076e6d>] [<ffffffff81076e6d>] sha256_ssse3_finup+0xcd/0x2f0
[ 445.069012] RSP: 0018:ffff8800b3eff860 EFLAGS: 00010202
[ 445.069066] RAX: 0000000000000000 RBX: ffff8800a4622970 RCX: 0000000000000020
[ 445.069128] RDX: 00000000decd3590 RSI: ffff8800a46229d8 RDI: ffff880036964680
[ 445.069188] RBP: ffff8800b3eff8a8 R08: 00000000dab6bc82 R09: 00000000d496d560
[ 445.069250] R10: 00000000ed72a76e R11: 00000000410aff32 R12: 0000000000000000
[ 445.069310] R13: ffffffff81075c00 R14: ffff8800a4622960 R15: 0000000000000000
[ 445.069370] FS: 0000000000000000(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
[ 445.069436] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 445.069491] CR2: 0000000000000000 CR3: 000000013057f000 CR4: 00000000001406e0
[ 445.069553] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 445.069616] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 445.069677] Stack:
[ 445.069730] ffffffff81075c00 0000001400000000 ffffffff00000003 ffff8800a4622998
[ 445.069797] 0000000000000000 0000000000000000 ffff8800a4622950 0000000000000000
[ 445.069863] ffff8800b196e800 ffff8800b3eff8b8 ffffffff810770a5 ffff8800b3eff8f0
[ 445.069944] Call Trace:
[ 445.070019] [<ffffffff81075c00>] ? done_hash+0x1c/0x1c
[ 445.070087] [<ffffffff810770a5>] sha256_ssse3_final+0x15/0x20
[ 445.070153] [<ffffffff812d4dec>] crypto_shash_final+0x2c/0xb0
[ 445.070211] [<ffffffff812daa4a>] hmac_final+0x4a/0xb0
[ 445.070270] [<ffffffff812d4f70>] ? crypto_shash_digest+0x40/0x40
[ 445.070331] [<ffffffff812d4dec>] crypto_shash_final+0x2c/0xb0
[ 445.070395] [<ffffffff812d4f70>] ? crypto_shash_digest+0x40/0x40
[ 445.070455] [<ffffffff812d4f86>] shash_async_final+0x16/0x20
[ 445.070513] [<ffffffff812d485a>] crypto_ahash_op+0x2a/0xc0
[ 445.070571] [<ffffffff812d4906>] crypto_ahash_final+0x16/0x20
[ 445.070640] [<ffffffffa0bcd9e6>] gss_digest_hmac+0x226/0x250 [ptlrpc_gss]
[ 445.071762] [<ffffffffa0bce456>] sk_make_hmac+0x76/0xe0 [ptlrpc_gss]
[ 445.072720] [<ffffffffa0bce51b>] gss_get_mic_sk+0x5b/0x60 [ptlrpc_gss]
[ 445.073663] [<ffffffffa0bc6c0e>] lgss_get_mic+0x2e/0x100 [ptlrpc_gss]
[ 445.074594] [<ffffffffa0bb2696>] gss_sign_msg+0x166/0x300 [ptlrpc_gss]
[ 445.075510] [<ffffffffa0bb787a>] gss_cli_ctx_sign+0x1ba/0x330 [ptlrpc_gss]
[ 445.076445] [<ffffffffa09a597c>] sptlrpc_cli_wrap_request+0xcc/0x360 [ptlrpc]
[ 445.077428] [<ffffffffa0976202>] ptl_send_rpc+0x1f2/0xe40 [ptlrpc]
[ 445.078364] [<ffffffffa09a9dad>] ? sptlrpc_req_refresh_ctx+0x4dd/0x900 [ptlrpc]
[ 445.079284] [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0
[ 445.080174] [<ffffffffa096f589>] ptlrpc_check_set.part.23+0x1899/0x1dd0 [ptlrpc]
[ 445.081076] [<ffffffffa096fb1b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[ 445.081918] [<ffffffffa099bbeb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
[ 445.082781] [<ffffffffa099bf9b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
[ 445.083587] [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20
[ 445.084373] [<ffffffffa099bce0>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc]
[ 445.085129] [<ffffffff810b064f>] kthread+0xcf/0xe0
[ 445.085846] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
[ 445.086548] [<ffffffff81696958>] ret_from_fork+0x58/0x90
[ 445.087221] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
[ 445.087934] Code: d0 49 89 46 70 48 89 df 41 ff d5 e8 ee e5 fb ff 49 8b 06 48 8b 40 58 8b 48 f8 31 c0 85 c9 74 17 66 0f 1f 44 00 00 8b 14 03 0f ca <41> 89 14 04 48 83 c0 04 39 c1 75 ef 31 c0 48 89 df b9 0d 00 00
[ 445.089398] RIP [<ffffffff81076e6d>] sha256_ssse3_finup+0xcd/0x2f0
[ 445.090089] RSP <ffff8800b3eff860>
[ 445.090772] CR2: 0000000000000000}}

Comment by James A Simmons [ 27/Feb/17 ]

Have you tested Kerboses as well? Does that work or do I need to test that as well.

Comment by Nathan Lavender (Inactive) [ 27/Feb/17 ]

Kerberos(krb5p) crashes with the following on my multiple node Kerberos test environment:

 

[ 543.160207] Lustre: DEBUG MARKER: Setting sptlrpc rule: lustre.srpc.flavor.default=krb5p
[ 557.399008] Lustre: 5147:0:(gss_mech_switch.c:72:lgss_mech_register()) Register gssnull mechanism
[ 557.399019] Key type lgssc registered
[ 557.402774] Lustre: 4241:0:(sec_gss.c:2063:gss_svc_handle_init()) create svc ctx ffff880208518040: user from 172.18.0.21@tcp authenticated as oss
[ 557.403797] Lustre: 4242:0:(sec_gss.c:371:gss_cli_ctx_uptodate()) server installed reverse ctx ffff880216a5dd80 idx 0x4b5feef5ea638827, expiry 1488313593(+86400s)
[ 559.347684] Lustre: 5163:0:(sec_gss.c:378:gss_cli_ctx_uptodate()) client refreshed ctx ffff8802091c6300 idx 0x905e6da7e2efc596 (0->lustre-OST0002_UUID), expiry 1488313585(+86390s)
[ 559.347696] Lustre: 5163:0:(gss_svc_upcall.c:863:gss_svc_upcall_install_rvs_ctx()) create reverse svc ctx ffff8800d285c840 to lustre-OST0002_UUID: idx 0x430434e1928b1936
[ 559.347756] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 559.347812] IP: [<ffffffff81071ebd>] sha1_ssse3_finup+0xbd/0x2e0
[ 559.347853] PGD 0
[ 559.347877] Oops: 0002 1 SMP
[ 559.347902] Modules linked in: ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) mbcache jbd2 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache iosf_mbi snd_hda_codec_generic crc32_pclmul snd_hda_intel snd_hda_codec ghash_clmulni_intel snd_hda_core aesni_intel snd_hwdep snd_seq snd_seq_device snd_pcm lrw gf128mul glue_helper ablk_helper cryptd snd_timer snd ppdev sg soundcore pcspkr parport_pc parport i2c_piix4 virtio_balloon nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi virtio_blk virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul
[ 559.348447] crct10dif_common ata_piix ttm crc32c_intel drm serio_raw libata i2c_core virtio_pci virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod
[ 559.348555] CPU: 0 PID: 2651 Comm: ptlrpcd_00_03 Tainted: G OE ------------ 3.10.0-514.6.1.el7_lustre.x86_64 #1
[ 559.348615] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 559.348648] task: ffff88020e6d4e70 ti: ffff880215da0000 task.ti: ffff880215da0000
[ 559.348688] RIP: 0010:[<ffffffff81071ebd>] [<ffffffff81071ebd>] sha1_ssse3_finup+0xbd/0x2e0
[ 559.348738] RSP: 0018:ffff880215da37f0 EFLAGS: 00010246
[ 559.348769] RAX: 0000000000000000 RBX: ffff8800d54c0b70 RCX: 0000000000000000
[ 559.348808] RDX: 00000000e1020503 RSI: 0000000020b16494 RDI: ffff88020e6dd040
[ 559.348847] RBP: ffff880215da3838 R08: ffffffff816a7280 R09: ffff8800d54c0b70
[ 559.348891] R10: ffffffff816a7280 R11: ffff8800d54c0bd0 R12: 0000000000000000
[ 559.348930] R13: ffffffff81071dd0 R14: ffff8800d54c0b60 R15: 0000000000000000
[ 559.348970] FS: 0000000000000000(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
[ 559.349014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 559.349046] CR2: 0000000000000000 CR3: 00000000019ba000 CR4: 00000000001406f0
[ 559.349087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 559.349127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 559.349166] Stack:
[ 559.349180] ffffffff81071dd0 00000004085b0e40 ffffffff00000004 ffff8800d54c0b90
[ 559.349227] 0000000000000000 0000000000000000 ffff8800d54c0b50 0000000000000000
[ 559.349275] ffff880208519000 ffff880215da3848 ffffffff810720f5 ffff880215da3880
[ 559.349323] Call Trace:
[ 559.349341] [<ffffffff81071dd0>] ? sha1_base_init+0x40/0x40
[ 559.349375] [<ffffffff810720f5>] sha1_ssse3_final+0x15/0x20
[ 559.349409] [<ffffffff812d4dec>] crypto_shash_final+0x2c/0xb0
[ 559.349443] [<ffffffff812daa4a>] hmac_final+0x4a/0xb0
[ 559.351126] [<ffffffff812d4f70>] ? crypto_shash_digest+0x40/0x40
[ 559.352781] [<ffffffff812d4dec>] crypto_shash_final+0x2c/0xb0
[ 559.354435] [<ffffffff812d4f70>] ? crypto_shash_digest+0x40/0x40
[ 559.356084] [<ffffffff812d4f86>] shash_async_final+0x16/0x20
[ 559.357693] [<ffffffff812d485a>] crypto_ahash_op+0x2a/0xc0
[ 559.359265] [<ffffffff812d4906>] crypto_ahash_final+0x16/0x20
[ 559.360804] [<ffffffffa0fc59e6>] gss_digest_hmac+0x226/0x250 [ptlrpc_gss]
[ 559.362532] [<ffffffffa0fc10e7>] krb5_make_checksum+0x107/0x3d0 [ptlrpc_gss]
[ 559.364014] [<ffffffffa0fc2fe9>] gss_wrap_kerberos+0x1c9/0x780 [ptlrpc_gss]
[ 559.365458] [<ffffffffa0fbee0e>] lgss_wrap+0x2e/0x100 [ptlrpc_gss]
[ 559.366864] [<ffffffffa0fb0652>] gss_cli_ctx_seal+0x1f2/0xa70 [ptlrpc_gss]
[ 559.368248] [<ffffffff810d23c2>] ? load_balance+0x1e2/0x990
[ 559.369644] [<ffffffffa0a4597c>] sptlrpc_cli_wrap_request+0xcc/0x360 [ptlrpc]
[ 559.371027] [<ffffffffa0a16202>] ptl_send_rpc+0x1f2/0xe40 [ptlrpc]
[ 559.372405] [<ffffffffa0a49dad>] ? sptlrpc_req_refresh_ctx+0x4dd/0x900 [ptlrpc]
[ 559.373777] [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0
[ 559.375158] [<ffffffffa0a0f589>] ptlrpc_check_set.part.23+0x1899/0x1dd0 [ptlrpc]
[ 559.376500] [<ffffffffa0a0fb1b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[ 559.377802] [<ffffffffa0a3bbeb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
[ 559.379070] [<ffffffffa0a3bf9b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
[ 559.380271] [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20
[ 559.381460] [<ffffffffa0a3bce0>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc]
[ 559.382597] [<ffffffff810b064f>] kthread+0xcf/0xe0
[ 559.383691] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
[ 559.384762] [<ffffffff81696958>] ret_from_fork+0x58/0x90
[ 559.385803] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
[ 559.386849] Code: 28 48 c1 e0 03 48 0f c8 ba 01 00 00 00 48 8b 75 d0 49 89 46 68 48 89 df 41 ff d5 e8 8e 35 fc ff 31 c0 0f 1f 40 00 8b 14 03 0f ca <41> 89 14 04 48 83 c0 04 48 83 f8 14 75 ed 30 c0 48 89 df b9 0c
[ 559.389118] RIP [<ffffffff81071ebd>] sha1_ssse3_finup+0xbd/0x2e0
[ 559.390185] RSP <ffff880215da37f0>
[ 559.391239] CR2: 0000000000000000

Comment by James A Simmons [ 27/Feb/17 ]

Let me guess. struct sk_ctx is used by both kerberos and HMAC. I have a test system working so I'm going to track this down.

Comment by Jeremy Filizetti [ 30/Mar/17 ]

I believe the panic is related to the changes from LU-8602. I've posted comments in the original patch regarding what I think is wrong.

Comment by Peter Jones [ 12/May/17 ]

Should we revert the culprit while we are working out the proper solution?

Comment by James A Simmons [ 13/May/17 ]

Well the reason I haven't fixed this yet is due to the breakage from LU-9034 which I have been debugging to figure out why you can't see flavors on the client. I was thinking of doing this right after. Now I could finish this first and submit it but GSS still will not work from the LU-9034 merger. What would people like seen done first?

Comment by Chris Hanna [ 15/May/17 ]

James, I think your ordering is fine. Right now, the SSK testing patch is blocking on both of these getting fixed.

Comment by Sebastien Buisson (Inactive) [ 14/Jun/17 ]

Hi,

With set 13 of the patch at https://review.whamcloud.com/25199, I experience a NULL pointer dereference in cfs_crypto_hash_final and a crash of the client node when I unmount and remove kernel modules:

[ 2164.649264] Lustre: 8077:0:(linux-crypto.c:83:cfs_crypto_hash_alloc()) Unsupported hash algorithm id = 255, max id is 10
[ 2164.649858] LustreError: 8077:0:(gss_krb5_mech.c:455:krb5_make_checksum()) failed to alloc hash hmac(sha1) : rc = -22
[ 2164.650417] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
[ 2164.650865] IP: [<ffffffffa021c253>] cfs_crypto_hash_final+0x13/0x90 [libcfs]                                                       [ 2164.651295] PGD 0
[ 2164.651403] Oops: 0000 [#1] SMP
[ 2164.651616] LustreError: 8144:0:(obd_config.c:1257:class_process_config()) no device for: seb-OST0000-osc-ffff880406457000
[ 2164.651617] LustreError: 8144:0:(obd_config.c:1985:class_manual_cleanup()) cleanup failed -22: seb-OST0000-osc-ffff880406457000
[ 2164.652617] Lustre: Unmounted seb-client
[ 2164.651403] Modules linked in: osc(OF) mgc(OF) lustre(OF) lmv(OF) mdc(OF) lov(OF) fid(OF) fld(OF) ptlrpc_gss(OF) sunrpc ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) libcfs(OF) i2c_piix4 i2c_core sg ppdev virtio_balloon parport_pc parport pcspkr ip_tables xfs libcrc32c sd_mod crct10dif_generic crc_t10dif crct10dif_common ata_generic pata_acpi virtio_scsi ata_piix 8139too virtio_pci libata virtio_ring 8139cp serio_raw virtio mii floppy
[ 2164.653282] CPU: 2 PID: 8077 Comm: sptlrpc_gc Tainted: GF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
[ 2164.653282] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 2164.653282] task: ffff880404eb4fa0 ti: ffff880406e40000 task.ti: ffff880406e40000
[ 2164.653282] RIP: 0010:[<ffffffffa021c253>]  [<ffffffffa021c253>] cfs_crypto_hash_final+0x13/0x90 [libcfs]
[ 2164.653282] RSP: 0018:ffff880406e43a58  EFLAGS: 00010246
[ 2164.653282] RAX: 0000000000000000 RBX: ffff880406e43b28 RCX: 0000000000000006
[ 2164.653282] RDX: ffff880406e43b28 RSI: 0000000000000000 RDI: ffffffffffffffea
[ 2164.653282] RBP: ffff880406e43a78 R08: 0000000000000096 R09: 000000000000028c
[ 2164.653282] R10: 0720072007200720 R11: 0720072007200720 R12: ffffffffffffffea
[ 2164.653282] R13: 00000000ffffffea R14: ffff880408dd3a20 R15: 000000000000000a
[ 2164.653282] FS:  0000000000000000(0000) GS:ffff88041fc80000(0000) knlGS:0000000000000000
[ 2164.653282] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2164.653282] CR2: 000000000000000a CR3: 0000000405908000 CR4: 00000000000006e0
[ 2164.653282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2164.653282] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2164.653282] Stack:
[ 2164.653282]  ffff880406e43b28 ffffffffffffffea 00000000ffffffea ffff880408dd3a20
[ 2164.653282]  ffff880406e43af8 ffffffffa0765332 ffff88041fc8f0a0 00000000000002d0
[ 2164.653282]  0000000000000000 ffff880406e43bb0 0000000200000000 ffff8800368fef40
[ 2164.653282] Call Trace:
[ 2164.653282]  [<ffffffffa0765332>] krb5_make_checksum+0x352/0x630 [ptlrpc_gss]
[ 2164.653282]  [<ffffffffa07663ea>] gss_get_mic_kerberos+0x9a/0x1c0 [ptlrpc_gss]
[ 2164.653282]  [<ffffffffa0762c0e>] lgss_get_mic+0x2e/0x100 [ptlrpc_gss]
[ 2164.653282]  [<ffffffffa074e696>] gss_sign_msg+0x166/0x300 [ptlrpc_gss]
[ 2164.653282]  [<ffffffffa075387a>] gss_cli_ctx_sign+0x1ba/0x330 [ptlrpc_gss]
[ 2164.653282]  [<ffffffffa05b3f3c>] sptlrpc_cli_wrap_request+0xcc/0x360 [ptlrpc]
[ 2164.653282]  [<ffffffffa05841d2>] ptl_send_rpc+0x1f2/0xe40 [ptlrpc]
[ 2164.653282]  [<ffffffffa0578a77>] ? ptlrpc_add_unreplied+0xc7/0x100 [ptlrpc]
[ 2164.653282]  [<ffffffffa057efbc>] ? ptlrpc_request_bufs_pack+0x1ec/0x4d0 [ptlrpc]
[ 2164.653282]  [<ffffffffa075cf60>] gss_do_ctx_fini_rpc+0x320/0x500 [ptlrpc_gss]
[ 2164.653282]  [<ffffffffa0755cec>] gss_cli_ctx_fini_common+0x5c/0x2d0 [ptlrpc_gss]
[ 2164.653282]  [<ffffffffa076d878>] ctx_destroy_kr+0x88/0x3d0 [ptlrpc_gss]
[ 2164.653282]  [<ffffffffa076e01d>] gss_sec_release_ctx_kr+0x2d/0xa0 [ptlrpc_gss]
[ 2164.653282]  [<ffffffffa05b2062>] sptlrpc_cli_ctx_put+0x42/0xb0 [ptlrpc]
[ 2164.653282]  [<ffffffffa05bd34d>] sec_process_ctx_list+0xdd/0x180 [ptlrpc]
[ 2164.653282]  [<ffffffffa05bd59c>] sec_gc_main+0x7c/0x420 [ptlrpc]
[ 2164.653282]  [<ffffffff810a9510>] ? wake_up_state+0x20/0x20
[ 2164.653282]  [<ffffffffa05bd520>] ? sptlrpc_gc_add_ctx+0x130/0x130 [ptlrpc]
[ 2164.653282]  [<ffffffff8109727f>] kthread+0xcf/0xe0
[ 2164.653282]  [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[ 2164.653282]  [<ffffffff81614358>] ret_from_fork+0x58/0x90
[ 2164.653282]  [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[ 2164.653282] Code: 09 48 83 c4 28 5b 41 5c 5d c3 e8 59 1e e5 e0 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 85 f6 48 89 e5 41 56 41 55 41 54 53 <48> 8b 4f 20 48 89 fb 48 8b 41 50 44 8b 70 f8 74 5c 48 85 d2 49
[ 2164.653282] RIP  [<ffffffffa021c253>] cfs_crypto_hash_final+0x13/0x90 [libcfs]
[ 2164.653282]  RSP <ffff880406e43a58>
[ 2164.653282] CR2: 000000000000000a

This is on a CentOS 7 node.

Comment by James A Simmons [ 19/Jun/17 ]

Try it now Sebastien

Comment by Sebastien Buisson (Inactive) [ 20/Jun/17 ]

To document completely the case, with patchset 14 I was not able to reproduce the crash anymore, but I was seeing the following traces when unmounting the client (on CentOS 7):

[ 4915.909072] Lustre: 1386:0:(linux-crypto.c:83:cfs_crypto_hash_alloc()) Unsupported hash algorithm id = 255, max id is 10
[ 4915.909701] LustreError: 1386:0:(gss_krb5_mech.c:455:krb5_make_checksum()) failed to alloc hash hmac(sha1) : rc = -22
[ 4915.910254] LustreError: 1386:0:(sec_gss.c:204:gss_sign_msg()) fail to generate MIC: 000d0000

Now with patchset 15 on a CentOS 7 client, I have the following crash when unmounting the client:

[438940.599108] BUG: unable to handle kernel paging request at 00000000deadbeef
[438940.599483] IP: [<ffffffff8129b01b>] sha1_final+0x7b/0x170
[438940.599783] PGD 0
[438940.599903] Oops: 0002 [#1] SMP
[438940.600075] Modules linked in: osc(OF) mgc(OF) lustre(OF) lmv(OF) mdc(OF) lov(OF) fid(OF) fld(OF) ptlrpc_gss(OF) ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) libcfs(OF) sha512_generic sunrpc sg virtio_balloon ppdev parport_pc parport i2c_piix4 i2c_core pcspkr ip_tables xfs libcrc32c sd_mod crct10dif_generic crc_t10dif crct10dif_common ata_generic pata_acpi virtio_scsi ata_piix 8139too libata virtio_pci serio_raw virtio_ring 8139cp virtio mii floppy [last unloaded: libcfs]
[438940.600075] CPU: 6 PID: 21534 Comm: sptlrpc_gc Tainted: GF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
[438940.603186] LustreError: 21611:0:(obd_config.c:1294:class_process_config()) no device for: seb-OST0000-osc-ffff880405907800
[438940.603187] LustreError: 21611:0:(obd_config.c:2024:class_manual_cleanup()) cleanup failed -22: seb-OST0000-osc-ffff880405907800   [438940.600075] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[438940.600075] task: ffff880405c3e660 ti: ffff880406410000 task.ti: ffff880406410000
[438940.600075] RIP: 0010:[<ffffffff8129b01b>]  [<ffffffff8129b01b>] sha1_final+0x7b/0x170
[438940.605512] Lustre: Unmounted seb-client
[438940.605562] RSP: 0018:ffff880406413928  EFLAGS: 00010246
[438940.605562] RAX: 0000000000000000 RBX: ffff8803d1c2dc70 RCX: 0000000000000000
[438940.605562] RDX: 000000002d49b7f0 RSI: ffff880406413930 RDI: ffff8803d1c2dc8c
[438940.605562] RBP: ffff880406413950 R08: 00000000b7bd30b2 R09: 00000000f128d062
[438940.605562] R10: 00000000ec79780e R11: 000000001dd8f2f0 R12: 00000000deadbeef
[438940.605562] R13: ffff8803d1c2dc60 R14: 0000000000000000 R15: ffff880408e1be00
[438940.605562] FS:  0000000000000000(0000) GS:ffff88041fd80000(0000) knlGS:0000000000000000
[438940.605562] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[438940.605562] CR2: 00000000deadbeef CR3: 00000004072f1000 CR4: 00000000000006e0
[438940.605562] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[438940.605562] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[438940.605562] Stack:
[438940.605562]  e00a000000000000 000000001a7a96b2 00000000deadbeef 00000000deadbeef
[438940.605562]  ffff8803d1c2dc50 ffff880406413988 ffffffff8129522c ffff8803d1c2dc60
[438940.605562]  00000000deadbeef ffff8803d1c2dc50 0000000000000000 ffff880408e1be00
[438940.605562] Call Trace:
[438940.605562]  [<ffffffff8129522c>] crypto_shash_final+0x2c/0xb0
[438940.605562]  [<ffffffff8129a70a>] hmac_final+0x4a/0xb0
[438940.605562]  [<ffffffff812953b0>] ? crypto_shash_digest+0x40/0x40
[438940.605562]  [<ffffffff8129522c>] crypto_shash_final+0x2c/0xb0
[438940.605562]  [<ffffffff812953b0>] ? crypto_shash_digest+0x40/0x40
[438940.605562]  [<ffffffff812953c6>] shash_async_final+0x16/0x20
[438940.605562]  [<ffffffff81294c9a>] crypto_ahash_op+0x2a/0xc0
[438940.605562]  [<ffffffff81294d46>] crypto_ahash_final+0x16/0x20
[438940.605562]  [<ffffffffa026a09f>] cfs_crypto_hash_final+0x4f/0x90 [libcfs]
[438940.605562]  [<ffffffffa0797337>] krb5_make_checksum+0x347/0x630 [ptlrpc_gss]
[438940.605562]  [<ffffffffa07983fa>] gss_get_mic_kerberos+0x9a/0x1c0 [ptlrpc_gss]
[438940.605562]  [<ffffffffa0794c1e>] lgss_get_mic+0x2e/0x100 [ptlrpc_gss]
[438940.605562]  [<ffffffffa0780696>] gss_sign_msg+0x166/0x300 [ptlrpc_gss]
[438940.605562]  [<ffffffffa078587a>] gss_cli_ctx_sign+0x1ba/0x330 [ptlrpc_gss]
[438940.605562]  [<ffffffffa05e112c>] sptlrpc_cli_wrap_request+0xcc/0x360 [ptlrpc]
[438940.605562]  [<ffffffffa05b0ec2>] ptl_send_rpc+0x1f2/0xe60 [ptlrpc]
[438940.605562]  [<ffffffffa05a57e7>] ? ptlrpc_add_unreplied+0xc7/0x100 [ptlrpc]
[438940.605562]  [<ffffffffa05abcac>] ? ptlrpc_request_bufs_pack+0x1ec/0x4d0 [ptlrpc]
[438940.605562]  [<ffffffffa078ef60>] gss_do_ctx_fini_rpc+0x320/0x500 [ptlrpc_gss]
[438940.605562]  [<ffffffffa0787cec>] gss_cli_ctx_fini_common+0x5c/0x2d0 [ptlrpc_gss]
[438940.605562]  [<ffffffffa079f7d8>] ctx_destroy_kr+0x88/0x3d0 [ptlrpc_gss]
[438940.605562]  [<ffffffffa079ff7d>] gss_sec_release_ctx_kr+0x2d/0xa0 [ptlrpc_gss]
[438940.605562]  [<ffffffffa05df252>] sptlrpc_cli_ctx_put+0x42/0xb0 [ptlrpc]
[438940.605562]  [<ffffffffa05ea52d>] sec_process_ctx_list+0xdd/0x180 [ptlrpc]
[438940.605562]  [<ffffffffa05ea77c>] sec_gc_main+0x7c/0x420 [ptlrpc]
[438940.605562]  [<ffffffff810a9510>] ? wake_up_state+0x20/0x20
[438940.605562]  [<ffffffffa05ea700>] ? sptlrpc_gc_add_ctx+0x130/0x130 [ptlrpc]
[438940.605562]  [<ffffffff8109727f>] kthread+0xcf/0xe0
[438940.605562]  [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[438940.605562]  [<ffffffff81614358>] ret_from_fork+0x58/0x90
[438940.605562]  [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[438940.605562] Code: 00 29 c2 48 c7 c6 80 fc 66 81 4c 89 ef e8 8e fe ff ff 48 8d 75 d8 ba 08 00 00 00 4c 89 ef e8 7d fe ff ff 31 c0 8b 54 03 08 0f ca <41> 89 14 04 48 83 c0 04 48 83 f8 14 75 ec f6 c3 01 48 89 df be
[438940.605562] RIP  [<ffffffff8129b01b>] sha1_final+0x7b/0x170
[438940.605562]  RSP <ffff880406413928>
[438940.605562] CR2: 00000000deadbeef
Comment by Sebastien Buisson (Inactive) [ 20/Jun/17 ]

With patchset 16:

[20763.922197] BUG: unable to handle kernel paging request at 00000000deadbef7
[20763.922606] IP: [<ffffffff812e2a85>] memcpy+0xb5/0x110
[20763.922927] PGD 0
[20763.923044] Oops: 0000 [#1] SMP
[20763.925271] LustreError: 1226:0:(obd_config.c:1294:class_process_config()) no device for: seb-OST0000-osc-ffff8800d2c1c800
[20763.925273] LustreError: 1226:0:(obd_config.c:2024:class_manual_cleanup()) cleanup failed -22: seb-OST0000-osc-ffff8800d2c1c800
[20763.926581] Lustre: Unmounted seb-client
[20763.923161] Modules linked in: osc(OF) mgc(OF) lustre(OF) lmv(OF) mdc(OF) lov(OF) fid(OF) fld(OF) ptlrpc_gss(OF) ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) libcfs(OF) sunrpc i2c_piix4 i2c_core sg virtio_balloon ppdev parport_pc parport pcspkr ip_tables xfs libcrc32c sd_mod crct10dif_generic crc_t10dif crct10dif_common ata_generic pata_acpi virtio_scsi 8139too virtio_pci virtio_ring serio_raw 8139cp virtio ata_piix mii libata floppy [last unloaded: libcfs]                                                                           [20763.927207] CPU: 5 PID: 1172 Comm: sptlrpc_gc Tainted: GF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
[20763.927207] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[20763.927207] task: ffff8804088871c0 ti: ffff880406c1c000 task.ti: ffff880406c1c000
[20763.927207] RIP: 0010:[<ffffffff812e2a85>]  [<ffffffff812e2a85>] memcpy+0xb5/0x110
[20763.927207] RSP: 0018:ffff880406c1fb00  EFLAGS: 00010202
[20763.927207] RAX: ffff880404ff2b50 RBX: ffff880404ff2b40 RCX: 0000000000000028
[20763.927207] RDX: 000000000000000c RSI: 00000000deadbef7 RDI: ffff880404ff2b50
[20763.927207] RBP: ffff880406c1fb68 R08: 00000000000163c0 R09: ffff88041fd563c0
[20763.927207] R10: ffffea000f427640 R11: ffffffffa04583a4 R12: ffff880406c1fb28
[20763.927207] R13: 00000000000002d0 R14: ffff880406c1fba0 R15: 0000000000000012
[20763.927207] FS:  0000000000000000(0000) GS:ffff88041fd40000(0000) knlGS:0000000000000000
[20763.927207] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[20763.927207] CR2: 00000000deadbef7 CR3: 000000000190a000 CR4: 00000000000006e0
[20763.927207] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[20763.927207] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[20763.927207] Stack:
[20763.927207]  ffffffffa0459452 0000000000000000 ffff880406c1fb28 0000000000000000
[20763.927207]  0000000000000012 ffff880300000014 00000000deadbeef 000000004c1a9725
[20763.927207]  0000000000000002 ffff880404ff2a08 0000000000000002 ffff880406c1fbd0
[20763.927207] Call Trace:
[20763.927207]  [<ffffffffa0459452>] ? gss_get_mic_kerberos+0x102/0x1c0 [ptlrpc_gss]
[20763.927207]  [<ffffffffa0455c1e>] lgss_get_mic+0x2e/0x100 [ptlrpc_gss]
[20763.927207]  [<ffffffffa0441696>] gss_sign_msg+0x166/0x300 [ptlrpc_gss]
[20763.927207]  [<ffffffffa044687a>] gss_cli_ctx_sign+0x1ba/0x330 [ptlrpc_gss]
[20763.927207]  [<ffffffffa07ba12c>] sptlrpc_cli_wrap_request+0xcc/0x360 [ptlrpc]
[20763.927207]  [<ffffffffa0789ec2>] ptl_send_rpc+0x1f2/0xe60 [ptlrpc]
[20763.927207]  [<ffffffffa077e7e7>] ? ptlrpc_add_unreplied+0xc7/0x100 [ptlrpc]
[20763.927207]  [<ffffffffa0784cac>] ? ptlrpc_request_bufs_pack+0x1ec/0x4d0 [ptlrpc]
[20763.927207]  [<ffffffffa044ff60>] gss_do_ctx_fini_rpc+0x320/0x500 [ptlrpc_gss]
[20763.927207]  [<ffffffffa0448cec>] gss_cli_ctx_fini_common+0x5c/0x2d0 [ptlrpc_gss]
[20763.927207]  [<ffffffffa04607c8>] ctx_destroy_kr+0x88/0x3d0 [ptlrpc_gss]
[20763.927207]  [<ffffffffa0460f6d>] gss_sec_release_ctx_kr+0x2d/0xa0 [ptlrpc_gss]
[20763.927207]  [<ffffffffa07b8252>] sptlrpc_cli_ctx_put+0x42/0xb0 [ptlrpc]
[20763.927207]  [<ffffffffa07c352d>] sec_process_ctx_list+0xdd/0x180 [ptlrpc]
[20763.927207]  [<ffffffffa07c377c>] sec_gc_main+0x7c/0x420 [ptlrpc]
[20763.927207]  [<ffffffff810a9510>] ? wake_up_state+0x20/0x20
[20763.927207]  [<ffffffffa07c3700>] ? sptlrpc_gc_add_ctx+0x130/0x130 [ptlrpc]
[20763.927207]  [<ffffffff8109727f>] kthread+0xcf/0xe0
[20763.927207]  [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[20763.927207]  [<ffffffff81614358>] ret_from_fork+0x58/0x90
[20763.927207]  [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[20763.927207] Code: 72 24 4c 8b 06 4c 8b 4e 08 4c 8b 54 16 f0 4c 8b 5c 16 f8 4c 89 07 4c 89 4f 08 4c 89 54 17 f0 4c 89 5c 17 f8 c3 90 83 fa 08 72 1b <4c> 8b 06 4c 8b 4c 16 f8 4c 89 07 4c 89 4c 17 f8 c3 66 2e 0f 1f
[20763.927207] RIP  [<ffffffff812e2a85>] memcpy+0xb5/0x110
[20763.927207]  RSP <ffff880406c1fb00>
[20763.927207] CR2: 00000000deadbef7
Comment by James A Simmons [ 20/Jun/17 ]

So the original code if you look at it leaks rawobj_t like crazy so I plugged the leaks. I was a bit over zealous and in the krb5 case we still need the rawobj_t around after running krb5_make_checksum(). Thanks for testing this. I don't have a KRB5 setup.

Comment by Sebastien Buisson (Inactive) [ 21/Jun/17 ]

Hi James,

I understand.
With patchset 17, no more crash on client side when unmounting Lustre
But, on server side (file system is configured to use krb5n flavor between clients and osts), I have those messages when the client is unmounted:

juin 22 00:05:20 ltest-vm4 kernel: LustreError: 26884:0:(gss_krb5_mech.c:633:gss_verify_mic_kerberos()) checksum mismatch
juin 22 00:05:20 ltest-vm4 kernel: LustreError: 26884:0:(sec_gss.c:243:gss_verify_msg()) mic verify error: 00060000
juin 22 00:05:20 ltest-vm4 kernel: LustreError: 26884:0:(sec_gss.c:2122:gss_svc_verify_request()) failed to verify request: 60000
juin 22 00:05:26 ltest-vm4 kernel: Lustre: 26630:0:(sec_gss.c:1222:gss_cli_ctx_fini_common()) reverse sec ffff880409776900: destroy ctx ffff88040748b240
Comment by James A Simmons [ 21/Jun/17 ]

Any good at doing dynamic tracing with perf or do you want me to push a patch with CERROR added. The current info is not enough for me to figure out what is going on.

Comment by Peter Jones [ 21/Jun/17 ]

James

Given that Sebastien is in France and it is very late there already I would suggest pushing the patch with CERROR proactively would make sense - we already know that the existing version is not going to be sutiable for landing...

Peter

Comment by Sebastien Buisson (Inactive) [ 22/Jun/17 ]

Hi James,

Here the corresponding stack. Does it help, or do you need anything else?

ll_ost00_002 28897 [003] 686311.879332:  probe:gss_verify_mic_kerberos: (ffffffffa0d12614)
                   18685 gss_verify_mic_kerberos ([unknown])
                   14d8e lgss_verify_mic ([unknown])
                     97a gss_verify_msg ([unknown])
                    5014 gss_svc_verify_request ([unknown])
                    a172 gss_svc_accept ([unknown])
                   1efd8 gss_svc_accept_kr ([unknown])
                   84ebe sptlrpc_svc_unwrap_request ([unknown])
                   65f94 ptlrpc_main ([unknown])
            7fff8109927f kthread ([kernel.kallsyms])
            7fff81616358 ret_from_fork ([kernel.kallsyms])

Comment by James A Simmons [ 22/Jun/17 ]

While looking to add various debug message I really thought about what the code was doing and I think the reason for your error was found. The function cfs_crypto_hash_final() has to be called before gss_crypt_generic() was called. It was being called after so the checksum final value wasn't being cached which lead to your failure. I fixed it up in the latest patch. Give it a try.

Comment by Sebastien Buisson (Inactive) [ 22/Jun/17 ]

Same behavior with patchset 18

juin 23 01:01:20 ltest-vm4 kernel: LustreError: 16857:0:(gss_krb5_mech.c:636:gss_verify_mic_kerberos()) checksum mismatch
juin 23 01:01:20 ltest-vm4 kernel: LustreError: 16857:0:(sec_gss.c:243:gss_verify_msg()) mic verify error: 00060000
juin 23 01:01:20 ltest-vm4 kernel: LustreError: 16857:0:(sec_gss.c:2122:gss_svc_verify_request()) failed to verify request: 60000
juin 23 01:01:25 ltest-vm4 kernel: Lustre: 16810:0:(sec_gss.c:1222:gss_cli_ctx_fini_common()) reverse sec ffff8800366ea500: destroy ctx ffff8803f5b7f0c0
ll_ost00_002 16857 [006] 715524.594882:  probe:gss_verify_mic_kerberos: (ffffffffa0d12624)
                   18695 gss_verify_mic_kerberos ([unknown])
                   14d8e lgss_verify_mic ([unknown])
                     97a gss_verify_msg ([unknown])
                    5014 gss_svc_verify_request ([unknown])
                    a172 gss_svc_accept ([unknown])
                   1f098 gss_svc_accept_kr ([unknown])
                   84ebe sptlrpc_svc_unwrap_request ([unknown])
                   65f94 ptlrpc_main ([unknown])
            7fff8109927f kthread ([kernel.kallsyms])
            7fff81616358 ret_from_fork ([kernel.kallsyms])
Comment by James A Simmons [ 22/Jun/17 ]

I have two theories why this might be happening. I pushed a debug patch. Please post the results of the run.

Comment by Sebastien Buisson (Inactive) [ 22/Jun/17 ]

With patchset 20, on server side when client is unmounted:

juin 23 03:27:19 ltest-vm4 kernel: LustreError: 3437:0:(gss_krb5_mech.c:470:krb5_make_checksum()) cksum->len 20, ke_hash_size 12 for ke_hash_name sha1
juin 23 03:27:19 ltest-vm4 kernel: LustreError: 3437:0:(gss_krb5_mech.c:643:gss_verify_mic_kerberos()) checksum mismatch
juin 23 03:27:19 ltest-vm4 kernel: LustreError: 3437:0:(sec_gss.c:243:gss_verify_msg()) mic verify error: 00060000
juin 23 03:27:19 ltest-vm4 kernel: LustreError: 3437:0:(sec_gss.c:2122:gss_svc_verify_request()) failed to verify request: 60000
juin 23 03:27:25 ltest-vm4 kernel: Lustre: 3390:0:(sec_gss.c:1222:gss_cli_ctx_fini_common()) reverse sec ffff8804058cf300: destroy ctx ffff8804062433c0
Comment by James A Simmons [ 23/Jun/17 ]

Found the reason for your error. The bug was in the user land utility. The bug was located in sk_session_kdf() and sk_compute_keys(). Before we had an entry for "ctr(aes)" in the libcfs crypto abstraction but I removed it. That was causing the mentioned functions to generate the incorrect size keys for the keyring. Currently I just hard coded the key size since we only use "ctr(aes)". I just discovered /proc/crypto has all the info we need. Will create a latter patch to handle that. Please give it a try. Thanks for you patience. It was a crash course in krb5 the last few days.

Comment by Sebastien Buisson (Inactive) [ 26/Jun/17 ]

Still 'checksum mismatch' errors on server side at client unmount when testing patchset 21 with krb5n flavor.

juin 26 16:39:58 ltest-vm4 kernel: LustreError: 6544:0:(gss_krb5_mech.c:640:gss_verify_mic_kerberos()) checksum mismatch
juin 26 16:39:58 ltest-vm4 kernel: LustreError: 6544:0:(sec_gss.c:243:gss_verify_msg()) mic verify error: 00060000
juin 26 16:39:58 ltest-vm4 kernel: LustreError: 6544:0:(sec_gss.c:2122:gss_svc_verify_request()) failed to verify request: 60000
juin 26 16:40:04 ltest-vm4 kernel: Lustre: 6494:0:(sec_gss.c:1222:gss_cli_ctx_fini_common()) reverse sec ffff8803f9680d00: destroy ctx

It seems patchset 21 only fixes sk_utils.c, so not related to krb5.

Comment by James A Simmons [ 26/Jun/17 ]

I have to ask does it work in 2.9 for you? I found with hmac the 2.9 clients doesn't work for me so I'm wondering it the GSS code works at all for anyone?

Comment by Andreas Dilger [ 26/Jun/17 ]

What is the current state of affairs with and without James' patch for Kerberos and SSK?

Sebastien, you report problems at unmount, but does Kerberos work for normal usage before that time with James' patch applied? Is it "more" broken without the patch, or is it working without the patch and the patch breaks Kerberos?

Chris Hanna or Nathan Lavender, does James' patch fix SSK again?

I'm trying to figure out if we should land James' patch to fix the problem(s) found so far and then get a later patch to fix the remaining problem(s)? It seems like there are multiple issues here, and I'd like to move forward with getting fixes landed, since this is the last problem holding up the 2.10 release.

Comment by Gerrit Updater [ 26/Jun/17 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/27823
Subject: LU-9073 gss: remove newer kernel support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9629cf20033982f49ab327203e5efa8578616872

Comment by James A Simmons [ 26/Jun/17 ]

I just pushed a revert in the hopes we are back to the state of lustre 2.9. For the revert I made it so GSS is disabled with newer kernels instead so this will not hinder newer kernel support. This revert is just a very poor band aid. IMHO based on my testing I don't think GSS is ready for production systems. Even with the 2.9 client I found I couldn't get hmac support going with multiple nodes. Sebastien reported for 2.9 that gss null is unstable and it tends to kernel panic. Also I just could never get GSS working without nodemap. Is that suppose to be the case? I found issues like a missing lgssc.conf file will cause phantom keys in the kernel which caused problems.
So the revert is far from a solution. A lot more work needs to be done for proper GSS support so it can be used in production environments.

Comment by Chris Hanna [ 26/Jun/17 ]

James, Nodemap is needed for SSK (see Lustre manual 24.5) in a multinode setup. It does not necessarily need to be an activated feature, however. The lgssc.conf file should be created by the test-framework.sh script. If the nodemap is not set correctly set up, you definitely will see strange issues.

SSK did function in a client-to-server mode prior to the initial GSS patches. In the most recent version of James' patch it appeared to work in sanity-sec up to test_15, until it ran into some issues related to fileop in test_16 and did not recover.

I can't speak to the most recent reversion James just uploaded. I don't expect this feature needs to get in the way of releasing the many other improvements in 2.10.

Comment by Peter Jones [ 26/Jun/17 ]

hannac could you please need out the behaviour on RHEL7.x with James's latest reversion patch? We don't want to release 2.10.0 with a drop in functionality compared to 2.9.

Comment by Sebastien Buisson (Inactive) [ 26/Jun/17 ]

With the patch 'LU-9073 gss: remove newer kernel support' at https://review.whamcloud.com/27823, I do not get any error message on server side when a client running krb5n flavor is unmounted:

juin 27 05:30:51 ltest-vm4 kernel: Lustre: 11812:0:(sec_gss.c:2323:gss_svc_handle_destroy()) destroy svc ctx ffff8803fbd6bc40 idx 0x31cb7c0f1b298198 (0->10.128.11.159@tcp)
juin 27 05:30:56 ltest-vm4 kernel: Lustre: 11764:0:(sec_gss.c:1222:gss_cli_ctx_fini_common()) reverse sec ffff8800364a4500: destroy ctx ffff880036603780

That being said, the problem mentioned earlier is not blocking, and could be tackled in another Jira. I would support the idea of having patch at https://review.whamcloud.com/25199 landed, in order not to block support for GSS with newer kernels, knowing that it gives 'checksum mismatch' errors on server side when clients are unmounted.

What do you think?

Comment by Chris Hanna [ 26/Jun/17 ]

After spending some time with the latest reversion from James, it appears to have fixed the issue. We were able to run sanity-sec and sanity for some time. There were some errors, but SSK was engaged any many tests did pass. I would proceed with the reversion if it is holding you back, and the errors can be diagnosed when time permits.

Comment by Andreas Dilger [ 27/Jun/17 ]

Sebastien, Chris, could you please mark +1 on the patch if it is working for you, so that we can land it.

Comment by Sebastien Buisson (Inactive) [ 27/Jun/17 ]

Chris, did you +1 on the right patch?

Comment by Gerrit Updater [ 27/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27823/
Subject: LU-9073 gss: remove newer kernel support
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2c27b194121665061cc0527e8bef35886ec7fea8

Comment by Peter Jones [ 27/Jun/17 ]

After some discussion we elected to land the reversion option for 2.10.0 and then aim to sort out GSS support for new kernel versions post-2.10.0. The rationale is that this keeps consistent behaviour with 2.9 for the officially supported distros.

Generated at Sat Feb 10 08:11:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.