[LU-14045] Fix O_DIRECT and encrypted files Created: 19/Oct/20  Updated: 07/Jan/21  Resolved: 07/Nov/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Sebastien Buisson Assignee: Sebastien Buisson
Resolution: Fixed Votes: 0
Labels: patch, sec

Issue Links:
Related
is related to LU-13745 tasks hang with copy_file_range: ll_f... Open
is related to LU-12275 Client-side file data encryption Resolved
is related to LU-14306 sanity-sec test_52: BUG: Bad rss-coun... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Because of patch https://review.whamcloud.com/38967, we can end up in a situation where osc_release_bounce_pages() mistakenly consider pages as fscrypt bounce pages, and tries to free them, as shown in the stack below.

2020-10-18 15:26:49 [ 4462.081809][T14012] Lustre: DEBUG MARKER: == sanity 
test 56w: check lfs_migrate -c stripe_count works 
========================================== 15:26:49 (1603049209)
2020-10-18 15:26:52 [ 4464.514691][T30281] BUG: kernel NULL pointer dereference, address: 0000000000000048
2020-10-18 15:26:52 [ 4464.524282][T30281] #PF: supervisor read access in kernel mode
2020-10-18 15:26:52 [ 4464.532011][T30281] #PF: error_code(0x0000) - not-present page
2020-10-18 15:26:52 [ 4464.539709][T30281] PGD 80000007edcce067 P4D 80000007edcce067 PUD 7f1306067 PMD 0
2020-10-18 15:26:52 [ 4464.549144][T30281] Oops: 0000 [#1] PREEMPT SMP PTI
2020-10-18 15:26:52 [ 4464.555851][T30281] CPU: 0 PID: 30281 Comm: ptlrpcd_00_04 Tainted: G        W         5.7.0-rc7+ #1
2020-10-18 15:26:52 [ 4464.566720][T30281] Hardware name: Supermicro Super Server/To be filled by O.E.M., BIOS 2.0b 08/12/2016
2020-10-18 15:26:52 [ 4464.577932][T30281] RIP: 0010:mempool_free+0x12/0x80
2020-10-18 15:26:52 [ 4464.584690][T30281] Code: 60 e8 ff cc cc cc cc cc 0f 1f 44 00 00 e9 86 a3 08 00 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 85 ff 
48 89 fd 53 74 1a 48 89 f3 <8b> 46 48 39 46 4c 7c 12 48 8b 73 58 48 8b 43 68 48 89 ef 5b 5d ff
2020-10-18 15:26:52 [ 4464.607734][T30281] RSP: 0018:ffffc9002414fcc0 EFLAGS: 00010282
2020-10-18 15:26:52 [ 4464.615423][T30281] RAX: ffff8887d44fb5e0 RBX: 0000000000000000 RCX: 0000000000000000
2020-10-18 15:26:52 [ 4464.625013][T30281] RDX: ffff888845abb780 RSI: 0000000000000000 RDI: ffffea001f553340
2020-10-18 15:26:52 [ 4464.634577][T30281] RBP: ffffea001f553340 R08: 0000000000000000 R09: 0000000000000000
2020-10-18 15:26:52 [ 4464.644109][T30281] R10: 0000000000000000 R11: 000000000000000f R12: 0000000000000000
2020-10-18 15:26:52 [ 4464.653614][T30281] R13: ffff8887d736c9f0 R14: 0000000000000010 R15: ffff888845abb780
2020-10-18 15:26:52 [ 4464.663095][T30281] FS:  0000000000000000(0000) GS:ffff88885e600000(0000) knlGS:0000000000000000
2020-10-18 15:26:52 [ 4464.673521][T30281] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2020-10-18 15:26:52 [ 4464.681579][T30281] CR2: 0000000000000048 CR3: 00000007cf9fa004 CR4: 00000000001606f0
2020-10-18 15:26:52 [ 4464.691015][T30281] Call Trace:
2020-10-18 15:26:52 [ 4464.695751][T30281]  brw_interpret+0xac/0xa60 [osc]
2020-10-18 15:26:52 [ 4464.702190][T30281]  ? _raw_spin_unlock+0x29/0x50
2020-10-18 15:26:52 [ 4464.708490][T30281]  ptlrpc_check_set+0x329/0x1790 [ptlrpc]
2020-10-18 15:26:52 [ 4464.715599][T30281]  ptlrpcd_check+0x411/0x460 [ptlrpc]
2020-10-18 15:26:52 [ 4464.722318][T30281]  ptlrpcd+0x278/0x300 [ptlrpc]
2020-10-18 15:26:52 [ 4464.728463][T30281]  ? remove_wait_queue+0x60/0x60
2020-10-18 15:26:52 [ 4464.734667][T30281]  kthread+0x12a/0x170
2020-10-18 15:26:52 [ 4464.739993][T30281]  ? ptlrpcd_check+0x460/0x460 [ptlrpc]
2020-10-18 15:26:52 [ 4464.746745][T30281]  ? kthread_bind+0x10/0x10
2020-10-18 15:26:52 [ 4464.752431][T30281]  ret_from_fork+0x24/0x30


 Comments   
Comment by Gerrit Updater [ 19/Oct/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40295
Subject: LU-14045 sec: fix O_DIRECT and encrypted files
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 15384abdb2cb0236bcd60e7f754dbc59e2a16f57

Comment by James A Simmons [ 19/Oct/20 ]

Here is an occurence of the crash hit without fix https://review.whamcloud.com/40295

Test:
https://testing.whamcloud.com/test_sets/8383343d-2440-49c5-992b-861551c6b407
Crash dump:
https://testing.whamcloud.com/test_logs/1bf73796-3256-48a3-bab2-75f6d96c1180/show_text

Comment by Yang Sheng [ 20/Oct/20 ]

https://testing.whamcloud.com/test_sessions/8cfbafe5-ac90-4a05-831f-e9f636a229a5

Comment by Bruno Faccini (Inactive) [ 20/Oct/20 ]

+2 with recent master at https://testing.whamcloud.com/test_sets/24317b7d-ea90-4b01-ae0a-e01b5284c227 and https://testing.whamcloud.com/test_sets/fb2d522c-391e-4979-a709-c6c4d8a967a0

Comment by Andreas Dilger [ 21/Oct/20 ]

I may be conflating two issues, but AFAICS, sanity test_56w has only crashed a couple of times in the past 4 weeks:
https://testing.whamcloud.com/test_sets/5850ea8a-7bc0-40a1-b88b-5aabd945fe10
https://testing.whamcloud.com/test_sets/6dadd2cd-29c9-4965-9ecd-433452337956

and those were both on 2020-10-10 when testing patch https://review.whamcloud.com/38883 "LU-11621 utils: optimize migrate_copy_data() with copy_file_range()".

The only other crash started on aarch64 kernels 4.18+ on 2020-10-19, but the patch https://review.whamcloud.com/38967 "LU-12275 sec: O_DIRECT for encrypted file" was landed on master 6 weeks ago. This is failing 100% in sanity test_426 since patch https://review.whamcloud.com/39695 "LU-13745 test: add splice test for lustre" landed, since that patch was submitted with "Test-Parameters: trivial" which only tests x86_64 on ldiskfs, but the test is crashing continuously on aarch64 and el8.2, both of which are using 4.18 kernels.

If this is related to crypto, it appears the source of the funky pages is the splice IO from "splice". The two failed sanity test_56w are testing copy_file_range() that is also using in-kernel data copying, similar to splice. Since the pages are generated in a source filesystem and sent to the target, it isn't whether we can play games with the mapping or not, so it might be better to use a page flag (e.g. PageChecked, maybe with a better wrapper like PageCrypto for Lustre)?

Comment by Andreas Dilger [ 21/Oct/20 ]

Stack trace from sanity.sh test_426:

[15000.400779] Lustre: DEBUG MARKER: == sanity test 426: splice test on Lustre ==== 20:58:26 (1603227506)
[15001.080742] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004
[15001.102937] user pgtable: 64k pages, 48-bit VAs, pgdp = 000000009f14b2d0
[15001.111120] Internal error: Oops: 96000005 [#1] SMP
[15001.149680] CPU: 1 PID: 11273 Comm: ptlrpcd_01_01  4.18.0-147.8.1.el8_1.aarch64 #1
[15001.164523] pc : mempool_free+0x24/0xe0
[15001.167022] lr : llcrypt_free_bounce_page.part.1+0x38/0x48 [libcfs]
[15001.223444] Process ptlrpcd_01_01 (pid: 11273, stack limit = 0x00000000f9135a93)
[15001.228185] Call trace:
[15001.229806]  mempool_free+0x24/0xe0
[15001.232143]  llcrypt_free_bounce_page.part.1+0x38/0x48 [libcfs]
[15001.236007]  llcrypt_free_bounce_page+0x24/0x30 [libcfs]
[15001.239541]  brw_interpret+0x124/0x10c8 [osc]
[15001.242729]  ptlrpc_check_set+0x688/0x3318 [ptlrpc]
[15001.246031]  ptlrpcd_check+0x470/0x820 [ptlrpc]
[15001.249060]  ptlrpcd+0x3d4/0x5c8 [ptlrpc]
[15001.251673]  kthread+0x130/0x138
Comment by Andreas Dilger [ 21/Oct/20 ]

I've pushed patch https://review.whamcloud.com/40326 "LU-13745 tests: skip sanity test_426 for 4.18+" to skip this test until the issue is resolved.

Comment by Gerrit Updater [ 07/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40295/
Subject: LU-14045 sec: fix O_DIRECT and encrypted files
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e07d0516dcde4b23375881077875b4cf96c90cd5

Comment by Peter Jones [ 07/Nov/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:06:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.