Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      Because of patch https://review.whamcloud.com/38967, we can end up in a situation where osc_release_bounce_pages() mistakenly consider pages as fscrypt bounce pages, and tries to free them, as shown in the stack below.

      2020-10-18 15:26:49 [ 4462.081809][T14012] Lustre: DEBUG MARKER: == sanity 
      test 56w: check lfs_migrate -c stripe_count works 
      ========================================== 15:26:49 (1603049209)
      2020-10-18 15:26:52 [ 4464.514691][T30281] BUG: kernel NULL pointer dereference, address: 0000000000000048
      2020-10-18 15:26:52 [ 4464.524282][T30281] #PF: supervisor read access in kernel mode
      2020-10-18 15:26:52 [ 4464.532011][T30281] #PF: error_code(0x0000) - not-present page
      2020-10-18 15:26:52 [ 4464.539709][T30281] PGD 80000007edcce067 P4D 80000007edcce067 PUD 7f1306067 PMD 0
      2020-10-18 15:26:52 [ 4464.549144][T30281] Oops: 0000 [#1] PREEMPT SMP PTI
      2020-10-18 15:26:52 [ 4464.555851][T30281] CPU: 0 PID: 30281 Comm: ptlrpcd_00_04 Tainted: G        W         5.7.0-rc7+ #1
      2020-10-18 15:26:52 [ 4464.566720][T30281] Hardware name: Supermicro Super Server/To be filled by O.E.M., BIOS 2.0b 08/12/2016
      2020-10-18 15:26:52 [ 4464.577932][T30281] RIP: 0010:mempool_free+0x12/0x80
      2020-10-18 15:26:52 [ 4464.584690][T30281] Code: 60 e8 ff cc cc cc cc cc 0f 1f 44 00 00 e9 86 a3 08 00 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 85 ff 
      48 89 fd 53 74 1a 48 89 f3 <8b> 46 48 39 46 4c 7c 12 48 8b 73 58 48 8b 43 68 48 89 ef 5b 5d ff
      2020-10-18 15:26:52 [ 4464.607734][T30281] RSP: 0018:ffffc9002414fcc0 EFLAGS: 00010282
      2020-10-18 15:26:52 [ 4464.615423][T30281] RAX: ffff8887d44fb5e0 RBX: 0000000000000000 RCX: 0000000000000000
      2020-10-18 15:26:52 [ 4464.625013][T30281] RDX: ffff888845abb780 RSI: 0000000000000000 RDI: ffffea001f553340
      2020-10-18 15:26:52 [ 4464.634577][T30281] RBP: ffffea001f553340 R08: 0000000000000000 R09: 0000000000000000
      2020-10-18 15:26:52 [ 4464.644109][T30281] R10: 0000000000000000 R11: 000000000000000f R12: 0000000000000000
      2020-10-18 15:26:52 [ 4464.653614][T30281] R13: ffff8887d736c9f0 R14: 0000000000000010 R15: ffff888845abb780
      2020-10-18 15:26:52 [ 4464.663095][T30281] FS:  0000000000000000(0000) GS:ffff88885e600000(0000) knlGS:0000000000000000
      2020-10-18 15:26:52 [ 4464.673521][T30281] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      2020-10-18 15:26:52 [ 4464.681579][T30281] CR2: 0000000000000048 CR3: 00000007cf9fa004 CR4: 00000000001606f0
      2020-10-18 15:26:52 [ 4464.691015][T30281] Call Trace:
      2020-10-18 15:26:52 [ 4464.695751][T30281]  brw_interpret+0xac/0xa60 [osc]
      2020-10-18 15:26:52 [ 4464.702190][T30281]  ? _raw_spin_unlock+0x29/0x50
      2020-10-18 15:26:52 [ 4464.708490][T30281]  ptlrpc_check_set+0x329/0x1790 [ptlrpc]
      2020-10-18 15:26:52 [ 4464.715599][T30281]  ptlrpcd_check+0x411/0x460 [ptlrpc]
      2020-10-18 15:26:52 [ 4464.722318][T30281]  ptlrpcd+0x278/0x300 [ptlrpc]
      2020-10-18 15:26:52 [ 4464.728463][T30281]  ? remove_wait_queue+0x60/0x60
      2020-10-18 15:26:52 [ 4464.734667][T30281]  kthread+0x12a/0x170
      2020-10-18 15:26:52 [ 4464.739993][T30281]  ? ptlrpcd_check+0x460/0x460 [ptlrpc]
      2020-10-18 15:26:52 [ 4464.746745][T30281]  ? kthread_bind+0x10/0x10
      2020-10-18 15:26:52 [ 4464.752431][T30281]  ret_from_fork+0x24/0x30
      

      Attachments

        Issue Links

          Activity

            [LU-14045] Fix O_DIRECT and encrypted files
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-14306 [ LU-14306 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.14.0 [ 14490 ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40295/
            Subject: LU-14045 sec: fix O_DIRECT and encrypted files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e07d0516dcde4b23375881077875b4cf96c90cd5

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40295/ Subject: LU-14045 sec: fix O_DIRECT and encrypted files Project: fs/lustre-release Branch: master Current Patch Set: Commit: e07d0516dcde4b23375881077875b4cf96c90cd5

            I've pushed patch https://review.whamcloud.com/40326 "LU-13745 tests: skip sanity test_426 for 4.18+" to skip this test until the issue is resolved.

            adilger Andreas Dilger added a comment - I've pushed patch https://review.whamcloud.com/40326 " LU-13745 tests: skip sanity test_426 for 4.18+ " to skip this test until the issue is resolved.

            Stack trace from sanity.sh test_426:

            [15000.400779] Lustre: DEBUG MARKER: == sanity test 426: splice test on Lustre ==== 20:58:26 (1603227506)
            [15001.080742] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004
            [15001.102937] user pgtable: 64k pages, 48-bit VAs, pgdp = 000000009f14b2d0
            [15001.111120] Internal error: Oops: 96000005 [#1] SMP
            [15001.149680] CPU: 1 PID: 11273 Comm: ptlrpcd_01_01  4.18.0-147.8.1.el8_1.aarch64 #1
            [15001.164523] pc : mempool_free+0x24/0xe0
            [15001.167022] lr : llcrypt_free_bounce_page.part.1+0x38/0x48 [libcfs]
            [15001.223444] Process ptlrpcd_01_01 (pid: 11273, stack limit = 0x00000000f9135a93)
            [15001.228185] Call trace:
            [15001.229806]  mempool_free+0x24/0xe0
            [15001.232143]  llcrypt_free_bounce_page.part.1+0x38/0x48 [libcfs]
            [15001.236007]  llcrypt_free_bounce_page+0x24/0x30 [libcfs]
            [15001.239541]  brw_interpret+0x124/0x10c8 [osc]
            [15001.242729]  ptlrpc_check_set+0x688/0x3318 [ptlrpc]
            [15001.246031]  ptlrpcd_check+0x470/0x820 [ptlrpc]
            [15001.249060]  ptlrpcd+0x3d4/0x5c8 [ptlrpc]
            [15001.251673]  kthread+0x130/0x138
            
            adilger Andreas Dilger added a comment - Stack trace from sanity.sh test_426: [15000.400779] Lustre: DEBUG MARKER: == sanity test 426: splice test on Lustre ==== 20:58:26 (1603227506) [15001.080742] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004 [15001.102937] user pgtable: 64k pages, 48-bit VAs, pgdp = 000000009f14b2d0 [15001.111120] Internal error: Oops: 96000005 [#1] SMP [15001.149680] CPU: 1 PID: 11273 Comm: ptlrpcd_01_01 4.18.0-147.8.1.el8_1.aarch64 #1 [15001.164523] pc : mempool_free+0x24/0xe0 [15001.167022] lr : llcrypt_free_bounce_page.part.1+0x38/0x48 [libcfs] [15001.223444] Process ptlrpcd_01_01 (pid: 11273, stack limit = 0x00000000f9135a93) [15001.228185] Call trace: [15001.229806] mempool_free+0x24/0xe0 [15001.232143] llcrypt_free_bounce_page.part.1+0x38/0x48 [libcfs] [15001.236007] llcrypt_free_bounce_page+0x24/0x30 [libcfs] [15001.239541] brw_interpret+0x124/0x10c8 [osc] [15001.242729] ptlrpc_check_set+0x688/0x3318 [ptlrpc] [15001.246031] ptlrpcd_check+0x470/0x820 [ptlrpc] [15001.249060] ptlrpcd+0x3d4/0x5c8 [ptlrpc] [15001.251673] kthread+0x130/0x138
            adilger Andreas Dilger added a comment - - edited

            I may be conflating two issues, but AFAICS, sanity test_56w has only crashed a couple of times in the past 4 weeks:
            https://testing.whamcloud.com/test_sets/5850ea8a-7bc0-40a1-b88b-5aabd945fe10
            https://testing.whamcloud.com/test_sets/6dadd2cd-29c9-4965-9ecd-433452337956

            and those were both on 2020-10-10 when testing patch https://review.whamcloud.com/38883 "LU-11621 utils: optimize migrate_copy_data() with copy_file_range()".

            The only other crash started on aarch64 kernels 4.18+ on 2020-10-19, but the patch https://review.whamcloud.com/38967 "LU-12275 sec: O_DIRECT for encrypted file" was landed on master 6 weeks ago. This is failing 100% in sanity test_426 since patch https://review.whamcloud.com/39695 "LU-13745 test: add splice test for lustre" landed, since that patch was submitted with "Test-Parameters: trivial" which only tests x86_64 on ldiskfs, but the test is crashing continuously on aarch64 and el8.2, both of which are using 4.18 kernels.

            If this is related to crypto, it appears the source of the funky pages is the splice IO from "splice". The two failed sanity test_56w are testing copy_file_range() that is also using in-kernel data copying, similar to splice. Since the pages are generated in a source filesystem and sent to the target, it isn't whether we can play games with the mapping or not, so it might be better to use a page flag (e.g. PageChecked, maybe with a better wrapper like PageCrypto for Lustre)?

            adilger Andreas Dilger added a comment - - edited I may be conflating two issues, but AFAICS, sanity test_56w has only crashed a couple of times in the past 4 weeks: https://testing.whamcloud.com/test_sets/5850ea8a-7bc0-40a1-b88b-5aabd945fe10 https://testing.whamcloud.com/test_sets/6dadd2cd-29c9-4965-9ecd-433452337956 and those were both on 2020-10-10 when testing patch https://review.whamcloud.com/38883 " LU-11621 utils: optimize migrate_copy_data() with copy_file_range() ". The only other crash started on aarch64 kernels 4.18+ on 2020-10-19, but the patch https://review.whamcloud.com/38967 " LU-12275 sec: O_DIRECT for encrypted file " was landed on master 6 weeks ago. This is failing 100% in sanity test_426 since patch https://review.whamcloud.com/39695 " LU-13745 test: add splice test for lustre " landed, since that patch was submitted with " Test-Parameters: trivial " which only tests x86_64 on ldiskfs, but the test is crashing continuously on aarch64 and el8.2, both of which are using 4.18 kernels. If this is related to crypto, it appears the source of the funky pages is the splice IO from "splice". The two failed sanity test_56w are testing copy_file_range() that is also using in-kernel data copying, similar to splice. Since the pages are generated in a source filesystem and sent to the target, it isn't whether we can play games with the mapping or not, so it might be better to use a page flag (e.g. PageChecked , maybe with a better wrapper like PageCrypto for Lustre)?
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-13745 [ LU-13745 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-12275 [ LU-12275 ]
            bruno Bruno Faccini (Inactive) added a comment - +2 with recent master at https://testing.whamcloud.com/test_sets/24317b7d-ea90-4b01-ae0a-e01b5284c227 and https://testing.whamcloud.com/test_sets/fb2d522c-391e-4979-a709-c6c4d8a967a0

            People

              sebastien Sebastien Buisson
              sebastien Sebastien Buisson
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: