[LU-11403] lov_io_init_empty() Page fault on a file without stripes Created: 19/Sep/18  Updated: 08/Jun/19  Resolved: 04/May/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File m.c    
Issue Links:
Related
is related to LU-12213 sanity test 61b uses wrong path Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

REFORMAT=yes RACER_ENABLE_DOM=false MOUNT_2=yes SLOW=yes sh racer.sh

 

LustreError: 20867:0:(lov_io.c:1481:lov_io_init_empty()) Page fault on a file without stripes: [0x200000402:0x116f5:0x0]
BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffffa0cc04e8>] ll_fault+0xb8/0x610 [lustre]
PGD 0 
Oops: 0002 [#1] SMP 
Modules linked in: zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) lustre(O) ofd(O) osp(O) lod(O) ost(O) mdt(O) mdd(O) mgs(O) osd_ldiskfs(O) ldiskfs(O) lquota(O) lfsck(O) obdecho(O) mgc(O) lov(O) mdc(O) osc(O) lmv(O) fid(O) fld(O) ptlrpc(O) obdclass(O) ksocklnd(O) lnet(O) libcfs(O)
CPU: 0 PID: 20867 Comm: systemd-coredum Tainted: P           O   ------------   3.10.0 #5
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800aebaa0c0 ti: ffff88006485c000 task.ti: ffff88006485c000
RIP: 0010:[<ffffffffa0cc04e8>]  [<ffffffffa0cc04e8>] ll_fault+0xb8/0x610 [lustre]
RSP: 0018:ffff88006485fc70  EFLAGS: 00010246
RAX: 0000000000000001 RBX: 0000000000000100 RCX: 00000000c0000100
RDX: ffffffff81ab94c0 RSI: ffff8800aebaa0c0 RDI: ffff88011fc15b00
RBP: ffff88006485fcd8 R08: ffff88006485c000 R09: 00000000000004c1
R10: 000000009837f050 R11: ffff88011fc15b70 R12: 0000000000000000
R13: ffff8801157a4b60 R14: ffff88006485fce8 R15: ffff8800aebaa0c0
FS:  00007f4c1846c940(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000689c0000 CR4: 00000000000007b0
Call Trace:
 [<ffffffff811649f9>] __do_fault.isra.12+0x79/0xd0
 [<ffffffff8116d563>] ? __vma_link_file+0x43/0x80
 [<ffffffff81164a88>] do_read_fault.isra.14+0x38/0x170
 [<ffffffff811703f7>] ? mmap_region+0x207/0x740
 [<ffffffff8116a7ae>] handle_mm_fault+0x77e/0x11a0
 [<ffffffff815b1078>] __do_page_fault+0x1b8/0x4b0
 [<ffffffff815b13e3>] trace_do_page_fault+0x43/0xd0
 [<ffffffff815b08ed>] do_async_page_fault+0x5d/0xa0
 


 Comments   
Comment by Patrick Farrell (Inactive) [ 13/Feb/19 ]

Not sure why we haven't been hitting this forever, but I'm pretty sure this is the bug.

When there's no stripe, we return EFAULT (lov_io_init_empty):

Which is translated to VM_FAULT_NOPAGE (to_fault_error)

 case -EFAULT:
         result = VM_FAULT_NOPAGE;
 break;

 

Then in ll_fault:

        if (!(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) {
                struct page *vmpage = vmf->page;                /* check if this page has been truncated */
                lock_page(vmpage);
 

But the page is not set in this case.

static int __do_fault(struct vm_area_struct *vma, unsigned long address,
                        pgoff_t pgoff, unsigned int flags,
                        struct page *cow_page, struct page **page,
                        void **entry, pmd_t *pmd, pte_t orig_pte)
{
        struct vm_fault vmf;
        int ret;        vmf.virtual_address = (void __user *)(address & PAGE_MASK);
        vmf.pgoff = pgoff;
        vmf.flags = flags;
        vmf.page = NULL; <----------------------
        vmf.gfp_mask = __get_fault_gfp_mask(vma);
        vmf.cow_page = cow_page;
        vmf.orig_pte = orig_pte;
        vmf.pmd = pmd;
        vmf.vma = vma;

        ret = vma->vm_ops->fault(vma, &vmf); <------------- ll_fault 

Various errors can cause page to not get set.

Fix seems clear - Check for the page.

Comment by Gerrit Updater [ 13/Feb/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34247
Subject: LU-11403 llite: Check if page is set
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b0522f8f147f71a2bb150570769693de196df763

Comment by Alexander Zarochentsev [ 11/Mar/19 ]

attached a simpler reproducer for the ll_fault() crash m.c (made by Andrew Perepechko).

https://review.whamcloud.com/34247 causes the reproducer to hang with repeatable messages in kernel log:

[   87.575164] LustreError: 5061:0:(lov_io.c:1481:lov_io_init_empty()) Page fault on a file without stripes: [0x200000401:0x1:0x0]
[   87.575546] LustreError: 5061:0:(lov_io.c:1481:lov_io_init_empty()) Skipped 251880 previous similar messages
[   89.577930] LustreError: 5061:0:(lov_io.c:1481:lov_io_init_empty()) Page fault on a file without stripes: [0x200000401:0x1:0x0]
[   89.578283] LustreError: 5061:0:(lov_io.c:1481:lov_io_init_empty()) Skipped 493482 previous similar messages
[   93.579391] LustreError: 5061:0:(lov_io.c:1481:lov_io_init_empty()) Page fault on a file without stripes: [0x200000401:0x1:0x0]
[   93.579750] LustreError: 5061:0:(lov_io.c:1481:lov_io_init_empty()) Skipped 946319 previous similar messages
[  101.581054] LustreError: 5061:0:(lov_io.c:1481:lov_io_init_empty()) Page fault on a file without stripes: [0x200000401:0x1:0x0]
[  101.581550] LustreError: 5061:0:(lov_io.c:1481:lov_io_init_empty()) Skipped 1945088 previous similar messages
[  117.585786] LustreError: 5061:0:(lov_io.c:1481:lov_io_init_empty()) Page fault on a file without stripes: [0x200000401:0x1:0x0]
[  117.586855] LustreError: 5061:0:(lov_io.c:1481:lov_io_init_empty()) Skipped 4004012 previous similar messages

instead of crash. however , the hang is interruptible from the terminal.

Comment by Patrick Farrell (Inactive) [ 11/Mar/19 ]

Oh, cool!  Thanks very much to Panda for the reproducer and Zam for the report.  I will take a look at the hang.  (Agreed it's better than panicking, but we should be able to do better.)

Comment by Gerrit Updater [ 13/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34247/
Subject: LU-11403 llite: ll_fault fixes
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a8f4d1e5fd79e77f1347e983ec52f2ddc3e75ab9

Comment by Peter Jones [ 13/Apr/19 ]

Landed for 2.13

Comment by Andreas Dilger [ 17/Apr/19 ]

The sanity.sh test_61b added from this patch is failing regularly.

My thinking is that it makes sense to instantiate the layout at the time that mmap() is called for the full range of the mapping, rather than waiting for the page fault.

Comment by Gerrit Updater [ 17/Apr/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34698
Subject: LU-11403 tests: Fix $tfile usage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 37a5b05ac0fe80f551ec279f070726edb04d7b33

Comment by Gerrit Updater [ 04/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34698/
Subject: LU-11403 tests: Fix $tfile usage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4191e0cdd0d96b848c1235471179d25d37a889dc

Comment by Peter Jones [ 04/May/19 ]

Take two

Comment by Gerrit Updater [ 21/May/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34914
Subject: LU-11403 tests: Fix $tfile usage
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 58572b0ba1be608226af3695c4f7d50399e2bded

Comment by Gerrit Updater [ 21/May/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34935
Subject: LU-11403 llite: ll_fault fixes
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 9760d68dd6915492f4d2241d0009b86fba85e770

Comment by Gerrit Updater [ 08/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34935/
Subject: LU-11403 llite: ll_fault fixes
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: a1ffe58ab0c41009b77abe553850798a1030d653

Comment by Gerrit Updater [ 08/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34914/
Subject: LU-11403 tests: Fix $tfile usage
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 5866a8975fe95cbd07b370ee4cf838ba223c13da

Generated at Sat Feb 10 02:43:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.