[LU-8256] BUG: unable to handle kernel paging request Created: 09/Jun/16  Updated: 14/Jun/18  Resolved: 04/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: llnl, llnlfixready
Environment:

RHEL 7.2 derivative, TOSS 3
kernel 3.10.0-327.13.1.3chaos.ch6.x86_64
lustre-2.5.5-6chaos.4.ch6.x86_64
Servers are lustre 2.5 on TOSS 2/RHEL6.7


Attachments: File 0001-LU-8256-llite-use-copy_-t-from-_user-in-ll_fiemap.patch    
Issue Links:
Related
is related to LU-6201 remove duplicate fiemap code/defines Resolved
is related to LU-8292 unnecessary LustreError 17754:0:(file... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On one cluster, copying a file stored on the lustre filesystem causes the node to crash with BUG: unable to handle kernel paging request at 00007fffffffa650 reported in the console log:

2016-06-08 17:01:54 [ 604.544195] BUG: unable to handle kernel paging request at 00007fffffffa650
2016-06-08 17:01:54 [ 604.552365] IP: [<ffffffffa113b3d7>] ll_fiemap+0x1a7/0x5c0 [lustre]
2016-06-08 17:01:54 [ 604.559682] PGD 201eb17067 PUD 200b0af067 PMD 2023ba0067 PTE 8000000feafd7067
2016-06-08 17:01:54 [ 604.567958] Oops: 0001 1 SMP
2016-06-08 17:01:54 [ 604.571838] Modules linked in: lmv(OE) fld(OE) mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) fid(OE) ptlrpc(OE) obdclass(OE) rpcsec_gss_krb5 ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables nfsv3 nf_log_ipv4 nf_log_common xt_LOG xt_multiport xfs libcrc32c intel_powerclamp coretemp intel_rapl kvm iTCO_wdt iTCO_vendor_support ipmi_devintf hfi1(OE) sb_edac mei_me lpc_ich edac_core sg pcspkr mei mfd_core i2c_i801 shpchp ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi binfmt_misc ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr nfsd nfs_acl auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sd_mod crc_t10dif crct10dif_generic mxm_wmi crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel mgag200 ghash_clmulni_intel syscopyarea sysfillrect sysimgblt drm_kms_helper aesni_intel lrw igb ttm gf128mul ahci glue_helper dca ablk_helper libahci ptp drm cryptd pps_core libata i2c_algo_bit i2c_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod scsi_transport_iscsi zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate [last unloaded: ip_tables]
2016-06-08 17:01:55 [ 604.708455] CPU: 38 PID: 7679 Comm: cp Tainted: P OE ------------ 3.10.0-327.13.1.3chaos.ch6.x86_64 #1
2016-06-08 17:01:55 [ 604.720645] Hardware name: Penguin Computing Relion 2900e/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
2016-06-08 17:01:55 [ 604.733342] task: ffff88101bb78b80 ti: ffff881019fbc000 task.ti: ffff881019fbc000
2016-06-08 17:01:55 [ 604.742269] RIP: 0010:[<ffffffffa113b3d7>] [<ffffffffa113b3d7>] ll_fiemap+0x1a7/0x5c0 [lustre]
2016-06-08 17:01:55 [ 604.752594] RSP: 0018:ffff881019fbfe78 EFLAGS: 00010206
2016-06-08 17:01:55 [ 604.759123] RAX: 00007fffffffa650 RBX: 0000000000000fe0 RCX: ffff881020fe1880
2016-06-08 17:01:55 [ 604.767703] RDX: 0000000000000002 RSI: ffff88101e6d2000 RDI: ffff8810235b1b48
2016-06-08 17:01:55 [ 604.776293] RBP: ffff881019fbfeb0 R08: 0000000000000000 R09: ffff88101e6d2000
2016-06-08 17:01:55 [ 604.784890] R10: ffffffffa113b27b R11: 0000000000000000 R12: ffff88101e6d2020
2016-06-08 17:01:55 [ 604.793493] R13: 7fffffffffffffff R14: ffff881019fbfec8 R15: ffff88101e6d2000
2016-06-08 17:01:55 [ 604.802102] FS: 00002aaaaab0a6c0(0000) GS:ffff88103f480000(0000) knlGS:0000000000000000
2016-06-08 17:01:55 [ 604.811797] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2016-06-08 17:01:55 [ 604.818875] CR2: 00007fffffffa650 CR3: 0000002022ce0000 CR4: 00000000003407e0
2016-06-08 17:01:55 [ 604.827521] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2016-06-08 17:01:55 [ 604.836175] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2016-06-08 17:01:55 [ 604.844830] Stack:
2016-06-08 17:01:55 [ 604.847761] ffff8810235b1b48 0000000000000000 0000000000000000 ffff8810235b1b48
2016-06-08 17:01:55 [ 604.856779] 00007fffffffa630 0000000000000003 0000000000000000 ffff881019fbff28
2016-06-08 17:01:55 [ 604.865801] ffffffff811fb409 7fffffffffffffff 0000000000000001 0000000000000048
2016-06-08 17:01:55 [ 604.874833] Call Trace:
2016-06-08 17:01:55 [ 604.878300] [<ffffffff811fb409>] do_vfs_ioctl+0x169/0x510
2016-06-08 17:01:55 [ 604.885174] [<ffffffff811fb851>] SyS_ioctl+0xa1/0xc0
2016-06-08 17:01:55 [ 604.891572] [<ffffffff8165cd49>] system_call_fastpath+0x16/0x1b
2016-06-08 17:01:55 [ 604.899040] Code: a0 d1 08 e0 44 89 e8 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 66 0f 1f 44 00 00 49 8b 46 10 48 8b 7d c8 4c 89 fe 4d 8d 67 20 <48> 8b 10 49 89 57 20 48 8b 50 08 49 89 57 28 48 8b 50 10 49 89
2016-06-08 17:01:55 [ 604.922361] RIP [<ffffffffa113b3d7>] ll_fiemap+0x1a7/0x5c0 [lustre]
2016-06-08 17:01:55 [ 604.930295] RSP <ffff881019fbfe78>
2016-06-08 17:01:55 [ 604.935013] CR2: 00007fffffffa650
2016-06-08 17:01:55 [ 605.513479] --[ end trace d79d98174ba667ee ]--
2016-06-08 17:01:55 [ 605.521246] Kernel panic - not syncing: Fatal exception

Another cluster running Lustre 2.5 client on TOSS3, mounting a TOSS2/Lustre2.5 server does not show the same problem.



 Comments   
Comment by Olaf Faaland [ 09/Jun/16 ]

This lustre tag is available where Chris normally pushes our branch. Peter knows where that is.

Code indicated by RIP in the error output is:

(gdb) l *( ll_fiemap+0x1a7)
0x173d7 is in ll_fiemap (/usr/src/debug/lustre-2.5.5/lustre/llite/file.c:3319).
3314            fiemap->fm_flags = fieinfo->fi_flags;
3315            fiemap->fm_extent_count = fieinfo->fi_extents_max;
3316            fiemap->fm_start = start;
3317            fiemap->fm_length = len;
3318            if (extent_count > 0)
3319                    memcpy(&fiemap->fm_extents[0], fieinfo->fi_extents_start,
3320                           sizeof(struct ll_fiemap_extent));
3321
3322            rc = ll_do_fiemap(inode, fiemap, num_bytes);
Comment by Olaf Faaland [ 09/Jun/16 ]

I'm not at all certain, but it looks to me like line 3319 should be using copy_from_user, not memcpy(). ll_fiemap() is called from the kernel by ioctl_fiemap(). From what I can see, fieinfo->fi_extents_start points to user memory (ufiemap->fm_extents).

static int ioctl_fiemap(struct file *filp, unsigned long arg)
{
        struct fiemap fiemap;
        struct fiemap __user *ufiemap = (struct fiemap __user *) arg;
        struct fiemap_extent_info fieinfo = { 0, };
        struct inode *inode = file_inode(filp);
        struct super_block *sb = inode->i_sb;
        u64 len;
        int error;

        if (!inode->i_op->fiemap)
                return -EOPNOTSUPP;

        if (copy_from_user(&fiemap, ufiemap, sizeof(fiemap)))
                return -EFAULT;

        if (fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS)
                return -EINVAL;

        error = fiemap_check_ranges(sb, fiemap.fm_start, fiemap.fm_length,
                                    &len);
        if (error)
                return error;

        fieinfo.fi_flags = fiemap.fm_flags;
        fieinfo.fi_extents_max = fiemap.fm_extent_count;
        fieinfo.fi_extents_start = ufiemap->fm_extents;

        if (fiemap.fm_extent_count != 0 &&
            !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start,
                       fieinfo.fi_extents_max * sizeof(struct fiemap_extent)))
                return -EFAULT;

        if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC)
                filemap_write_and_wait(inode->i_mapping);

        error = inode->i_op->fiemap(inode, &fieinfo, fiemap.fm_start, len);
Comment by Andreas Dilger [ 10/Jun/16 ]

It looks like this was fixed in patch http://review.whamcloud.com/13608 "LU-6201 llite: remove duplicate fiemap defines". That was considered mostly a code cleanup, but included the copy_from_user() fix.

There is a patch for b2_7_fe at: http://review.whamcloud.com/16815

The patch also needs to be backported to b2_5_fe and b_ieel2_0.

Comment by Zhenyu Xu [ 11/Jun/16 ]

b2_5_fe backport at http://review.whamcloud.com/20731
b_ieel2_0 backport at http://review.whamcloud.com/20732

Comment by Olaf Faaland [ 14/Jun/16 ]

Thanks for working on the backport. However it's a big patch. Is the all the change appropriate for a stable branch? It looks to me that most of it is pruning unnecessary code, as opposed to fixing code that is wrong.

Comment by Olaf Faaland [ 14/Jun/16 ]

I made a smaller change and tested it locally on the cluster that was hitting the issue. It worked there. I assume I won't be able to push to the b2_5_fe branch, so I've attached it as 0001-LU-8256llite-use-copy_-t-from_user-in-ll_fiemap.patch

Comment by Olaf Faaland [ 15/Jun/16 ]

Actually, I'm not sure about this patch. I ran my cp command under strace, and I see the FIEMAP ioctl returned EOPNOTSUPP. This does indeed allow the copy to succeed without crashing the node, but I don't understand why my patch had that effect. The patch just changes the memcpy calls to copy_to/from_user.

Comment by Zhenyu Xu [ 15/Jun/16 ]

yes, your patch can solve this issue as well.

As the EOPNOTSUPP of the fiemap path, I can see one place has this

        /* If the stripe_count > 1 and the application does not understand
         * DEVICE_ORDER flag, then it cannot interpret the extents correctly.
         */
        if (lsm->lsm_stripe_count > 1 &&
            !(fiemap->fm_flags & FIEMAP_FLAG_DEVICE_ORDER))
                GOTO(out, rc = -EOPNOTSUPP);

the file data crosses multiple OST stripes, and fiemap will return each OST index in the fiemap_extent::fe_reserved[0], and lustre-aware-less application can not interpret the extents and use it correctly.

Comment by Olaf Faaland [ 15/Jun/16 ]

I see. Thank you for the explanation.

Comment by Olaf Faaland [ 15/Jun/16 ]

Zhenyu,
I applied your patch to our Lustre 2.5.5 stack and am running it on a couple of nodes. It turns out in my test case the EOPNOTSUPP is coming from obd_get_info, both with my patch yesterday and with yours today. I get the following console log messages:

LustreError: 11-0: fsrzb-OST0009-osc-ffff88200c9f7000: Communicating with 172.21.1.110@o2ib200, operation ost_get_info failed with -95.
LustreError: 170050:0:(file.c:1900:ll_do_fiemap()) obd_get_info failed: rc = -95
Comment by Olaf Faaland [ 15/Jun/16 ]

dbo_fiemap_get is not implemented in the ZFS osd, because zfs does not support fiemap (not that it could not, it's just not implemented). We suspect that ofd_get_info() is calling dt_fiemap_get() which returns -EOPNOTSUPP, and that error code is passed back to the client.

Comment by Olaf Faaland [ 15/Jun/16 ]

I think I've confirmed that is what is happening:

On the client:

[root@rzgenie48:fiemap]# lfs getstripe test.1k.sparse.1extent
test.1k.sparse.1extent
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  9
        obdidx           objid           objid           group
             9       259446897      0xf76d871                0

[root@rzgenie48:fiemap]# strace cp --sparse=always test.1k.sparse.1extent test.1k.sparse.1extent.copy2 2>&1 | grep ioctl
ioctl(3, FS_IOC_FIEMAP, 0x7fffffffc760) = -1 EOPNOTSUPP (Operation not supported)
[root@rzgenie48:fiemap]# lctl list_nids
192.168.128.48@o2ib21

Logs on the OSS:

00000100:00100000:9.0:1466033830.426303:0:10320:0:(service.c:2086:ptlrpc_server_handle_request()) Handl
ing RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost02_006:68119517-017d-eeee-db17-abd63a7e2e89+5:125727:x1537240460206584:12345-192.168.128.48@o2ib21:7
00002000:00000001:9.0:1466033830.426308:0:10320:0:(ofd_obd.c:536:ofd_get_info()) Process entered
00002000:00000001:9.0:1466033830.426309:0:10320:0:(ofd_obd.c:599:ofd_get_info()) Process leaving via out (rc=0 : 0 : 0x0)
00002000:00000001:9.0:1466033830.426310:0:10320:0:(ofd_obd.c:669:ofd_get_info()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:9.0:1466033830.426311:0:10320:0:(pack_generic.c:336:lustre_pack_reply_v2()) Process entered
00000100:00000001:9.0:1466033830.426313:0:10320:0:(pack_generic.c:370:lustre_pack_reply_v2()) Process leaving (rc=0 : 0 : 0)
00002000:00000001:9.0:1466033830.426314:0:10320:0:(ofd_obd.c:536:ofd_get_info()) Process entered
00002000:00000001:9.0:1466033830.426315:0:10320:0:(ofd_objects.c:81:ofd_object_find()) Process entered
00002000:00000001:9.0:1466033830.426317:0:10320:0:(ofd_objects.c:89:ofd_object_find()) Process leaving (rc=18446612192165995760 : -131881543555856 : ffff880def71a4f0)
00002000:00000001:9.0:1466033830.426319:0:10320:0:(ofd_obd.c:669:ofd_get_info()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1)
00000100:00000001:9.0:1466033830.426321:0:10320:0:(niobuf.c:649:ptlrpc_send_error()) Process entered
00000100:00000001:9.0:1466033830.426323:0:10320:0:(connection.c:136:ptlrpc_connection_addref()) Process entered
00000100:00000001:9.0:1466033830.426323:0:10320:0:(connection.c:143:ptlrpc_connection_addref()) Process leaving (rc=18446612201684338752 : -131872025212864 : ffff881026c80840)
00000100:00000001:9.0:1466033830.426325:0:10320:0:(niobuf.c:60:ptl_send_buf()) Process entered
00000100:00000001:9.0:1466033830.426332:0:10320:0:(niobuf.c:101:ptl_send_buf()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:9.0:1466033830.426333:0:10320:0:(connection.c:99:ptlrpc_connection_put()) Process entered
00000100:00000001:9.0:1466033830.426334:0:10320:0:(connection.c:129:ptlrpc_connection_put()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:9.0:1466033830.426335:0:10320:0:(niobuf.c:666:ptlrpc_send_error()) Process leaving (rc=0 : 0 : 0)
00000100:00100000:9.0:1466033830.426338:0:10320:0:(service.c:2130:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost02_006:68119517-017d-eeee-db17-abd63a7e2e89+5:125727:x1537240460206584:12345-192.168.128.48@o2ib21:7 Request procesed in 38us (104us total) trans 0 rc -95/-95
Comment by Olaf Faaland [ 15/Jun/16 ]

So, I believe your patch is working as expected.

Comment by Zhenyu Xu [ 16/Jun/16 ]

great, thank you for your search info.

Comment by Scott Wood [ 09/Sep/16 ]

Not sure if I should open another issue, but I have seen this with 2.7.0. Clients and servers were both installed using the RPMs from the Intel community repos. Servers are running CentOS6 with 2.6.32-504.8.1.el6_lustre.x86_64 kernel (again from the Intel rpms) and CentOS7 clients running the CentOS provided 3.10.0-123.20.1.el7.x86_64 kernel.

One user can reliably bring down a host and processing the kdump vmcore produces the following:

[root@hpcnode045 crash]# crash 127.0.0.1-2016-09-09-11\:54\:04/vmcore /usr/lib/debug/lib/modules/3.10.0-123.20.1.el7.x86_64/vmlinux 

crash 7.1.2-3.el7_2.1
Copyright (C) 2002-2014  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-123.20.1.el7.x86_64/vmlinux
    DUMPFILE: 127.0.0.1-2016-09-09-11:54:04/vmcore  [PARTIAL DUMP]
        CPUS: 56
        DATE: Fri Sep  9 11:53:48 2016
      UPTIME: 00:08:59
LOAD AVERAGE: 10.24, 10.07, 5.45
       TASKS: 1137
    NODENAME: hpcnode045.adqimr.ad.lan
     RELEASE: 3.10.0-123.20.1.el7.x86_64
     VERSION: #1 SMP Thu Jan 29 18:05:33 UTC 2015
     MACHINE: x86_64  (2594 Mhz)
      MEMORY: 255.9 GB
       PANIC: "BUG: unable to handle kernel paging request at 00007fff6fe704b0"
         PID: 4366
     COMMAND: "cp"
        TASK: ffff883fcc340000  [THREAD_INFO: ffff883fb92c2000]
         CPU: 49
       STATE: TASK_RUNNING (PANIC)

crash>bt
PID: 4366   TASK: ffff883fcc340000  CPU: 49  COMMAND: "cp"
 #0 [ffff883fb92c3aa0] machine_kexec at ffffffff81041191
 #1 [ffff883fb92c3af8] crash_kexec at ffffffff810cf2b2
 #2 [ffff883fb92c3bc8] oops_end at ffffffff815eb648
 #3 [ffff883fb92c3bf0] no_context at ffffffff815db9e4
 #4 [ffff883fb92c3c40] __bad_area_nosemaphore at ffffffff815dba7a
 #5 [ffff883fb92c3c88] bad_area_nosemaphore at ffffffff815dbbe4
 #6 [ffff883fb92c3c98] __do_page_fault at ffffffff815ee46e
 #7 [ffff883fb92c3d98] do_page_fault at ffffffff815ee68a
 #8 [ffff883fb92c3dc0] page_fault at ffffffff815ea8c8
    [exception RIP: ll_fiemap+591]
    RIP: ffffffffa0ba3bbf  RSP: ffff883fb92c3e78  RFLAGS: 00010206
    RAX: 00007fff6fe704b0  RBX: 0000000000000fe0  RCX: ffff883fcf76a580
    RDX: 0000000000000002  RSI: ffff883fd1572000  RDI: ffff883fcf428108
    RBP: ffff883fb92c3eb0   R8: 0000000000000000   R9: ffff883fd1572000
    R10: ffff881ffec07300  R11: ffffffffa0814754  R12: ffff883fd1572020
    R13: 0000000000000000  R14: ffff883fb92c3ec8  R15: ffff883fd1572000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff883fb92c3e70] ll_fiemap at ffffffffa0ba3a06 [lustre]
#10 [ffff883fb92c3eb8] do_vfs_ioctl at ffffffff811c2b04
#11 [ffff883fb92c3f30] sys_ioctl at ffffffff811c2f01
#12 [ffff883fb92c3f80] system_call_fastpath at ffffffff815f3219
    RIP: 00007f9dd8e64d77  RSP: 00007fff6fe71340  RFLAGS: 00010202
    RAX: 0000000000000010  RBX: ffffffff815f3219  RCX: 00007f9dd8e685da
    RDX: 00007fff6fe70490  RSI: 00000000c020660b  RDI: 0000000000000003
    RBP: 7fffffffffffffff   R8: 00000000ffffffff   R9: 0000000000401007
    R10: 00007fff6fe70210  R11: 0000000000000206  R12: 0000000000000000
    R13: 0000000000000000  R14: 00007fff6fe71600  R15: 00007fff6fe70490
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
Comment by Peter Jones [ 04/Jan/17 ]

This was fixed on master as part of LU-6201

Generated at Sat Feb 10 02:15:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.