[LU-6827] sanity-scrub: Failed mount OST on ldiskfs Created: 09/Jul/15  Updated: 20/Oct/16  Resolved: 03/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

server and client: lustre-master build # 3094 RHEL7


Issue Links:
Duplicate
is duplicated by LU-6958 conf-sanity conf-sanity: (osd_interna... Resolved
Related
is related to LU-7118 sanity-scrub: No sub tests failed in ... Resolved
is related to LU-5645 Fail to start sanity-scrub Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

Please provide additional information about the failure here.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f7c4952e-25eb-11e5-866a-5254006e85c2.

21:50:25:CMD: shadow-43vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all -lnet -lnd -pinger\" 4 
21:50:27:CMD: shadow-43vm3 e2label /dev/lvm-Role_MDS/P1 2>/dev/null
21:50:27:Started lustre-MDT0000
21:50:27:CMD: shadow-43vm4 mkdir -p /mnt/ost1
21:50:27:CMD: shadow-43vm4 test -b /dev/lvm-Role_OSS/P1
21:50:28:Starting ost1:   /dev/lvm-Role_OSS/P1 /mnt/ost1
21:50:28:CMD: shadow-43vm4 mkdir -p /mnt/ost1; mount -t lustre   		                   /dev/lvm-Role_OSS/P1 /mnt/ost1
21:50:28:shadow-43vm4: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P1 at /mnt/ost1 failed: Cannot send after transport endpoint shutdown
21:50:28:sanity-scrub returned 0
21:50:28:running: sanity-benchmark 
21:50:28:run_suite sanity-benchmark /usr/lib64/lustre/tests/sanity-benchmark.sh



 Comments   
Comment by Peter Jones [ 09/Jul/15 ]

Bobijam

Could you please look into this as a high priority? It is disrupting RHEL7 server testing.

Thanks

Peter

Comment by Zhenyu Xu [ 10/Jul/15 ]

it does not look like Lustre issue, more like test network environment issue. When mounting OST, the OSS node cannot reach the MGS node at that moment.

Comment by Peter Jones [ 10/Jul/15 ]

Thanks Bobi. Sarah, does this issue 100% reproduce?

Comment by Sarah Liu [ 13/Jul/15 ]

No, this one is not 100% reproduced, while it is not a one-time failure neither, here are other instances:

RHEL7.1 server/6.6 client:
https://testing.hpdd.intel.com/test_sessions/f47111c4-282d-11e5-b595-5254006e85c2

RHEL7.1 server/client DNE:
https://testing.hpdd.intel.com/test_sessions/c465a3ec-26fb-11e5-b3d7-5254006e85c2

Comment by Zhenyu Xu [ 14/Jul/15 ]

can you grab a debug log from the node when it happens next time, or let me log into it to check what the problem might be?

Comment by Sarah Liu [ 17/Jul/15 ]

Hello Bobijam,

could you please check the link here and see if there is any useful information, if still not enough, I will try to reproduce it manually:

https://testing.hpdd.intel.com/sub_tests/1517745c-25ec-11e5-866a-5254006e85c2

Comment by Zhenyu Xu [ 18/Jul/15 ]

From the link I saw the issue of LU-6186 "sanity-hsm test_70: Failed to start copytool monitor" (from the OST stack_trace log), not anything relates to this issue.

Comment by Sarah Liu [ 21/Jul/15 ]

ok, then I will try to reproduce this problem manually.

Hit TEI-3732 when try to provision RHEL7.1 server. will push a for test only patch and see how it goes

Comment by Sarah Liu [ 23/Jul/15 ]

Hello Bobijam,

I can reproduce this bug and here is what I got from the MDS

=========----- Thu Jul 23 04:51:48 UTC 2015
[10502.926564] Lustre: DEBUG MARKER: -----============= acceptance-small: sanity-scrub ============----- Thu Jul 23 04:51:48 UTC 2015
[10504.076263] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-scrub test complete, duration -o sec ======================================================= 04:51:49 \(1437627109\)
[10504.254124] Lustre: DEBUG MARKER: == sanity-scrub test complete, duration -o sec ======================================================= 04:51:49 (1437627109)
[10506.618798] Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
[10506.935315] Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
[10509.247280] LustreError: 2904:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880066f1cf00 x1507451119138540/t0(0) o13->lustre-OST0005-osc-MDT0000@10.1.5.50@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[10509.247286] LustreError: 2905:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88003d415800 x1507451119138536/t0(0) o13->lustre-OST0004-osc-MDT0000@10.1.5.50@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[10510.975199] LustreError: 2905:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88003d415800 x1507451119138556/t0(0) o13->lustre-OST0000-osc-MDT0000@10.1.5.50@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[10510.979720] LustreError: 2905:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 4 previous similar messages
[10511.211582] Lustre: lustre-MDT0000: Not available for connect from 10.1.5.50@tcp (stopping)
[10511.211584] Lustre: lustre-MDT0000: Not available for connect from 10.1.5.50@tcp (stopping)
[10511.215851] Lustre: Skipped 1 previous similar message
[10516.205426] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.1.5.50@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[10516.209156] LustreError: Skipped 4 previous similar messages
[10519.155134] Lustre: 31667:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437627119/real 1437627119]  req@ffff88006a10f000 x1507451119138576/t0(0) o251->MGC10.1.5.59@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1437627125 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[10519.160540] Lustre: 31667:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
[10519.252293] Lustre: server umount lustre-MDT0000 complete
[10519.415565] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
[10524.606842] Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
[10524.850875] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
[10525.212232] Lustre: DEBUG MARKER: mkfs.lustre --mgs --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=200000 --reformat /dev/lvm-Role_MDS/P1
[10525.598654] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro
[10533.175997] Lustre: DEBUG MARKER: running=$(grep -c /mnt/mds1' ' /proc/mounts);
[10533.175997] mpts=$(mount | grep -c /mnt/mds1' ');
[10533.175997] if [ $running -ne $mpts ]; then
[10533.175997]     echo $(hostname) env are INSANE!;
[10533.175997]     exit 1;
[10533.175997] fi
[10533.586791] Lustre: DEBUG MARKER: running=$(grep -c /mnt/mds1' ' /proc/mounts);
[10533.586791] mpts=$(mount | grep -c /mnt/mds1' ');
[10533.586791] if [ $running -ne $mpts ]; then
[10533.586791]     echo $(hostname) env are INSANE!;
[10533.586791]     exit 1;
[10533.586791] fi
[10534.926722] Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
[10535.227879] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
[10535.549478] Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/mds1
[10535.790731] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro
[10535.917518] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
[10535.963581] Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
[10535.965404] Lustre: Skipped 5 previous similar messages
[10536.128614] Lustre: 32326:0:(osd_internal.h:1085:osd_trans_exec_check()) op 3: used 8, used now 8, reserved 4
[10536.130569] Lustre: 32326:0:(osd_internal.h:1085:osd_trans_exec_check()) Skipped 2 previous similar messages
[10536.132675] Luse7
[10536.164238]  [<ffffffffa0634ddb>] lbug_with_loc+0xab/0xc0 [libcfs]
[10536.164238]  [<ffffffffa0c0f81e>] osd_trans_exec_check.part.91+0x1a/0x1a [osd_ldiskfs]
[10536.164238]  [<ffffffffa0be88c7>] osd_object_ea_create+0x927/0xb80 [osd_ldiskfs]
[10536.164238]  [<ffffffffa077afa8>] dt_find_or_create+0x598/0x8d0 [obdclass]
[10536.164238]  [<ffffffffa0614e20>] fld_index_init+0x1a0/0xbb0 [fld]
[10536.164238]  [<ffffffffa0612565>] fld_server_init+0xa5/0x3e0 [fld]
[10536.164238]  [<ffffffffa0da7cf4>] mdt_init0+0x4f4/0x12b0 [mdt]
[10536.164238]  [<ffffffffa0da8b29>] mdt_device_alloc+0x79/0x110 [mdt]
[10536.164238]  [<ffffffffa075ba84>] obd_setup+0x114/0x2a0 [obdclass]
[10536.164238]  [<ffffffffa075cb77>] class_setup+0x2f7/0x8d0 [obdclass]
[10536.164238]  [<ffffffffa0764c96>] class_process_config+0x1c36/0x2db0 [obdclass]
[10536.164238]  [<ffffffff812de5d9>] ? simple_strtoul+0x9/0x10
[10536.164238]  [<ffffffffa0768fb0>] ? target_name2index+0x90/0xc0 [obdclass]
[10536.164238]  [<ffffffff811acbe3>] ? __kmalloc+0x1f3/0x230
[10536.164238]  [<ffffffffa075e6bb>] ? lustre_cfg_new+0x8b/0x400 [obdclass]
[10536.164238]  [<ffffffffa07666bd>] class_config_llog_handler+0x8ad/0x1d20 [obdclass]
[10536.164238]  [<ffffffffa0727345>] llog_process_thread+0x5f5/0x1020 [obdclass]
[10536.164238]  [<ffffffffa07287c0>] ? llog_backup+0x500/0x500 [obdclass]
[10536.164238]  [<ffffffffa072880c>] llog_process_thread_daemonize+0x4c/0x80 [obdclass]
[10536.164238]  [<ffffffff8109739f>] kthread+0xcf/0xe0
[10536.164238]  [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140
[10536.164238]  [<ffffffff81615018>] ret_from_fork+0x58/0x90
[10536.164238]  [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140
[10536.164238] drm_kms_helper: panic occurred, switching back to text console
[10536.164238] ------------[ cut here ]------------
[10536.164238] kernel BUG at arch/x86/mm/pageattr.c:216!
[10536.164238] invalid opcode: 0000 [#1] SMP 
[10536.164238] Modules linked in: osp(OF) mdd(OF) lod(OF) mdt(OF) lfsck(OF) mgs(OF) mgc(OF) osd_ldiskfs(OF) lquota(OF) fid(OF) fld(OF) ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) sha512_generic libcfs(OF) ldiskfs(OF) dm_mod nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ppdev ib_mad parport_pc virtio_balloon serio_raw pcspkr i2c_piix4 parport ib_core ib_addr ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect virtio_blk sysimgblt drm_kms_helper ttm ata_piix 8139too virtio_pci virtio_ring virtio drm 8139cp i2c_core mii libata floppy [last unloaded: obdecho]
[10536.164238] CPU: 1 PID: 32326 Comm: llog_process_th Tainted: GF       W  O--------------   3.10.0-229.7.2.el7_lustre.gfd6f11c.x86_64 #1
[10536.164238] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[10536.164238] task: ffff8800366416c0 ti: ffff88007870c000 task.ti: ffff88007870c000
[10536.164238] RIP: 0010:[<ffffffff8105c2ef>]  [<ffffffff8105c2ef>] change_page_attr_set_clr+0x4ef/0x500
[10536.164238] RSP: 0018:ffff88007870f070  EFLAGS: 00010046
[10536.164238] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000010
[10536.164238] RDX: 0000000000002000 RSI: 0000000000000000 RDI: 0000000080000000
[10536.164238] RBP: ffff88007870f108 R08: 0000000000000004 R09: 000000000006b390
[10536.164238] R10: 0000000000003689 R11: 0000000000000004 R12: 0000000000000010
[10536.164238] R13: 0000000000000000 R14: 0000000000000200 R15: 0000000000000005
[10536.164238] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[10536.164238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10536.164238] CR2: 00007feffbdcf000 CR3: 000000000190e000 CR4: 00000000000006e0
[10536.164238] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10536.164238] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10536.164238] Stack:
[10536.164238]  000000045a5a5a5a ffff880000000000 0000000000000000 ffff88006bf4a000
[10536.164238]  ffff8800366416c0 0000000000000000 0000000000000000 0000000000000010
[10536.164238]  0000000000000000 0000000500000001 000000000006b390 0000020000000000
[10536.164238] Call Trace:
[10536.164238]  [<ffffffff8105c646>] _set_pages_array+0xe6/0x130
[10536.164238]  [<ffffffff8105c6c3>] set_pages_array_wc+0x13/0x20
[10536.164238]  [<ffffffffa01253af>] ttm_set_pages_caching+0x2f/0x70 [ttm]
[10536.164238]  [<ffffffffa01254f4>] ttm_alloc_new_pages.isra.7+0xb4/0x180 [ttm]
[10536.164238]  [<ffffffffa0125e50>] ttm_pool_populate+0x3e0/0x500 [ttm]
[10536.164238]  [<ffffffffa013c32e>] cirrus_ttm_tt_populate+0xe/0x10 [cirrus]
[10536.164238]  [<ffffffffa01226dd>] ttm_bo_move_memcpy+0x65d/0x6e0 [ttm]
[10536.164238]  [<ffffffff8118f73e>] ? map_vm_area+0x2e/0x40
[10536.164238]  [<ffffffffa011e2c9>] ? ttm_tt_init+0x69/0xb0 [ttm]
[10536.164238]  [<ffffffffa013c2d8>] cirrus_bo_move+0x18/0x20 [cirrus]
[10536.164238]  [<ffffffffa011fde5>] ttm_bo_handle_move_mem+0x265/0x5b0 [ttm]
[10536.164238]  [<ffffffff81601a64>] ? __slab_free+0x10e/0x277
[10536.164238]  [<ffffffffa012074a>] ? ttm_bo_mem_space+0x10a/0x310 [ttm]
[10536.164238]  [<ffffffffa0120e17>] ttm_bo_validate+0x247/0x260 [ttm]
[10536.164238]  [<ffffffff81059e69>] ? iounmap+0x79/0xa0
[10536.164238]  [<ffffffff81050000>] ? kgdb_arch_late+0x80/0x180
[10536.164238]  [<ffffffffa013cac2>] cirrus_bo_push_sysram+0x82/0xe0 [cirrus]
[10536.164238]  [<ffffffffa013ac84>] cirrus_crtc_do_set_base.isra.8.constprop.10+0x84/0x430 [cirrus]
[10536.164238]  [<ffffffffa013b479>] cirrus_crtc_mode_set+0x449/0x4d0 [cirrus]
[10536.164238]  [<ffffffffa00db939>] drm_crtc_helper_set_mode+0x2e9/0x520 [drm_kms_helper]
[10536.164238]  [<ffffffffa00dc6bf>] drm_crtc_helper_set_config+0x87f/0xaa0 [drm_kms_helper]
[10536.164238]  [<ffffffffa009c711>] drm_mode_set_config_internal+0x61/0xe0 [drm]
[10536.164238]  [<ffffffffa00e3e83>] restore_fbdev_mode+0xb3/0xe0 [drm_kms_helper]
[10536.164238]  [<ffffffffa00e4045>] drm_fb_helper_force_kernel_mode+0x75/0xb0 [drm_kms_helper]
[10536.164238]  [<ffffffffa00e4d59>] drm_fb_helper_panic+0x29/0x30 [drm_kms_helper]
[10536.164238]  [<ffffffff81610a6c>] notifier_call_chain+0x4c/0x70
[10536.164238]  [<ffffffff81610aca>] atomic_notifier_call_chain+0x1a/0x20
[10536.164238]  [<ffffffff815fea4c>] panic+0x106/0x1e7
[10536.164238]  [<ffffffffa0634ddb>] lbug_with_loc+0xab/0xc0 [libcfs]
[10536.164238]  [<ffffffffa0c0f81e>] osd_trans_exec_check.part.91+0x1a/0x1a [osd_ldiskfs]
[10536.164238]  [<ffffffffa0be88c7>] osd_object_ea_create+0x927/0xb80 [osd_ldiskfs]
[10536.164238]  [<ffffffffa077afa8>] dt_find_or_create+0x598/0x8d0 [obdclass]
[10536.164238]  [<ffffffffa0614e20>] fld_index_init+0x1a0/0xbb0 [fld]
[10536.164238]  [<ffffffffa0612565>] fld_server_init+0xa5/0x3e0 [fld]
[10536.164238]  [<ffffffffa0da7cf4>] mdt_init0+0x4f4/0x12b0 [mdt]
[10536.164238]  [<ffffffffa0da8b29>] mdt_device_alloc+0x79/0x110 [mdt]
[10536.164238]  [<ffffffffa075ba84>] obd_setup+0x114/0x2a0 [obdclass]
[10536.164238]  [<ffffffffa075cb77>] class_setup+0x2f7/0x8d0 [obdclass]
[10536.164238]  [<ffffffffa0764c96>] class_process_config+0x1c36/0x2db0 [obdclass]
[10536.164238]  [<ffffffff812de5d9>] ? simple_strtoul+0x9/0x10
[10536.164238]  [<ffffffffa0768fb0>] ? target_name2index+0x90/0xc0 [obdclass]
[10536.164238]  [<ffffffff811acbe3>] ? __kmalloc+0x1f3/0x230
[10536.164238]  [<ffffffffa075e6bb>] ? lustre_cfg_new+0x8b/0x400 [obdclass]
[10536.164238]  [<ffffffffa07666bd>] class_config_llog_handler+0x8ad/0x1d20 [obdclass]
[10536.164238]  [<ffffffffa0727345>] llog_process_thread+0x5f5/0x1020 [obdclass]
[10536.164238]  [<ffffffffa07287c0>] ? llog_backup+0x500/0x500 [obdclass]
[10536.164238]  [<ffffffffa072880c>] llog_process_thread_daemonize+0x4c/0x80 [obdclass]
[10536.164238]  [<ffffffff8109739f>] kthread+0xcf/0xe0
[10536.164238]  [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140
[10536.164238]  [<ffffffff81615018>] ret_from_fork+0x58/0x90
[10536.164238]  [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140
[10536.164238] Code: ba 00 00 00 48 c7 c7 78 9a 81 81 89 8d 78 ff ff ff 44 89 45 80 e8 a2 21 01 00 8b 8d 78 ff ff ff 44 8b 45 80 e9 99 fd ff ff 0f 0b <0f> 0b 0f 0b 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 
[10536.164238] RIP  [<ffffffff8105c2ef>] change_page_attr_set_clr+0x4ef/0x500
[10536.164238]  RSP <ffff88007870f070>

Comment by Gerrit Updater [ 23/Jul/15 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15694
Subject: LU-6827 osd: trans credit insufficient for EA object create
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dbbc1abe98a28ae1bc84deca3a8d4d747e78e8c9

Comment by Zhenyu Xu [ 23/Jul/15 ]

would you mind trying this patch?

Comment by Sarah Liu [ 23/Jul/15 ]

sure, will update the ticket when I have result

Comment by Sarah Liu [ 24/Jul/15 ]

Rerun the tests for 2 times, it passed with the patch.

Comment by Gerrit Updater [ 03/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15694/
Subject: LU-6827 osd: trans credit insufficient for EA object create
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2aa5f604fba4f01a2c433f1d4cd77ae688e35e61

Comment by Peter Jones [ 03/Aug/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:03:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.