[LU-11809] conf-sanity test 28A hangs on file system mount Created: 18/Dec/18 Updated: 13/Jan/20 Resolved: 05/Feb/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ubuntu | ||
| Environment: |
Ubuntu 18.04 clients/RHEL 7.6 servers |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
conf-sanity test_28A hangs due to client problem mounting the file system. We are only seeing this issue when testing with Ubuntu 18.04 clients. Looking at the test suite hang at https://testing.whamcloud.com/test_sets/d8edfc48-fdd4-11e8-a97c-52540065bddc , the client test_log is empty. Looking at the Client 1 (vm1) console log, we see the issue [ 2626.155389] Lustre: DEBUG MARKER: == conf-sanity test 28A: permanent parameter setting ================================================= 02:34:15 (1544495655) [ 2626.772885] Lustre: Lustre: Build Version: 2.12.0_RC2 [ 2626.836890] LNet: Added LNI 10.9.4.224@tcp [8/256/0/180] [ 2626.842963] LNet: Accept all, port 7988 [ 2628.417559] Lustre: 3169:0:(gss_svc_upcall.c:1199:gss_init_svc_upcall()) Init channel is not opened by lsvcgssd, following request might be dropped until lsvcgssd is active [ 2628.419207] Key type lgssc registered [ 2628.519146] Lustre: Echo OBD driver; http://www.lustre.org/ [ 2639.110774] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre [ 2639.121362] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock trevis-19vm4@tcp:/lustre /mnt/lustre [ 2650.553084] Lustre: Mounted lustre-client [ 2651.652270] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre-*.max_read_ahead_whole_mb [ 2651.665867] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre-*.max_read_ahead_whole_mb [ 2652.000321] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre-*.max_read_ahead_whole_mb [ 2652.011374] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre-*.max_read_ahead_whole_mb [ 2653.024019] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre-*.max_read_ahead_whole_mb [ 2653.764392] BUG: unable to handle kernel paging request at 0000000080d48269 [ 2653.765275] IP: class_process_config+0x1cf8/0x27b0 [obdclass] [ 2653.765911] PGD 0 P4D 0 [ 2653.766218] Oops: 0000 [#1] SMP PTI [ 2653.766619] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds joydev mac_hid serio_raw sch_fq_codel sunrpc ip_tables x_tables autofs4 psmouse virtio_blk floppy 8139too 8139cp mii pata_acpi i2c_piix4 [last unloaded: libcfs] [ 2653.771057] CPU: 1 PID: 3729 Comm: llog_process_th Tainted: G W OE 4.15.0-32-generic #35-Ubuntu [ 2653.772039] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 2653.772668] RIP: 0010:class_process_config+0x1cf8/0x27b0 [obdclass] [ 2653.773327] RSP: 0018:ffffb4fd4247bc58 EFLAGS: 00010246 [ 2653.773892] RAX: 0000000080d47e61 RBX: ffff9d7f744c0880 RCX: 0000000000000000 [ 2653.774624] RDX: 0000000000000018 RSI: ffffffffc0768d04 RDI: ffff9d7f744c08c6 [ 2653.775363] RBP: ffffb4fd4247bd08 R08: 00000000ffffffff R09: 0000000000000024 [ 2653.776103] R10: ffffffffc07594b0 R11: f000000000000000 R12: ffffffffc0759680 [ 2653.776838] R13: ffffffffc0761b50 R14: 0000000000000000 R15: ffffb4fd4247bd40 [ 2653.777571] FS: 0000000000000000(0000) GS:ffff9d7fbfd00000(0000) knlGS:0000000000000000 [ 2653.778413] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2653.779026] CR2: 0000000080d48269 CR3: 0000000010a0a002 CR4: 00000000000606e0 [ 2653.779769] Call Trace: [ 2653.780100] ? libcfs_debug_msg+0x50/0x70 [libcfs] [ 2653.780624] ? libcfs_debug_msg+0x50/0x70 [libcfs] [ 2653.781174] class_config_llog_handler+0x7cb/0x14c0 [obdclass] [ 2653.781828] llog_process_thread+0x651/0x1580 [obdclass] [ 2653.782413] llog_process_thread_daemonize+0x9f/0xe0 [obdclass] [ 2653.783071] kthread+0x121/0x140 [ 2653.783461] ? llog_backup+0x4d0/0x4d0 [obdclass] [ 2653.783981] ? kthread_create_worker_on_cpu+0x70/0x70 [ 2653.784534] ret_from_fork+0x35/0x40 [ 2653.784951] Code: 8b 40 38 48 85 c0 0f 84 e5 08 00 00 48 89 da be 20 00 00 00 4c 89 ef e8 b7 41 ed c9 41 89 c4 e9 d9 f4 ff ff 48 8b 85 58 ff ff ff <48> 8b 80 08 04 00 00 48 8b 50 10 81 3a 03 bd ac bd 0f 85 87 06 [ 2653.786833] RIP: class_process_config+0x1cf8/0x27b0 [obdclass] RSP: ffffb4fd4247bc58 [ 2653.787628] CR2: 0000000080d48269 [ 0.439151] Kernel panic - not syncing: Out of memory and no killable processes... Logs for other hangs are at |
| Comments |
| Comment by Peter Jones [ 18/Dec/18 ] |
|
James Not sure if this is related to the other tickets you've just picked up... Peter |
| Comment by Andreas Dilger [ 18/Dec/18 ] |
|
This looks to be related to |
| Comment by Gerrit Updater [ 20/Dec/18 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33900 |
| Comment by Li Xi [ 20/Dec/18 ] |
|
The instance name of Lustre client like "$fsname-ffff88002738bc00" doesn't feel like a meaningful name. The pointer (ffff88002738bc00) looks like a random number and it changes when remounting. There is no consistent name for a Lustre client. This will bring inconvienence when writing a script. I am wondering whether the instance name can be changed to something like "$fsname-$mount_point". Maybe in order to avoid problems, we need to replace "/" in the mount point to "-" or so. But a least, for a given mount point, we can generate the client name easily. |
| Comment by Andreas Dilger [ 20/Dec/18 ] |
|
The "ffff888..." part of the name is not meant to be meaningful, it is meant to be unique. If we use something like the mounpoint, what happens if e.g. there are two containers that are both mounting the same filesystem on the same mountpoint internally? What happens if the filesystem is moved to a new location with "mount --bind /mnt/lustre /mnt/testfs; umount /mnt/testfs"? I think it would be possible to use something other than the superblock pointer to be unique per client (e.g. PRNG value stored in superblock, client mountpoint UUID, etc.) and something like ll_sb_uuid might make it easier in userspace to match the /sys/fs/lustre/llite entries to a specific mountpoint. Another option is to include the client UUID in the mount options shown in /proc/mounts, like: 10.0.2.15@tcp:/testfs /mnt/testfs2 lustre rw,seclabel,uuid=b893c52a-995a-d164-5dc6-46e03330e73e 0 though this might mean that we accept "uuid=" as a client mount option (which is maybe silently discarded, because we don't want users to be able to specify non-unique UUID values for the mounts, jut print it to identify the mountpoint. |
| Comment by James A Simmons [ 20/Dec/18 ] |
|
I have been playing with a patch that uses the UUID. In my patch I removed the dashes so it prints it as one number. Should we keep the dashes? |
| Comment by Andreas Dilger [ 20/Dec/18 ] |
|
I was thinking about the same. Unfortunately, the dashes make it hard to parse. In my current patch I change llapi_getname() from assuming a fixed 16-char instance number at the end to allowing everything after the last dash. As an alternative, I could explicitly scan for " In any case, that kind of change isn't something to be done at the last minute. For 2.12 we need to stick with a single hex digit, and we can look at changing it in 2.13 when there is a much longer testing window. The only thing that might be changed in my current patch is to make llapi_getname() more tolerant of what we might introduce in the future. |
| Comment by Gerrit Updater [ 21/Dec/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33900/ |