[LU-458] silence excess 1.8 error messages Created: 24/Jun/11  Updated: 25/May/12  Resolved: 25/May/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8, Lustre 1.8.6
Fix Version/s: Lustre 1.8.8

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Story Points: 2
Rank (Obsolete): 9739

 Description   

Some CERROR(...) or CWARN() messages clutter up the syslog and should be changed to CDEBUG(D_*, ...).

Ideally, it should be possible to mount then unmount Lustre under normal usage without getting a screen full of messages.

Simply turning off all of the messages is NOT an acceptable solution for all of them. Please try to make changes to both b1_8 and master in a similar manner where possible.

2011-06-23 14:54:31 LustreError: 9730:0:(mds_open.c:1693:mds_close()) @@@ no handle for file close ino 121636742: cookie 0x0 req@ffff812001bdf000 x1372347904386655/t0 o35->c339861d-0c0a-b9ee-a39d-9cfe591452c1@NET_0x50000c0a87223_UUID:0/0 lens 408/864 e 0 to 0 dl 1308866077 ref 1 fl Interpret:/0/0 rc 0/0

Happens many times for each evicted client, but nothing that can be done about it by administrator. This is already fixed on master to use CDEBUG(D_INFO, ...).

LustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode 63486047

Should be quieted to CDEBUG(D_INODE,) for the -ENOENT case, since this can happen with racing "rm -r" vs. "rm -r" or "ls -l".

LustreError: 11-0: an error occurred while communicating with 172.16.x.x@tcp. The ost_write operation failed with -28

LustreError: 11-0: an error occurred while communicating with 172.23.68.8@tcp. The mds_getattr_lock operation failed with -13

The "-28" (-ENOSPC), "-13" (-EACCES), and "-2" (-EPERM) shouldn't print an error on the client console. However, in this case the problem isn't on the client, but rather because the server is returning an RPC with PTL_RPC_MSG_ERR set. The PTL_RPC_MSG_ERR flag should only be used for cases where there is an error in the RPC handling that prevented the server from even executing the RPC, and NOT for the case where the RPC was processed correctly but returned an error (e.g. -EPERM or -EACCESS or -ENOSPC). Fixing these requires looking into the MDS/OSS code and seeing where the server is returning rc != 0 to the handler, or calling ptlrpc_error() (except in the case where it is not possible to pack a reply message). Some of these may already be fixed on master.

Lustre: myth-OST0002: haven't heard from client c06c5b51-02f2-e84d-3aaa-8cc820badd80 (at 192.168.20.159@tcp) in 249 seconds. I think it's dead, and I am evicting it.

The exp_client_uuid should be kept in a static "last_uuid" string, and if the same UUID is evicted by another target on this node it doesn't need to be printed to the console again, only to debug logs. Use CDEBUG_LIMIT(D_CONSOLE | (mask), ...).

Just looking through https://maloo.whamcloud.com/test_logs/eb3fd5de-98f0-11e0-9a27-52540025f9af to see what messages are printed on every mount, and what can be removed:

Lustre: OBD class driver, http://www.lustre.org/

Lustre: Lustre Version: 1.8.6

Lustre: Build Version: jenkins-wc1--PRISTINE-2.6.32-131.2.1.el6.x86_64

This can just use the "#ifdef CRAY_XT3" version and print "Lustre: Build Version: "BUILD_VERSION"\n", and a separate project that Brian is working on will fix the build version string.

Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1

Please ask Liang how important this is. Maybe it shouldn't be printed if it is 0xffff...?

Lustre: Lustre Client File System; http://www.lustre.org/

Seems redundant with the message in obdclass.

LustreError: 152-6: Ignoring deprecated mount option 'acl'.

Why do our test scripts specify a mount option in local.sh that is no longer useful?

Lustre: Client lustre-client has started

Lustre: client ffff880410c8f000 umount complete

It would be good to make these messages consistent with each other, like:

Lustre: client lustre-client (fff880410c8f00) mount complete
:
:
Lustre: client lustre-client (fff880410c8f00) unmount complete



 Comments   
Comment by Peter Jones [ 04/May/12 ]

http://review.whamcloud.com/#change,2381

Comment by Peter Jones [ 04/May/12 ]

Yangsheng/Andreas

Is this same change needed for b2_1 and master also?

Thanks

Peter

Comment by Andreas Dilger [ 04/May/12 ]

I previously landed a patch which quieted a number of similar messages for 2.2 (2.1.54+) via LU-459 http://review.whamcloud.com/1876. It would probably be a good idea to land this for 2.1 also.

Comment by Jian Yu [ 15/May/12 ]

While testing Lustre 1.8.8-wc1 RC1, I found the following messages in the dmesg logs of client nodes:

Lustre: client ZZZZZZZZZZZZZZ�^�5(ffff88007d18cc00) umount complete
......
Lustre: client ZZZZZZZZZZZZZZ(ffff8800768b5c00) umount complete
......
Lustre: client ZZZZZZZZZZZZZZ(ffff880076ccfc00) umount complete

https://maloo.whamcloud.com/test_logs/e1da5dca-9d59-11e1-8587-52540035b04c/show_text

This is a regression introduced by change http://review.whamcloud.com/#change,2381.

Comment by Jian Yu [ 16/May/12 ]

Patch on b1_8 branch to fix the above issue is in http://review.whamcloud.com/2799.

Comment by Jian Yu [ 21/May/12 ]

Patch on b1_8 branch to fix the above issue is in http://review.whamcloud.com/2799.

Unfortunately, the above patch does not really fix the issue. What's more, it introduces another defect that the client profile is not deleted properly.

00000020:00000001:12:1337579915.876844:0:23445:0:(obd_mount.c:1735:lustre_common_put_super()) Process entered
<snip>
00000020:00000001:12:1337579915.876876:0:23445:0:(obd_mount.c:1204:lustre_free_lsi()) Process entered
00000020:01000004:12:1337579915.876878:0:23445:0:(obd_mount.c:1207:lustre_free_lsi()) Freeing lsi ffff810410c100c0
00000020:00000010:12:1337579915.876882:0:23445:0:(obd_mount.c:1218:lustre_free_lsi()) kfreed 'lsi->lsi_lmd->lmd_dev': 24 at ffff810413c7dc00.
00000020:00000010:12:1337579915.876886:0:23445:0:(obd_mount.c:1221:lustre_free_lsi()) kfreed 'lsi->lsi_lmd->lmd_profile': 14 at ffff810413c7dc20.
00000020:00000010:12:1337579915.876891:0:23445:0:(obd_mount.c:1224:lustre_free_lsi()) kfreed 'lsi->lsi_lmd->lmd_opts': 17 at ffff810413c7dbe0.
00000020:00000010:12:1337579915.876895:0:23445:0:(obd_mount.c:1229:lustre_free_lsi()) kfreed 'lsi->lsi_lmd': 56 at ffff810415e37680.
00000020:00000010:12:1337579915.876899:0:23445:0:(obd_mount.c:1234:lustre_free_lsi()) kfreed 'lsi': 112 at ffff810410c100c0.
00000020:00000001:12:1337579915.876903:0:23445:0:(obd_mount.c:1237:lustre_free_lsi()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:12:1337579915.876906:0:23445:0:(obd_mount.c:1252:lustre_put_lsi()) Process leaving (rc=1 : 1 : 1)
00000020:00000001:12:1337579915.876909:0:23445:0:(obd_mount.c:1752:lustre_common_put_super()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0)
00000080:02000400:12:1337579915.876914:0:23445:0:(llite_lib.c:1237:ll_put_super()) client ZZZZZZZZZZZZZZ(ffff810416aa3c00) umount complete
00000004:00000001:1:1337579915.878315:0:21724:0:(handler.c:1859:mds_update_server_data()) Process entered
00000004:00000004:1:1337579915.878319:0:21724:0:(handler.c:1862:mds_update_server_data()) MDS mount_count is 1, last_transno is 4294967296
00000020:00000001:12:1337579915.882613:0:23445:0:(obd_config.c:717:class_del_profile()) Process entered
00000020:01000000:12:1337579915.882616:0:23445:0:(obd_config.c:719:class_del_profile()) Del profile ZZZZZZZZZZZZZZ
00000020:00000001:12:1337579915.882618:0:23445:0:(obd_config.c:657:class_get_profile()) Process entered
00000020:00000001:12:1337579915.882621:0:23445:0:(obd_config.c:663:class_get_profile()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:12:1337579915.882624:0:23445:0:(obd_config.c:730:class_del_profile()) Process leaving
00000080:00000001:12:1337579915.882627:0:23445:0:(llite_lib.c:1244:ll_put_super()) Process leaving

In ll_put_super(), the memory space pointed to by profilenm is in fact freed inside lustre_common_put_super(sb), which is called before LCONSOLE_WARN().

The new patch on b1_8 branch is in http://review.whamcloud.com/2841.

Comment by Peter Jones [ 25/May/12 ]

Landed for 1.8.8-wc1

Generated at Sat Feb 10 01:07:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.