[LU-4485]  Some error message on lustre client Created: 14/Jan/14  Updated: 27/Feb/14  Resolved: 27/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: None

Type: Task Priority: Major
Reporter: Supporto Lustre Jnet2000 (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

operating system redhat 5.7
lustre 1.8.7


Attachments: File messages    
Rank (Obsolete): 12276

 Description   

On a client we have an initial problem with quota "kernel: LustreError: 11-0: an error occurred while communicating with 10.121.13.59@tcp. The ost_write operation failed with -122" , after we have some error message of which we do not understand the meaning. Do you have any suggestions?

Regards

Augusto Casciola



 Comments   
Comment by Peter Jones [ 14/Jan/14 ]

Niu

Could you please advise with this ticket?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 14/Jan/14 ]

The messages means "write failed with EDQUOT (run out of quota)", looks some user is over quota.

Comment by Niu Yawei (Inactive) [ 14/Jan/14 ]

You can use "lfs quota -u uid/gid -v fsname" to check the quota limit and usage for the user, if it's not over quota, could you upload the syslog on OSS (10.121.13.59@tcp) to see why it returned EDQUOT?

Comment by Supporto Lustre Jnet2000 (Inactive) [ 14/Jan/14 ]

In the attachment file we have these errors after the error of quota. Refer to the quota ?

"2014-01-07T17:11:40.337576+01:00 osiride-lp-041 kernel: Lustre: 23101:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1436582400258712 sent from home-OST0008-osc-ffff81063fc2f800 to NID 10.121.13.28@tcp 7s ago has timed out (7s prior to deadline).
2014-01-07T17:11:40.337583+01:00 osiride-lp-041 kernel: req@ffff810806324c00 x1436582400258712/t0 o103->home-OST0008_UUID@10.121.13.28@tcp:17/18 lens 312/384 e 0 to 1 dl 1389111100 ref 2 fl Rpc:N/0/0 rc 0/0
2014-01-07T17:11:40.337589+01:00 osiride-lp-041 kernel: Lustre: home-OST0008-osc-ffff81063fc2f800: Connection to service home-OST0008 via nid 10.121.13.28@tcp was lost; in progress operations using this service will wait for recovery to complete.
2014-01-07T17:11:40.337595+01:00 osiride-lp-041 kernel: LustreError: 23101:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
2014-01-07T17:11:40.337601+01:00 osiride-lp-041 kernel: LustreError: 23101:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
2014-01-07T17:11:40.866529+01:00 osiride-lp-041 kernel: Lustre: 7801:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1436582400258725 sent from home-OST0008-osc-ffff81063fc2f800 to NID 10.121.13.28@tcp 7s ago has timed out (7s prior to deadline).
2014-01-07T17:11:40.866542+01:00 osiride-lp-041 kernel: req@ffff8107ddac0800 x1436582400258725/t0 o13->home-OST0008_UUID@10.121.13.28@tcp:7/4 lens 192/528 e 0 to 1 dl 1389111100 ref 2 fl Rpc:/0/0 rc 0/0
2014-01-07T17:11:41.263604+01:00 osiride-lp-041 kernel: Lustre: home-OST000b-osc-ffff81063fc2f800: Connection to service home-OST000b via nid 10.121.13.28@tcp was lost; in progress operations using this service will wait for recovery to complete.
2014-01-07T17:11:41.263618+01:00 osiride-lp-041 kernel: failure to allocate a tage (491)
2014-01-07T17:11:41.263624+01:00 osiride-lp-041 kernel: LustreError: 7780:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
2014-01-07T17:11:41.263628+01:00 osiride-lp-041 kernel: LustreError: 7780:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
2014-01-07T17:11:41.566994+01:00 osiride-lp-041 kernel: Lustre: 23093:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1436582400258713 sent from home-OST0007-osc-ffff81063fc2f800 to NID 10.121.13.28@tcp 8s ago has timed out (8s prior to deadline).
2014-01-07T17:11:41.567020+01:00 osiride-lp-041 kernel: req@ffff8109d781c400 x1436582400258713/t0 o103->home-OST0007_UUID@10.121.13.28@tcp:17/18 lens 304/384 e 0 to 1 dl 1389111101 ref 2 fl Rpc:N/0/0 rc 0/0
2014-01-07T17:11:41.567025+01:00 osiride-lp-041 kernel: Lustre: 23093:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
2014-01-07T17:11:41.785397+01:00 osiride-lp-041 kernel: failure to allocate a tage (9)
2014-01-07T17:11:41.785412+01:00 osiride-lp-041 kernel: Lustre: home-OST0006-osc-ffff81063fc2f800: Connection to service home-OST0006 via nid 10.121.13.28@tcp was lost; in progress operations using this service will wait for recovery to complete.
2014-01-07T17:11:41.785416+01:00 osiride-lp-041 kernel: Lustre: Skipped 3 previous similar messages
2014-01-07T17:11:41.785421+01:00 osiride-lp-041 kernel: LustreError: 17973:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
2014-01-07T17:11:41.785426+01:00 osiride-lp-041 kernel: LustreError: 17973:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Skipped 7 previous similar messages
2014-01-07T17:11:41.785430+01:00 osiride-lp-041 kernel: LustreError: 17973:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
2014-01-07T17:11:41.785434+01:00 osiride-lp-041 kernel: LustreError: 17973:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) Skipped 7 previous similar messages
2014-01-07T17:11:41.814197+01:00 osiride-lp-041 kernel: failure to allocate a tage (18)"

Comment by Supporto Lustre Jnet2000 (Inactive) [ 14/Jan/14 ]

Sorry for the misunderstanding. We need to know because the connection to home-OST0008, home-OST0008 and home-OST0006 was lost by client and the meaning of the "failure to allocate a tage" error.
Thanks

Comment by Niu Yawei (Inactive) [ 15/Jan/14 ]

The "failure to allocate a tage" means Lustre logging system can't allocate buffer to store debug message, and the result is that some debug message will be lost. It won't break the connection between client and OSTs.

So, client lost connection to OST0006, OST0007 and OST0008? and you want to know why the client lost connections?

Comment by Supporto Lustre Jnet2000 (Inactive) [ 11/Feb/14 ]

Hi, we want know because there has been the client lost of connection to OST0006, OST0007 and OST0008.

Regards

Comment by Niu Yawei (Inactive) [ 14/Feb/14 ]
2014-01-07T17:11:57.274407+01:00 osiride-lp-041 kernel: Lustre: 7567:0:(import.c:517:import_select_connection()) home-OST000a-osc-ffff81063fc2f800: tried all connections, increasing latency to 3s

I suspect it's a network problem, not related to the write failures (-122 EDUOT error).

Comment by Gabriele Paciucci (Inactive) [ 27/Feb/14 ]

I have talked with the customer and we agreed that this is a network problem. We can close this issue. In case of other similar errors, we can activate the debug daemon in order to have more informations.

Comment by Peter Jones [ 27/Feb/14 ]

ok - thanks Gabriele

Generated at Sat Feb 10 01:43:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.