[LU-582] 1.8<->2.1 interop: sanity test_132: FAIL: some glimpse RPC is expected Created: 09/Aug/11  Updated: 09/Apr/13  Resolved: 20/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0, Lustre 2.1.2, Lustre 2.1.3, Lustre 1.8.8, Lustre 1.8.6
Fix Version/s: Lustre 2.4.0, Lustre 2.1.5

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

Old Lustre Version: 1.8.6-wc1
Lustre Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/

New Lustre Version: 2.0.66.0
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/228/

Clean upgrading (Lustre servers and clients were upgraded all at once) from Lustre 1.8.6-wc1 to Lustre 2.0.66.0 under the following configuration:

OSS1: RHEL5/x86_64
OSS2: RHEL5/x86_64
MDS: RHEL5/x86_64
Client1: RHEL6/x86_64
Client2: RHEL5/x86_64


Attachments: Text File sanity.test_132.debug_log.client-7.1344970176.log     Text File sanity.test_132.debug_log.fat-intel-2.1344970176.log    
Severity: 3
Bugzilla ID: 23,339
Rank (Obsolete): 4234

 Description   

After the upgrading, sanity test 132 failed on Lustre 2.0.66.0 as follows:

== sanity test 132: som avoids glimpse rpc == 03:17:56 (1312885076)
====> SOM is disabled, 0 glimpse RPC occured
 sanity test_132: @@@@@@ FAIL: some glimpse RPC is expected 

Please refer to the Maloo report for more logs: https://maloo.whamcloud.com/test_sets/a570d34e-c278-11e0-8bdf-52540025f9af

This is an known issue: bug 23339.



 Comments   
Comment by James A Simmons [ 23/Aug/11 ]

This is also failing with Lustre 2.1 clients.

== sanity test 132: som avoids glimpse rpc =========================================================== 12:59:06 (1314118746)
====> SOM is disabled, 0 glimpse RPC occured
sanity test_132: @@@@@@ FAIL: some glimpse RPC is expected
Dumping lctl log to /tmp/test_logs//1314118691/sanity.test_132.*.1314118756.log

Comment by James A Simmons [ 23/Aug/11 ]

Updated logs for the failure from one of my clients to your ftp site. Its in uploads/LU-582

Comment by James A Simmons [ 23/Jan/12 ]

Sanity test 132 no longer fails. Youc an close this ticket.

Comment by Peter Jones [ 23/Jan/12 ]

Thanks James!

Comment by James A Simmons [ 08/Feb/12 ]

Peter can you reopen this ticket.

Comment by Peter Jones [ 08/Feb/12 ]

James

Can you please confirm that the 1.8.x client that you are testing with is at least version 1.8.7-wc1 (not the Oracle 1.8.7)?

Thanks

Peter

Comment by James A Simmons [ 08/Feb/12 ]

Originally this ticket reported a interop problem but I noted it that its fails with 2.X clients. It stopped failing but now its fails again with Lustre 2.X clients. prehaps this ticket shoudl be retitled.

Comment by Peter Jones [ 08/Feb/12 ]

James

I think that a new ticket would be simpler\clearer

Peter

Comment by Andreas Dilger [ 17/May/12 ]

This is failing 16% of the time currently, with 2.6.18 clients vs. 2.6.32 servers:

https://maloo.whamcloud.com/test_sets/35a3f450-9faa-11e1-b416-52540035b04c

Comment by Jian Yu [ 07/Jun/12 ]

Old Lustre Version: 1.8.8-wc1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/198/

New Lustre Version: 2.1.2
Lustre Tag: v2_1_2_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/86/
Network: TCP (1GigE)

After clean upgrading with the following configuration, sanity test 132 still failed:

OSS1: RHEL5/x86_64
OSS2: RHEL5/x86_64
MDS: RHEL5/x86_64
Client1: RHEL6/x86_64
Client2: RHEL6/x86_64
client-1: == sanity test 132: som avoids glimpse rpc == 04:43:30 (1339069410)
client-1: ====> SOM is disabled, 0 glimpse RPC occured
client-1:  sanity test_132: @@@@@@ FAIL: some glimpse RPC is expected
client-1: Dumping lctl log to /home/yujian/test_logs/2012-06-07/043300/sanity.test_132.*.1339069414.log
client-1: FAIL 132 (5s)

Maloo report: https://maloo.whamcloud.com/test_sets/c250c2d6-b0d3-11e1-99ce-52540035b04c

Comment by Jay Lan (Inactive) [ 20/Jun/12 ]

Is there a new ticket for 2.x client?
I ran into this problem yesterday. Both server and client are 2.1.2.

Comment by Jian Yu [ 14/Aug/12 ]

After clean upgrading from Lustre 1.8.8-wc1 to 2.1.3 RC1, the issue occurred again:
https://maloo.whamcloud.com/test_sets/57d3cf2c-e673-11e1-afac-52540035b04c

Comment by Hongchao Zhang [ 15/Aug/12 ]

there is indeed glimpse request during calling 'stat $DIR/$tfile'

in client-7

00000001:00010000:2.0:1344970176.071156:0:12030:0:(glimpse.c:120:cl_glimpse_lock()) Glimpsing inode [0x20b2735b0:0x1:0x0]
00000020:00010000:2.0:1344970176.071177:0:12030:0:(cl_lock.c:143:cl_lock_trace0()) enqueue lock: ffff8802e6a12938@(2 ffff8802decb8ae0 1 0 0 1 1 0)(ffff880331731d80/0/1) at cl_enqueue_try():1189
00000020:00010000:2.0:1344970176.071182:0:12030:0:(cl_lock.c:143:cl_lock_trace0()) enclosure lock: ffff8802e6a12af8@(1 (null) 0 0 0 1 1 0)(ffff880331731d80/1/1) at cl_lock_enclosure():1685
00000020:00010000:2.0:1344970176.071186:0:12030:0:(cl_lock.c:143:cl_lock_trace0()) enclosure lock: ffff8802e6a12938@(2 ffff8802decb8ae0 1 1 0 1 1 0)(ffff880331731d80/0/2) at cl_lock_enclosure():1685
00000020:00010000:2.0:1344970176.071190:0:12030:0:(cl_lock.c:143:cl_lock_trace0()) enqueue lock: ffff8802e6a12af8@(2 ffff8802decb8ae0 1 0 0 1 1 0)(ffff880331731aa0/1/0) at cl_enqueue_try():1189

in fat-intel-2

00010000:00000001:13.0:1344970176.078486:0:7796:0:(ldlm_lockd.c:1065:ldlm_handle_enqueue0()) Process entered
00010000:00010000:13.0:1344970176.078487:0:7796:0:(ldlm_lockd.c:1067:ldlm_handle_enqueue0()) ### server-side enqueue handler START
00010000:00000001:13.0:1344970176.078489:0:7796:0:(ldlm_lockd.c:1419:ldlm_request_cancel()) Process entered
00010000:00000001:13.0:1344970176.078489:0:7796:0:(ldlm_lockd.c:1423:ldlm_request_cancel()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:13.0:1344970176.078491:0:7796:0:(ldlm_lock.c:1245:ldlm_lock_create()) Process entered
00010000:00000010:13.0:1344970176.078494:0:7796:0:(ldlm_resource.c:983:ldlm_resource_new()) slab-alloced '(res)': 320 at ffff8103058d1940.
00002000:00000001:13.0:1344970176.078496:0:7796:0:(filter_lvb.c:76:filter_lvbo_init()) Process entered
00002000:00000010:13.0:1344970176.078498:0:7796:0:(filter_lvb.c:84:filter_lvbo_init()) kmalloced 'lvb': 40 at ffff810320fdd880.
00002000:00000002:13.0:1344970176.078500:0:7796:0:(filter_lvb.c:96:filter_lvbo_init()) lustre-OST0000: filter_lvbo_init(o_seq=0, o_id=70152)
00002000:00000001:13.0:1344970176.078502:0:7796:0:(filter.c:1474:filter_fid2dentry()) Process entered
00002000:00000002:13.0:1344970176.078504:0:7796:0:(filter.c:1499:filter_fid2dentry()) looking up object O/d8/70152
00002000:00000002:13.0:1344970176.078507:0:7796:0:(filter.c:1518:filter_fid2dentry()) got child objid 70152: ffff810320f62660, count = 1
00002000:00000001:13.0:1344970176.078509:0:7796:0:(filter.c:1522:filter_fid2dentry()) Process leaving (rc=18446604449170728544 : -139624538823072 : ffff810320f62660)
00002000:00010000:13.0:1344970176.078511:0:7796:0:(filter_lvb.c:116:filter_lvbo_init()) res: 0x11208 initial lvb size: 0x200, mtime: 0x502a9dbf, blocks: 0x8
00002000:00000001:13.0:1344970176.078514:0:7796:0:(filter_lvb.c:120:filter_lvbo_init()) Process leaving
00002000:00000002:13.0:1344970176.078514:0:7796:0:(filter.c:221:f_dput()) putting 70152: ffff810320f62660, count = 0
00010000:00000001:13.0:1344970176.078516:0:7796:0:(ldlm_lock.c:413:ldlm_lock_new()) Process entered
00010000:00000010:13.0:1344970176.078518:0:7796:0:(ldlm_lock.c:418:ldlm_lock_new()) slab-alloced '(lock)': 560 at ffff810305f58240.
00000020:00000001:13.0:1344970176.078520:0:7796:0:(lustre_handles.c:88:class_handle_hash()) Process entered
00000020:00000040:13.0:1344970176.078521:0:7796:0:(lustre_handles.c:122:class_handle_hash()) added object ffff810305f58240 with handle 0xfa6adb28e6af43de to hash
00000020:00000001:13.0:1344970176.078524:0:7796:0:(lustre_handles.c:123:class_handle_hash()) Process leaving
00010000:00000001:13.0:1344970176.078525:0:7796:0:(ldlm_lock.c:455:ldlm_lock_new()) Process leaving (rc=18446604448717701696 : -139624991849920 : ffff810305f58240)
00010000:00000001:13.0:1344970176.078528:0:7796:0:(ldlm_extent.c:806:ldlm_interval_alloc()) Process entered
00010000:00000010:13.0:1344970176.078530:0:7796:0:(ldlm_extent.c:809:ldlm_interval_alloc()) slab-alloced '(node)': 72 at ffff81030d7ceac0.
00010000:00000001:13.0:1344970176.078532:0:7796:0:(ldlm_extent.c:815:ldlm_interval_alloc()) Process leaving (rc=18446604448844016320 : -139624865535296 : ffff81030d7ceac0)
00010000:00000001:13.0:1344970176.078536:0:7796:0:(ldlm_lock.c:1284:ldlm_lock_create()) Process leaving (rc=18446604448717701696 : -139624991849920 : ffff810305f58240)
00010000:00010000:13.0:1344970176.078539:0:7796:0:(ldlm_lockd.c:1152:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: filter-lustre-OST0000_UUID lock: ffff810305f58240/0xfa6adb28e6af43de lrc: 2/0,0 mode: -/PR res: 70152/0 rrc: 1 type: EXT [0->0] (req 0>0) flags: 0x0 remote: 0xeb07c50dcc271870 expref: -99 pid: 7796 timeout 0
00000020:00000040:13.0:1344970176.078546:0:7796:0:(genops.c:1064:__class_export_add_lock_ref()) lock = ffff810305f58240, export = ffff81030c5bbc00, refs = 1
00010000:00000040:13.0:1344970176.078548:0:7796:0:(ldlm_lockd.c:1164:ldlm_handle_enqueue0()) lock GETting export ffff81030c5bbc00 : new locks_count 16
00000020:00000040:13.0:1344970176.078550:0:7796:0:(genops.c:782:class_export_get()) GETting export ffff81030c5bbc00 : new refcount 22
00010000:00000001:13.0:1344970176.078553:0:7796:0:(ldlm_lock.c:1302:ldlm_lock_enqueue()) Process entered
00002000:00000001:13.0:1344970176.078554:0:7796:0:(filter.c:1696:filter_intent_policy()) Process entered

the filter_intent_policy is called!

Comment by Hongchao Zhang [ 15/Aug/12 ]

the logs of client and OST

Comment by Hongchao Zhang [ 16/Aug/12 ]

Yujian help to reproduce the bug after upgrading Lustre, and the ldlm_glimpse_enqueue is indeed increased.

the debug patch is tracked at http://review.whamcloud.com/#change,3692

Hi Yujian, could you please help to test with the debug patch? thanks!!

Comment by Jian Yu [ 16/Aug/12 ]

the debug patch is tracked at http://review.whamcloud.com/#change,3692

The real patch is in http://review.whamcloud.com/#change,3693. Testing is ongoing.

Comment by Hongchao Zhang [ 17/Aug/12 ]

there is a bug in get_ost_param

get_ost_param() {
local token=$1
local gl_sum=0
for node in $(osts_nodes); do
gl=$(do_node $node "$LCTL get_param -n ost.OSS.ost.stats" | awk '/'$token'/

{print $2}

' | head -n 1)
[ x$gl = x"" ] && gl=0
gl_sum=$((gl_sum + gl))
done
echo $gl <--- here should be "echo $gl_sum"!
}

previous, there is only one OSS in autotest, then the problem is hidden, but in upgrade test, there are 2 OSS,
then the issue shows up. the updated patch will be attached soon!

Comment by Hongchao Zhang [ 17/Aug/12 ]

the patch has been updated

Comment by Jay Lan (Inactive) [ 31/Dec/12 ]

I saw this problem between 2.1.3 server and 2.3.0 client. Let me know if you want the test_logs.

Comment by Stephen Champion [ 07/Feb/13 ]

I've been using http://review.whamcloud.com/#change,3693 for
several months, and recommend it for b2_1.

Generated at Sat Feb 10 01:08:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.