[LU-1028] Bus error (core dumped) during fsx test Created: 24/Jan/12  Updated: 16/Apr/13  Resolved: 17/Feb/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: Lustre 2.2.0, Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Mikhail Pershin Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-2305 Test failure sanityn, test_16: fsx bu... Resolved
is related to LU-884 Client In-Memory Data Checksum Resolved
is related to LU-2305 Test failure sanityn, test_16: fsx bu... Resolved
Severity: 3
Rank (Obsolete): 4251

 Description   

This bug appeared after commit e8ffe16619baf1ef7c5c6b117d338956372aa752, "LU-884 clio: client in memory checksum",
unfortunately our tests don't show failure of fsx in sanity-benchmark and sanity-benchmark is not part of every autotest.

The issue looks like the following:
/usr/lib64/lustre/tests/sanity-benchmark.sh: line 186: 16826 Bus error (core dumped) fsx -c 50 -p 1000 -S $FSX_SEED -P $TMP -l $FSX_SIZE -N $(($FSX_COUNT * 100)) $testfile

Example of report (master):
https://maloo.whamcloud.com/test_sets/02485a3a-45d0-11e1-8d6e-5254004bbbd3

sanity-benchmark is green but fsx failed as showed above. Recently the same code was landed to the orion and we start experiencing the same issue.



 Comments   
Comment by Mikhail Pershin [ 24/Jan/12 ]

bug appeared after lu-884 was landed

Comment by Andreas Dilger [ 27/Jan/12 ]

I've submitted http://review.whamcloud.com/2007 in order to reenable fsx in sanityn.sh and also check the result of fsx in performance-sanity.sh.

Comment by Jinshan Xiong (Inactive) [ 28/Jan/12 ]

I pushed a patch for a quick fix at: http://review.whamcloud.com/2037

Comment by Peter Jones [ 02/Feb/12 ]

jinshan

Can you please rebase your patch to the tip of master to pickup the fix from LU1048?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 08/Feb/12 ]

Hi Andreas,

Can you please take a look at this problem?

The current situation is: the patch set 4 works well but it can cause more grants to be allocated to clients. It used to return lost grant to target but in this patch lost_grant will be returned to client itself.

Di and Johann thought this is not good so they suggested to use lost_grant when we exactly need it. However, there exists a race that makes it not work. Assuming the application makes lots of dirty pages and avail_grant will be used up, and then truncate them, then an urgent read RPC is coming which will bring lost_grant to the server. During that read RPC is in flight, all avail_grant, cl_dirty and cl_w_in_flight are zero and if a page mkwrite process is trying to allocate grant, definitely it will be returned with -EDQUOT.

Checking cl_r_in_flight won't fix this problem as well, because there exists another race that in osc_send_oap_rpc(). It packs and clears grant information in osc_build_rpc() and then update cl_r_in_flight, so this is a race window that both lost_grant and cl_r_in_flight are zero.

Comment by Andreas Dilger [ 09/Feb/12 ]

Given that Jinshan is reworking the CLIO RPC engine for 2.3, I'm content to have a minimal fix for 2.2 that allows fsx to run without problems. I think that real world workloads will not hit the problems that fsx is seeing, so that should be enough for now.

Comment by Andreas Dilger [ 10/Feb/12 ]

I guess it wasn't clear from my previous comment that "minimal fix" means "use the lost_grant" and see where that gets us.

Comment by Jinshan Xiong (Inactive) [ 10/Feb/12 ]

There are two ways of using lost_grant:
1. if a dirty page is truncated, return the grant to avail grant. This way will cause more grant allocated to client but fsx can pass;
2. use lost_grant if there is no avail grant in osc_enter_cache(), this way seems to cleaner because it doesn't change grant protocol. However, fsx can still run into problem because of the race I mentioned above.

I suppose you'd like the 1st solution because you want fsx to pass from your previous comment(this is what I did in 6th patch). It seems I understood it wrong and actually you want the 2nd solution, please confirm it, thanks.

Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,server,el5,ofa #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,client,el5,ofa #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Peter Jones [ 17/Feb/12 ]

Landed for 2.2

Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #479
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = SUCCESS
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #480
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = FAILURE
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #480
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = FAILURE
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,client,el6,ofa #480
LU-1028 osc: fix grant checking on the osc side (Revision 0204171fd3e1b393c53bd374aff228e80080a55a)

Result = ABORTED
Oleg Drokin : 0204171fd3e1b393c53bd374aff228e80080a55a
Files :

  • lustre/osc/osc_request.c
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanity-benchmark.sh
  • lustre/tests/sanityn.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanity-benchmark.sh
  • lustre/tests/sanityn.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanityn.sh
  • lustre/tests/sanity-benchmark.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanityn.sh
  • lustre/tests/sanity-benchmark.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » i686,server,el5,ofa #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanityn.sh
  • lustre/tests/sanity-benchmark.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanityn.sh
  • lustre/tests/sanity-benchmark.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanityn.sh
  • lustre/tests/sanity-benchmark.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » i686,client,el5,ofa #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanity-benchmark.sh
  • lustre/tests/sanityn.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanityn.sh
  • lustre/tests/sanity-benchmark.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanityn.sh
  • lustre/tests/sanity-benchmark.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanity-benchmark.sh
  • lustre/tests/sanityn.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanityn.sh
  • lustre/tests/sanity-benchmark.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanity-benchmark.sh
  • lustre/tests/sanityn.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » i686,server,el6,ofa #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanity-benchmark.sh
  • lustre/tests/sanityn.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanityn.sh
  • lustre/tests/sanity-benchmark.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » i686,client,el6,ofa #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanity-benchmark.sh
  • lustre/tests/sanityn.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanity-benchmark.sh
  • lustre/tests/sanityn.sh
Comment by Build Master (Inactive) [ 23/Feb/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #487
LU-1028 tests: re-enable sanityn.sh test_16 (fsx) (Revision df764443d452c1db1db5e72f72c9ad6e0819f567)

Result = SUCCESS
Oleg Drokin : df764443d452c1db1db5e72f72c9ad6e0819f567
Files :

  • lustre/tests/sanity-benchmark.sh
  • lustre/tests/sanityn.sh
Comment by Andreas Dilger [ 04/Jun/12 ]

I'm still able to reproduce this problem in local testing if sanityn.sh test_15() and test_16() both run with smaller OSTs that do not cause test_15 to be skipped.

Comment by Jinshan Xiong (Inactive) [ 15/Aug/12 ]

This is also the issue of grant, when this issue is happening, all osc.cur_

{dirty|grant|lost_grant}

_bytes are zero.

139896832 bytes (140 MB) copied, 182.781 s, 765 kB/s
Success!
osc.lustre-OST0000-osc-ffff880105ae6000.cur_dirty_bytes=0
osc.lustre-OST0000-osc-ffff88020f12b400.cur_dirty_bytes=0
osc.lustre-OST0001-osc-ffff880105ae6000.cur_dirty_bytes=0
osc.lustre-OST0001-osc-ffff88020f12b400.cur_dirty_bytes=0
osc.lustre-OST0000-osc-ffff880105ae6000.cur_grant_bytes=0
osc.lustre-OST0000-osc-ffff88020f12b400.cur_grant_bytes=0
osc.lustre-OST0001-osc-ffff880105ae6000.cur_grant_bytes=0
osc.lustre-OST0001-osc-ffff88020f12b400.cur_grant_bytes=0
osc.lustre-OST0000-osc-ffff880105ae6000.cur_lost_grant_bytes=0
osc.lustre-OST0000-osc-ffff88020f12b400.cur_lost_grant_bytes=0
osc.lustre-OST0001-osc-ffff880105ae6000.cur_lost_grant_bytes=0
osc.lustre-OST0001-osc-ffff88020f12b400.cur_lost_grant_bytes=0
Resetting fail_loc on all nodes...done.
PASS 15 (198s)

== sanityn test 16: 2500 iterations of dual-mount fsx == 14:39:45 (1345066785)

So we met the same issue and a temp fix is to delete all test file of test_15 and write some bytes in sync mode so that more grants can be allocated.

Comment by Bruno Faccini (Inactive) [ 14/Sep/12 ]

We also trigger the same situation (osc.cur_

{dirty|grant|lost_grant}

_bytes = 0) on CEA test system running with our/Bull build of Lustre v2.1.2. This build/version integrates LU-1299 (patch set 11) and ORNL-22 patches but not the one for this LU-1028.

The very bad news/consequence, which does not clearly appear in this JIRA comments, is that this situation causes files corruptions on affected Clients because applications/cmds are still allowed to write in cache when later asynchronous flushes never succeed (-EDQUOT) but this occurs silently and out of context.

As a work-around, we are also able to recover grants and associated+working mechanism by writing sunchronous/O_DIRECT I/Os on affected OSTs/OSCs. But this problem is a showstopper for customer to migrate to v2.1.2 since there is always a timing-window where corruption can occur.

Comment by Jinshan Xiong (Inactive) [ 14/Sep/12 ]

Hi Bruno, there is no ->page_mkwrite() method yet in v2.1.2 so yes there exists an issue that writing via mmap won't reserve grants so the later flush will fail if it doesn't have enough space on the OST. As a result, one flag in the address_space will be set error so the subsequent IO will see this error(this is a common problem for async write because the app may have already exited so it won't see this error at all). Have you ever seen that the OST is running out of space and the application writes via mmap?

BTW, if the OST is really running out of space, the grants can't be recovered by issuing an O_DIRECT io.

There is a report about data corruption which involves all series of clients - 1.8, 2.1 and 2.3. There are huge changes on client code between 1.8 and 2.x but the grant mechanism has never changed. I smell there would be some problem with it.

Do you have a steady way to reproduce this problem?

Comment by Bruno Faccini (Inactive) [ 15/Sep/12 ]

Hello Jinshan, thank's for your quick answer already !!

Even if we don't know how we fall in such situation, I don't think we are running with the mmap case you described. This comes from the fact that now we are aware of this problem, we frequently monitor "/proc/fs/lustre/osc/<OST-import>/cur_

{dirty|grant|lost_grant}_bytes" on each Client nodes and at some point of time we find some of them with all 3 counters be NULL when OST is far from beeing full and others Clients still have grants for it!!

On these Clients we are able to successfully create a file that we limited to only one of these OST/Stripe (lfs setstripe) and then we can easily re-read it from the same Client. But when we try to access it from an other Client, file appears empty with a 0 size, and thus/now also from its creator/original Client node !!!

If enabling full traces, we can see the -EDQUOT errno returns from osc_enter_cache() which, I think, seem to happen during lock revoke and further pages flushes mechanism ...
So, according to me, it better looks like that some scenario or I/O sequence is able to break the grant flow/mechanism.

And again, when the problem is there on a Client, all 3 "cur_{dirty|grant|lost_grant}

_bytes" stay NULL and we can do the same proof of failure multiple times. The only way to return to normal we know is to run a small program doing O_DIRECT writes to each affected OST.

Will try to get you the relating infos/traces out from the site soon.

Comment by Andreas Dilger [ 15/Sep/12 ]

I Bruno, could you please file a separate bug for your "cur_grant is zero" issue, and copy the last comments over. This bug is closed, and was flats stpecifidally to the fsx core dump problem, but your issue may be different. It is easier to track and prioritize if you file a new bug.

In addition to Jinshan, Johann should be CC'd on the bug, since he was working on the grant code recently.

Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ]

If enabling full traces, we can see the -EDQUOT errno returns from osc_enter_cache() which, I think, seem to happen during lock revoke and further pages flushes mechanism ...

osc_enter_cache() is called to add the page into OSC write cache. When it runs out of grants on this OSC, -EQUOTA will be returned to caller(ll_commit_write() of llite), and then a SYNC io will be issued so more grants should be returned by the OST. Please take a look at the log again to make sure this process happened.

In addition to Jinshan, Johann should be CC'd on the bug, since he was working on the grant code recently.

Yes, I agree with Andreas that a new ticket should be created and I will add Johann in the cc list.

Comment by Bruno Faccini (Inactive) [ 16/Sep/12 ]

No problem I understand, I will fill a new Jira starting next week with all related inf

Comment by Andreas Dilger [ 08/Nov/12 ]

I saw this recently in https://maloo.whamcloud.com/test_sets/235473d8-d8b7-11e1-9e3b-52540035b04c

Comment by Andreas Dilger [ 08/Nov/12 ]

Actually, this is happening a LOT for what appears to be a wide striping test at IU:
https://maloo.whamcloud.com/test_sets/7114751c-d0c6-11e1-8d8f-52540035b04c

I'm going to file a new bug.

Generated at Sat Feb 10 01:12:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.