Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage: Reject updates that would exceed database limits #2370

Open
advanceboy opened this issue Jan 29, 2025 · 5 comments
Open

Storage: Reject updates that would exceed database limits #2370

advanceboy opened this issue Jan 29, 2025 · 5 comments

Comments

@advanceboy
Copy link

advanceboy commented Jan 29, 2025

Description of the Problem

When adding a total of 1,000,000 XML files (total size: 38GiB) using the add command, a java.lang.RuntimeException: Data Access out of bounds exception occurs.

Expected Behavior

  • Step (2) should complete successfully.
  • Step (4) should complete successfully.

Actual Behavior

  • In step (2), adding dir2 and dir3 results in the following error:

    Improper use? Potential bug? Your feedback is welcome:
    Contact: [email protected]
    Version: BaseX 11.6
    Java: Microsoft, 21.0.5
    OS: Windows 11, amd64
    Stack Trace:
    java.lang.RuntimeException: Data Access out of bounds:
    - pre value: 1098571872
    - table size: -2101426251
    - first/next pre value: -2101426432/-2101426251
    - #total/used pages: 8568520/8568520
    - accessed page: 8568519 (8568520 > 8568519]
            at org.basex.util.Util.notExpected(Util.java:68)
            at org.basex.io.random.TableDiskAccess.cursor(TableDiskAccess.java:477)
            at org.basex.io.random.TableDiskAccess.read4(TableDiskAccess.java:172)
            at org.basex.data.Data.id(Data.java:302)
            at org.basex.data.Data.insert(Data.java:838)
            at org.basex.query.up.atomic.Insert.apply(Insert.java:44)
            at org.basex.query.up.atomic.AtomicUpdateCache.applyUpdates(AtomicUpdateCache.java:291)
            at org.basex.query.up.atomic.AtomicUpdateCache.execute(AtomicUpdateCache.java:275)
            at org.basex.core.cmd.Add.lambda$run$0(Add.java:62)
            at org.basex.core.cmd.ACreate.update(ACreate.java:90)
            at org.basex.core.cmd.Add.run(Add.java:56)
            at org.basex.core.Command.run(Command.java:233)
            at org.basex.core.Command.execute(Command.java:93)
            at org.basex.api.client.LocalSession.execute(LocalSession.java:131)
            at org.basex.api.client.Session.execute(Session.java:36)
            at org.basex.core.CLI.execute(CLI.java:94)
            at org.basex.core.CLI.execute(CLI.java:78)
            at org.basex.core.CLI.execute(CLI.java:65)
            at org.basex.BaseX.<init>(BaseX.java:82)
            at org.basex.BaseX.main(BaseX.java:44)
    
    Improper use? Potential bug? Your feedback is welcome:
    Contact: [email protected]
    Version: BaseX 11.6
    Java: Microsoft, 21.0.5
    OS: Windows 11, amd64
    Stack Trace:
    java.lang.RuntimeException: Data Access out of bounds:
    - pre value: -2101432987
    - table size: -2101426251
    - first/next pre value: -256/0
    - #total/used pages: 8568520/8568520
    - accessed page: 2147483647 (0 > -2]
            at org.basex.util.Util.notExpected(Util.java:68)
            at org.basex.io.random.TableDiskAccess.cursor(TableDiskAccess.java:477)
            at org.basex.io.random.TableDiskAccess.read4(TableDiskAccess.java:172)
            at org.basex.data.Data.size(Data.java:356)
            at org.basex.data.NSNode.find(NSNode.java:128)
            at org.basex.data.Namespaces.uriIdForPrefix(Namespaces.java:170)
            at org.basex.data.Namespaces.root(Namespaces.java:245)
            at org.basex.data.NSScope.loop(NSScope.java:44)
            at org.basex.data.Data.insert(Data.java:782)
            at org.basex.query.up.atomic.Insert.apply(Insert.java:44)
            at org.basex.query.up.atomic.AtomicUpdateCache.applyUpdates(AtomicUpdateCache.java:291)
            at org.basex.query.up.atomic.AtomicUpdateCache.execute(AtomicUpdateCache.java:275)
            at org.basex.core.cmd.Add.lambda$run$0(Add.java:62)
            at org.basex.core.cmd.ACreate.update(ACreate.java:90)
            at org.basex.core.cmd.Add.run(Add.java:56)
            at org.basex.core.Command.run(Command.java:233)
            at org.basex.core.Command.execute(Command.java:93)
            at org.basex.api.client.LocalSession.execute(LocalSession.java:131)
            at org.basex.api.client.Session.execute(Session.java:36)
            at org.basex.core.CLI.execute(CLI.java:94)
            at org.basex.core.CLI.execute(CLI.java:78)
            at org.basex.core.CLI.execute(CLI.java:65)
            at org.basex.BaseX.<init>(BaseX.java:82)
            at org.basex.BaseX.main(BaseX.java:44)
    
  • In step (4), adding dir1 results in the following error:

    Improper use? Potential bug? Your feedback is welcome:
    Contact: [email protected]
    Version: BaseX 11.6
    Java: Microsoft, 21.0.5
    OS: Windows 11, amd64
    Stack Trace:
    java.lang.RuntimeException: Data Access out of bounds:
    - pre value: 1252171726
    - table size: -1943453255
    - first/next pre value: -1943453440/-1943453255
    - #total/used pages: 9185602/9185602
    - accessed page: 9185601 (9185602 > 9185601]
            at org.basex.util.Util.notExpected(Util.java:68)
            at org.basex.io.random.TableDiskAccess.cursor(TableDiskAccess.java:477)
            at org.basex.io.random.TableDiskAccess.read4(TableDiskAccess.java:172)
            at org.basex.data.Data.id(Data.java:302)
            at org.basex.data.Data.insert(Data.java:838)
            at org.basex.query.up.atomic.Insert.apply(Insert.java:44)
            at org.basex.query.up.atomic.AtomicUpdateCache.applyUpdates(AtomicUpdateCache.java:291)
            at org.basex.query.up.atomic.AtomicUpdateCache.execute(AtomicUpdateCache.java:275)
            at org.basex.core.cmd.Add.lambda$run$0(Add.java:62)
            at org.basex.core.cmd.ACreate.update(ACreate.java:90)
            at org.basex.core.cmd.Add.run(Add.java:56)
            at org.basex.core.Command.run(Command.java:233)
            at org.basex.core.Command.execute(Command.java:93)
            at org.basex.api.client.LocalSession.execute(LocalSession.java:131)
            at org.basex.api.client.Session.execute(Session.java:36)
            at org.basex.core.CLI.execute(CLI.java:94)
            at org.basex.core.CLI.execute(CLI.java:78)
            at org.basex.core.CLI.execute(CLI.java:65)
            at org.basex.BaseX.<init>(BaseX.java:82)
            at org.basex.BaseX.main(BaseX.java:44)
    

Steps to Reproduce the Behavior

  1. Create a database DB1 using the create-db command.
  2. Add the directories dir1, dir2, and dir3 in this order using the add command.
  3. Create a database DB2 using the create-db command.
  4. Add the directories dir2, dir3, and dir1 in this order using the add command.

Do you have an idea how to solve the issue?

No response

What is your configuration?

The XML files are divided into the following three directories:

  • dir1: 800k files, 17GiB
  • dir2: 200k files, 18GiB
  • dir3: 25k files, 3GiB

OS: Windows 11 22H2

> java --version
openjdk 21.0.5 2024-10-15 LTS
OpenJDK Runtime Environment Microsoft-10377968 (build 21.0.5+11-LTS)
OpenJDK 64-Bit Server VM Microsoft-10377968 (build 21.0.5+11-LTS, mixed mode, sharing)
@advanceboy
Copy link
Author

I suspect this issue might be related to the BaseX node limit (#902). However, it is difficult to determine this from the error message alone.

@ChristianGruen
Copy link
Member

True; the amount of data exceeds the limits of a single database instance. A common approach is to distribute documents across multiple database instances (all of which can be addressed by a single query).

However, we need to be more consistent in rejecting update operations when the database limits would be exceeded by that update.

@advanceboy
Copy link
Author

The fundamental issue of this ticket—namely, that the error message does not clearly indicate the root cause—has not been resolved. Therefore, I will keep this issue open.


@ChristianGruen Thank you for your advice! Based on your suggestion, I experimented with distributing documents across multiple databases and found that it is possible to perform cross-database statistics if the paths of all XML documents remain unique. Additionally, by binding each collection function call to a let clause at the beginning and avoiding redundant db: functions, I was able to control execution order and retrieve the expected results.

Here is an example demonstrating this approach:

> basex
BaseX 11.6 [Standalone]
Try 'help' to get more information.
> CREATE DB  multidb-01
> OPEN multidb-01
> ADD TO 1.xml <root><item a1='hoo11'>bar1</item><item a1='hoo12' /></root>
> ADD TO 4.xml <root><item a1='hoo41'>bar4</item><item a1='hoo42' /></root>
> CREATE DB  multidb-02
> OPEN multidb-02
> ADD TO 2.xml <root><item a1='hoo21'>bar2</item><item a1='hoo22' /></root>
> ADD TO 5.xml <root><item a1='hoo51'>bar5</item><item a1='hoo52' /></root>
> CREATE DB  multidb-03
> OPEN multidb-03
> ADD TO 3.xml <root><item a1='hoo31'>bar3</item><item a1='hoo32' /></root>
> ADD TO 6.xml <root><item a1='hoo61'>bar6</item><item a1='hoo62' /></root>
> 
> # expected result
> XQUERY let $x1 := collection('multidb-01') let $x2 := collection('multidb-02') let $x3 := collection('multidb-03') for $x in ($x1, $x2, $x3) return db:path($x)
1.xml
4.xml
2.xml
5.xml
3.xml
6.xml
> 
> XQUERY let $x1 := collection('multidb-01') let $x2 := collection('multidb-02') let $x3 := collection('multidb-03') for $x in ($x1, $x2, $x3) return $x[db:path($x)!='']/root/item[ends-with(@a1, "1")]
<item a1="hoo11">bar1</item>
<item a1="hoo41">bar4</item>
<item a1="hoo21">bar2</item>
<item a1="hoo51">bar5</item>
<item a1="hoo31">bar3</item>
<item a1="hoo61">bar6</item>

While this approach is less efficient, it seems feasible for distributing documents across multiple databases.

However, I encountered unexpected results when attempting more intuitive queries. The execution engine appears to be processing them in a way that leads to unintended behavior, but I couldn't pinpoint the exact cause.

> # unexpected result (1)
> XQUERY for $x in (collection("multidb-01"), collection("multidb-02"), collection("multidb-03")) return db:path($x)
1.xml
4.xml
1.xml
4.xml
1.xml
4.xml
> 
> # unexpected result (2)
> XQUERY let $x1 := collection('multidb-01') let $x2 := collection('multidb-02') let $x3 := collection('multidb-03') for $x in ($x1, $x2, $x3) return $x/root/item[ends-with(@a1, "1")]
<item a1="hoo11">bar1</item>
<item a1="hoo41">bar4</item>
<item a1="hoo21">bar2</item>
<item a1="hoo51">bar5</item>
<item a1="hoo21">bar2</item>
<item a1="hoo51">bar5</item>
> 
> # unexpected result (3)
> XQUERY for $doc in ("multidb-01", "multidb-02", "multidb-03") let $elm := collection($doc)/root/item[ends-with(@a1, "1")] return $elm
<item a1="hoo11">bar1</item>
<item a1="hoo41">bar4</item>
<item a1="hoo11">bar1</item>
<item a1="hoo41">bar4</item>
<item a1="hoo11">bar1</item>
<item a1="hoo41">bar4</item>
> 

It appears that the issue stems from how the queries are being optimized or executed internally. If you have any insights into why this is happening or suggestions for improving efficiency, I would greatly appreciate it.

Thanks again for your help!

@ChristianGruen
Copy link
Member

@advanceboy Thanks for your new observation. Feel free to create a new issue for it… Ideally, with an example that can be reproduced, but I imagine it may take a while to formulate it. The SET QUERYINFO true command may give you some insight into what the optimizer does (and what it possibly shouldn’t do).

I’ll keep the original issue open, with a slightly updated title.

@ChristianGruen ChristianGruen changed the title Data Access out of bounds Exception in TableDiskAccess when Adding a Large Number of XML Files Storage: Reject updates that would exceed database limits Jan 30, 2025
@advanceboy
Copy link
Author

I've created #2373 about #issuecomment-2624910884

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants