Skip to content
This repository has been archived by the owner on Oct 30, 2023. It is now read-only.

Fix bug in memory estimation #49

Closed
wants to merge 2 commits into from

Conversation

dlogothetis
Copy link
Contributor

Method MemoryEstimatorOracle.calculateRegression() exits if the number of valid columns to use for the regression is not the same as the total number of columns. This is wrong, the regression can still run on only the valid columns. This causes memory estimation to never be used in practice, and OOC starts spilling only when memory usage gets very high.

This is fixed in #34 too, but I want to make these changes one-by-one so that we can test in isolation.

Tests:

  • mvn clean install
  • Snapshot tests, including snapshot test that uses OOC.
  • Run 3 production jobs and verified that this reduces data spills and jobs finish faster. The max % spilled is reduced by more than 40%.

JIRA: https://issues.apache.org/jira/browse/GIRAPH-1160

Copy link

@heslami heslami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -792,7 +792,6 @@ private static boolean calculateRegression(double[] coefficient,
LOG.warn("There are " + coefficient.length +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this a LOG.info. We can also entirely remove this if block as it doesn't add much info to the log actually. If we want to keep the if, we should also remove the "but" from the logline :-)

Copy link

@majakabiljo majakabiljo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

What does it mean that columns are invalid?

@asfgit asfgit closed this in 22e8511 Sep 21, 2017
@dlogothetis
Copy link
Contributor Author

These columns correspond to the different variables in linear regression model and include the number of edges read so far, number of vertices computed etc. A case of an invalid column would be all samples have a value of zero for this column (e.g. there no vertices computed yet). Another case would be there is a linear dependency between two columns, so you can't run the regression.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants