Throttling Batch Ingest on AWS

Reasons for Throttling

AWS Reasons

First AWS Elastic Transcoder allows you to post 2 jobs per 1 second, with support for burst periods of 100 requests per second. If all of a large spreadsheet is queue at once that can exceed the burst allowance, assuming the burst allowance is even available. If a spreadsheet that had come in right before had used up the burst allowance, even a small spreadsheet could exceed it. So our default is to throttle at 2 jobs every second (except right now it's lower due to the Modeshape reason below)

For more info on Elastic Transcoder limits, see: https://docs.aws.amazon.com/elastictranscoder/latest/developerguide/limits.html

The second issue that right now Elastic Transcoder can handle 20 jobs at a time. 20 jobs represent a total of ~100 Fedora calls. The posting of the newly made derivatives (currently 2 quality levels), updating the MasterFile and MediaObject these derivatives are attached to, and some other callbacks. So we need to throttle the rate at which completed transcodes are fed back into Fedora in the form of derivative creation. The material you are transcoding also impacts the need to throttle. Transcoding a thousand Red Hot Chili Peppers songs will result in transcodes finishing extremely close to one another and thousands of writes to Fedora in short order. A thousand feature length films will space out the impact due to longer transcoding times.

Modeshape and S3 Reasons

If you are running Fedora 4.7.4 or earlier, this is known bug in Modeshape that makes writing problematic on S3. See here for more details. Long story short more than one write operation to S3 at a time is bad and throws a 500 error. Even with extensive throttling, one job every 10 seconds level, it is still difficult to totally eliminate simultaneous writes.

NUL has deployed our own bespoke Fedora 4.7.4 that has the fixes needed compiled into it. You can download the .war here. Using this has significantly reduces the errors, although it should be noted you can still overload Fedora pretty easily given the power Elastic Transcoder.

How Do We Throttle

We throttle using ActiveJob::TrafficControl. We've installed it using the method in the README:

Adding it to our Gemfile and bundle installing.
Updating application.rb to let ActiveJob::TrafficControl get its hooks into the queue
Updating the jobs we want to throttle, for us the jobs are IngestBatchEntry and ActiveEncodeJob::Update
Adding environment variables to our EC2 instances
Deploying the new code

Our specific commit where we added the throttling is here

The environment variables we have added are:

SETTINGS__BATCH_THROTTLING__INGEST_JOBS_THROTTLE_THRESHOLD = 2
SETTINGS__BATCH_THROTTLING__INGEST_JOBS_SPACING = 5
SETTINGS__BATCH_THROTTLING__UPDATE_JOBS_THROTTLE_THRESHOLD = 3 
SETTINGS__BATCH_THROTTLING__UPDATE_JOBS_SPACING = 10

To set these on AWS you need to go into your console and then to Elastic Beanstalk and then (assuming you're following Avalon convention) add all four to both your webapp and your worker applications. After those environments have updated you can deploy the code and those those variables picked up.

For folks not on AWS, you can add the above to your config/settings.yml file in the form of:

batch_throttling:
  ingest_jobs_throttle_threshold: 2
  ingest_jobs_spacing: 5
  update_jobs_throttle_threshold: 3
  update_jobs_spacing: 10

The ingest_jobs values control the initial creation of the MediaObject and MasterFiles, which is done when the batch is first run. A thousand item batch will need to create 1000 MediaObjects and at least 1000 MasterFiles (more if MediaObjects in the batch have multiple parts). This also controls the rate jobs are sent to the transcoder. Decreasing the threshold limit or increasing the spacing will reduce the impact this part of the ingest has on Fedora. Although note if you set these values such that creation time takes a fair amount of time, you'll begin overlapping with the update job since you'll still be creating MediaObject even as items finished transcoding and need to create derivatives.

The update_jobs values impact how often derivatives are created after transcoding complete. Each job here needs to create the derivative in Fedora and run callbacks against the parents of the derivatives as part of the creation and attachment process. For users not using AWS Elastic Transcode or transcoding primarily long films or such, you may not need to do much throttling here due to the fact the long running transcodes will naturally space these jobs out.

How Does it Throttle In the App?

The above code allows X jobs to run in a Y second window. So when the next job comes along (X+1) and you're still in Y window, that job will be rejected. TrafficControl will have that job wait 1x to 5x the Y second window and then try again. So too conservative of throttling, or really massive batches with lots of jobs that have to be throttled, is going to result in numerous jobs hitting your queue, bouncing and trying again.

For example if you let 1 job run every 10 seconds and you submit a batch of 3500 items:

In the first second one item will run
All others, 3499 jobs, will fail and wait 10 to 50 seconds before trying again
Over the next 50 seconds: 5 jobs will run, the other 3,444 will be rejected when they try and wait another 10 to 50 seconds
This vicious cycle will continue until all the jobs are processed or your queue blows up at you, hopefully the former

How Can You Tell If It Is Working?

If you haven't throttled enough for Fedora

STATUS: 500 org.modeshape.jcr.value.binary.BinaryStoreException: org.modeshape.jcr.value.binary.BinaryStoreException: Unable to find binary value with key \"c65b23722510dbce87a62224dea16081544cb5a5\" within binary store at \"/var/cache/tomcat8/temp/modeshape
...

If you're having issues with these you'll like see rows in the BatchEntries table ending up with error: true, error_message: STATUS: 500 blurb above. You may also see it in the Fedora Tomcat logs if the job itself is running but later call backs are failing. These errors mean you're writing out to Fedora too aggressively and need to scale back your writes further.

If you have't throttled enough for Elastic Transcode

The batch will run, but individual items will have transcode errors in the form of:

You need to throttle your IngestBatchEntry to stay under the submission threshold for Elastic Transcoder.

Database Connection

The last thing to note there is that whenever a job is throttled there is a cost in the form of writing out to your database, since the job needs to be saved and requeued. If you have a large number of jobs that have to be throttled you may overload your database connection. This will manifest itself in the form of:

ActiveRecord::ConnectionTimeoutError (could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)):

messages appearing in your logs. If this is the case you can up your database pool size (non AWS users see: config/database.yml) or add more workers. Note that AWS implements different values for max connections from an EC2 instance based on the type of instance (t2.small, t2.medium, etc). So review AWS documentation for whatever instance type you are running to determine the max value for pool size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throttling Batch Ingest on AWS

Reasons for Throttling

How Do We Throttle

How Does it Throttle In the App?

How Can You Tell If It Is Working?

Clone this wiki locally