-
Notifications
You must be signed in to change notification settings - Fork 596
Troubleshooting Spark
The following commands show some typical settings for running Spark pipelines.
See Spark evaluation scripts for more details about running Spark pipelines. Also, Spark Evaluation Results for performance numbers.
Reads pipeline on exome data running on a 10 node (n1-standard-16) GCS cluster
./gatk-launch ReadsPipelineSpark \
-I hdfs:///user/tom/exome_spark_eval/NA12878.ga2.exome.maq.raw.bam \
-O hdfs://tw-cluster-2-m:8020/user/tom/exome_spark_eval/out/NA12878.ga2.exome.maq.raw.vcf \
-R hdfs:///user/tom/exome_spark_eval/Homo_sapiens_assembly18.2bit \
--knownSites hdfs://tw-cluster-2-m:8020/user/tom/exome_spark_eval/dbsnp_138.hg18.vcf \
-pairHMM AVX_LOGLESS_CACHING \
-maxReadsPerAlignmentStart 10 \
-apiKey /home/tom/.gcs/broad-gatk-collab-0853abf3a8f1.json \
-- \
--sparkRunner GCS --cluster tw-cluster-2 \
--num-executors 20 --executor-cores 7 --executor-memory 28g \
--driver-memory 4g \
--conf spark.dynamicAllocation.enabled=false
Reads pipeline on WGS data running on a 20 node (n1-standard-16) GCS cluster
./gatk-launch ReadsPipelineSpark \
-I hdfs:///user/tom/q4_spark_eval/WGS-G94982-NA12878-no-NC_007605.bam \
-O hdfs://tw-cluster-2-m:8020/user/tom/q4_spark_eval/out/WGS-G94982-NA12878.vcf \
-R hdfs:///user/tom/q4_spark_eval/human_g1k_v37.2bit \
--knownSites hdfs://tw-cluster-2-m:8020/user/tom/q4_spark_eval/dbsnp_138.b37.vcf \
-pairHMM AVX_LOGLESS_CACHING \
-maxReadsPerAlignmentStart 10 \
-apiKey /home/tom/.gcs/broad-gatk-collab-0853abf3a8f1.json \
-- \
--sparkRunner GCS --cluster tw-cluster-2 \
--num-executors 20 --executor-cores 8 --executor-memory 46g \
--driver-memory 8g \
--conf spark.dynamicAllocation.enabled=false
If the job fails quickly with a java.lang.AbstractMethodError
or java.lang.NoSuchMethodError
this probably means that you are using Spark 1.6 rather than Spark 2. When running on a CDH cluster you need to specify that Spark 2 is to be used by adding --sparkSubmitCommand spark2-submit
to the Spark-specific arguments to gatk-launch
(the ones after --
).
Spark in general is sensitive to memory settings, and they will need tuning for any non-trivial job. The settings in the "Illustrative commands" section above should provide a good starting point.
- Driver. Usually 4g or 8g for
--driver-memory
will suffice. - Executors. In general, prefer a smaller number of larger executors over a larger number of smaller executors, since this allows more complex GATK tools like BQSR to share resources (e.g. known sites) in the same JVM. In the case of
ReadsPipelineSpark
on exome-sized data it's better to use one executor with 7 cores and 28g of memory than 7 executors each with one thread and 4g of memory. The total number of executors should be determined by the cluster size. E.g. in the example above 20 executors (each with 7 cores and 28g memory) was chosen since that's the maximum that fits in the cluster. Alternatively, you might consider settingspark.dynamicAllocation.enabled
totrue
to have Spark scale up the number of executors.
Make sure that all file paths on HDFS are absolute paths. E.g. hdfs://tw-cluster-2-m:8020/user/tom/exome_spark_eval/NA12878.ga2.exome.maq.raw.bam. The hostname and port must also be specified in some cases, so if in doubt include them in the path.