Spark有用的配置选项

任务提交

  • spark任务对应的Jar,配置该选项后,每次提交任务不用上传该assembly.jar,减少了任务启动时间
    1
    2
    ``` shell
    spark.yarn.jar hdfs://****:****/path/spark/spark-assembly-***.jar
  • yarn提交任务时的黑名单
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    ## Shuffle

    - ```spark.shuffle.file.buffer```

    Size of the in-memory buffer for each shuffle file output stream. These buffers reduce the number of disk seeks and system calls made in creating intermediate shuffle files.

    ## Memory

    - ```spark.yarn.executor.memoryOverhead```
    The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).


    ## Spark SQL

    - ``` spark.sql.shuffle.partitions

Spark SQL 2.0版本之前,reduce的个数是通过spark.default.parallelism和spark.sql.shuffle.partitions两个参数进行配置。如果配置过大,将会导致下游产生很多碎片化的Task,或者下游HDFS产生很多小文件。如果设置过小,将会导致单个ReduceTask计算负载过大。

  • ```
    1
    2
    Spark SQL 2.0+支持通过spark.sql.adaptive.enabled来设置reduce大小自适应
    - ```spark.sql.files.ignoreCorruptFiles