Spark2.0 DataSource API

Spark 1 如何实现Spark External Datasource

Spark Data Source API: Extending Our Spark SQL Query Engine

Spark SQL之External DataSource外部数据源(一)示例

Spark SQL之External DataSource外部数据源(二)源码分析

参考实现:

no

Spark 2 如何实现Spark External Datasource

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-datasource-api.html

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-datasource.html

http://www.spark.tc/exploring-the-apache-spark-datasource-api/

新特性:

  • 子查询的自持
  • 更加丰富的读写api支持,包括
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    - RelationProvider

    https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-datasource-api.html

    https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-datasource.html

    http://www.spark.tc/exploring-the-apache-spark-datasource-api/

    [关键](http://wiki.baidu.com/pages/viewpage.action?pageId=213816907)

    - DataSourceRegister

    triat ```DataSourceRegister``` is an interface to register DataSources under their shortName aliases (to look them up later).
    ``` scala
    package org.apache.spark.sql.sources

    trait DataSourceRegister {
    def shortName(): String
    }

It allows users to use the data source alias as the format type over the fully qualified class name.