Spark Job任务提交,取消,查看进度

Posted on 2016-08-24 | In Spark |

使用一个SparkContext时，可以针对不同Job进行分组提交和取消：

// 提交任务
private SparkContext sc;
private SQLContext sqlc;
sc.setJobGroup(jobGroup, description, true);
sqlc.sql(st);
sc.clearJobGroup();

// 取消任务
sc.cancelJobGroup(jobGroup）

获取任务执行进度信息：

String jobGroup = getJobGroup(context);
SQLContext sqlc = getSparkInterpreter().getSQLContext();
SparkContext sc = sqlc.sparkContext();
int completedTasks = 0;
int totalTasks = 0;
JobProgressListener sparkListener = new JobProgressListener(context.getConf());


DAGScheduler scheduler = sc.dagScheduler();
HashSet<ActiveJob> jobs = scheduler.activeJobs();
Iterator<ActiveJob> it = jobs.iterator();
while (it.hasNext()) {
      ActiveJob job = it.next();
      String g = (String) job.properties().get("spark.jobGroup.id");
      if (jobGroup.equals(g)) {
      int[] progressInfo = null;
        
      progressInfo = getProgressFromStage_1_1x(sparkListener, job.finalStage());
        
      totalTasks += progressInfo[0];
      completedTasks += progressInfo[1];
    }
 }

if (totalTasks == 0) {
   return 0;
}
return completedTasks * 100 / totalTasks;
}

private int[] getProgressFromStage_1_1x(JobProgressListener sparkListener, Stage stage) {
    int numTasks = stage.numTasks();
    int completedTasks = 0;

    try {
      Method stageIdToData = sparkListener.getClass().getMethod("stageIdToData");
      HashMap<Tuple2<Object, Object>, Object> stageIdData =
          (HashMap<Tuple2<Object, Object>, Object>) stageIdToData.invoke(sparkListener);
      Class<?> stageUIDataClass =
          this.getClass().forName("org.apache.spark.ui.jobs.UIData$StageUIData");

      Method numCompletedTasks = stageUIDataClass.getMethod("numCompleteTasks");

      Set<Tuple2<Object, Object>> keys =
          JavaConverters.asJavaSetConverter(stageIdData.keySet()).asJava();
      for (Tuple2<Object, Object> k : keys) {
        if (stage.id() == (int) k._1()) {
          Object uiData = stageIdData.get(k).get();
          completedTasks += (int) numCompletedTasks.invoke(uiData);
        }
      }
    } catch (Exception e) {
      logger.error("Error on getting progress information", e);
    }

    List<Stage> parents = JavaConversions.asJavaList(stage.parents());
    if (parents != null) {
      for (Stage s : parents) {
        int[] p = getProgressFromStage_1_1x(sparkListener, s);
        numTasks += p[0];
        completedTasks += p[1];
      }
    }
    return new int[] {numTasks, completedTasks};
  }

Hive on Spark 如何进行小文件merge

Posted on 2016-08-23 | In Hive |

Hive中有相关的属性property可以进行设置，对执行结果进行小文件merge；当使用Spark作为Hive的执行引擎时，遇到小文件合并需求时，也可以进行处理：

配置属性 hive.merge.sparkfiles=true
https://issues.apache.org/jira/browse/HIVE-8043
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
ps：奈何公司的Spark版本不支持，命令行设置&hive-site.xml设置均无效
使用distribute by命令进行数据重分布
使用时间戳取模进行数据分布
http://stackoverflow.com/questions/31009834/merge-multiple-small-files-in-to-few-larger-files-in-spark
Demo:
1
select * from table1 where yymmdd='20160816' distribute by (column % 64)

Hive中order by,sort by,distribute by,cluster by的区别

Posted on 2016-08-23 | In Hive |

转自：http://blog.csdn.net/lzm1340458776/article/details/43306115

order by

order by会对输入做全局排序，因此只有一个Reducer(多个Reducer无法保证全局有序)，然而只有一个Reducer，会导致当输入规模较大时，消耗较长的计算时间。关于order by的详细介绍请参考这篇文章：Hive Order by操作。

sort by

sort by不是全局排序，其在数据进入reducer前完成排序，因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只会保证每个reducer的输出有序，并不保证全局有序。sort by不同于order by，它不受hive.mapred.mode属性的影响，sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序。使用sort by你可以指定执行的reduce个数(通过set mapred.reduce.tasks=n来指定)，对输出的数据再执行归并排序，即可得到全部结果。

distribute by

distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列，对应reduce的个数进行分发，默认是采用hash算法。sort by为每个reduce产生一个排序文件。在有些情况下，你需要控制某个特定行应该到哪个reducer，这通常是为了进行后续的聚集操作。distribute by刚好可以做这件事。因此，distribute by经常和sort by配合使用。

注：Distribute by和sort by的使用场景

Map输出的文件大小不均。
Reduce输出文件大小不均。
小文件过多。
文件超大。

cluster by

cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒叙排序，不能指定排序规则为ASC或者DESC。

Demo：
根据年份和气温对气象数据进行排序，以确保所具有相同年份的行最终都在一个reduce分区中。

hive (hive)> select * from temperature distribute by year sort by year asc,tempra desc;  
//MapReduce...  
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2  
//MapReduce...  
OK  
year    tempra  
2008    35`C  
2008    32.5`C  
2008    31`C  
2008    31.5`C  
2008    30`C  
2015    41`C  
2015    39`C  
2015    37`C  
2015    36`C  
2015    35`C  
2015    33`C  
Time taken: 17.358 seconds

Dropwizard--非常棒的Java REST服务器栈

Posted on 2016-08-11 | In REST |

什么是Dropwizard？

Dropwizard 是一个开源的Java框架，用于开发OPS友好、高性能的基于REST的后端。它是由Yammer开发的，来驱动基于JVM的后端。

Dropwizard提供同类最佳的Java库到一个嵌入式应用程序包。它由以下部分组成：

嵌入式Jetty：每一个应用程序被打包成一个jar（而不是war）文件，并开始自己的嵌入式Jetty容器。没有任何war文件和外部servlet容器。
JAX-RS：Jersey（JAX-RS的参考实现）是用来写基于REST的Web服务的。
JSON：REST服务用的是JSON，Jackson库用来做所有的JSON处理。
日志：使用Logback和SLF4J完成。
Hibernate验证：Dropwizard使用Hibernate验证API进行声明性验证。
指标：Dropwizard支持监控使用标准库，它在监控代码方面有无与伦比的洞察力。

除了上面提到的这几个，Dropwizard还使用了一些其他的库，你可以在这里找到完整的列表。

为什么是Dropwizard？

我决定学Dropwizard的原因有以下几点：

快速的项目引导：如果你已经在使用Spring和Java EE，你就会明白开发人员在引导项目时的痛苦。使用Dropwizard，你只需要在你的 pom.xml 文件中添加一个依赖就完成了。
应用指标：Dropwizard自带应用程序指标的支持。它提供了类似请求/响应时间这种非常有用的信息，只要把@ 定时注解来获取方法的执行时间。
生产力：每个Dropwizard应用程序有一个启动Jetty容器的主程序。这意味着，完全可以把应用程序作为一个主程序在IDE中运行和调试。所以就没有重新编译或部署war文件。

Demo

Dropwizard —— 非常棒的Java REST服务器栈

day13-dropwizard-mongodb-demo-app

Scala mutable和immutable类型转换

Posted on 2016-08-10 | In Scala |

一般而言，from mutable to immutable, 使用 to 系列方法in mutable collections, like MutableList and ListBuffer’s toList method.
from immutable to mutable, 使用构造函数: scala.collection.mutable.ListBuffer(immtableList: _).

Note that the to* methods like toList, toMap are is performed in constant time.

Map

// from mutable to immutable
val mutableMap1 = immutableMap.toMap() // after 2.8
val mutbaleMap2 = collection.immutable.Map(x.toList: _*) // before 2.8
// from immutable to mutable
val immutableMap = scala.collection.immutable.Map(1 -> "1", 2 -> "2")
val mutableMap = scala.collection.mutable.Map(immutableMap: _*)

List

// from mutable to immutable
val immutableList = mutableListBuffer.toList
// from immutable to mutable
val immutableList = scala.collection.immutable.List(1, 2, 3)
val mutableListBuffer = scala.collection.mutable.ListBuffer(immutableList: _*)

参考链接

Spark性能优化

Posted on 2016-08-04 | In Spark |

参考:
how-to-tune-your-apache-spark-jobs-part-1
how-to-tune-your-apache-spark-jobs-part-2
tuning_spark_streaming
Spark Streaming性能调优详解

Spark性能优化：shuffle调优
 Spark性能优化：数据倾斜调优

Spark性能优化：开发调优篇
 top-5-mistakes-when-writing-spark-applications 强力推荐

一基础说明

job–>stage–>task
job划分为stage，stage划分为Task，一个Task运行在一个core上
executor–>core
The number of tasks in a stage is the same as the number of partitions in the last RDD in the stage.

二 Tuning Resource Allocation

Spark应用的GC调优 –>重点讲解了G1垃圾回收器的调优工作
Spark性能优化：资源调优篇
Every Spark executor in an application has the same fixed number of cores and same fixed heap size.

--executor-cores/ spark.executor.cores 提交时通过该参数设置每个executor的core数量，决定了Task的并行度
--executor-memory/spark.executor.memory 设置executor的JVM memory
--num-executors/spark.executor.instances 设置executor的数量
spark.dynamicAllocation.enabled 设置动态申请资源(value设为true)，此时不要设置num-executors
spark.yarn.executor.memoryOverhead 设置堆外的memory大小

spark.dynamicAllocation.enabled

executor空闲超时后，会被移除
对于Spark Streaming，数据按时间段到达，为了防止executor频繁出现添加移除现象，应该禁用该功能。

内存格局

说明:

The application master, which is a non-executor container with the special capability of requesting containers from YARN, takes up resources of its own that must be budgeted in. In yarn-client mode, it defaults to a 1024MB and one vcore. In yarn-cluster mode, the application master runs the driver, so it’s often useful to bolster its resources with the –driver-memory and –driver-cores properties.
Running executors with too much memory often results in excessive garbage collection delays. 64GB is a rough guess at a good upper limit for a single executor.最多4G内存，防止GC压力过大。
I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number. 最多5个Task可以同时达到最高的HDFS写入带宽
Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM. For example, broadcast variables need to be replicated once on each executor, so many small executors will result in many more copies of the data.

注意事项:

保留内存和core给hadoop ，yarn等系统运行

Slimming Down Your Data Structures

定制序列化方法,减少序列化后的存储占用
spark.serializer=org.apache.spark.serializer.KryoSerializer

三 Tuning Parallelism

分区过少时，Task数量有限，无法充分利用机器资源。
方法:

Use the repartition transformation, which will trigger a shuffle.
Configure your InputFormat to create more splits.
Write the input data out to HDFS with a smaller block size.

3.1 参数spark.default.parallelism

参数说明：该参数用于设置每个stage的默认task数量。这个参数极为重要，如果不设置可能会直接影响你的Spark作业性能。

　　参数调优建议：Spark作业的默认task数量为500~1000个较为合适。很多同学常犯的一个错误就是不去设置这个参数，那么此时就会导致Spark自己根据底层HDFS的block数量来设置task的数量，默认是一个HDFS block对应一个task。通常来说，Spark默认设置的数量是偏少的（比如就几十个task），如果task数量偏少的话，就会导致你前面设置好的Executor的参数都前功尽弃。

减少shuffle以及shuffle的数据量

操作repartition , join, cogroup, and any of the *By or *ByKey transformations can result in shuffles.
Avoid groupByKey when performing an associative reductive operation. For example, rdd.groupByKey().mapValues(_.sum) will produce the same results as rdd.reduceByKey(_ + _)
However, the former will transfer the entire dataset across the network, while the latter will compute local sums for each key in each partition and combine those local sums into larger sums after shuffling.

以下函数应该优先于 groupByKey ：

combineByKey组合数据，但是组合之后的数据类型与输入时值的类型不一样。
foldByKey 合并每一个 key 的所有值，在级联函数和“零值”中使用。

Avoid reduceByKey When the input and output value types are different.

1 2	rdd.map(kv => (kv._1, new Set[String]() + kv._2)) .reduceByKey(_ ++ _)

This code results in tons of unnecessary object creation because a new set must be allocated for each record. It’s better to use aggregateByKey, which performs the map-side aggregation more efficiently:

val zero = new collection.mutable.Set[String]()
rdd.aggregateByKey(zero)(
    (set, v) => set += v,
    (set1, set2) => set1 ++= set2)

Avoid the

pattern. When two datasets are already grouped by key and you want to join them and keep them grouped, you can just use ```cogroup```. That avoids all the overhead associated with unpacking and repacking the groups. join数据源时直接使用```cogroup```


## 四 shuffle不发生的情况
- 两个数据源进行join时，已经进行group分组后，如果分组时使用的是同样的partitioner，那么进行join时是不需要进行shuffle的。
- 当数据量较少时，使用广播变量，不需要shuffle

## When More Shuffles are Better
当数据partition较少，数据量较大时，进行shuffle可以提高partition数量，提高并行度，从而达到提高效率的目的。

## 五 RDD
[Spark性能优化：开发调优篇](http://www.iteblog.com/archives/1657)
- 原则一：避免创建重复的RDD
- 原则二：尽可能复用同一个RDD
- 原则三：对多次使用的RDD进行持久化  cache persist
- 原则四：尽量避免使用shuffle类算子  广播大变量 
- 原则五：使用map-side预聚合的shuffle操作
- 原则六：使用高性能的算子
	- **使用reduceByKey/aggregateByKey替代groupByKey**
	- **使用mapPartitions替代普通map(mapPartitions类的算子**，一次函数调用会处理一个partition所有的数据，而不是一次函数调用处理一条，性能相对来说会高一些。)
	- **使用foreachPartitions替代foreach**(一次函数调用处理一个partition的所有数据，而不是一次函数调用处理一条数据)
	- **使用filter之后进行coalesce操作**(通常对一个RDD执行filter算子过滤掉RDD中较多数据后（比如30%以上的数据），建议使用coalesce算子，手动减少RDD的partition数量，将RDD中的数据压缩到更少的partition中去。)
- 原则七：广播大变量 
- 原则八：使用Kryo优化序列化性能
- 原则九：优化数据结构
## 5.1 不要将大型RDD中所有元素发送到Driver端
慎重使用```collect countByKey countByValue collectAsMap```等函数，使用```take或者takeSample```来限制数据大小的上限

## 六 其他
### 6.1 Spark优化：禁止应用程序将依赖的Jar包传到HDFS
[Spark优化：禁止应用程序将依赖的Jar包传到HDFS](http://www.iteblog.com/archives/1173)
编辑spark-default.conf文件，添加以下内容：
spark.yarn.jar=hdfs://my/home/iteblog/spark_lib/spark-assembly-1.1.0-hadoop2.2.0.jar
也就是使得spark.yarn.jar指向我们HDFS上的Spark lib库。
### 6.2 参数配置相关
1. Always use ```–-verbose``` option on ‘spark-submit’ command to run your workload    log会打印相关的spark参数信息
Example command:
spark-submit --driver-memory 10g --verbose --master yarn --executor-memory 
2. Use ```--packages``` to include comma-separated list of Maven coordinates of JARs
This includes JARs on both driver and executor classpaths

3. Typical workloads that need large driver heap size
- Spark SQL
- Spark Streaming

4. GC policies Tuning options
- Spark default is -XX:UseParallelGC
- Try overwrite with –XX:G1GC

5.  No space left on device
``` shell
Lost task 38.4 in stage 89.3 (TID 30100, rhel4.cisco.com): java.io.IOException: No space left on device

Complains about ‘/tmp’ is full
§ Controlled by ‘spark.local.dir’ parameter

Default is ‘/tmp’
Stores map output files and RDDs
Two reasons ‘/tmp’ is not an ideal place for Spark “scratch” space
‘/tmp’ usually is small and for OS
‘/tmp’ usually is a single disk, a potential IO bottleneck
§ To fix, add the following line to ‘spark-defaults.conf’ file:
spark.local.dir /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp,

Java动态技术

Posted on 2016-07-31 | In Java |

Java 编程的动态性，第四部分: 用 Javassist 进行类转换

Servlet Listener之ServletContextListener用法

Posted on 2016-07-26 | In Web |

本文旨在解释JavaEE中的ServletContextListener接口及用法。

1.何时需要使用ServletContextListener？

通常我们可能有这样的需求：即在web 应用启动之前运行一些代码。例如：我们可能需要创建一个数据库连接以便web应用在任何时候都能使用它执行一些操作，并且当web应用关闭的时候能够关闭数据库连接。

2.如何实现这个需求？

Java EE规范提供了一个叫ServletContextListener的接口，这个接口可以实现我们的需求。ServletContextListener监听servlet context的生命周期事件。当这个listener关联的web应用启动和关闭的时候，这个接口会收到通知。下面是javadoc对这个接口的说明：

Implementations of this interface receive notifications about changes to the servlet context of the web application they are part of. To receive notification events, the implementation class must be configured in the deployment descriptor for the web application.

如果想要监听web应用的启动，可以使用contextInitialized(ServletContextEvent event)方法。

Notification that the web application initialization process is starting. All ServletContextListeners are notified of context initialization before any filter or servlet in the web application is initialized.

如果要监听web应用的停止（关闭），用contextDestroyed(ServletCOntextEvent event)方法。

1	Notification that the servlet context is about to be shut down. All servlets and filters have been destroy()ed before any ServletContextListeners are notified of context destruction.

如下创建一个监听器类：

package com.cruise;

import javax.servlet.ServletContextEvent;
import javax.servlet.ServletContextListener;

public class MyServletContextListener implements ServletContextListener {

    public void contextInitialized(ServletContextEvent event) {
        System.out.println("context initialized");
    }

    public void contextDestroyed(ServletContextEvent event) {
        System.out.println("context destroyed");
    }

}

接下来在web.xml文件中配置listener

</web-app ...>
  <listener>
    <listener-class>com.thejavageek.MyServletContextListener</listener-class>
  </listener>
</web-app>

配置完成后，部署应用到tomcat服务器并启动tomcat，将会看到如下的日志。

INFO: Starting service Catalina
Oct 24, 2015 10:52:04 AM org.apache.catalina.core.StandardEngine start
INFO: Starting Servlet Engine: Apache Tomcat/6.0.35
context initialized
Oct 24, 2015 10:52:04 AM org.apache.coyote.http11.Http11Protocol start
INFO: Starting Coyote HTTP/1.1 on http-8080
Oct 24, 2015 10:52:04 AM org.apache.jk.common.ChannelSocket init

继承thread
` java
public class ThreadListener extends Thread implements ServletContextListener {

public void contextInitialized(ServletContextEvent event) {
```
super.start();
```
}

public void contextDestroyed(ServletContextEvent event) {
```
super.stop();
```
}

@override
public void run(){

}

}
`

WebHDFS与HttpFS的使用

Posted on 2016-07-22 | In Hadoop |

WebHDFS与HttpFS的使用

WebHDFS

介绍

提供HDFS的RESTful接口，可通过此接口进行HDFS文件操作。

安装

WebHDFS服务内置在HDFS中，不需额外安装、启动。

配置

需要在hdfs-site.xml打开WebHDFS开关，此开关默认打开。

<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>

使用

连接NameNode的50070端口进行文件操作。

比如：

"http://ctrl:50070/webhdfs/v1/?op


### 更多操作
参考文档：[官方WebHDFS REST API](https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/WebHDFS.html)

## HttpFS(Hadoop HDFS over HTTP)

### 介绍

HttpFS is a server that provides a REST HTTP gateway supporting all HDFS File System operations (read and write). And it is inteoperable with the webhdfs REST HTTP API.

### 安装

Hadoop自带，不需要额外安装。默认服务未启动，需要手工启动。

### 配置

- httpfs-site.xml
有配置文件httpfs-site.xml，此配置文件一般保存默认即可，无需修改。

- hdfs-site.xml
需要增加如下配置，其他两个参数名称中的root代表的是启动hdfs服务的OS用户，应以实际的用户名称代替。
``` xml
<property>  
    <name>hadoop.proxyuser.root.hosts</name>  
    <value>*</value>  
</property>  
<property>  
<name>hadoop.proxyuser.root.groups</name>  
    <value>*</value>  
</property>

启动

1 2	sbin/httpfs.sh start sbin/httpfs.sh stop

启动后，默认监听14000端口：

1
2
3

[root@ctrl sbin]# netstat -antp | grep 14000
tcp        0      0 :::14000   :::*       LISTEN      7415/java
[root@ctrl sbin]#

使用

curl “http://ctrl:14000/webhdfs/v1/?op=liststatus&user.name=root" | python -mjson.tool

参考文档

更多操作：
官方WebHDFS REST API
HttpFS官方文档

WebHDFS与HttpFS的关系

WebHDFS vs HttpFs Major difference between WebHDFS and HttpFs: WebHDFS needs access to all nodes of the cluster and when some data is read it is transmitted from that node directly, whereas in HttpFs, a singe node will act similar to a “gateway” and will be a single point of data transfer to the client node. So, HttpFs could be choked during a large file transfer but the good thing is that we are minimizing the footprint required to access HDFS.

Spring @scheduled注解执行定时任务

Posted on 2016-06-14 | In Spring |

创建spring-task.xml 文件

<!---加入：xmlns:task="http://www.springframework.org/schema/task"
   xsi:schemaLocation="http://www.springframework.org/schema/task
	http://www.springframework.org/schema/task/spring-task-3.1.xsd"
-->

<?xml version="1.0" encoding="UTF-8"?>  
<beans xmlns="http://www.springframework.org/schema/beans"  
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   
    xmlns:tx="http://www.springframework.org/schema/tx"  
    xmlns:aop="http://www.springframework.org/schema/aop"  
    xmlns:context="http://www.springframework.org/schema/context"  
    xmlns:mvc="http://www.springframework.org/schema/mvc"
    xmlns:task="http://www.springframework.org/schema/task"  
    xsi:schemaLocation="http://www.springframework.org/schema/beans     
    http://www.springframework.org/schema/beans/spring-beans-3.2.xsd     
    http://www.springframework.org/schema/tx     
    http://www.springframework.org/schema/tx/spring-tx-3.2.xsd   
    http://www.springframework.org/schema/aop  
    http://www.springframework.org/schema/aop/spring-aop-3.2.xsd   
    http://www.springframework.org/schema/context    
    http://www.springframework.org/schema/context/spring-context-3.2.xsd    
    http://www.springframework.org/schema/mvc  
    http://www.springframework.org/schema/mvc/spring-mvc-3.2.xsd
    http://www.springframework.org/schema/task  
    http://www.springframework.org/schema/task/spring-task-3.2.xsd
   ">  

	<task:annotation-driven /> <!-- 定时器开关-->  
  
      
    <context:annotation-config/>
	<!-- 自动扫描的包名 -->    
    <context:component-scan base-package="com.spring.task" /> 
    <bean class="org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor"/>

实现接口和实现类，添加注解和说明

public interface IMyTestService {  
       public void myTest();  
}  

@Component  //import org.springframework.stereotype.Component;  
public class MyTestServiceImpl  implements IMyTestService {  
      @Scheduled(cron="0/5 * *  * * ? ")   //每5秒执行一次  
      @Override  
      public void myTest(){  
            System.out.println("进入测试");  
      }  
}

备注：

spring的@Scheduled注解需要写在实现类上
定时器的任务方法不能有返回值（如果有返回值，spring初始化的时候会告诉你有个错误、需要设定一个proxytargetclass的某个值为true）
实现类上要有组件的注解@Component

参数说明:

字段 允许值 允许的特殊字符  
秒 0-59 , - * /  
分 0-59 , - * /  
小时 0-23 , - * /  
日期 1-31 , - * ? / L W C  
月份 1-12 或者 JAN-DEC , - * /  
星期 1-7 或者 SUN-SAT , - * ? / L C #  
年（可选） 留空, 1970-2099 , - * /  
表达式意义  
"0 0 12 * * ?" 每天中午12点触发  
"0 15 10 ? * *" 每天上午10:15触发  
"0 15 10 * * ?" 每天上午10:15触发  
"0 15 10 * * ? *" 每天上午10:15触发  
"0 15 10 * * ? 2005" 2005年的每天上午10:15触发  
"0 * 14 * * ?" 在每天下午2点到下午2:59期间的每1分钟触发  
"0 0/5 14 * * ?" 在每天下午2点到下午2:55期间的每5分钟触发  
"0 0/5 14,18 * * ?" 在每天下午2点到2:55期间和下午6点到6:55期间的每5分钟触发  
"0 0-5 14 * * ?" 在每天下午2点到下午2:05期间的每1分钟触发  
"0 10,44 14 ? 3 WED" 每年三月的星期三的下午2:10和2:44触发  
"0 15 10 ? * MON-FRI" 周一至周五的上午10:15触发  
"0 15 10 15 * ?" 每月15日上午10:15触发  
"0 15 10 L * ?" 每月最后一日的上午10:15触发  
"0 15 10 ? * 6L" 每月的最后一个星期五上午10:15触发  
"0 15 10 ? * 6L 2002-2005" 2002年至2005年的每月的最后一个星期五上午10:15触发  
"0 15 10 ? * 6#3" 每月的第三个星期五上午10:15触发  
每天早上6点  
0 6 * * *  
每两个小时  
0 */2 * * *  
晚上11点到早上8点之间每两个小时，早上八点  
0 23-7/2，8 * * *  
每个月的4号和每个礼拜的礼拜一到礼拜三的早上11点  
0 11 4 * 1-3  
1月1日早上4点  
0 4 1 1 *

order by

sort by

distribute by

cluster by

什么是Dropwizard？

为什么是Dropwizard？

Demo

Map

List

一 基础说明

二 Tuning Resource Allocation

spark.dynamicAllocation.enabled

内存格局

注意事项:

Slimming Down Your Data Structures

三 Tuning Parallelism

3.1 参数spark.default.parallelism

减少shuffle以及shuffle的数据量

WebHDFS与HttpFS的使用

WebHDFS

介绍

安装

配置

使用

启动

使用

更多操作

参考文档

WebHDFS与HttpFS的关系

一基础说明