It seems that shark won't take "hive.merge.mapfiles" into account, when I execute "create table as select ..", it gets a lot of input splits, one split stand for one task and also stand for one output file. Hive will launch a conditional task to compute whether the average file size of output dir is below some value and do a merge task. But shark ignore it and leave the hdfs with many small files(even empty files).
To make matters worse, shark doesn't use combinehiveinputformat. If we want select the created table ,we get huge amouts of tasks. Spark has coalescedrdd to gather rdd partitions and reduce the task number, but shark didn't implement similar rdd.Under such circumstance, shark is not qualified for ETL job.
It seems that shark won't take "hive.merge.mapfiles" into account, when I execute "create table as select ..", it gets a lot of input splits, one split stand for one task and also stand for one output file. Hive will launch a conditional task to compute whether the average file size of output dir is below some value and do a merge task. But shark ignore it and leave the hdfs with many small files(even empty files).
To make matters worse, shark doesn't use combinehiveinputformat. If we want select the created table ,we get huge amouts of tasks. Spark has coalescedrdd to gather rdd partitions and reduce the task number, but shark didn't implement similar rdd.Under such circumstance, shark is not qualified for ETL job.