shark will generate many small files when executing queries like "create table as select ..." 

It seems that shark won't take "hive.merge.mapfiles" into account, when I execute "create table as select ..", it gets a lot of input splits, one split stand for one task and also stand for one output file. Hive will launch a conditional task to compute whether the average file size of output dir is below some value and do a merge task. But shark ignore it and leave the hdfs with many small files(even empty files). 
To make matters worse, shark doesn't use combinehiveinputformat. If we want select the created table ,we get huge amouts of tasks. Spark has coalescedrdd to gather rdd partitions and reduce the task number, but shark didn't implement similar rdd.Under such circumstance, shark is not qualified for ETL job.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shark will generate many small files when executing queries like "create table as select ..." #6

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

shark will generate many small files when executing queries like "create table as select ..." #6

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions