Skip to content

shark will generate many small files when executing queries like "create table as select ..."  #6

@lalaguozhe

Description

@lalaguozhe

It seems that shark won't take "hive.merge.mapfiles" into account, when I execute "create table as select ..", it gets a lot of input splits, one split stand for one task and also stand for one output file. Hive will launch a conditional task to compute whether the average file size of output dir is below some value and do a merge task. But shark ignore it and leave the hdfs with many small files(even empty files).
To make matters worse, shark doesn't use combinehiveinputformat. If we want select the created table ,we get huge amouts of tasks. Spark has coalescedrdd to gather rdd partitions and reduce the task number, but shark didn't implement similar rdd.Under such circumstance, shark is not qualified for ETL job.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions