a. Replacing MR with Tez:-Tez offer API to handle petabytes of data from across clusters.
b. Follow ORC file format for great performance.
c. Partiting:-Right partitioning based on logical requirement of data.
d. Bucketing:-Next level of partitioning is more useful when lots of other requirments comes into picture to use same data. Bucketing allow user to use data as per requirement.
e.Vectorization:-One must do the right logical reading by scan,Agg,Filter,Join.
f.CBO-Make sure you do analysis of resource usage before finilizing your parallelism.Check Cost of each process and compare it.
g.Indexing:- Make sure your tables are indexed.
Comments
Post a Comment