What is SORT BY command in Hive ?
ORDER BY command always make reducers as one, even we set more number reducers by command
SET mapreduce.job.reduces=5; (5 reducers)
ORDER BY command always square off with one reducer. To overcome this problem we need to use SORT BY command. However, if run SORT BY command with more than one reducers we will get the result that data’s are duplicated in the reducers.
Why SORT BY needs to be used with DISTRIBUTE BY in Hive?
when we are using SORT BY command the results are duplicated that means each reducer consider the column taking as key and the data’s are same in all reducers. If you use three reducers your result data will available in all three reducers – three duplicate copies. So, to overcome this problem Sort by command needs to be use along with DISTRIBUTE BY command.