排序——万亿数量级
数量级
- 接前面
- 小结
- 一万: O ( n 2 ) O(n^2) O(n2)
- 五万: n ∗ log 2 n n*\log_{2}{n} n∗log2n
- 五十万:中值排序
- 代码
- 五百万: s h e l l shell shell排序
- 一千万:快速排序
- 五千万:基数排序
- 一亿:点数排序
- 代码
- 打印
- 十亿:散列排序
- 桶排序(hash)
接前面
习题非递归快速排序讲了数据结构在排序算法中的重要作用。排序对象是随机自然数,从一万开始。
小结
书上对常见的排序算法做了小结。
一万: O ( n 2 ) O(n^2) O(n2)
里面最慢的是冒泡排序,八秒多,因为它要一个一个地比。
list_sort 0.002000093460083008
heapsort 0.013000011444091797
bubble减少交换 3.235999584197998
bub 8.67300009727478
insert 4.486999988555908
shellSort 0.04900026321411133
quick 0.05900001525878906
quick_partition 0.06200003623962402
MedianSortBFPRT 0.12800002098083496
merge 0.04200005531311035
五万: n ∗ log 2 n n*\log_{2}{n} n∗log2n
这里插入冒泡都到一两百,要淘汰了。
淘汰:
select 136.42199969291687
bubble减少交换 81.0329999923706
bub 214.32599997520447
insert 113.50599980354309
list_sort 0.0070002079010009766
heapSort done 0.372999906539917
shellSort 0.3559999465942383
quick 0.18700003623962402
quick_partition 0.1510000228881836
MedianSortBFPRT 0.6370000839233398
merge 0.25099992752075195
pivotsort 0.22099995613098145
五十万:中值排序
这里最慢的是MedianSortBFPRT中值排序,八秒多。先找到数组的中值,把数分两边;再递归两边的数据,直到排好序。
list_sort 0.10700011253356934
heapsort 0.7980000972747803
heapSort done 5.137999773025513
shellSort 6.007999897003174
quick 1.9620001316070557
quick_partition 2.17900013923645
MedianSortBFPRT 8.876000165939331
merge 3.1600000858306885
pivotsort 2.5890002250671387
代码
MedianSortBFPRT 是中值排序的改进版,从四个值里选中值。从C语言代码改过来的。应该比较适合指针操作的,说到指针还有表插入排序。
def cmp(a, b):if a < b:return -1elif a > b:return 1else:return 0
def medianOfFour(ar, left, gap):pos1, pos2, pos3, pos4 = left, left+gap, left+2*gap, left+3*gapa1, a2, a3, a4 = ar[pos1], ar[pos2], ar[pos3], ar[pos4]if cmp(a1, a2) <= 0:if cmp(a2, a3) <= 0:if cmp(a2, a4) <= 0:if cmp(a3, a4) > 0:ar[pos3], ar[pos4] = ar[pos4], ar[pos3]else:ar[pos2], ar[pos3] = ar[pos3], ar[pos2]else:if cmp(a1, a3) <= 0:if cmp(a3, a4) <= 0:if cmp(a2, a4) <= 0:ar[pos2], ar[pos3] = ar[pos3], ar[pos2]else:ar[pos3], ar[pos4] = ar[pos4], ar[pos3]else:if cmp(a1, a4) <= 0:if cmp(a2, a4) <= 0:ar[pos2], ar[pos3] = ar[pos3], ar[pos2]else:ar[pos3], ar[pos4] = ar[pos4], ar[pos3]else:ar[pos1], ar[pos3] = ar[pos3], ar[pos1]else:if cmp(a1, a3) <=0:if cmp(a1, a4) <=0:if cmp(a3, a4) > 0:ar[pos3], ar[pos4] = ar[pos4], ar[pos3]else:ar[pos3], ar[pos1] = ar[pos1], ar[pos3]else:if cmp(a2, a3) <=0:if cmp(a3, a4) <=0:if cmp(a1, a4) <= 0:ar[pos1], ar[pos3] = ar[pos3], ar[pos1]else:ar[pos3], ar[pos4] = ar[pos4], ar[pos3]else:if cmp(a2, a4) <=0:if cmp(a1, a4) <= 0:ar[pos1], ar[pos3] = ar[pos3], ar[pos1]else:ar[pos3], ar[pos4] = ar[pos4], ar[pos3] else:ar[pos2], ar[pos3] = ar[pos3], ar[pos2]def insertion(ar, low, right, gap):for loc in range(low+gap, right, gap):i = loc - gapvalue = ar[loc]while i>=low and cmp(ar[i],value)>0:ar[i+gap] = ar[i]i -= gapar[i+gap] = valuedef medianOfMedians(ar, left, right, gap):'''gap是1也会是最后调用的span'''span = 4*gapnum = (right-left+1)//spanif (0==num):insertion(ar, left, right, gap)num = (right-left+1)//gapreturn int(left + gap*(num-1)//2)for s in range(left, right, span):if s + span <= right:medianOfFour(ar, s, gap)if (4>num):insertion(ar, left+span//2, right, span)return int(left + num*span//2)else:return medianOfMedians(ar, left+span//2, s-1, span)def partition(alist, left, right, pivotIndex):'''返回第几大的前区'''pivot = alist[pivotIndex]store = leftalist[right], alist[pivotIndex] = alist[pivotIndex], alist[right]for i in range(left, right):if alist[i] <= pivot:alist[i] ,alist[store] = alist[store], alist[i]store += 1alist[right], alist[store] = alist[store], alist[right]return storedef selectMedian(ar, left, right):k =(right - left + 1)//2while k > 0:idx = medianOfMedians(ar, left, right, 1)pivotIndex = partition(ar, left, right, idx)p = left + kif p == pivotIndex:return pivotIndexelif p < pivotIndex:right = pivotIndex - 1else:k = k - (pivotIndex - left + 1)left = pivotIndex + 1return leftdef medianSort(alist, left, right):if (left >= right):returnmid = (right - left + 1) // 2me = selectMedian(alist, left, right)medianSort(alist, left, left + mid -1)medianSort(alist, left + mid + 1, right)
五百万: s h e l l shell shell排序
MedianSortBFPRT已经137秒,可以淘汰了。 s h e l l shell shell排序98秒下一级淘汰。
淘汰:
MedianSortBFPRT 137.97499990463257
list_sort 1.5319998264312744
shellSort 98.6560001373291
merge 39.10199999809265
pivotsort 30.817999839782715
quick_medianOf3 26.94600009918213
heapSort done 66.4060001373291
quick 25.361000299453735
quick非递归 25.13599991798401
quick_partition 26.33399987220764
quick_median_pivot_2cursor 26.187999963760376
quick_end_pivot 24.217000007629395
quick_end 23.700000047683716
radixSort 24.115000009536743
一千万:快速排序
pivotsort,quick_partition都是快速排序,使用轴交换数据。 像前面的quick_medianOf3 是用前中后三个值大小在中间的那个做轴,quick_median_pivot_2cursor 正中间的做轴,quick_end_pivot是以最后一个数做轴,quick_end在数量少到10的时候用插入排序。
if low - 1 - start <= 10:insertionSort(alist[start:low])else:quick_end(alist, start, low - 1)if end - 1 - low <= 10:insertionSort(alist[low + 1:end + 1])else:quick_end(alist, low + 1, end)
淘汰:
shellSort 276.1419999599457
list_sort 3.2100000381469727
heapSort done 142.9500002861023
quick 90.61400008201599
quick_partition 146.29100012779236
merge 81.66299986839294
pivotsort 111.06699991226196
快速排序和归并排序merge 也要淘汰了。
五千万:基数排序
看数据下一个淘汰的是基数排序radixSort和堆排序heapSort。堆没得说像二叉树一样。但是基数尽然比堆还快了600多秒,这里就要看题目了,排序的是自然数,要是带小数的就不能用基数排序了。
淘汰:
quick 1389.9990000724792
pivotsort 1236.2779998779297
list_sort 17.25499963760376
radixSort 170.40100002288818
heapSort done 848.5410001277924
一亿:点数排序
其它的都打印不出来了,python列表排序sort()也要34秒了。
list_sort 34.79900026321411
代码
里面是有一亿重复的数据(0到9)。
本想用x=random.sample(range(1000),10000)生成重复的数据,报错了:
ValueError Sample larger than population or is negative
只好用这种方式:模运算还是很好用的。
x = [i % 10 for i in range(100000000)]
random.shuffle(x)
x = [i % 10 for i in range(100000000)]random.shuffle(x)from collections import defaultdictdef countingSort(A):B = defaultdict(lambda : 0) #bucketsfor i in A:B[i] += 1idx = 0for i in range(len(B)):while B[i] > 0:A[idx] = iidx += 1B[i] -= 1 #x = random_int_list(0,10,50)#x=random.sample(range(1000),10000)dianshux = x.copy()start = time.time() countingSort(dianshux)end = time.time()print("点数排序",end - start)pythenlistx = x.copy()start = time.time() pythenlistx.sort()end = time.time()print("list_sort",end - start)
打印
先看一下这个打印:
点数排序 28.121000051498413
list_sort 7.856000185012817
这也是一亿随机数据啊,怎么list_sort从上面的34秒降到了7秒。看来打败list_sort没有那么容易。
十亿:散列排序
怎么把散的数据列起来?
点数排序 294.4590001106262
list_sort 134.59100008010864
没有打败。
桶排序(hash)
不能用
def hash(a):
return a // 100 #改这里应该好点吧,还是要加大桶
这样的散列函数(hash)把你排在比较低的位置。
匹配——散列法讲到了散列法。散列函数有数字分析法、折叠法、中平方法、基数转换法、除法…还有很多。匹配——rabin_karp是怎么滚动的?讲到的取模运算散列方法把我折磨得不轻。只要充分分析好数据,运用更多的数学知识,一定能打败list_sort。