Wednesday, September 23, 2020

Sebastian Witowski: Sorting Lists

There are at least two common ways to sort lists in Python:

  • With sorted function that returns a new list
  • With list.sort method that modifies list in place

Which one is faster? Let’s find out!

sorted() vs list.sort()

I will start with a list of 1 000 000 randomly shuffled integers. Later on, I will also check if the order matters.

# sorting.py
from random import sample

# List of 1 000 000 integers randomly shuffled
MILLION_RANDOM_NUMBERS = sample(range(1_000_000), 1_000_000)


def test_sort():
    return MILLION_RANDOM_NUMBERS.sort()

def test_sorted():
    return sorted(MILLION_RANDOM_NUMBERS)
$ python -m timeit -s "from sorting import test_sort" "test_sort()"
1 loop, best of 5: 6 msec per loop

$ python -m timeit -s "from sorting import test_sorted" "test_sorted()"
1 loop, best of 5: 373 msec per loop

When benchmarked with Python 3.8 (how I benchmark), sort() is around 60 times as fast as sorted() when sorting 1 000 000 numbers (373/6≈62.167).

But it turns out that the initial order has a massive impact on the final results! What happens when our initial list is already sorted?

MILLION_NUMBERS = list(range(1_000_000))
$ python -m timeit -s "from sorting import test_sort" "test_sort()"
50 loops, best of 5: 6.34 msec per loop

$ python -m timeit -s "from sorting import test_sorted" "test_sorted()"
20 loops, best of 5: 11 msec per loop

Now the difference is much smaller - sorted is only around 70% slower (11/6.34≈1.735).

If you want to run the benchmarks on your computer, make sure to adjust the test_sort and test_sorted functions, so they use the new MILLION_NUMBERS variable (instead of the MILLION_RANDOM_NUMBERS). Make sure you do this update for each of the following tests.

And if we try to sort a list of 1 000 000 numbers initially ordered in descending order:

DESCENDING_MILLION_NUMBERS = list(range(1_000_000, 0, -1))
$ python -m timeit -s "from sorting import test_sort" "test_sort()"
50 loops, best of 5: 6.23 msec per loop

$ python -m timeit -s "from sorting import test_sorted" "test_sorted()"
20 loops, best of 5: 11.2 msec per loop

The results are almost identical as before. The more random our initial list is, the longer it takes sorted() to “sort it out.”

Next, let’s try to sort 1 000 000 numbers where 100 000 elements are shuffled, and the rest are ordered:

# 10% of numbers are random
MILLION_SLIGHTLY_RANDOM_NUMBERS = [*range(900_000), *sample(range(1_000_000), 100_000)]
$ python -m timeit -s "from sorting import test_sort" "test_sort()"
50 loops, best of 5: 8.56 msec per loop

$ python -m timeit -s "from sorting import test_sorted" "test_sorted()"
5 loops, best of 5: 53.8 msec per loop

sorted gets slower as the input list gets more scrambled.

sort is a clear winner. But keep in mind that it modifies the list in place. If you want to preserve the order of the initial list, you should use sorted instead. And sorted can be used with any iterable, while sort only works with lists. If you want to sort a set, then sorted is your only solution.

Conclusions

list.sort() is faster than sorted, so if you can use it, it’s a better solution.

You might want to stick with sorted if:

  • You don’t want to modify the original list. sort performs sorting in-place, so you can’t use it here.
  • You need to sort something else than a list. sort is only defined on lists, so if you want to sort a set or any other collection of items, you have to use sorted instead.

If you want to learn more, the Sorting HOW TO guide from Python documentation contains a lot of useful information.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...