Comments:"algorithm - Plain English explanation of Big O - Stack Overflow"
URL:http://stackoverflow.com/questions/487258/plain-english-explanation-of-big-o
Big-O notation (also called "asymptotic growth" notation) is what functions "look like" when you ignore constant factors and stuff near the origin. We use it to talk about how thing scale.
Basics
for "sufficiently" large inputs...
f(x) ∈ O(upperbound)
meansf
"grows no faster than"upperbound
f(x) ∈ Ɵ(justlikethis)
meanf
"grows exactly like"justlikethis
f(x) ∈ Ω(lowerbound)
meansf
"grows no slower than"lowerbound
big-O notation doesn't care about constant factors: the function 9x²
is said to "grow exactly like" 10x²
. Neither does big-O asymptotic notation care about non-asymptotic stuff ("stuff near the origin" or "what happens when the problem size is small"): the function 10x²
is said to "grow exactly like" 10x² - x + 2
.
Why would you want to ignore the smaller parts of the equation? Because they become completely dwarfed by the big parts of the equation as you consider larger and larger scales; their contribution becomes dwarfed and irrelevant. (See example section.)
Put another way, it's all about the ratio. If you divide the actual time it takes by the O(...)
, you will get a constant factor in the limit of large inputs. Intuitively this makes sense: functions "scale like" one another if you can multiply one to get the other. That is, when we say...
actualAlgorithmTime(N) ∈ O(bound(N))
e.g. "time to mergesort N elements
is O(N log(N))"
... this means that for "large enough" problem sizes N (if we ignore stuff near the origin), there exists some constant (e.g. 2.5, completely made up) such that:
actualAlgorithmTime(N) e.g. "mergesort_duration(N) "
────────────────────── < constant ───────────────────── < 2.5
bound(N) N log(N)
There are many choices of constant; often the "best" choice is known as the "constant factor" of the algorithm... but we often ignore it like we ignore non-largest terms (see Constant Factors section for why they don't usually matter). You can also think of the above equation as a bound, saying "In the worst-case scenario, the time it takes will never be worse than roughly N*log(N)
, within a factor of 2.5 (a constant factor we don't care much about)".
In general, O(...)
is the most useful one because we often care about worst-case behavior. If f(x)
represents something "bad" like processor or memory usage, then "f(x) ∈ O(upperbound)
" means "upperbound
is the worse-case scenario of processor/memory usage".
Intuition
This lets us make statements like...
"For large enough inputsize=N, and a constant
factor of 1, if I double the input size...
... I double the time it takes." ( O(N) )
... I quadruple the time it takes." ( O(N²) )
... I add 1 to the time it takes." ( O(log(N)) )
... I don't change the time it takes." ( O(1) )
(with credit to http://stackoverflow.com/a/487292/711085 )
Applications
As a purely mathematical construct, big-O notation is not limited to talking about processing time and memory. You can use it to discuss the asymptotics of anything where scaling is meaningful, such as:
- the number of possibly handshakes among
N
people at a party (Ɵ(N²)
, specificallyN(N-1)/2
, but what matters is that it "scales like"N²
) - probabilistic expected number of people who have seen some viral marketing as a function of time
- how website latency scales with the number of processing units in a CPU or GPU or computer cluster
- how heat output scales on CPU dies as a function of transistor count, voltage, etc.
Example
For the handshake example, #handshakes ∈ Ɵ(N²)
. The number of handshakes is exactly n-choose-2 or (N²-N)/2
(each of N people shakes the hands of N-1 other people, but this double-counts handshakes so divide by 2). However, for very large numbers of people, the linear term N
is dwarfed and effectively contributes 0 to the ratio. Therefore the scaling behavior is order N²
, or the number of handshakes "grows like N²".
#handshakes(N)
────────────── ≈ 1/2
N²
If you wanted to prove this to yourself, you could perform some simple algebra on the ratio to split it up into multiple terms (lim
means "considered in the limit of", you can ignore it if it makes you feel better):
N²/2 - N/2 (N²)/2 N/2 1/2
lim ────────── = lim ( ────── - ─── ) = lim ─── = 1/2
N→∞ N² N→∞ N² N² N→∞ 1
┕━━━┙
this is 0 in the limit of N→∞:
graph it, or plug in a really large number for N
Constant factors
Usually we don't care what the specific constant factors are, because they don't affect the way the function grows. For example, two algorithm may both take O(N)
time to complete, but one may be twice as slow as the other. We usually don't care too much unless the factor is very large, since optimizing is tricky business ( When is optimisation premature? ); also the mere act of picking an algorithm with a better big-O will often improve performance by orders of magnitude.
Some asymptotically superior algorithms (e.g. a non-comparison O(N log(log(N)))
sort) can have so large a constant factor (e.g. 100000*N log(log(N))
), or overhead that is relatively large like O(N log(log(N)))
with a hidden + 100*N
, that they are rarely worth using even on "big data".
Why O(N) is sometimes the best you can do, i.e. why we need datastructures
O(N)
algorithms are in some sense the "best" algorithms if you need to read all your data. The very act of reading a bunch of data is an O(N)
operation. Loading it into memory is usually O(N)
(or faster if you have hardware support, or no time at all if you've already read the data). However if you touch or even look at every piece of data (or even every other piece of data), your algorithm will take O(N)
time to perform this looking. Nomatter how long your actual algorithm takes, it will be at least O(N)
because it spent that time looking at all the data.
The same can be said for the very act of writing. For example, all algorithms which print out all permutations of a number N are O(N!)
because the output is at least that long.
This motivates the use of data structures: a data structure requires reading the data only once (usually O(N)
time), plus some arbitrary amount of preprocessing (e.g. O(N)
or O(N log(N))
or O(N²)
) which we try to keep small. Thereafter, modifying the data structure (insertions / deletions / etc.) and making queries on the data take very little time, such as O(1)
or O(log(N))
. You then proceed to make a large number of queries! In general, the more work you're willing to do ahead of time, the less work you'll have to do later on.
For example, say you had the latitude and longitude coordinates of millions of roads segments, and wanted to find all street intersections.
- Naive method: If you had the coordinates of a street intersection, and wanted to examine nearby streets, you would have to go through the millions of segments each time, and check each one for adjacency.
- If you only needed to do this once, it would not be a problem to have to do the naive method of
O(N)
work only once, but if you want to do it many times (in this case,N
times, once for each segment), we'd have to doO(N²)
work, or 1000000²=1000000000000 operations. Not good (a modern computer can perform about a billion operations per second). - If we use a simple structure called a hash table (an instant-speed lookup table, also known as a hashmap or dictionary), we pay a small cost by preprocessing everything in
O(N)
time. Thereafter, it only takes constant time on average to look up something by its key (in this case, our key is the latitude and longitude coordinates, rounded into a grid; we search the adjacent gridspaces of which there are only 9, which is a constant). - Our task went from an infeasible
O(N²)
to a manageableO(N)
, and all we had to do was pay a minor cost to make a hash table.
The moral of the story: a data structure lets us speed up operations. Even more advanced data structures can let you combine, delay, or even ignore operations in incredibly clever ways, like leaving the equivalent of "to-do" notes at junctions in a tree.
Amortized / average-case complexity
There is also the concept of "amortized" or "average case". This is no more than using big-O notation for the expected value of a function, rather than the function itself. For example, some data structures may have a worse-case complexity of O(N)
for a single operation, but guarantee that if you do many of these operations, the average-case complexity will be O(1)
.
Multidimensional big-O
Most of the time, people don't realize that there's more than one variable at work. For example, in a string-search algorithm, your algorithm may take time O([length of text] + [length of query])
, i.e. it is linear in two variables like O(N+M)
. Other more naive algorithms may be O([length of text]*[length of query])
or O(N*M)
. Ignoring multiple variables is one of the most common oversights I see in algorithm analysis, and can handicap you when designing an algorithm.
The whole story
Keep in mind that big-O is not the whole story. You can drastically speed up some algorithms by using caching, making them cache-oblivious, avoiding bottlenecks by working with RAM instead of disk, using parallelization, or doing work ahead of time -- these techniques are often independent of the order-of-growth "big-O" notation, though you will often see the number of cores in the big-O notation of parallel algorithms.
Also keep in mind that due to hidden constraints of your program, you might not really care about asymptotic behavior. You may be working with a bounded number of values, for example:
- If you're sorting something like 5 elements, you don't want to use the speedy
O(N log(N))
quicksort; you want to use insertion sort, which happens to perform well on small inputs. These situations often comes up in divide-and-conquer algorithms, where you split up the problem into smaller and smaller subproblems, such as recursive sorting, fast Fourier transforms, or matrix multiplication. - If some values are effectively bounded due to some hidden fact (e.g. the average human name is softly bounded at perhaps 40 letters, and human age is softly bounded at around 150). You can also impose bounds on your input to effectively make terms constant.
In practice, even among algorithms which have the same or similar asymptotic performance, their relative merit may actually be driven by other things, such as: other performance factors (quicksort and mergesort are both O(N log(N))
, but quicksort takes advantage of CPU caches); non-performance considerations, like ease of implementation; whether a library is available, and how reputable and maintained the library is.
Many things can implicitly contribute to the running time's constant factor, such as whether you run your algorithm on a 500MHz computer vs 2GHz computer, whether your programming language is interpreted or using a JIT compiler, whether you are doing a constant amount of extra work in a critical section of code, etc. The effect may be small (e.g. 0.9x speed) or large (e.g. 0.01x speed) compared to a different implementation and/or environment. Do you switch languages to eek out that little extra constant factor of work? That literally depends on a hundred other reasons (necessity, skills, coworkers, programmer productivity, the monetary value of your time, familiarity, workarounds, why not assembly or GPU, etc...), which may be more important than performance.
The above issues, like programming language, are almost never considered as part of the constant factor (nor should they be); yet one should be aware of them, because sometimes (though rarely) they may not be constant. For example in cpython, the native priority queue implementation is asymptotically non-optimal (O(log(N))
rather than O(1)
for your choice of insertion or find-min); do you use another implementation? Probably not, since the C implementation is probably faster, and there are probably other similar issues elsewhere. There are tradeoffs; sometimes they matter and sometimes they don't.
Math addenda
For completeness, the precise definition of big-O notation is as follows: f(x) ∈ O(g(x))
means that "f is asymptotically upper-bounded by const*g": ignoring everything below some finite value of x, there exists a constant such that |f(x)| ≤ const * |g(x)|
. (The other symbols are as follows: just like O
means ≤, Ω
means ≥. There are lowercase variants: o
means <, and ω
means >.) f(x) ∈ Ɵ(g(x))
means both f(x) ∈ O(g(x))
and f(x) ∈ Ω(g(x))
(upper- and lower-bounded by g): there exists some constants such that f will always lie in the "band" between const1*g(x)
and const2*g(x)
. It is the strongest asymptotic statement you can make and roughly equivalent to ==
. (Sorry, I elected to delay the mention of the absolute-value symbols until now, for clarity's sake; especially because I have never seen negative values come up in a computer science context.)
People will often use = O(...)
. It is technically more correct to use ∈ O(...)
. ∈
means "is an element of". O(N²)
is actually an equivalence class, that is, it is a set of things which we consider to be the same. In this particular case, O(N²)
contains elements like {2 N²
, 3 N²
, 1/2 N²
, 2 N² + log(N)
, - N² + N^1.9
, ...} and is infinitely large, but it's still a set. People will know what you mean if you use =
however. Additionally, it is often the case that in a casual setting, people will say O(...)
when they mean Ɵ(...)
; this is technically true since the set of things Ɵ(exactlyThis)
is a subset of O(noGreaterThanThis)
... and it's easier to type. ;-)