Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the event someone is encountering suffix trees for the first time and thinking of using one: the amount of RAM required for suffix trees is obscene.


I totally fell for the "obscene memory" trap myself. My first encounter with suffix trees outside of a textbook was for an ITA Software 'Instant Search' puzzle. The requirement was sub-0.1ms search on a large string database, I went straight for a generalized suffix tree. Then I realized they had asked for the solution to fit within a 1GB heap. :(

I wrote up the full 'war story' of how I had to profile the heap and optimize the node representation (shaving bytes off the edge storage) just to get it to boot without an OOM error: https://www.abahgat.com/blog/the-programming-puzzle-that-got...

It’s the most tangible example I've run into of where theoretical O(n) space complexity meets the reality of object pointer overhead.


I've never grokkeed suffix trees, but isn't possible for them to be O(n) in space (n total length of all strings)? Is there just an unacceptable constant factor overhead? I can imagine the pointer overhead being painful.


Like the other poster said, the rabbit hole continues with suffix arrays (https://en.wikipedia.org/wiki/Suffix_array#Space_efficiency), then compressed suffix arrays (https://en.wikipedia.org/wiki/Compressed_suffix_array).

Also explained by the creator of this: https://www.abahgat.com/project/suffix-tree/

> the human genome can be encoded as a 3GB string constructed out of an alphabet of four characters

> As of 2019, a suffix tree indexing the human genome using state of the art algorithms can easily occupy tens of gigabytes.


Yes, we get the same speed from suffix arrays these days, and much, much less memory usage.

But good luck visualizing what those algorithms do :)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: