Sorting text

An article, posted 12 days ago filed in sorting, ruby, programming, database, order, databases & sql.

There are a few hard problems in computing. Correctly handling time, naming, preventing off by one errors… sorting text may not be one of them but recently we ran into a discussion where I couldn’t make up my mind anymore. Hence, this post’s topic: sorting text.

Sorting text

The problem

How do you sort the following words:

If you’d ask ruby I’d get:

 %w[cheese Ape Drums dent Beer].sort

Results in:

  1. Ape
  2. Beer
  3. Drums
  4. cheese
  5. dent

Which in my useless and ramshackle programmer’s brain translates to, well why not, it is sorted right?

But then we moved the data into a database which was correctly set up with a proper locale for ‘collation’, a term that I’ve seen but never meant anything to me until this problem. Collation is:

the assembly of written information into a standard order.

(thanks Wikipedia - Collation)

Databases often have a collation property. E.g. my local database has ‘C’ as a collation property. Which means it reproduces the sorting that we found ruby to be returning. C is a computer language that much software is written in. Even the ruby interpreter. It prioritises capital letters over smaller letters because of the ordering of letters in the ASCII table. Capital letters in this table come before the lower case letters.

Code Letter
A 101
B 102
C 103
a 141
b 142
c 143

To offer better sorting, databases have collation rules that are, e.g. en_US or nl_NL. It can be a property of the database, or column, or passed explicitly in a query:

SELECT name from names WHERE ORDER BY street COLLATE "en_US";

would return:

  1. Ape
  2. Beer
  3. cheese
  4. dent
  5. Drums

While

SELECT name from names WHERE ORDER BY street COLLATE "C";

returns

  1. Ape
  2. Beer
  3. Drums
  4. cheese
  5. dent

So now we know why our tester flagged the inconsistency in sorting. He was preparing expected output in ruby, and the results, that were sorted in the database, did not align.

Case doesn’t matter?

Look at the screenshot (index of a Dutch cookbook):

  1. gemengd groen met geitenkaas
  2. Genuese pansotti
  3. gepocheerde eieren…

A nice bit advice I read in this appendix on Rules for Alphabetical Filing in The Gregg Reference Manual (Sabin, W.A.):

the differences between capital and lowercase letters should be ignored

Wikipedia on Alphabetical order:

Capital or upper case letters are generally considered to be identical to their corresponding lower case letters for the purposes of alphabetical ordering, although conventions may be adopted to handle situations where two strings differ only in capitalization.

If you spend a bit more time on these documents, you’ll find that proper sorting is a bit harder, but probably even too hard to do with a simple list of strings.

Just lowercase then?

It is better not to use simply lowercase, because the collation rules may have specific rules on priority of Abc vs ABC and when sorting in database perhaps it has rules that reverse the sort to revert Abc -> ABC to ABC <- Abc. I don’t know to be honest, I tend to prefer to go along with the most advanced option that is offered me at little to no additional cost (just setting the database config right).

But as said, the sorting with collation rules set for a human language are basically like sorting on lower case. Because of this it is fair to simulate the collation rules from the database in ruby by indeed sorting using lowercase (or downcase) letters:

%w[cheese Ape Drums dent Beer].sort{|a,b|a.downcase<=>b.downcase}

Happy sorting.

Op de hoogte blijven?

Maandelijks maak ik een selectie artikelen en zorg ik voor wat extra context bij de meer technische stukken. Schrijf je hieronder in:

Mailfrequentie = 1x per maand. Je privacy wordt serieus genomen: de mailinglijst bestaat alleen op onze servers.