C++ Program for Kemeny-Young Preference Aggregation

William H. Press

University of Texas at Austin

August 6, 2012

This note, along with linked code and executable files, provides an efficient implementation for calculating the Kemeny-Young ordering of a set of "candidates" who have been partially ranked by a set of "voters". Not all voters need rank all candidates.

Contents

What is Kemeny-Young?

Heuristic Algorithm

Program Usage

Small Scale Example

Larger Scale Example

Elections with Many Voters, Few Candidates

Program Source Code and Compiled Binaries

References

What is Kemeny Young?

The Kemeny-Young method^{1-7}
is a way to combine the ordered preference lists of multiple individuals
into a single overall preference order. It can thus be viewed as a voting scheme
that determines not just a single chosen winner, but an entire ordered list. It is
often used in elections whose outcome is the selection of a slate of winners, for
example for *N* vacant positions on a council when there are *M>N* candidates.
After applying Kemeny-Young, the declared winners are its top *N* ranked candidates
out of the full number *M* (whose ranks are all calculated).

Kemeny-Young is a Condorcet method, meaning that its first-ranked candidate would win against all other candidates in individual pairwise contests, whenever there is a candidate with this property. Similarly, Kemeny-Young's second-ranked candidate would win against all other candidates (except the first ranked); and so on.

Mathematically, a Kemeny-Young ordering is defined as one that minimizes the total number of discrepancies among all the voters in their pairwise preferences between all candidates. That is, if the Kemeny-Young ordering ranks candidate X ahead of candidate Y, then any individual voter who ranked Y ahead of X contributes +1 to the total discrepancy. A voter who chose to rank only X, or only Y, or neither, does not contribute to the discrepancy. The discrepancy is summed over all distinct pairs of candidates X and Y.

The Kemeny-Young ordering thus satisfies, loosely speaking, the most
possible voters as regards their stated preferences. Note that it
does not use any information about *how much higher* a voter
ranks X as compared to Y, but only the fact of the relative
ordering. In this regard it is similar to the score in
Kendall's
tau rank correlation coefficient.

More than one ordering can be tied for the smallest number of discrepancies, so the
Kemeny-Young ordering is not necessarily unique. Furthermore, it is known that
the problem of finding even a single minimum discrepancy ordering is
NP hard.^{8,9} So,
in real life, any program for finding the K-Y ordering can be only heuristic.
However fast heuristics, good enough for practical voting applications, do exist.
We implement one such here.

Heuristic Algorithm

It is known that ordering all the candidates by their mean rank (across all voters) provides, in most cases, a good starting guess. Indeed, under some asymptotic assumptions, mean rank is a good approximation to K-Y. So, we always start with this guess.

As an outer loop, we do *Ntry* independent greedy minimizations of the K-Y
score and record the resulting ranks for each candidate. Since each minimization
is partly stochastic (as described below), the results can be different on each
try. We report for each candidate the smallest (best) rank seen, the largest (worst),
the mean rank over the tries, and the standard deviation of ranks seen. When
the number of voters is large it is often the case that only one single ordering is
found repeatedly -- putatively (though not easily provably) the unique K-Y answer.
When multiple orderings are found, all tied for best score, it is up to the user
to decide how to use the output. We recommend increasing *Ntry* until
the order of the mean ranks is stable and defining this order as the outcome.
(See example below.)

A single greedy minimization proceeds by indefinitely repeating these
steps: Draw a random permutation of the candidates. Then, in the order of this
permutation, remove and re-insert each candidate in the order position that minimizes the
K-Y score. If there is a tie between such positions, choose one randomly.
After all candidates have been so moved, then, starting at the top of the list
and moving down, consider in turn each window of size *Mperm* candidates. Trying
all permutations of the candidates within the window, choose the permutation
that minimizes the K-Y score. If all the steps indicated fail to decrease the
K-Y score at all, then exit the indefinite greedy minimization loop.

Typical values for *Ntry* are between 4 and 200 (depending on
your computer speed). Bigger is always better. Computation time is of
course linear in *Ntry*. A typical value for *Mperm* is 7.
Computational time increases exponentially as *Mperm* is
increased, so bigger, while better, is not very practical.

Program Usage

As supplied, the program takes one command-line argument, an integer value *Ntry*.
The program reads the input data from standard input and expects one line per voter.
That line is a series of strings separated by space or tab, the first of which is an
identifier for the voter. The voter identifiers are not used by the program, but are
only to facilitate human preparation of the input file; they may be randomly
assigned, or be all identical. Each input line is taken to be a different voter,
regardless of its voter identifier string.

The second and subsequent strings on each line, space or tab separated, are the names or unique identifiers of candidates in the voter's preference order, from most preferred to least preferred. A voter may thus rank any number of candidates. However, a voter who ranks zero or one candidate will not contribute to the outcome.

Small Scale Example

Ten professors (Brown, Davis, Johnson, Jones, Miller, Moore, Smith, Taylor, Williams, and Wilson) each rank a subset of 13 students (Anderson, Banks, Carter, Garcia, Horton, Kelly, Lee, Meyer, O'Brien, Price, Rice, Schmidt, and Wang) for a departmental prize. The input file, votes.txt, looks like this:

Smith: Anderson Garcia Lee
Johnson: Anderson Kelly Rice Garcia
Williams: Horton Schmidt Wang
Jones: Horton Banks Schmidt Rice
Brown: Anderson Garcia
Davis: O'Brien Wang Price
Miller: Kelly Banks Wang Lee Price
Wilson: Carter Price
Moore: Carter Meyer
Taylor: Horton Meyer

So, for example, Professor Miller's favorite student is Kelly; her least favorite (of those whom she ranked) is Price. Notice that the colons are interpreted by the program as merely part of each voter identifier string. However, they add readability to the file.

The program command line, with *Ntry*=200, is:

kemenyyoung 200 < votes.txt > results.txt

Note the use of redirection (< and >) to specify the input and output files. You can omit the output redirection (> results.txt), in which case the output will print to the screen. On a desktop machine, the program takes a few seconds to run, producing this output to the screen (stderr):

*** There are 13 candidates and 10 voters. ***
*** Please be patient, while I crunch...
*** Score: 35 preferences out of 35 are honored.
*** Now sending results to output file.

The file results.txt contains:

candidate best worst average in 190 tied orderings
Anderson 1 4 1.432 +- 0.601
Horton 1 5 2.153 +- 1.002
Kelly 2 5 2.847 +- 0.690
Banks 4 6 4.563 +- 0.706
O'Brien 1 9 5.132 +- 1.376
Carter 1 11 5.363 +- 1.798
Schmidt 5 8 6.837 +- 0.688
Rice 6 10 8.742 +- 1.062
Wang 8 11 9.163 +- 0.935
Meyer 7 13 9.868 +- 1.849
Garcia 8 11 10.337 +- 0.913
Lee 10 12 11.716 +- 0.536
Price 12 13 12.847 +- 0.360

For these data, we see that there are many possible K-Y orderings that
satisfy all 35 of the pairwise preferences expressed by the voters
(faculty). Indeed, four students are ranked first in at least one
such ordering. However, a consistent ordering of the average ranks of
each student within such orderings does emerge. (Not shown here, increasing
*Ntry* to 2000 produces an identical order by average ranks.)
We also see above that in 10 of 200 tries greedy minimization terminated
on a K-Y score that was not equal to the best found. These 10 failed tries did
not contribute to the results.

Larger Scale Example

Here is an example where 330 papers submitted to a meeting (the candidates) were each evaluated by reviewers chosen from a pool of 141 experts (the voters). Each reviewer was assigned papers in his/her area of expertise, and each paper was assigned to at least 4 reviewers. Some reviewers had broad expertise and were assigned a sample of papers across multiple specialties: This is important so that the comparative preferences don't partition into disjoint sets in unrelated specialties.

The input file has 141 lines like this:

1557: 5866 6013 5710 5860 5902 6028
1656: 5777 5652 5737 5660 5664 5959 5637 5905 5788 5924 5473 5843
1682: 5538 5449 6084 6010 5462
2217: 5915 5988 5757 5850 5978 5994

Here 1656, 1682, and 2217 identify reviewers, and the other numbers identify papers. Reviewer 1682 liked paper 5538 the best, 5462 the worst, out of the 5 that he was asked to review.

With *Ntry*=10, the program runs for about a minute, producing this output:

*** There are 330 candidates and 141 voters. ***
*** Please be patient, while I crunch...
*** Score: 4359 preferences out of 7393 are honored.
*** Now sending results to output file.

Like many real-life elections, there is significant disagreement among the voters, so that only 4359/7393 = 0.590 of pairwise voter preferences are satisfied by the best ordering. The output file looks like this:

candidate best worst average in 1 tied orderings
5800 1 1 1.000 +- 0.000
5747 2 2 2.000 +- 0.000
5665 3 3 3.000 +- 0.000
5564 4 4 4.000 +- 0.000
... (lines omitted) ...
5636 327 327 327.000 +- 0.000
6080 328 328 328.000 +- 0.000
5836 329 329 329.000 +- 0.000
6047 330 330 330.000 +- 0.000

Since the best score was found only once, we here get no information about
the uncertainty of individual scores. As an alternative, you can recompile
the program with the switch ALLSCORES set; this causes all greedy iterations,
not just the best one, to contribute to the result. We don't recommend
this, because the purpose of most elections is to reach a decisive outcome.
Choosing a value for *Ntry* in advance and utilizing only the
best scoring ordering seen seems fairer and more decisive. In the above
example, running with *Ntry*=100, which takes about 10 minutes
on a desktop machine, finds a best ordering that honors 4385/7393 = 0.593
of voter preferences, slightly larger than before. This score is also
found only once, indicating that it is still likely not the best possible.

Elections with Many Voters, Few Candidates

When the number of candidates is small, and the number of voters is large,
as for a city council election with 35 candidates competing for
6 vacancies, and with 3000 voters, the program's heuristics should produce
a unique best ordering, without ambiguity. After an initial linear-time
digestion of the input file, the program should find that K-Y ordering
rapidly. This is because it is exhaustive within every window of
*Mperm* candidates, because there are not many such windows, and
because the large number of voters makes ties in the greedy algorithm's
stochastic steps unlikely.

Program Source Code and Compiled Binaries

The file kemenyyoung.zip contains three files: kemenyyoung.cpp is the single C++ source file, kemenyyoung.exe is a compiled binary for Windows XP and 7, and kemenyyoung is a compiled binary for Linux (g++ in Linux 2.6 on Ubuntu 10.04). The first line of the source file is Windows-specific and should be deleted for Linux compilation.

The source file references a few Numerical Recipes routines, available at the Numerical Recipes web site, but is otherwise conforming standard C++. A typical compilation on a Linux (after setting the environment variable CPLUS_INCLUDE_PATH to point to a directory with the Numerical Recipes routines) is

g++ -O2 kemenyyoung.cpp -o kemenyyoung

References

1. John G. Kemeny, "Mathematics without numbers", Daedalus 88 (1959), pp. 577-591

2. John G. Kemeny and James L. Snell, Mathematical Models in the Social Sciences, Ginn, Boston, 1960. 3. H. P. Young and A. Levenglick, "A Consistent Extension of Condorcet's Election Principle", SIAM Journal on Applied Mathematics 35, no. 2 (1978), pp. 285-300

4. H. P. Young, "Condorcet's Theory of Voting", American Political Science Review 82, no. 2 (1988), pp. 1231-1244

5. H. P. Young, "Optimal ranking and choice from pairwise comparisons", in Information pooling and group decision making edited by B. Grofman and G. Owen (1986), JAI Press, pp. 113-122

6. H. P. Young, "Optimal Voting Rules", Journal of Economic Perspectives 9, no.1 (1995), pp. 51-64

7. H. P. Young, "Group choice and individual judgements", Chapter 9 of Perspectives on public choice: a handbook, edited by Dennis Mueller (1997) Cambridge UP., pp.181-200

8. John Bartholdi, III, Craig Tovey, and Michael Trick. "Voting schemes for which it can be difficult to tell who won the election", Social Choice and Welfare, 6, 157-165 (1989).

9. Vincent Conitzer, Andrew Davenport, and Jayant Kalagnanam, "Improved Bounds for Computing Kemeny Rankings" (2006)