Programmatic Partial-Credit Put-In-Order Grading

Back in Autumn, 2009 we introduced online tests with several question types. Certain types—multiple choice for example—are simple for a computer program to grade. You just tell it what the correct answer is, and if the student marks that answer, they get the credit. Other question types, however, aren’t so simple: put-in-order, for example. Different instructors have different methods and rationales for how to assign credit for a partially correct put-in-order question, so replicating how a human would grade one is no mean feat. Or so we’ve learned over the last five years.

Our first method was simple: we'd evaluate each item in turn, and if it was before or after the item that it was supposed to be before or after then it would be counted as correct.

This did a pretty good job approximating how an instructor might give partial credit by focusing on the order of items as opposed to their placement. This method stayed in place until we discovered its principal flaw. Though highly unlikely, it was possible for an incorrect response to receive 100% credit provided each item was next to at least one correct neighbor.

When an instructor brought this to our attention last Winter (over four years since we added online tests!) we quickly revised our methodology to focus on placement. This seemed like a simple, reasonable method: imagine a teaching assistant lining up an answer key next to the student’s response and marking incorrect any item that didn’t match the key.

In reality, though, this sometimes proved much harsher than an actual professor would be, especially if the question had a larger number of options.

In fact, not long after the update an instructor showed us a rather harshly-graded 25-item put-in-order question; Populi counted 13 options as correct when the instructor would have counted 24. In light of this, we sought out another approach. The best programmatic method for giving partial credit on put-in-order questions would need to take into account more than simple placement in order to better replicate how a human teacher would grade and avoid being overly harsh or generous. After testing every method we could think of, here’s what we came up with.

The new method aims to give as much credit as is reasonable (as most instructors would) by focusing on what we’re calling chains—that is, two or more correctly-ordered items in a row. First, we locate the longest chain. Then, we use it to figure out whether or not other chains before it or after it are in order. Anything not in a chain is incorrect.

order_5

This method worked well overall, but there were a couple wrinkles to iron out. One was that the first and last items in the list are at a disadvantage when it comes to chaining: each has only one neighbor to chain with, and so are less likely to be counted as correct. This was solved by treating the top and bottom boundaries of the list as non-credit positions. In other words, if the first or last item is in the correct position it always counts as being in a chain, and receives credit.

The other wrinkle: a response could have more than one longest chain. Depending on which chain you started with it was possible to come up with a different number of points. Here, starting with the first chain leads to a lower score:

order_7

We solved this by grading the question multiple times, as it were. We look at each chain, and then look at the position of each chain in the answer. We then see which chain to use as a starting point to grant the most credit.

  • Above, starting with eight-nine-ten  would cause Populi to mark the other two chains wrong (because one-two-three and four-five-six don't come after eight-nine-ten). This would result in a score of 30% of the possible points.
  • Below, starting with one-two-three lets us then say that four-five-six is correct (because it comes after one-two-three). This lets us mark two chains correct, giving the student more credit for the question (60%)—and thus, is preferable to the other option. Starting with four-five-six results in this same score.

order_8

There's also a very, very remote possibility that the most credit would be awarded by starting with the second-longest chain. So we also try grading using every longest chain and every chain with one fewer item than the longest chain, just in case.

We’re happy to announce this as the new (and hopefully final) method for assigning partial credit to put-in-order questions.

Now, partial credit is, after all, just an option on a feature. But the time we spent working it out and building it is worth it. Professors rely on Populi to save them time with the mundane things (like test-grading), but some of the mundane things are hard to nail. It's actually quite a challenge to replicate a teacher's intuition with rigid, literal software code.

As a bonus, we now show which items were marked incorrect in the test history view so students and teachers can see how the partial-credit grade was derived.

Editor's Note: As you suspected, Yes—the title of this article can be sung to the tune of Supercalifragilisticexpialidocious from Mary Poppins.