More explicit score

robertb · March 23, 2012, 5:39pm

I’d like to see a more explicit scoring mechanism. Right now I’ve completed 25 tasks, and I only have a vague idea of what the scoring system is. It seems to be correlated with the volume of neuron that you’ve added, but sometimes not. Just something that says, “you got X points for A, Y points for B, for a total of P points.” would make me feel that the point system isn’t random – that is has a point, if you will :)

Also, in keeping with Jane McGonigal’s gamification principle that immediate feedback is better than delayed feedback (cf. Reality is Broken), perhaps a real-time display of score so that you can see how it changes in response to what you do, the moment you do it.

aawood · March 24, 2012, 5:18am

(Disclaimer: I’m not a member of the team, just a guy with some thoughts on the matter!)

I have a feeling that score may tie in to how well your selection maps to those of others on the same task, not sure about that. If this is the case, this means that the highest possible score you can get in a task is by picking a selection that overlaps as much as possible with what’s already been picked.

Immediate feedback works great for tasks where there is a theoretically optimal state that you can check and are trying to approach; for example, an electronics game where you want to build a circuit with fewer components, or a financial game where you want to spend the minimum possible to gain the maximum profit. For EyeWire, the “best answer” is an unknown; there’s no objective way to see if person A has a better answer than person B, the model is built by comparing each answer and finding the most matches.

For this case, immediate feedback would seem to be counter-productive. The only thing you can compare against is “what previous players have answered”, but (especially for the first few people taking the task) you want them to all make independent judgements in case they’re picking up on a link that’s been missed by someone else. An immediate score would nudge everyone towards counting paths that have already been chosen by someone else as “right” and paths that people didn’t choose as “wrong”, even if they intuitively believe that the none-scoring selection is the right one, and the effect would be self-amplifying.

Please don’t get me wrong, I’m not against the idea of immediate feedback in itself, it can definitely keep people hooked and that’s a good thing, I’m just not sure how the team could implement it in a way that wouldn’t undermine the projects aims. And who knows, maybe I’ve made some completely wrong assumptions above

backupelk · March 24, 2012, 9:02am

I’d second aawood’s point about the score. In other posts here, it has been pointed out that the actual answer is not known a-priori, therefore there is no real way other than calculating score by comparing a cohort of possible answers to look for a consensus.

The idea is to see what a group of people come up with and go from there.

robertb · March 24, 2012, 7:56pm

I think you’re right – the goal is to get an independent judgement from the player, and if immediate scoring were provided on a task that others have done before, then you could lazily get points by just clicking until you maximize your score, which doesn’t help anyone.

One thing that might be useful is a running tally of newly found volume while you’re doing your task. This is not a score, but it does help you see how much “better” you think you are than the algorithm that initially colored the branch in. Obviously you can click on everything and get an enormous new volume, but since that doesn’t have anything to do with scoring, there’s no point in the end. It’s really just an immediate pseudo-reward.

Now, in terms of actual scoring, the first person doing a particular task gets no points, because there’s nothing to compare against. The next person doing that task gets points after completing the task based on how much agreement there is with the first person. And so on for N people. However, there is nothing wrong with awarding points after the fact. If the first person’s work ends up agreeing with four other people, that’s a good indicator that the first person did something right, and there should be an appropriate reward. Even if the feedback is late, late feedback is better than no feedback. So a message could pop up saying, “Congratulations! You’ve been awarded X points for previous tasks that agree with others!” Perhaps a simple formula such as the maximum of zero and the volume that agrees minus the volume that doesn’t match.

The point is that the scoring cannot be arbitrary, or hidden. You should always have a clear idea why you get points, and why you got that number of points, otherwise the point system will seem arbitrary, and thus meaningless.

backupelk · March 25, 2012, 3:51am

Actually, you make a very good point about messages regarding previous tasks and points. I’m not sure if they correct scores after they have a sample of answers, but it would be a good way to go.

The most annoying ones, which I do not know if you saw, were the cell bodies that crept in. 30 odd minutes of correction -> zero score, or hundreds of points, and the machine grinds to a halt if you don’t switch off the 3D view

Although, thinking further along these lines, a simple score could be augmented (after a cell has been completed) by an accuracy rating for a player that is normalised against the amount of data processed. This would reveal accuracy to some degree by removing the ‘volume’ of results returned.

robertb · March 25, 2012, 5:33pm

Yes, normalization is a good idea so that you don’t get lots of points just because you happen to have been dealt a large volume. Also accounting for zero new volume in cases where the neural network got it right and there’s nothing to do.

balkamm · March 26, 2012, 9:18am

Great discussion of scoring. You guys are now officially exactly where we are. We want the scored to be fun and engaging which means relatively instant feedback. But we don’t want it to negatively effect the results that we get from people, which means that it can’t be per-click. We don’t know the right answer a-priori, so we’ve got to compare different people’s results with each other. We don’t want to give people 0 points for tasks, but we don’t want to give people points for doing nothing and/or doing the wrong thing. We know better how a person did after a number of other people have done the task, but we don’t want people to wait for their scores. We also don’t want too much overlap for each task, because once we’ve converged on the truth for a particular piece, anyone else working on it is just wasted effort.

So here are some of the thoughts that we’ve had in terms of how to make the scoring more fun, and more transparent. Please let us know what you think about these ideas, and throw out more if you’ve got them:

Retroactive Scoring - Once a tasks is completed and closed, we hand out points to everyone
Retroactive Bonuses - Sort of a hybrid. Completion points for doing a task and bonus points for agreement later
Team Based Scoring - Teams work together to reconstruct neurons on their own. Teams that do more neurons, or more accurately get more points
Two Types of Task - Sometimes you will be asked to segment neurons, sometimes you will be asked to synthesize the results of other people’s segmentations. These points would be handed out retroactively too. Huge points if you found a branch that everyone else missed.

Scoring is one area where we definitely want to improve. We’d like to be transparent with how we do things, but right now, we’re still looking for the right way to do it. Feedback is definitely appreciated for these things. Thanks for starting the discussion!

backupelk · March 26, 2012, 12:02pm

Retroactive scoring.

I do think that this is worth trying. Sure, it might put off people who are only here to get ‘points’ by any means, but the more serious players will understand the objective. This would function, in a sense rather like the election nights we have in the UK & US, it would create a sense of anticipation.

2. Hybrid Scoring.

This could be attempted by correcting scores after the close of a task. Advantage - it gives people a score to work with; Disadvantage - be prepared for very petty people questioning where their points have gone if there is a large discrepancy before and after

3 & 4 Team Scoring & segmentation.

This is similar to an idea I put in another thread about competitions in which people or teams would construct a neural trace that would be compared against an AI attempt & the consensus. I would not call it a scoring scheme, rather an alternative way of playing the game. I definitely think that something like this should be done.

This seems to come down to two main questions :

1. Retroactive or running score?

2. What should the final score be? One number (accuracy?) or multiple measures (volume of tasks done, and accuracy).

Thoughts?

robertb · March 26, 2012, 3:12pm

How about this.

A player’s score is increased by max(10, average over other players(100*(new voxels + 50 - disagreement)/(new voxels+50))).

Thus, if a player finds 500 new voxels, with no disagreement, the player’s score will increase by 100(550-0)/550 = 100 points (1 to 4 other players) or 10 points (0 other players).

If a second player finds 700 new voxels, where the previous player found only 500, then the player’s score will increase by 100(750-200)/750 = 73 points.

A third player, finding 500 new voxels, will score the average of 100(550-0)/550 and 100(550-200)/550 = average of 100 and 64 = 82 points.

And so on. Why is it voxels + 50? Because if you didn’t find any new voxels because the AI was perfect, you still get something. And why max with 10? Because if there is no other player to compare against, you should still get something, along with a “medal” for trailblazing. The medals accumulate so that at least you don’t feel somewhat cheated for trailblazing.

When all five players have completed a given task, and the volume is synthesized, each player gets awarded extra points (with an announcement) like this:

score += 100*(AI voxels + agreed new voxels) / (AI voxels + submitted new voxels) * ((AI voxels + agreed new voxels) / AI voxels) * trailblazer ? 2 : 1.

Let’s suppose that there were 1000 voxels that the AI found. Thus, in the above example, the first player gets 100*(1500/1500) * (1500/1000) * 2 = 300 points. The second player gets 100*(1500/1700) * (1500/1000) * 1 = 132 points. The third player gets 100*(1500/1500) * (1500/1000) * 1 = 150 points.

So in the end, it’s:

player 1: 310

player 2: 205

player 3: 232

which intuitively is fair. Remember that 20% of the time you’ll be a trailblazer, so will get the extra bonus.

Why add AI voxels on both sides? Because if the AI was perfect, you should still get something. Why the additional trailblazing factor? To compensate for the first person. And why the factor with AI voxels in the denominator? Because the more voxels you find relative to the AI, the more you should be rewarded.

In the occasional case, the AI finds a small number of voxels that is perfect. If you find no new voxels, then you’ll get 10 points regardless of whether you trailblaze or not. But after all five results are in, 200 points goes to the trailblazer, and 100 points to every other player. This is a reward for checking over the AI’s work.

One other case to consider is a troll who clicks on everything to get a massive submitted volume, for example, 100,000 voxels. In this case, the initial score will be max(10, small number) = 10. And the postfacto score will be practically zero. This also is intuitively fair. But all other players also get 10. Nevertheless, all players who got it right get points postfacto, but the troll gets nothing.

robertb · March 26, 2012, 3:12pm

(Continued)

You should also be able to click on your score and find out why you got that score for a given task. Player 2’s explanation, for example, would read:

The AI found 1000 voxels. You initially scored 73 points out of the maximum 100 because you submitted 700 new voxels, and one previous player disagreed by 200 voxels. After synthesis of the results, you scored an additional 132 points out of the maximum 150 (100 times a better-than-AI bonus factor of 1.5x) because of your total 1700 voxels, the synthesis agreed with 1500 voxels. Your total score is 205 out of the maximum 250. (View this task)

Player 1’s explanation would read:

The AI found 1000 voxels. You initially scored 10 points plus a trailblazer award because you were the first player to examine this task. After synthesis of the results, you scored an additional 300 points out of the maximum 300 (100 times a better-than-AI bonus factor of 1.5x, with a trailblazer bonus factor of 2x) because of your total 1500 voxels, the synthesis agreed with 1500 voxels. Your total score is 310 out of the maximum 310. (View this task)

And player 3’s explanation:

The AI found 1000 voxels. You initially scored 82 points out of the maximum 100 because you submitted 500 new voxels, two previous players agreed with 500 of your voxels, and one previous player disagreed by 200 voxels. After synthesis of the results, you scored an additional 150 points out of the maximum 150 (100 times a better-than-AI bonus factor of 1.5x) because of your total 1500 voxels, the synthesis agreed with 1500 voxels. Your total score is 232 out of the maximum 250. (View this task)

So in summary, this algorithm accomplishes the following:

Rewards players for trailblazing.
Rewards players for agreeing with each other and postfacto with the synthesis.
Partially rewards players for agreeing with each other partially and postfacto partially with the synthesis.
Rewards players even if there is nothing to do.
Rewards players for being better than the AI.
Does not reward players who blindly click.
Is fair, but not 100% fair (people respond better to intermittent rewards than consistent rewards, go figure)

backupelk · March 26, 2012, 6:04pm

Sidestepping the maths for the moment, I like the idea of AI generated feedback on the scores. It isn’t quite as simple as giving people a number, but with retroactively corrected scores, it would help alleviate the ‘Where’s my score gone, waaa waaa waaa’ response.

This could be augmented by a 3D view that shows the consensus structure, and in a different colour, the players disagreement. This would act as a feedback mechanism for learning.

As an aside, it should be noted that immediate points gratification doesn’t factor in projects such as PlanetHunters.org. They are rather popular and have a long feedback time, even if you have found something.

aawood · March 27, 2012, 6:08pm

I know this hasn’t been central to the discussion, but I think the one main core way the current system is “non-gamified” is with trailblazers getting no points. This can’t be done with any accuracy as it stands without granting them points after the fact. I’d like to suggest a relatively simple way to solve this in the current format, without unbalancing scores, or delaying feedback; when a trailblazer finishes a task, give them a “TB token”. Players can spend a token when being given a task, with it having the effect of doubling their score once they’re done.

This gives immediate gratification upon finishing a task, same as being granted points does normally. It also feels like a “special” reward for being the first to do something, which is a special situation to be in. The double points won’t/shouldn’t unbalance things because, in effect, you’re doing two tasks and getting two tasks worth of points. Essentially, it’ll turn getting a brand new task from being an annoyance, to a special reward, and add a small extra game-like element to the project.

On a meta note, if down the line there’s more emphasis on tracking player stats and letting them see each others’ (perhaps to promote friendly rivalrys), the race to have the highest “TP tokens granted” should get the highest-level players that much more involved. Especially as there’re only so many tasks in total (with some already trailblazed), making these tokens an ultimately limited resource…