I recently wrote an algorithm that would use the answers from 57,000 stories to predict what three topics people might choose for a story with similar words in it. The rigorous way to do this is set aside 10-20% of the data to test the algorithm and use the rest to “train” it, then run the alorithm on the test set to estimate how likely it will be to choose the correct topic from among these 10 choices:
That means that I can accurately predict stories about “knowledge” 96% of the time, but only 2.8% correct for “security” stories. Correlation with number of stories tagged with a topic is low. Fun is a seldom used topic, but matches with 67% accuracy; self-esteem is 0nly 0.4% accurate, but tagged in 3X the number of stories that fun was.
Next I thought, “maybe the most common words in each reference dictionary are too similar among all 10 topics.” I noticed the top words are similar in many of the 10 topics. Words like ‘school’, ‘organization’, and ‘community’ are present in all stories, and so offer no differentiating ability. I should remove them.