My data scientist friend suggested two changes to the emoji/emoticon script I wrote: sort the list by score, and use tf-idf to calculate the significance of a detected emoticon and emoji and filter on that.
tf-idf stands for “term frequency, inverse document frequency.” The idea is that if a term appears a lot in a document (tweet), then that term must be important. But if the term also appears in a lot of documents, then it must be not-so-important. Here’s the pseudocode to calculate a tf-idf score for each given word in a document:
let tf = number of times the word appears in the document / number of words in a document
let n_containing = the number of documents where the word appears
let idf = math.log(number of documents) / (1 + n_containing)
let tf-idf = tf * idf
To calculate the average tf-idf of a word for all tweets, I took the median of the tf-idf score for tweets that did contain that particular term.
To adopt the definition of a word for purposes of counting emojis as words in a tweet, I used the following regular expression and counted the occurrence of each within the tweet:
\S+
I expanded the tweet list to 16K, and then filtered for terms that appeared more than 20 times and had an average tf-idf of 0.01. The only pre-processing I did within the tweet was to substitute any whitespace, including newlines, for a single space.
Here’s the new list — much nicer! Thanks for the suggestions, Michael!
Term | Sentiment Score | TD-IDF | Occurrences | Positive Score | Negative Score |
❤ | 5.32 | 0.040 | 28 | 151 | -2 |
♡ | 5.09 | 0.023 | 22 | 114 | -2 |
? | 4.58 | 0.059 | 26 | 119 | -0 |
✨ | 4.52 | 0.099 | 21 | 99 | -4 |
? | 3.73 | 0.032 | 37 | 138 | -0 |
? | 3.50 | 0.026 | 42 | 173 | -26 |
? | 3.39 | 0.012 | 101 | 358 | -16 |
? | 3.07 | 0.027 | 30 | 112 | -20 |
‘. | 2.56 | 0.016 | 32 | 90 | -8 |
? | 2.44 | 0.022 | 50 | 145 | -23 |
😉 | 2.36 | 0.035 | 22 | 52 | -0 |
?? | 2.23 | 0.044 | 22 | 63 | -14 |
£ | 2.18 | 0.026 | 28 | 67 | -6 |
?. | 2.00 | 0.020 | 29 | 58 | -0 |
? | 2.00 | 0.020 | 29 | 58 | -0 |
? | 2.00 | 0.040 | 58 | 116 | -0 |
? | 2.00 | 0.020 | 29 | 58 | -0 |
? | 2.00 | 0.020 | 29 | 58 | -0 |
? | 2.00 | 0.020 | 29 | 58 | -0 |
? | 2.00 | 0.040 | 58 | 116 | -0 |
= | 1.72 | 0.031 | 29 | 89 | -39 |
? | -1.58 | 0.023 | 31 | 47 | -96 |
| | 1.56 | 0.012 | 79 | 175 | -52 |
.@ | 1.52 | 0.017 | 40 | 96 | -35 |
@__ | 1.46 | 0.033 | 28 | 79 | -38 |
“@ | 1.41 | 0.023 | 27 | 65 | -27 |
?! | 1.39 | 0.026 | 28 | 67 | -28 |
? | -1.37 | 0.035 | 30 | 22 | -63 |
~ | 1.36 | 0.024 | 39 | 84 | -31 |
– | 1.21 | 0.017 | 38 | 77 | -31 |
? | -1.18 | 0.042 | 22 | 26 | -52 |
!!!! | 1.08 | 0.036 | 36 | 75 | -36 |
?? | 1.06 | 0.020 | 51 | 120 | -66 |
? | 1.05 | 0.049 | 20 | 33 | -12 |
?? | -0.99 | 0.014 | 70 | 97 | -166 |
__: | -0.98 | 0.023 | 41 | 47 | -87 |
] | 0.94 | 0.013 | 67 | 114 | -51 |
? | -0.87 | 0.037 | 23 | 26 | -46 |
[ | 0.79 | 0.014 | 61 | 96 | -48 |
__ | 0.70 | 0.016 | 60 | 129 | -87 |
??? | -0.66 | 0.013 | 85 | 101 | -157 |
🙁 | 0.61 | 0.026 | 31 | 64 | -45 |
??? | 0.55 | 0.051 | 20 | 44 | -33 |
.” | 0.51 | 0.018 | 37 | 53 | -34 |
* | 0.51 | 0.013 | 184 | 295 | -202 |
? | -0.42 | 0.023 | 52 | 77 | -99 |
….. | 0.41 | 0.020 | 37 | 68 | -53 |
? | -0.39 | 0.043 | 23 | 46 | -55 |
???? | 0.33 | 0.041 | 21 | 40 | -33 |
? | 0.29 | 0.017 | 51 | 84 | -69 |
? | -0.29 | 0.039 | 31 | 52 | -61 |
;& | -0.25 | 0.016 | 116 | 126 | -155 |
0.25 | 0.033 | 36 | 57 | -48 | |
” | -0.22 | 0.021 | 27 | 44 | -50 |
?” | 0.03 | 0.018 | 30 | 48 | -47 |