A data scientist friend and I talked briefly about how hard it is to parse text into words. He mentioned that he felt that not enough attention were paid to emoticons or emoji in the Twitter sentiment analysis papers he reads.
In this context, a sentiment score is a measure of emotion associated with a word or phrase. Negative scores mean the word is associated with negative emotions, and positive scores mean the word is associated with positive emotions.
Coincidentally, I’d been taking a data science course, and the assignment I was working on concerned Twitter sentiment analysis. It wasn’t too hard to adopt the homework to estimate sentiment for non-alphanumeric characters. The regular expression I used to grab emoticons and emoji was:
[^A-Za-z0-9\s]+
Clearly, the above regex captures more than emoticons and emoji — it captures valid words in non-ASCII languages and punctuation. Nonetheless, I thought it’d be an interesting start.
For calculating the score of nonalphanumeric sentiments, the formula I used was:
let pos = [cumulative score of positive sentiment words of the tweet which the term appeared in]
let neg = [cumulative score of negative sentiment words of the tweet which the term appeared in]
let count = [number of times that terms appeared in the document]
let sentiment = pos / count - neg / count
Where the sentiment words and scores are taken from the AFINN list.
I did a quick and dirty run through about 6K English tweets collected from the Twitter sprinkler API. Not a representative sample by any means, but again, my aim was to get a quick estimate, not a scientific paper. The results are below — terms which appeared fewer than 10 times are not published here.
Term | Sentiment Score | Occurrences | Positive Score | Negative Score |
@ | 1.41 | 2629 | 6337 | -2630 |
. | 1.19 | 2585 | 5816 | -2733 |
: | 0.99 | 1746 | 3770 | -2033 |
/ | 1.14 | 1470 | 3014 | -1341 |
:// | 1.07 | 1421 | 2871 | -1349 |
‘ | 0.61 | 1119 | 2503 | -1816 |
# | 1.74 | 753 | 1795 | -483 |
, | 1.22 | 692 | 1723 | -877 |
_ | 1.56 | 421 | 1013 | -355 |
! | 2.59 | 356 | 1111 | -190 |
… | 1.67 | 315 | 835 | -310 |
– | 1.27 | 265 | 538 | -201 |
“ | 0.57 | 255 | 498 | -353 |
? | 1.55 | 197 | 482 | -177 |
; | 0.97 | 188 | 447 | -265 |
& | 1.13 | 151 | 400 | -230 |
… | 0.60 | 122 | 234 | -161 |
!! | 2.17 | 114 | 352 | -105 |
( | 1.09 | 76 | 133 | -50 |
) | 1.01 | 68 | 112 | -43 |
_: | -0.28 | 58 | 95 | -111 |
@_ | 0.19 | 57 | 117 | -106 |
.. | 0.75 | 52 | 115 | -76 |
’ | 2.11 | 46 | 131 | -34 |
? | 0.33 | 43 | 84 | -70 |
;& | 0.66 | 38 | 53 | -28 |
❤️ | 3.24 | 37 | 129 | -9 |
* | 0.79 | 34 | 65 | -38 |
!!! | 2.73 | 30 | 100 | -18 |
🙂 | 2.93 | 27 | 98 | -19 |
? | 4.04 | 25 | 106 | -5 |
?? | 1.58 | 24 | 64 | -26 |
[ | 0.95 | 22 | 40 | -19 |
] | 0.64 | 22 | 36 | -22 |
$ | 1.95 | 22 | 62 | -19 |
??? | 0.05 | 21 | 22 | -21 |
…. | -0.05 | 21 | 30 | -31 |
.” | 0.05 | 19 | 23 | -22 |
? | 4.78 | 18 | 86 | 0 |
✨ | 5.00 | 18 | 90 | 0 |
| | 2.41 | 17 | 44 | -3 |
?? | -1.12 | 16 | 25 | -43 |
— | 0.27 | 15 | 33 | -29 |
__ | 1.29 | 14 | 34 | -16 |
% | 1.43 | 14 | 26 | -6 |
? | 2.54 | 13 | 42 | -9 |
“ | 0.69 | 13 | 24 | -15 |
+ | 2.25 | 12 | 33 | -6 |
? | 4.33 | 12 | 54 | -2 |
:… | 2.45 | 11 | 32 | -5 |
? | 0.18 | 11 | 20 | -18 |
??? | 4.00 | 11 | 44 | 0 |
0.36 | 11 | 20 | -16 | |
? | -0.40 | 10 | 11 | -15 |
__: | 0.90 | 10 | 23 | -14 |
? | 2.80 | 10 | 32 | -4 |
Leave a Reply