High Uncertainty

By NinjasPounced | Published December 26, 2015

My data scientist friend suggested two changes to the emoji/emoticon script I wrote: sort the list by score, and use tf-idf to calculate the significance of a detected emoticon and emoji and filter on that.

tf-idf stands for “term frequency, inverse document frequency.” The idea is that if a term appears a lot in a document (tweet), then that term must be important. But if the term also appears in a lot of documents, then it must be not-so-important. Here’s the pseudocode to calculate a tf-idf score for each given word in a document:

let tf = number of times the word appears in the document / number of words in a document let n_containing = the number of documents where the word appears let idf = math.log(number of documents) / (1 + n_containing) let tf-idf = tf * idf

To calculate the average tf-idf of a word for all tweets, I took the median of the tf-idf score for tweets that did contain that particular term.

To adopt the definition of a word for purposes of counting emojis as words in a tweet, I used the following regular expression and counted the occurrence of each within the tweet:

\S+

I expanded the tweet list to 16K, and then filtered for terms that appeared more than 20 times and had an average tf-idf of 0.01. The only pre-processing I did within the tweet was to substitute any whitespace, including newlines, for a single space.

Here’s the new list — much nicer! Thanks for the suggestions, Michael!

Term	Sentiment Score	TD-IDF	Occurrences	Positive Score	Negative Score
❤	5.32	0.040	28	151	-2
♡	5.09	0.023	22	114	-2
?	4.58	0.059	26	119	-0
✨	4.52	0.099	21	99	-4
?	3.73	0.032	37	138	-0
?	3.50	0.026	42	173	-26
?	3.39	0.012	101	358	-16
?	3.07	0.027	30	112	-20
‘.	2.56	0.016	32	90	-8
?	2.44	0.022	50	145	-23
😉	2.36	0.035	22	52	-0
??	2.23	0.044	22	63	-14
£	2.18	0.026	28	67	-6
?.	2.00	0.020	29	58	-0
?	2.00	0.020	29	58	-0
?	2.00	0.040	58	116	-0
?	2.00	0.020	29	58	-0
?	2.00	0.020	29	58	-0
?	2.00	0.020	29	58	-0
?	2.00	0.040	58	116	-0
=	1.72	0.031	29	89	-39
?	-1.58	0.023	31	47	-96
\|	1.56	0.012	79	175	-52
.@	1.52	0.017	40	96	-35
@__	1.46	0.033	28	79	-38
“@	1.41	0.023	27	65	-27
?!	1.39	0.026	28	67	-28
?	-1.37	0.035	30	22	-63
~	1.36	0.024	39	84	-31
–	1.21	0.017	38	77	-31
?	-1.18	0.042	22	26	-52
!!!!	1.08	0.036	36	75	-36
??	1.06	0.020	51	120	-66
?	1.05	0.049	20	33	-12
??	-0.99	0.014	70	97	-166
__:	-0.98	0.023	41	47	-87
]	0.94	0.013	67	114	-51
?	-0.87	0.037	23	26	-46
[	0.79	0.014	61	96	-48
__	0.70	0.016	60	129	-87
???	-0.66	0.013	85	101	-157
🙁	0.61	0.026	31	64	-45
???	0.55	0.051	20	44	-33
.”	0.51	0.018	37	53	-34
*	0.51	0.013	184	295	-202
?	-0.42	0.023	52	77	-99
…..	0.41	0.020	37	68	-53
?	-0.39	0.043	23	46	-55
????	0.33	0.041	21	40	-33
?	0.29	0.017	51	84	-69
?	-0.29	0.039	31	52	-61
;&	-0.25	0.016	116	126	-155
	0.25	0.033	36	57	-48
”	-0.22	0.021	27	44	-50
?”	0.03	0.018	30	48	-47

Getting Twitter sentiment of emoticons and emoji

By NinjasPounced | Published December 26, 2015

A data scientist friend and I talked briefly about how hard it is to parse text into words. He mentioned that he felt that not enough attention were paid to emoticons or emoji in the Twitter sentiment analysis papers he reads.

In this context, a sentiment score is a measure of emotion associated with a word or phrase. Negative scores mean the word is associated with negative emotions, and positive scores mean the word is associated with positive emotions.

Coincidentally, I’d been taking a data science course, and the assignment I was working on concerned Twitter sentiment analysis. It wasn’t too hard to adopt the homework to estimate sentiment for non-alphanumeric characters. The regular expression I used to grab emoticons and emoji was:

[^A-Za-z0-9\s]+

Clearly, the above regex captures more than emoticons and emoji — it captures valid words in non-ASCII languages and punctuation. Nonetheless, I thought it’d be an interesting start.

For calculating the score of nonalphanumeric sentiments, the formula I used was:

let pos = [cumulative score of positive sentiment words of the tweet which the term appeared in] let neg = [cumulative score of negative sentiment words of the tweet which the term appeared in] let count = [number of times that terms appeared in the document] let sentiment = pos / count - neg / count

Where the sentiment words and scores are taken from the AFINN list.

I did a quick and dirty run through about 6K English tweets collected from the Twitter sprinkler API. Not a representative sample by any means, but again, my aim was to get a quick estimate, not a scientific paper. The results are below — terms which appeared fewer than 10 times are not published here.

Term	Sentiment Score	Occurrences	Positive Score	Negative Score
@	1.41	2629	6337	-2630
.	1.19	2585	5816	-2733
:	0.99	1746	3770	-2033
/	1.14	1470	3014	-1341
://	1.07	1421	2871	-1349
‘	0.61	1119	2503	-1816
#	1.74	753	1795	-483
,	1.22	692	1723	-877
_	1.56	421	1013	-355
!	2.59	356	1111	-190
…	1.67	315	835	-310
–	1.27	265	538	-201
“	0.57	255	498	-353
?	1.55	197	482	-177
;	0.97	188	447	-265
&	1.13	151	400	-230
…	0.60	122	234	-161
!!	2.17	114	352	-105
(	1.09	76	133	-50
)	1.01	68	112	-43
_:	-0.28	58	95	-111
@_	0.19	57	117	-106
..	0.75	52	115	-76
’	2.11	46	131	-34
?	0.33	43	84	-70
;&	0.66	38	53	-28
❤️	3.24	37	129	-9
*	0.79	34	65	-38
!!!	2.73	30	100	-18
🙂	2.93	27	98	-19
?	4.04	25	106	-5
??	1.58	24	64	-26
[	0.95	22	40	-19
]	0.64	22	36	-22
$	1.95	22	62	-19
???	0.05	21	22	-21
….	-0.05	21	30	-31
.”	0.05	19	23	-22
?	4.78	18	86	0
✨	5.00	18	90	0
\|	2.41	17	44	-3
??	-1.12	16	25	-43
—	0.27	15	33	-29
__	1.29	14	34	-16
%	1.43	14	26	-6
?	2.54	13	42	-9
“	0.69	13	24	-15
+	2.25	12	33	-6
?	4.33	12	54	-2
:…	2.45	11	32	-5
?	0.18	11	20	-18
???	4.00	11	44	0
	0.36	11	20	-16
?	-0.40	10	11	-15
__:	0.90	10	23	-14
?	2.80	10	32	-4

NinjasPounced's Musings

Using tf-idf to measure emoji sentiment

Getting Twitter sentiment of emoticons and emoji

Recent Posts

Archives

Categories

Meta