Word frequencies are used in many text mining and information retrieval applications. Nowadays, and thanks to the Internet, it is easy to find lists of word frequencies for various languages.
Wikitionary, the sister website of Wikipedia provides such lists for many languages. I have decided to extract two of these lists the Gutenberg list and the TV movies list.
The Gutenberg list was built from the Gutenberg project book collection. This is a collection of classic books that fell into the public domain. The list has been last updated in 2005. All books in the considered collection were published before 1923. The TV movies list was built from TV shows and movies scripts and transcripts.
I extracted both lists by HTML scraping the Wikitionary pages. After some filtering, I got two dictionaries: The Gutenberg dictionary and the movies dictionary. The first dictionary contains 36500 words, whereas the second dictionar has 37500 words.Both dictionaries will be used in some future text mining projects.
The first question I was interested in was to find out the most stable words in English. That is words who have very close frequencies in the Gutenburg and movies lists. To do so, I have defined the relative frequency change (rfc) metric as follows:
rfc = |fgut - fmov| / fgut
As can be seen from the equation above, the Gutenberg dictionary was used as a reference as one usually moves from past to present. The table below shows the words with an rfc less than 1%.
I have been expecting this list to be mostly populated by stop words. This is obviously not the case. Anectoditically, the most stable US state (frequency-wise) is Texas and the most stable country is Cuba. Champanzees did not evolve much over time. Also, the only family-related words that made it to this list are motherhood and stepmother. Finally, the only biblical character who managed to get on the list, or should I say who managed to be raised from the dead is Lazarus.
word | rfc (%) |
should | 0.00004743506895271617 |
loyalty | 0.00005496073882532188 |
secretary | 0.00024423521858014236 |
feeding | 0.0003610843091807392 |
healing | 0.000394336394769263 |
compromise | 0.0005731197973101149 |
dot | 0.0009873404309642963 |
technological | 0.0010700187787658674 |
tally | 0.0012378777466739354 |
ken | 0.0013697871390586014 |
texas | 0.00150745934498105 |
grudge | 0.0015268867132275763 |
ventilate | 0.001625039600913739 |
suburban | 0.0016303147019023538 |
disagreement | 0.001790198032184901 |
tasting | 0.0018079890081789077 |
assassin | 0.0019795814675743887 |
heartstrings | 0.002040768273260367 |
unbind | 0.002040768273260367 |
voting | 0.0020869686227272194 |
chimpanzee | 0.0022646009581913863 |
chauffeur | 0.002633344621267412 |
lager | 0.003009632873766722 |
atrium | 0.003009703443265755 |
undamaged | 0.003009720997053414 |
says | 0.0030761090941198934 |
appropriate | 0.003274693070628771 |
glutton | 0.003370892970005746 |
tearing | 0.00340037105305281 |
destroying | 0.0036858691364575658 |
maze | 0.0037089703989563964 |
stepmother | 0.0037425570755429006 |
brace | 0.003792940127059884 |
recently | 0.003820548170542061 |
motherhood | 0.0038334402274800903 |
tang | 0.003888883809406485 |
hydrate | 0.00407996878517319 |
thirty | 0.004080813347742288 |
graphite | 0.004218203321342335 |
clubs | 0.004373836488956027 |
skull | 0.004463195113036589 |
giggling | 0.004933570627853502 |
rift | 0.004934643287717591 |
expletive | 0.004941834641172613 |
rides | 0.004999247306646525 |
endangered | 0.005220246730317241 |
literally | 0.005368216499445233 |
street | 0.005435271139889008 |
turbulence | 0.005492018403877165 |
accelerate | 0.005597772369993267 |
machiavellian | 0.005767630517208131 |
asleep | 0.005953241783944628 |
fact | 0.006120847631338802 |
cuba | 0.0065567675181238795 |
jabber | 0.006764678686678997 |
pylon | 0.006764678686678997 |
sprawl | 0.006764714473773329 |
figurehead | 0.006764738331836927 |
ranch | 0.007263997135806669 |
lazarus | 0.007526007481549322 |
sensational | 0.007556175355208775 |
brent | 0.007700132011442866 |
meg | 0.007722122046453785 |
bondsman | 0.0078260703368953 |
paladin | 0.007826083105353333 |
merry | 0.008309065520959207 |
dont | 0.008393122277909327 |
aramaic | 0.008930954403080598 |
whiff | 0.009016833792812845 |
formula | 0.009106323271107105 |
delicious | 0.00914831363028016 |
either | 0.009188630751667306 |
qualified | 0.009190445500019108 |
ess | 0.009238320769792727 |
reel | 0.009384335141084265 |
participate | 0.009479090187479023 |
integrity | 0.009568233518715032 |
ira | 0.009647947494585262 |
adolescence | 0.009744345608119081 |
moody | 0.009957762135335769 |
saga | 0.009957787946398965 |