Skip to main content
Home

Main navigation

  • Home
User account menu
  • Log in

Breadcrumb

  1. Home

On English Words Frequencies From Shakespeare To Modern Soap Operas

By Skander, 2 July, 2012
Johannes Gutenberg

Word frequencies are used in many text mining and information retrieval applications. Nowadays, and thanks to the Internet, it is easy to find lists of word frequencies for various languages.

Wikitionary, the sister website of Wikipedia provides such lists for many languages. I have decided to extract two of these lists the Gutenberg list and the TV movies list.

The Gutenberg list was built from the Gutenberg project book collection. This is a collection of classic books that fell into the public domain. The list has been last updated in 2005. All books in the considered collection were published before 1923. The TV movies list was built from TV shows and movies scripts and transcripts.

I extracted both lists by HTML scraping the Wikitionary pages.  After some filtering, I got two dictionaries: The Gutenberg dictionary and the movies dictionary. The first dictionary contains 36500 words, whereas the second dictionar has 37500 words.Both dictionaries will be used in some future text mining projects.

The first question I was interested in was to find out the most stable words in English. That is words who have very close frequencies in the Gutenburg and movies lists. To do so, I have defined the relative frequency change (rfc) metric as follows:

rfc = |fgut - fmov| / fgut

As can be seen from the equation above, the Gutenberg dictionary was used as a reference as one usually moves from past to present. The table below shows the words with an rfc less than 1%.

I have been expecting this list to be mostly populated by stop words. This is obviously not the case. Anectoditically, the most stable US state (frequency-wise) is Texas and the most stable country is Cuba. Champanzees did not evolve much over time. Also, the only family-related words that made it to this list are motherhood and stepmother. Finally, the only biblical character who managed to get on the list, or should I say who managed to be raised from the dead is Lazarus.

wordrfc (%)
should0.00004743506895271617
loyalty0.00005496073882532188
secretary0.00024423521858014236
feeding0.0003610843091807392
healing0.000394336394769263
compromise0.0005731197973101149
dot0.0009873404309642963
technological0.0010700187787658674
tally0.0012378777466739354
ken0.0013697871390586014
texas0.00150745934498105
grudge0.0015268867132275763
ventilate0.001625039600913739
suburban0.0016303147019023538
disagreement0.001790198032184901
tasting0.0018079890081789077
assassin0.0019795814675743887
heartstrings0.002040768273260367
unbind0.002040768273260367
voting0.0020869686227272194
chimpanzee0.0022646009581913863
chauffeur0.002633344621267412
lager0.003009632873766722
atrium0.003009703443265755
undamaged0.003009720997053414
says0.0030761090941198934
appropriate0.003274693070628771
glutton0.003370892970005746
tearing0.00340037105305281
destroying0.0036858691364575658
maze0.0037089703989563964
stepmother0.0037425570755429006
brace0.003792940127059884
recently0.003820548170542061
motherhood0.0038334402274800903
tang0.003888883809406485
hydrate0.00407996878517319
thirty0.004080813347742288
graphite0.004218203321342335
clubs0.004373836488956027
skull0.004463195113036589
giggling0.004933570627853502
rift0.004934643287717591
expletive0.004941834641172613
rides0.004999247306646525
endangered0.005220246730317241
literally0.005368216499445233
street0.005435271139889008
turbulence0.005492018403877165
accelerate0.005597772369993267
machiavellian0.005767630517208131
asleep0.005953241783944628
fact0.006120847631338802
cuba0.0065567675181238795
jabber0.006764678686678997
pylon0.006764678686678997
sprawl0.006764714473773329
figurehead0.006764738331836927
ranch0.007263997135806669
lazarus0.007526007481549322
sensational0.007556175355208775
brent0.007700132011442866
meg0.007722122046453785
bondsman0.0078260703368953
paladin0.007826083105353333
merry0.008309065520959207
dont0.008393122277909327
aramaic0.008930954403080598
whiff0.009016833792812845
formula0.009106323271107105
delicious0.00914831363028016
either0.009188630751667306
qualified0.009190445500019108
ess0.009238320769792727
reel0.009384335141084265
participate0.009479090187479023
integrity0.009568233518715032
ira0.009647947494585262
adolescence0.009744345608119081
moody0.009957762135335769
saga0.009957787946398965
Thoughts
Technology
  • Add new comment

My Apps

  • Collatz (Syracuse) Sequence Calculator / Visualizer
  • Erdős–Rényi Random Graph Generator / Analyzer
  • KMeans Animator
  • Language Family Explorer

New Articles

Divine Connections: Building Promptheon, a GenAI Semantic Graph Generator of Ancient Gods
Machine Learning Mind Maps
Thompson Sampling With Gaussian Distribution - A Stochastic Multi-armed Bandit
Stochastic Multi-armed Bandit - Thompson Sampling With Beta Distribution
The Exploration-Exploitation Balance: The Epsilon-Greedy Approach in Multi-Armed Bandits

Skander Kort