As I sit here in Southern California awaiting the Academy Awards this weekend,  a model I have been working on to better understand the language of film titles is complete. I was able to do a bunch of work on the airplane out here from New York.  The model was built using Latent Semantic Indexing (LSI) to understand what film title words are associated with winning festival awards.  LSI is a natural language processing technique that is good for understanding hidden patterns in collections of text.  Here is a general description of the process I wrote for a article a few years ago… 

LSI maps the contextual relationships between words in terms of common usage patterns across a collection of documents called a repository. For instance, in documents about dogs (the animal), one would expect that word “dog” would be accompanied by contextually relevant words such as “collar”, “wagging”, “puppy”, or “leash.” These associations are less likely in comparable documents discussing “reptiles.” When a large number of documents are put together as repository, a statistical measure of these connections can be generated via LSI.     

LSI enables an analyst to understand how words relate to one another through the creation of a similarity measure, which reveals whether a given language pattern is similarly used compared with another pattern.

The math is brutal, but the result are always interesting.  The top 50 title words (a natural breaking point) that are associated with award winning festival are:

  1. war
  2. country
  3. love
  4. mother
  5. happy
  6. son
  7. blue
  8. broken
  9. now
  10. princess
  11. everything
  12. body
  13. day
  14. black
  15. story
  16. up
  17. park
  18. pool
  19. wild
  20. daily
  21. cant
  22. la
  23. run
  24. high
  25. innocent
  26. requiem
  27. august
  28. night
  29. amour
  30. crazy
  31. four
  32. hope
  33. dark
  34. daughter
  35. trouble
  36. bell
  37. color
  38. full
  39. dance
  40. it
  41. dust
  42. want
  43. super
  44. eye
  45. sex
  46. hotel
  47. go
  48. legacy
  49. she
  50. river
