Jon Kleinberg, a professor of computer science at Cornell University, Ithaca, N.Y., has developed a method for a computer to find the topics that dominate a discussion at a particular time by scanning large collections of documents for sudden, rapid bursts of words. Among other tests of the method, he scanned presidential State of the Union addresses from 1790 to the present and created a list of words that eerily reflects historical trends. The technique, he suggests, could have many 'data mining' applications, including searching the Web or studying trends in society as reflected in Web pages.
Kleinberg says he got the idea of searching over time while trying to deal with his own flood of incoming e-mail. He reasoned that when an important topic comes up for discussion, keywords related to the topic will show a sudden increase in frequency. A search for these words that suddenly appear more often might, he theorized, provide ways to categorize messages.
He devised a search algorithm that looks for 'burstiness,' measuring not just the number of times words appear, but the rate of increase in those numbers over time. Programs based on his algorithm can scan text that varies with time and flag the most "bursty" words. "The method is motivated by probability models used to analyze the behavior of communication networks, where burstiness occurs in the traffic due to congestion and hot spots," he explains.
A few years ago, he suggested that a way to find the most useful Web sites on a particular subject would be to look at the way they are linked to one another. Sites that are 'linked to' by many others are probably 'authorities.' Sites that link to many others are likely to be 'hubs.' The most authoritative sites on a topic would be the ones that are linked to most often by the most active hubs, he reasoned. A variation on this idea is used by Google, and a more formal version is being used in a new search engine called Teoma. >from *Buzzwords of history, revealed by computer scans, indicate new ways of searching the Web*. february 18, 2003
related context
> daypop word bursts. word bursts are heightened usage of certain words in weblogs within the last couple days. they are indicators of what webloggers are writing about right now. feature implemented since february 26, 2003.
> uniting with only a few random links: small-world networking in simulations. february 4, 2003
> how does 'six degrees of separation' work? explanation is personal networking. august 23, 2000. kleinberg's work refinement of an earlier study by steven h. strogatz and duncan watts.
> authoritative sources in a hyperlinked environment by jon m. kleinberg [pdf]. introduction of the hits (hyperlink-induced topic search) algorithm. 1998
imago
> amerika administration word burst graph
| permaLink