Automatic News Summarization Using a Dependency Structure

This project aims to create a program that will make a coherent and indicative overview of a given news article by using a dependency structure, or a graph of the relations between words, across the entire article to identify the important concepts and to recombine those important concepts into a high density summary.


Requirements

Design

Source

Explore

Progress

Existing Methods:

Headlines

A very short statement of topic, to draw readers to the article.

Human-Written Summaries

A summary written by a person for the space it is to occupy

Lead-Excerpt Summary

The first few sentences of the article, trimmed by hand and used as a summary

General Automatic Summaries

Using a generic summarization engine to rank the sentences in an article and compile the top ranked sentences into a summary

The Process

Input

The user enters the article either through a GUI on Windows XP, or the user saves the article as a text file and feeds it into a command line utility.

Part of Speech Tagging

The words in the article are then tagged with their parts of speech (Noun, verb, etc...). This is an important prerequisite for automatically creating the dependencies. The tagging is done with FreeLing's language analysis tools, which uses a Hidden Markov Model.

Dependency Generation

To generate the dependency structure, it uses the Link Generator from CMU, which describes many different types of relationships between the words. The link generation is close to dependency generation when only a certain set of links are used.

Ranking

To determine what is the most important data, each word is ranked based on how many words depend on it and how many words depend on those words.

Generation

Finally, the program follows the dependency relations to regenerate new sentences with the most important information from the original document.