r66 - 04 May 2012 - 10:40:43 - PaolaVelardiYou are here: TWiki >  Estrinfo Web > WebHome
Ricerca ovunque con Google ...

Web Information Retrieval - A.Y. 2011/2012

Instructor Telephone Office hours Studio
Paola Velardi 06-49918356 send e-mail Via Salaria 113 - 3° floor n. 3412

Course schedule

II semester:

When Where
Monday 15.45-17.15 aula beta
Thursday 14.10-15.40 aula seminari

Important Notices

This course is taught in English.

Important Notices

On Monday, April 16th we make a simulation of a written test. Topics will be selected among the following: Vector Space Model, Boolean Model, BIM, LSI, query expansion, perfromance evaluation. Questions may relate to the theory or consist of practical exercises.

On Monday 23rd lesson on Opinion Mining On Monday 7 written test (will amount for 1/2 of oral exam approx.)

Summary of Course Topics

Architecture of an Information Retrieval system. Tokenisation, stop-word removal and stemming; morphology; selection of index terms, use of thesauri. Inverted indices. Boolean and vector-space retrieval models, ranked retrieval and text-similarity metrics. Performance metrics: recall, precision, and F-measure. Evaluations on benchmark text collections. Probabilistic ranking: binary independence model (BIM) and belief networks. Latent Semantic Indexing. Relevance feedback. Query expansion. Web Information Retrieval. Link analysis: Page Rank and HITS. Intelligent Information Retrieval: Information Extraction, Question Answering, Opinion Mining. Multimedia Information Retrieval: Speech, Images, Music, Video.

Textbooks

Exam

  • Written exam on course material (40%)
  • Project (teams of 2-3, joint project with Machine Learning or Web Information Retrieval possible) using Lucene (40%)
  • Read and present (in English) one paper on a "hot" topic. This year the topic is: "Predicting The Future: the impact of "web sentiment" on predictions in Politics, Finance, and Health"
  • Written exam can be done in two steps , mid-term (april, part I) and end of course (early june, part II), or in july (on the full program)
  • For more complex projects, paper presentation might not be necessary.

Mid Term: Create an Index for a Corpus of Finanace and Economy

Mid-term project on Lucene: download the following files: first zipped archive second zipped archive terminology

1.with Lucene Analyzer, tokenize, stem and remove stopwords from the collection

2. with IndexWriter? create an index for the terminology (notice they are NOT single terms but multi-word expressions, and must be indexed as such)

3. with QueryParser? verify your retrieval system

When you are done, send me an email to receive a set of 3 queries. Return your answers and all your software.

Notice that each term is the terminology.txt file has four real values indicating its relevance. You don't need to use these values, but if so, just consider the first one.

The project is worth "+2" on your final grade, PROVIDED you send it by end may.

Slides and course materials

Timetable Topic PPT Details
2012 Introduction ppt Introduction, architecture of IR systems, text processing, indexing
2012 Basic Ranking Models pptx Boolean Model, Vector Space Model Version Space
2012 Lucene pdf Lucene text search engine library in java
2012 Evaluation pptx Performance measures and benchmarking of IR systems
2012 Query Expansion ppt Improving basic IR models
| 2012 | Latent Semantic Indexing| pptx | Algebric Ranking Models|example of LSI calculation
2012 Statistical Ranking Models ppt Statistical Ranking Models
2012 Web Search ppt Web IR
2012 Link Analysis ppt Web Page Ranking
2012 Opinion Mining zip Mining Opinions on the Web
2012 Open Information Extraction to be prepared Extracting Information on a Web Scale
2012 MultiMedia? IR to be prepared Image, Speech, Video IR

Syllabus

Part A:

  • Introduction, Architecture of IR systems
  • Text processing, Indexing
  • Boolean and Vector Space Models
  • Evaluation methods: experimental and theoretical methods
  • Latent Semantic Indexing
  • Probabilistic Methods
  • Web Information Retrieval
  • Link Analysis

Part B:Intelligent Information Retrieval

  • Open Domain Information Extraction
  • Question Answering
  • Opinion Mining
  • Multimedia Information Retrieval
  • Enterprise Knowledge Management (Prof. M. Missikoff)

Suggested papers on 2012's HOT topic (bibliography)

  • "Predicting Financial Markets: Comparing Survey, News, Twitter and Search Engine Data" Huina Mao, Scott Counts, and Johan Bollen
  • "Sentiment Analysis of Financial News Articles" Robert P. Schumaker, Yulei Zhang and Chun-Neng Huang
  • "Twitter mood predicts the stock market" Johan Bollena,, Huina Mao, Xiaojun Zeng
  • "Opinion Mining and Sentiment Analysis" Bo Pang and Lillian Lee
  • "Sentic Computing for social media marketing" Erik Cambria , Marco Grassi , Amir Hussain , Catherine Havasi
  • "Forecasting Stock Market Volatility with Search Engine Query Statistics" R. Beker
  • "Nowcasting with Google Trends in an Emerging Market" Y. Carriere-Swallow, F. Labbe
  • "Googling the present" G. Chamberlin

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r66 < r65 < r64 < r63 < r62 | More topic actions







  • TWiki ... TWiki
 
Viva la pace! Torna al Dipartimento di Informatica

  • create new tag
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback