Ricerca ovunque con Google ...
Web Information Retrieval - A.Y. 2011/2012
Course schedule
II semester:
| When | | Where |
| Monday | 15.45-17.15 | aula beta |
| Thursday | 14.10-15.40 | aula seminari |
Important Notices
This course is taught in English.
Important Notices
On Monday, April 16th we make a simulation of a written test. Topics will be selected among the following: Vector Space Model, Boolean Model, BIM, LSI, query expansion, perfromance evaluation. Questions may relate to the theory or consist of practical exercises.
On Monday 23rd lesson on Opinion Mining
On Monday 7 written test (will amount for 1/2 of oral exam approx.)
Summary of Course Topics
Architecture of an Information Retrieval system. Tokenisation, stop-word removal and stemming; morphology; selection of index terms, use of thesauri. Inverted indices. Boolean and vector-space retrieval models, ranked retrieval and text-similarity metrics. Performance metrics: recall, precision, and F-measure. Evaluations on benchmark text collections. Probabilistic ranking: binary independence model (BIM) and belief networks. Latent Semantic Indexing. Relevance feedback. Query expansion. Web Information Retrieval. Link analysis: Page Rank and HITS. Intelligent Information Retrieval: Information Extraction, Question Answering, Opinion Mining. Multimedia Information Retrieval: Speech, Images, Music, Video.
Textbooks
Exam
- Written exam on course material (40%)
- Project (teams of 2-3, joint project with Machine Learning or Web Information Retrieval possible) using Lucene (40%)
- Read and present (in English) one paper on a "hot" topic. This year the topic is: "Predicting The Future: the impact of "web sentiment" on predictions in Politics, Finance, and Health"
- Written exam can be done in two steps , mid-term (april, part I) and end of course (early june, part II), or in july (on the full program)
- For more complex projects, paper presentation might not be necessary.
Mid Term: Create an Index for a Corpus of Finanace and Economy
Mid-term project on Lucene: download the following files:
first zipped archive
second zipped archive
terminology
1.with Lucene Analyzer, tokenize, stem and remove stopwords from the collection
2. with
IndexWriter? create an index for the terminology (notice they are NOT single terms but multi-word expressions, and must be indexed as such)
3. with
QueryParser? verify your retrieval system
When you are done, send me an email to receive a set of 3 queries. Return your answers and all your software.
Notice that each term is the terminology.txt file has four real values indicating its relevance. You don't need to use these values, but if so, just consider the first one.
The project is worth "+2" on your final grade, PROVIDED you send it by end may.
Slides and course materials
| Timetable | Topic | PPT | Details |
| 2012 | Introduction | ppt | Introduction, architecture of IR systems, text processing, indexing |
| 2012 | Basic Ranking Models | pptx | Boolean Model, Vector Space Model Version Space |
| 2012 | Lucene | pdf | Lucene text search engine library in java |
| 2012 | Evaluation | pptx | Performance measures and benchmarking of IR systems |
| 2012 | Query Expansion | ppt | Improving basic IR models |
| 2012 | Latent Semantic Indexing|
pptx | Algebric Ranking Models|
example of LSI calculation
| 2012 | Statistical Ranking Models | ppt | Statistical Ranking Models |
| 2012 | Web Search | ppt | Web IR |
| 2012 | Link Analysis | ppt | Web Page Ranking |
| 2012 | Opinion Mining | zip | Mining Opinions on the Web |
| 2012 | Open Information Extraction | to be prepared | Extracting Information on a Web Scale |
| 2012 | MultiMedia? IR | to be prepared | Image, Speech, Video IR |
Syllabus
Part A:
- Introduction, Architecture of IR systems
- Text processing, Indexing
- Boolean and Vector Space Models
- Evaluation methods: experimental and theoretical methods
- Latent Semantic Indexing
- Probabilistic Methods
- Web Information Retrieval
- Link Analysis
Part B:Intelligent Information Retrieval
- Open Domain Information Extraction
- Question Answering
- Opinion Mining
- Multimedia Information Retrieval
- Enterprise Knowledge Management (Prof. M. Missikoff)
Suggested papers on 2012's HOT topic (bibliography)
- "Predicting Financial Markets: Comparing Survey, News, Twitter and Search Engine Data" Huina Mao, Scott Counts, and Johan Bollen
- "Sentiment Analysis of Financial News Articles" Robert P. Schumaker, Yulei Zhang and Chun-Neng Huang
- "Twitter mood predicts the stock market" Johan Bollena,, Huina Mao, Xiaojun Zeng
- "Opinion Mining and Sentiment Analysis" Bo Pang and Lillian Lee
- "Sentic Computing for social media marketing" Erik Cambria , Marco Grassi , Amir Hussain , Catherine Havasi
- "Forecasting Stock Market Volatility with Search Engine Query Statistics" R. Beker
- "Nowcasting with Google Trends in an Emerging Market" Y. Carriere-Swallow, F. Labbe
- "Googling the present" G. Chamberlin