create new tag
view all tags

Web and Social Information Extraction - A.Y. 2015/2016

The course presents algorithms and architectures to retrieve information from the web and to analyze social networks. Topics are information retrieval, web search engines, web mining, social network analytics.

Instructor Telephone Office hours Studio
Paola Velardi 06-49918356 send e-mail Via Salaria 113 - 3° floor n. 3412

Course schedule

II semester:

When   Where

Monday 14.00-16.00 aula Alfa
Friday 12.00-14.00 aula Alfa or Colossus lab

Course Organization

The course presents architectures and algorithms related with the extraction of information from the Web, analyzing both Web Search engines and on-line Social Networks. A number of lessons are held in the Colossus lab.

During the lab, students will learn:

  • Lucene, an open-source text-search library. With Lucene, you will learn to index and search a document archive
  • How to useTwitter API, to track and index Twitter messages, to create and analyze word time series, and more
  • To build a web scraper and trace the content of a forum

Self Assessment and Final Project

Self assessments and final project are sent to all students by email (google group).

Summary of Course Topics

Architecture of Information Retrieval systems. Tokenisation, stop-word removal and stemming; morphology; selection of index terms, use of thesauri. Inverted indices. Boolean and vector-space retrieval models, ranked retrieval and text-similarity metrics. Performance metrics: recall, precision, and F-measure. Evaluations on benchmark text collections. Latent Semantic Indexing. Relevance feedback. Query expansion.

Web Information Retrieval. Browsing and Scraping. Link analysis: Page Rank and HITS.

Social Network analysis: Opinion Mining, Social Network analysis, Community Detection, Social Media Analytics, Recommender systems



  • a) Written or oral exam on course material (50%)
  • b) Project (teams of 2) (50%). The quality of developed software is matter of evaluation.


The project is presented after the first 4-5 weeks of the course.

Find here, as an example, the best student project presented in 2014: pdf

Project 2016

This year the project consists in analyzing a Twitter dataset collected around relevant dates of the Presidential Candidates Elections in U.S.A. The dataset can be accessed following the indications of this document pdf.

A detailed description of what you are expected to do with the dataset will shortly follows.

Projects can be carried out by teams of two. They must be handled BY DECEMBER 2016 - STRICT DEADLINE - I cannot register your exam if you don't handle the project first!

We expect a report with project description, figures, tables, etc. along with zipped files with code. No data!


  • Written exam can be done in two steps , mid-term (april, part I) and end of course (early june, part II), or in july (on the full program). After july, oral exams are scheduled with the instructor
  • To obtain grant for presence, students must attend 90% of lessons and handle all homeworks.

Project 2017


Google Group

Please Subscribe to Web and Social 2018 Group
Web and Social 2018 on Google Groups

Slides and course materials (UPDATED = 2017)

Last Update Topic PPT PDF Details
2017 Introduction ppt pdf Introduction, architecture of IR systems, text processing, indexing
2017 Basic Ranking Models pptx pdf Basic ranking models: Boolean, Vector Space
2017 Query Expansion ppt pdf Improving basic IR models
2017 Evaluation pptx pdf Performance measures and benchmarking of IR systems
2017 Latent Semantic Indexing ppt pdf Algebric Ranking Models
2017 Web Search ppt pdf Web IR: crawling, scraping, searching on the Web; Categorization of web pages
2017 Link Analysis ppt pdf Hyperlink-based Ranking (HITS and Page Rank)
2017 Social Media Analytics ppt pdf Social Network Analysis PART A1: node-centric measures of influence
2017 Social Media Analytics ppt pdf Social Network Analysis PART A2: graph-based measures
2017 Social Media Analytics ppt pdf Social Network Analysis PART B: community detection
2017 Social Media Analytics ppt pdf Spread of Influence in Social Networks (Suggested reading: Chapter 2 of https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412/_media/social_networks.pdf )
2016 Opinion Mining pdf   Searching for opinions on the Web
2017 Recommender Systems pdf pptx Collaborative filtering, Content-based recommenders, Semantic recommenders
2016 Lab: Lucene   pdf Lucene text search engine library in java (Prof. Giovanni Stilo)
2017 SAX temporal strings   pdf Event mining with SAX (Prof. Giovanni Stilo)
2016 Lab: Maven Core   pdf Maven Core (Prof. Giovanni Stilo)
2016 Lab: Twitter Api   pdf Twitter Api (Prof. Giovanni Stilo)
2016 Lab: Scraping   pdf Scrapers (Prof. Giovanni Stilo)
2016 Lab: Crawling   pdf Crawling Principles (Prof. Giovanni Stilo)
2017 Lab: Time Series   pdf Tracing Temporal Streams of Words in Twitter (Prof. Giovanni Stilo)
2017 Lab: Graph-G Library     Graph Libraries


Part A: Web Information Retrieval

  • Introduction, Architecture of IR systems
  • Text processing, Indexing
  • Boolean and Vector Space Models
  • Query expansion, understanding users' needs
  • Evaluation methods: experimental and theoretical methods
  • Latent Semantic Indexing
  • Web Information Retrieval
  • Link Analysis

Part B:Social Information Extraction

  • Social Network Analysis: graph-based measures, community detection, temporal analysis
  • Opinion Mining
  • Analyzing users'behaviors (recommenders, population studies, enterprise social networks, applications in e-health)

Topics for Final Dissertation

e-health, event detection in social media and on the web, enterprise social networks, applications to social studies, temporal information retrieval, semantic recommenders

Merit-based (= high grades, very good programming skills) scholarships are available - ask the instructor-

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt 1intro.ppt r2 r1 manage 4262.5 K 2013-03-07 - 13:40 PaolaVelardi  
PowerPointpptx 2VectorSpaceModel.pptx r2 r1 manage 333.6 K 2014-01-20 - 13:57 PaolaVelardi  
PDFpdf CDCommunity2.pdf r1 manage 693.8 K 2017-05-19 - 08:29 PaolaVelardi Carlotta Domeniconi slides persistent roles
Edit | Attach | Watch | Print version | History: r131 < r130 < r129 < r128 < r127 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r131 - 2017-09-12 - PaolaVelardi

Questo sito usa cookies, usandolo ne accettate la presenza. (CookiePolicy)
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback