Tags:
create new tag
view all tags

Web and Social Information Extraction - A.Y. 2019/2020

The course presents algorithms and architectures to retrieve information from the web and to analyze social networks. Topics are information retrieval, web search engines, web mining, social network analytics.

Instructors Telephone Office hours Studio

Paola Velardi

Bardh Prenkaj

(teaching assistant)

06-49918356 send e-mail Via Salaria 113 - 3° floor n. 3412

IMPORTANT INFORMATION: From Monday, March 9th until the end of Coronavirus emergency, lessons will be REGULARLY held in streaming following the same schedule of in-class lessons. Students will receive the link to the videoconference few minutes before the beginning of the lesson. Information will be distributed through the course Google group and you will be able to participate in the videoconference ONLY using your institutional account. Videorecorded lessons are available on a drive folder to students of the Google group.

Course schedule

II semester:

When   Where

Monday 14.00-16.00

aula alfa via Salaria 113

(Lab will be held in Colossus

labs- via salaria 113 - Since 2nd week of March)

Wednesday 10.00-13.00 aula alfa via Salaria 113

Course Organization

The course presents architectures and algorithms related with the extraction of information from the Web, analyzing both Web Search engines and on-line Social Networks. A number of lessons are held in the Colossus lab.

During the lab, students will learn:

  • Lucene, an open-source text-search library. With Lucene, you will learn to index and search a document archive
  • How to useTwitter API, to track and index Twitter messages, to create and analyze word time series, and more
  • To build a web scraper and trace the content of a forum

Self Assessment and Final Project

Self assessments and final project are sent to all students by email (google group).

Summary of Course Topics

Architecture of Information Retrieval systems. Tokenisation, stop-word removal and stemming; morphology; selection of index terms, use of thesauri. Inverted indices. Boolean and vector-space retrieval models, ranked retrieval and text-similarity metrics. Performance metrics: recall, precision, and F-measure. Evaluations on benchmark text collections. Latent Semantic Indexing. Relevance feedback. Query expansion.

Web Information Retrieval. Browsing and Scraping. Link analysis: Page Rank and HITS.

Social Network analysis: Opinion Mining, Social Network analysis, Community Detection, Social Media Analytics, Recommender systems

Textbooks

Exam

  • a) Written or oral exam on course material (50%)
  • b) Project (teams of 2) (50%). The quality of developed software is matter of evaluation.

Project

The project is presented after the first 4-5 weeks of the course.

Find here, as an example, the best project in 2018: wsie_report_moschella_spini.zip:

Project 2019

Find here the dataset for the project: datasetwebsocial2019.zip

The description of the project can be downloaded from this link: https://docs.google.com/document/d/1NYVXM_X7qEX95RdFHnw0M3H2b-2BU_72IwmMxWVNca8/edit?usp=sharing.

Project 2018

The project is on Recommender Systems.

The description is provided in Project_Description_2018.pdf

The description includes the link from where you can download the main dataset for the project, the Wiki_MID dataset.

The three additional datasets are: S21.tsv S22_preferences.tsv and S23.tsv

Project 2019

to be decided

Google Group

Please Subscribe to Web and Social 2020 Group
Web and Social 2020 on Google Groups

Slides and course materials (note: UPDATED = 2020)

Last UpdateTopicPPTZIP/PDFDetailsSuggested readings
2020 Introduction

1.WS1.introAB.pptx (part 2)

1.WS1.introB.pptx (part 3)

1.WS1.introA.pptx.zip (part 1)

Introduction, architecture of IR systems, text processing, indexing

A survey on Indexing Techniques for Big Data: Taxonomy and Performance Evaluation (2015) https://www.researchgate.net/publication/273082158_A_survey_on_Indexing_Techniques_for_Big_Data_Taxonomy_and_Performance_Evaluation

The case for learned index structures (April 30th, 2018) https://arxiv.org/pdf/1712.01208.pdf

2020 Basic Ranking Models pptx pdf Basic ranking models: Boolean, Vector Space https://nlp.stanford.edu/IR-book/pdf/01bool.pdf
2020 Query Expansion ppt Improving basic IR models

Query_Expansion_Techniques_for_Information_Retriev_1.pdf

QueryExpansionTechniquesSurvey.pdf

2020

Eigenvectors, eigenvalues and SVD decomposition

Retrieval with Latent Semantic Indexing & Word Embeddings

5b.Embeddings.pptx

5.LSI.pptx.zip

5.LSI.pdf.zip

5b.Embeddings.pdf

Alternative Ranking Models based on word similarities

https://cs224d.stanford.edu/lecture_notes/notes1.pdf

BERT: https://arxiv.org/pdf/1810.04805.pdf

(video lessons available on drive for Google Group)

2020 Evaluation 3.EvaluationIR.pptx

Performance measures and benchmarking of IR systems

https://papers.nips.cc/paper/5867-precision-recall-gain-curves-pr-analysis-done-right.pdf

see also: https://trec.nist.gov/ for TREC evaluation challenges

(video lessons available on drive for Google Group)

2020 Web Search 6.WebSearch-compressed.pdf Web IR: crawling, scraping, searching on the Web; Categorization of web pages

WEB_SITE_CLASSIFICATION_FEATURES_AND_ALG.pdf (survey on feature selection for web page classification)

information-10-00150.pdf (survey on machine learning for document classification )

2019 Link Analysis 7.link_analysis.pptx
Hyperlink-based Ranking (HITS and Page Rank) https://pdfs.semanticscholar.org/43b6/d922bcfcc8f8fcd3e7d22c8dc732653d9571.pdf
2019 Social Media Analytics A1 p 9a.SocialMediaAnalyticsA1-compresso.pdf Social Network Analysis PART A1: node-centric measures of influence
2019 Social Media Analytics A2 9b.SocialMediaAnalyticsA2.pptx Social Network Analysis PART A2: graph-based measures Surveytopknodesinsocialnetworks.pdf
2019

Social Media Analytics B

10.SocialMediaAnalytics_community_detectionB.pptx Social Network Analysis PART B: community detection

https://arxiv.org/ftp/arxiv/papers/1708/1708.00977.pdf

https://hal.archives-ouvertes.fr/file/index/docid/804234/filename/Survey-on-Social-Community-Detection-V2.pdf

2019

Social Media Analytics C

11.Maximizing-the-Spread-of-Influence.pptx
Spread of Influence in Social Networks Chapter 2 of https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412/_media/social_networks.pdf
2016 Opinion Mining pdf Searching for opinions on the Web https://link.springer.com/content/pdf/10.1007/s10462-017-9599-6.pdf
2019 Recommender Systems pdf Collaborative filtering, Content-based recommenders, Semantic recommenders

http://shuaizhang.tech/2017/07/28/Summary-of-Recommender-System-Surveys-in-recent-years/ (a portal)

https://ieeexplore.ieee.org/document/8506344 ("A Survey of Collaborative Filtering-Based
Recommender Systems: From Traditional Methods to Hybrid Methods Based on Social Networks" 2018)

LABs NOTE: UDATED LAB MATERIAL, INCLUDING VIDEOs IS SHARED THROUGH THE GOOGLE GROUP Lucene, Crawlers, Scrapers, Social Media, Temporal series












Lab: Graph-G Library Graph Libraries

Syllabus

Part A: Web Information Retrieval

  • Introduction, Architecture of IR systems
  • Text processing, Indexing
  • Boolean and Vector Space Models
  • Query expansion, understanding users' needs
  • Ranking and query expansion based on word similarities (Singular value Decomposition, Word Embeddings)
  • Evaluation methods: experimental and theoretical methods
  • Web Information Retrieval
  • Link Analysis

Part B:Social Information Extraction

  • Social Network Analysis: graph-based measures, community detection, topic diffusion, temporal analysis
  • Opinion Mining
  • Recommender Systems

Topics for Final Dissertation

e-health, network medicine, event detection in social media and on the web, enterprise social networks, applications to social studies, temporal information retrieval, semantic recommenders, prediction and edetection of trendy topics

Merit-based (= high grades, very good programming skills) scholarships are available - ask the instructor-

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf 3.QueryExpansion.pdf r5 r4 r3 r2 r1 manage 9251.4 K 2020-02-27 - 15:32 PaolaVelardi  
PowerPointpptx 3.QueryExpansion.pptx r5 r4 r3 r2 r1 manage 5387.4 K 2020-02-27 - 15:32 PaolaVelardi  
Compressed Zip archivezip 5.LSI.pdf.zip r1 manage 8811.4 K 2020-03-13 - 15:49 PaolaVelardi  
Compressed Zip archivezip 5.LSI.pptx.zip r1 manage 7130.3 K 2020-03-13 - 15:49 PaolaVelardi  
Edit | Attach | Watch | Print version | History: r177 < r176 < r175 < r174 < r173 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r177 - 2020-03-26 - PaolaVelardi





 
Questo sito usa cookies, usandolo ne accettate la presenza. (CookiePolicy)
Torna al Dipartimento di Informatica
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback