Tags: view all tags

Web and Social Information Extraction - A.Y. 2019/2020

The course presents algorithms and architectures to retrieve information from the web and to analyze social networks. Topics are information retrieval, web search engines, web mining, social network analytics.

Instructors	Telephone	Office hours	Studio
Paola Velardi Bardh Prenkaj (teaching assistant)	06-49918356	send e-mail	Via Salaria 113 - 3° floor n. 3412

Instructors

Studio

(teaching assistant)

06-49918356

send e-mail

Via Salaria 113 - 3° floor n. 3412

IMPORTANT INFORMATION: From Monday, March 9th until the end of COVID-19 emergency, lessons will be REGULARLY held in streaming following the same schedule of in-class lessons. Students will receive the link to the videoconference few minutes before the beginning of the lesson. Information will be distributed through the course Google group and you will be able to participate in the videoconference ONLY using your institutional account. Videorecorded lessons and labs are available on a drive folder ONLY to students of the Google group.

Course schedule

II semester:

When		Where

Monday	14.00-16.00	aula alfa via Salaria 113 (Lab will be held in Colossus labs- via salaria 113 - Since 2nd week of March)
Wednesday	10.00-13.00	aula alfa via Salaria 113

Monday

14.00-16.00

aula alfa via Salaria 113

(Lab will be held in Colossus

labs- via salaria 113 - Since 2nd week of March)

Wednesday

10.00-13.00

aula alfa via Salaria 113

Course Organization

The course presents architectures and algorithms related with the extraction of information from the Web, analyzing both Web Search engines and on-line Social Networks. A number of lessons are held in the Colossus lab.

During the lab, students will learn:

Lucene, an open-source text-search library. With Lucene, you will learn to index and search a document archive
How to useTwitter API, to track and index Twitter messages, to create and analyze word time series, and more
To build a web scraper and trace the content of a forum

Self Assessment and Final Project

Self assessments and final project are sent to all students by email (google group).

Summary of Course Topics

Architecture of Information Retrieval systems. Tokenisation, stop-word removal and stemming; morphology; selection of index terms, use of thesauri. Inverted indices. Boolean and vector-space retrieval models, ranked retrieval and text-similarity metrics. Performance metrics: recall, precision, and F-measure. Evaluations on benchmark text collections. Latent Semantic Indexing. Relevance feedback. Query expansion.

Web Information Retrieval. Browsing and Scraping. Link analysis: Page Rank and HITS.

Social Network analysis: Opinion Mining, Social Network analysis, Community Detection, Social Media Analytics, Recommender systems

Textbooks

Ricardo Baeza-Yates, Berthier Ribeiro-Neto Modern Information Retrieval, Addison Wesley Longman Publishing Co. Inc.
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, University Press. 2008.
David Easley and Jon Kleinberg: Networks, Crowds, and Markets: Reasoning About a Highly Connected World http://www.cs.cornell.edu/home/kleinber/networks-book/
Chen, Lakshmanan and castillo: Information and Influence Propagation in Social networks https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412/_media/social_networks.pdf

Exam

Standard grading policy:

a) Written or oral exam on course material (50%). For those who attend classes, written test is split into a mid-term and a final test (usually, April and June).
b) Project (teams of 2) (50%). The quality of developed software is matter of evaluation.

IMPORTANT: Due to COVID-19 emergency, the mid-term test will be replaced by a new grading policy, which includes an "interactivity grade" to prize active students (those who submit self assessments and lab exercises, and present their work during classes in streming).

The interactivity grade will be 1/3 of your final grade, and incorporates also interactivity and contributions during Labs.

Final grade will be computed as follows:

1/3 interactivity
1/3 project
1/3 final written test (hopefully in presence..)

Those with zero or low interactivity will be given the opportunity to recover the missing 1/3 (with a more complex written test ).

Project

The project is presented after the first 4-5 weeks of the course.

Find here, as an example, the best project in 2018: wsie_report_moschella_spini.zip:

Project 2019

Find here the dataset for the project: datasetwebsocial2019.zip

The description of the project can be downloaded from this link: <a data-saferedirecturl="https://www.google.com/url?q=https://docs.google.com/document/d/1NYVXM_X7qEX95RdFHnw0M3H2b-2BU_72IwmMxWVNca8/edit?usp%3Dsharing&source=gmail&ust=1559035338277000&usg=AFQjCNF9cfAdDVVAucDhoRLQa0NDtctAXw" href="https://docs.google.com/document/d/1NYVXM_X7qEX95RdFHnw0M3H2b-2BU_72IwmMxWVNca8/edit?usp=sharing" target="_blank">https://docs.google.com/document/d/1NYVXM_X7qEX95RdFHnw0M3H2b-2BU_72IwmMxWVNca8/edit?usp=sharing</a>.

best project in 2019 : MiningEvolvingTopics_best_2019.pdf (code can be requested to Bardh Prenkaj)

Project 2018

The project is on Recommender Systems.

The description is provided in Project_Description_2018.pdf

The description includes the link from where you can download the main dataset for the project, the Wiki_MID dataset.

The three additional datasets are: S21.tsv S22_preferences.tsv and S23.tsv

Project 2020

assigned (recommender Systems in Tourism)

Google Group

Please Subscribe to Web and Social 2020 Group
Web and Social 2020 on Google Groups

Slides and course materials (note: UPDATED = 2020)

Last Update	Topic	PPT	ZIP/PDF	Details	Suggested readings
2020	Introduction	1.WS1.introAB.pptx (part 2) 1.WS1.introB.pptx (part 3)	1.WS1.introA.pptx.zip (part 1)	Introduction, architecture of IR systems, text processing, indexing	A survey on Indexing Techniques for Big Data: Taxonomy and Performance Evaluation (2015) https://www.researchgate.net/publication/273082158_A_survey_on_Indexing_Techniques_for_Big_Data_Taxonomy_and_Performance_Evaluation The case for learned index structures (April 30th, 2018) https://arxiv.org/pdf/1712.01208.pdf
2020	Basic Ranking Models	pptx	pdf	Basic ranking models: Boolean, Vector Space	https://nlp.stanford.edu/IR-book/pdf/01bool.pdf
2020	Query Expansion	ppt		Improving basic IR models	Query_Expansion_Techniques_for_Information_Retriev_1.pdf QueryExpansionTechniquesSurvey.pdf
2020	Eigenvectors, eigenvalues and SVD decomposition Retrieval with Latent Semantic Indexing & Word Embeddings	5b.Embeddings.pptx	5.LSI.pptx.zip 5.LSI.pdf.zip 5b.Embeddings.pdf	Alternative Ranking Models based on word similarities	https://cs224d.stanford.edu/lecture_notes/notes1.pdf BERT: https://arxiv.org/pdf/1810.04805.pdf (video lessons available on drive for Google Group)
2020	Evaluation	3.EvaluationIR.pptx		Performance measures and benchmarking of IR systems	https://papers.nips.cc/paper/5867-precision-recall-gain-curves-pr-analysis-done-right.pdf see also: https://trec.nist.gov/ for TREC evaluation challenges (video lessons available on drive for Google Group)
2020	Web Search		6.WebSearch-compressed.pdf	Web IR: crawling, scraping, searching on the Web; Categorization of web pages	(video lessons available on drive for Google Group) WEB_SITE_CLASSIFICATION_FEATURES_AND_ALG.pdf (survey on feature selection for web page classification) information-10-00150.pdf (survey on machine learning for document classification )
2020	Link Analysis	7.link_analysis.pptx	7.link_analysis-compressed.pdf	Hyperlink-based Ranking (HITS and Page Rank)	https://pdfs.semanticscholar.org/43b6/d922bcfcc8f8fcd3e7d22c8dc732653d9571.pdf (video lessons available on drive for Google Group)
2020	Social Media Analytics A1		9a.SocialMediaAnalyticsA1.pdf 9a.SocialMediaAnalyticsA1.zip	Social Network Analysis PART A1: node-centric measures of influence	https://arxiv.org/pdf/1907.11229.pdf (a survey on trending topic detection from social networks ) https://www.researchgate.net/publication/277689549_Efficient_temporal_mining_of_micro-blog_texts_and_its_application_to_event_discovery (SAX*) (video lessons available on drive for Google Group)
2020	Social Media Analytics A2	9b.SocialMediaAnalyticsA2.pptx	9b.SocialMediaAnalyticsA2.pdf	Social Network Analysis PART A2: graph-based measures	Surveytopknodesinsocialnetworks.pdf (video lessons available on drive for Google Group)
2020	Social Media Analytics B	10.SocialMediaAnalytics_community_detectionB.pptx.zip	10.SocialMediaAnalytics_community_detectionB.pdf	Social Network Analysis PART B: community detection	https://arxiv.org/ftp/arxiv/papers/1708/1708.00977.pdf https://hal.archives-ouvertes.fr/file/index/docid/804234/filename/Survey-on-Social-Community-Detection-V2.pdf (video lessons available on drive for Google Group)
2020	Social Media Analytics C	11.Maximizing-the-Spread-of-Influence.pptx	11.Maximizing-the-Spread-of-Influence.pdf	Spread of Influence in Social Networks	Chapter 2 of https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412/_media/social_networks.pdf (video lessons available on drive for Google Group)
2020	Opinion Mining		13.OpinionMining.pdf	Searching for opinions on the Web	https://link.springer.com/content/pdf/10.1007/s10462-017-9599-6.pdf
2020	Recommender Systems	12.RecommenderSystems.pptx	12.RecommenderSystems.pdf	Collaborative filtering, Content-based recommenders, Semantic recommenders	http://shuaizhang.tech/2017/07/28/Summary-of-Recommender-System-Surveys-in-recent-years/ (a portal) https://ieeexplore.ieee.org/document/8506344 ("A Survey of Collaborative Filtering-Based Recommender Systems: From Traditional Methods to Hybrid Methods Based on Social Networks" 2018) (video lessons available on drive for Google Group)
	LABs	NOTE: UDATED LAB MATERIAL, INCLUDING VIDEOs IS SHARED THROUGH THE GOOGLE GROUP		Lucene, Crawlers, Scrapers, Social Media, Temporal series	(video labs available on drive for Google Group members)

						Lab: Graph-G Library	Graph Libraries

Syllabus

Part A: Web Information Retrieval

Introduction, Architecture of IR systems
Text processing, Indexing
Boolean and Vector Space Models
Query expansion, understanding users' needs
Ranking and query expansion based on word similarities (Singular value Decomposition, Word Embeddings)
Evaluation methods: experimental and theoretical methods
Web Information Retrieval
Link Analysis

Part B:Social Information Extraction

Social Network Analysis: graph-based measures, community detection, topic diffusion, temporal analysis
Opinion Mining
Recommender Systems

Topics for Final Dissertation

e-health, network medicine, event detection in social media and on the web, enterprise social networks, applications to social studies, temporal information retrieval, semantic recommenders, prediction and edetection of trendy topics

Merit-based (= high grades, very good programming skills) scholarships are available - ask the instructor-

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who
zip	5.LSI.pdf.zip	r1	manage	8811.4 K	2020-03-13 - 15:49	PaolaVelardi
zip	5.LSI.pptx.zip	r1	manage	7130.3 K	2020-03-13 - 15:49	PaolaVelardi
pdf	7.link_analysis-compressed.pdf	r1	manage	1504.4 K	2020-03-31 - 10:42	PaolaVelardi
pptx	7.link_analysis.pptx	r5 r4 r3 r2 r1	manage	7732.8 K	2020-03-31 - 10:42	PaolaVelardi