CMSC 773 Computational Linguistic II

posted in: Graduate, Spring2015 | 0

Official Course Webpage:

Instructor: Prof. Philip Resnik (resnik _AT_

TA: Raul Guerra (rguerra_AT_

Syllabus: See the schedule of topics.

Prof. Resnik is an awesome teacher! He is an expert in not merely natural language processing & linguistics, but he teaches it in an easy-understanding way as well. He is very nice to students and his class is organized very well. Go for his class and you’ll enjoy it. Our final project is Learning Depression Patterns from Facebook and Reddit data, which is a lot of fun. His homework has only 3 grades, high-pass, middle-pass and low-pass; of course you will have high-pass given enough efforts! Great teacher! From he gives 72% A.

Due to the confidential problem, I have put my homework on a separate page here.

The course was well-structured:

  • Jan 29 Course administrivia, semester plan; some statistical NLP fundamentals
  • Feb 4 Words and lexical association
  • Feb 11 Information theory
  • Feb 18 Maximum likelihood estimation and Expectation Maximization
  • Feb 25 More on EM and HMMs
  • March 4 Parsing, Generalizing CFG
  • March 11 Philip Resnik and Eric Hardisty, Gibbs Sampling for the Uninitiated
  • Mar 25 Supervised classification and evaluation Supervised classification
  • April 1 Deep learning
  • Apr 8 Guest lecture: Raul Guerra on Data Stream Mining in NLP
  • April 15 Structured prediction Noah Smith
  • Apr 29 Machine translation

Our final project paper:



All the references I read for the final project

  1. John_The Big-Five Trait Taxonomy
  2. Mark_Changes In US Spending On Mental Health
  3. Daume_MarkoveRandomTopicFields Greene_More than Words – Syntactic Packaging and Implicit Sentiment_NAACL09
  4. Clark_Computational Linguistics and Natrual Language Processing
  5. Coppersmith_Quantifying Mental Health Signals in Twitter_2014
  6. Choudhury_Predicting Postpartum Changes in Emotion and Behavior via Social Media_CHI13
  7. Choudhury_Role of Social Media in Tackling Challenges in Mental Health_ACM13
  8. Choudhury_Predicting Depression via Social Media_ICWSM13
  9. ACL_Using Topic Modeling to Improve Prediction of Neuroticism and Depression
  10. Choudhury_AnorexiaOnTumblr_DH2015
  11. Schwartz_Characterizing Geographic Variation in Well-Being Using Tweets_ICWSM13
  12. Solberg_Regulating Human Subjects Research in the Information Age Data Mining on Social Networking Sites_2012
  13. Santen_Computational Prosodic Markers for Autism
  14. Roark_Spoken Language Derived Measures for Detecting_ToASLP11
  15. Radloff_The CES-D scale – a self-report depression scale for research in the general population.
  16. Pestian_Sentiment Analysis of suicide notes_ A shared Task
  17. Prudhommeaux_Classification of atypical language in autism
  18. Pennebaker_Linguistic styles language use as an individual difference
  19. National Expenditures for Mental Health Services & Substance Abuse Treatment

CLIP Talks Summary by Ruofei Du

[Summary] 04/22 CLIP Talk: Paraphrastic Similarity in Text Embedding

Speaker: Kevin Gimpel from Toyota Technological Institute at Chicago (TTIC)

Note taker: Ruofei Du

It’s a joint work with John Wieting (UIUC), Ang Lu (Tsinghua), Weiran Wang, Mohit Bansal and Karen Livescu


Summary: Dr. Gimpel created two datasets built upon PPDB for predicting similarity between bigrams and phrases using crowdsourcing and expert annotation approaches. They use recursive neural networks (RNN) and addition model, train and test their results by Paragram form Skip-gram and Hashimoto et al. 2014. They proved that RNN and addition are better.


Word Embeddings& Similarity & Relatedness

  • How should we evaluate word similarity?
    • WordSim353 (Agirre et al., 2009)
      • 0-10 score: e.g., journey~voyage=9.3, rooster~voyage=0.6;
    • BigramSim (Mitchell and Lapata, 2010)
      • TV set, television set
  • Skip-gram Model (Mikolov et al., 2003) it’s bag of words according to Prof. Resnik.
    • Train a vector w(t) for a word, predict w(t+i)
    • Give a window size
    • PoS & dependency parsing are better than WordSim (Bansalet al. ACL’14)
  • Differ similarity from relatedness
    • SimLex-999 (Hill et al., 2014)
      • insane~crazy=9.6; new~ancient=0.2
    • BigramPara (this talk)
    • PhrasePara (this talk)
      • cannot be separated from ~ is inseparable from = 5.0
      • Motivation: Why focus on bigrams / phrases?
        • Simpler composiitonal phenomena
        • clean way to compare compositional architectures
      • Prof. Resnik: locality matters
    • Sentences (MSR Paraphrase Corpus)
  • Dataset Creation
    • Paraphrase Database (Ganitkevitch et al., 2013)
      • imprisoned – verhaftet
    • Annotations permit evaluation of phrase embedding models
    • Why use PPDB?
      • phrase embeddings could improve confidence scores or potentially replace PPDB with a more concise model
    • Mechanical Turk to annotate
      • US workers, 3,000 phrase paris annotated, less than a day, score 1-5
      • create balanced dataset, retained pairs with lowest mean deviation, randomly slit into dev (260) and test (1000)
    • Annotate bigram by ourselves
      • Inter-annotator agreement
      • Spearman’s rho and Cohen’s kappa are used, (Ruofei: very common in HCI)
        • Spearman’s rho reflects the strength of the relationship between the variables, which is represented as a value between -1 and 1
        • Cohen’s kappa reflects the agreement derived from a crosstab table. This metric is used when two people look at the same data and categorize them. The problem is that you may do not remove the effects caused by randomness. You may just have a good result by chance. To claim the reproducibility, we want to have a metric which gets rid of the effects caused by randomness, and a Cohen’s Kappa is such a metric.
        • Training samples: contamination~pollution, broad~general
      • Loss function for learning
        • \sum_{<u,v>\in PPDB} = \max(0, 1-u\cdot v + u\cdot t) + max(0, 1-u\cdot v + v\cdot t’)
        • hinge losses encourage given words to be more similar to each other than either is to its negative example
        • For efficiency, they only do \argmax over current mini-batch
        • regularize by penalizing squred L_2 distance
    • Training: PPDB
    • Tuning: WordSim353
    • Test: SimLex-999
    • Comparison
      • Vector Addition
      • Recursive Neural Network (RNN)
      • Convolutional Neural Network (Kim, 2014) with 200 unigram filters static
      • Use Stanford parser for banalized parse
      • recurrent neural networks
    • skip-gram vs paragram vs Hashimoto et al. 2014
    • average over three data splits: adj noun, noun noun, verb noun.
    • support vector regression to predict gold similarities with 5-fold cross validation
    • RNN is better. addition model is better
  • Multiview embedding
    • learn embeddings directly from parallel text
    • active research area
    • Deep canonical Correlation Analysis


Questions and comments

  • What traditional techniques are used in similarity estimation?
    • K-means, window size
  • Intuition of how the similarity results are (15 vs. 17, what does this mean?)
    • Still work-in-progress
  • Dr. Resnik: take a look at Bioinformatics.
  • What about taking context into account?


[CLIP Summary] 04/29 Automatic Grammatical Error Correction for Language Learners

Topic: Automatic Grammatical Error Correction for Language Learners

Speaker: Joel Tetreault

Note taker: Ruofei Du



This talk provides a great motivation story and history for Grammatical Error Correction (GEC). The speaker described popular methodologies for correcting language learner errors such as rule-based approaches, data-driven approaches and corresponding features, using Statistical Machine Translation and so on. Eventually, the speaker envision this field in adding more features, crowdsourcing, unsupervised learning, and its application.


Some examples:



  • Grammatical Error
    • Syntax error -> rule-driven
      • easier to learn
    • Usage error -> conventional usage habits
      • easier to make for beginners (data driven must be taken!)
  • Motivation
    • 1B+ people English as 2nd language
  • Goal
    • Brief history of GEC
    • Challenge that language learners face
    • Challenge of designing tools to assis learners
    • Approaches
    • Visions for the future
  • Methodologies & Systems Figure
    • IBM (1982~~>1993) MSWord
      • Rule-based Grammars Parsers
    • ALEK (2000)
    • Izumi from Japan on ACL (2003)
      • first to use supervised learning (SL)
      • explored 3 training paradigms
      • well formed
      • artificial errors
      • real errors.
    • CLEC (Chinese Learners Corpus) Google 1TB
    • MSR (2006-2008)
      • Well-formed text from ETS, dissertation and etc.
      • ML: MaxEnt, SVMs, LMs, web counts
    • Web-scale Data with Artificial Errors (2008-2010)
    • Real Errors (2011-2013) 70+ papers
      • HOO, FCE, NUCLE, Lang-8, Wiki Rev
      • 4 shared tasks (2011-2014)
      • MT, Beam Search, Hybrid systems, Joint Learning, System Combination
    • TOEFL11 WIKed, GING, CorrectEngine, Grammar, WhiteSmxxx
  • Learning Errors from Cambridge Learner Corpus (CLC)
    • Verbal Morphology and Tense: 14%
    • Over-regularization of irregular verb
    • Ill-formed tense, participate
    • Preposition Presence and Choice 13%
    • Preposition Choice: to / on / toward / onto
  • Background
    • Corpora: NUCLE, FCE, HOO2011, CLEC; TOEFL11, LANG8, WIKIed, ICLE
    • Shared tasks
  • Data-driven approaches
    • Rule-based Approaches
      • Infinitive formation
        • to talking -> to talk
      • Modal verb + have + past participate
    • Error types that require data driven
    • Training data
      • Well-formed text only
      • Artificial errors
      • Error-annotated leaner data
    • Methods
      • classification
      • language models
      • web-based
      • statistical machine translation
    • Machine learning classifier
      • MaxEnt, SVN, average perceptron
    • Typical Features
      • N-grams: 1,2,3
      • PoS tags
    • Training on Correct Usage
      • train on examples of correct usage only
      • plenty of well-formed text available
    • Prep Detection Example
    • Heuristic Rule that cover cases classifier misses
    • Best method
      • Derive “real” error rate from annotated learner corpus
    • Artificial Errors
      • GenERRate (2009)
        • Tool for automatically inserting errors given a configuration file
      • Error Annotated Corpora
        • Use writer’s word choice as a feature
        • Han et al. (2010): large corpus is better than + examples only
      • Comparing Training Paradigms
    • Cahill et al. (2013)
    • Trends:
      • Artificial errors derived from lang-8 proved best on 2 out of 3 test sets
      • Artificial errors models can be competitive with real error models, if enough training data generated.
      • Training on Wikipedia revision yields most consistent system across domains
      • Web-based Methods: google & bing
        • smart formulation of queries : Yi (2008)
        • use methodology to mine L1 specific errors
      • Language Models
        • scores over phrase or sentence for correction
        • meta-learner
        • rank whole sentence outputs from rule-based and SMT systems
  • Statistical Machine Translation
    • Motivation
      • Most work in correction targets specific error types
      • Can we use statistical machine translation (SMT) to do whole sentence error correction
    • Noisy Channel Model
      • Key: Felice et al. (2014) algorithm
      • RBS -> Generated Candidate -> LM -> SMT -> … Times change, people change
  • Current & future directions
    • Current state of affairs
      • shared resources, shared tasks, workshops, lots of papers
    • What is the future of GEC
      • Provide useful feedback to learner
      • Track leaner over time and model language development
      • Take into account L1, user context, etc.
      • Integrate with persistent spoken dialogue tutor
      • Annotation and Evaluation
        • Some of people like cats
        • Some people like cats
        • Some of the people like cats
    • Annotation Issues
      • Comprehensive vs. Targeted approaches
      • Tradeoffs of time and coverage vs. cost and “quality”
      • Calculating agreement
    • Crowdsourcing
      • How to efficiently and cheaply collect? (HCI aspect)
    • Unsupervised Methods
      • Lots of unlabeled learner data available
      • How can these sources be leveraged?
        • Levy and Park (2011)
    • Multilingual GEC Challenges
      • Lack of good NLP tools (taggers, parsers, etc. )
      • Lack of large corpora
    • Direct Application of GEC


  • Questions & Comments
    • What about …. Machine makes mistakes and machine correct it…
      • There’s a workshop on ACL to combine both aspects.
    • What about …. Online community that crowd-sourced human translation
      • Good point should be included in the Methodologies & Systems figure.
    • Data matters in this field
      • Stupid methods with more data sometimes works better.
  • Personal comments
    • I built a Japanese GEC corpus as an undergraduate before:
    • The lessons learnt is that classification of different grammar error is very hard to learn
    • They are labeled by university professors & graduate students
    • To make everything correct and easy to search & filter, it was rule-based

[Summary] 04/08/15 CLIP Talk

Title: Weakly Supervised Information Extraction for the Social Web

Speaker:Alan Ritter, Ohio State University, graduated from University of Washington


Goal: read all the public text and maintain a detailed up-to-date database




Social Media

  • Pro: Short, easy to write, instantly and widely disseminated
  • Con: Many irrelevant and redundant messages and lots of unreliable information (information overload, Double Edged Sword)
  • Example, 1 May 2011
    • 3:58 PM EST Helicopter hovering above Abbottabad at 1AM (is a rar event)
    • 10:24 PM EST So I’m told by a reputable person they have killed Osama Bin …
  • Long tail of Smaller Events
    • Spamhaus is currently under a DDoS attack against our website which we are working on mitigating. Our DNSbls are not affected.
  • Big Data
    • We need NLP for Social Media analysis
    • We need to learn models from lots of text
    • Hypothesis: Small, manually annotated datasets won’t scale up to the diversity we see on Twitter



  • NLP breaks and noisy style: e.g. POS issue: NNP/Yess; Chunk issue: NP its official Nintendo; ner: 3DS in north: loc America
    • Re-building the NLP pipeline for social media: supervised model -> pos -> shallow parse -> NER
    • Named entity recognition (NER) in Twitter
  • Diversity
    • One phrase usually has many forms such as 2m 2ma 2maa
    • Weakly supervised learning
      • NER in Twitter
        • Plethora of distinctive, infrequent types
          • Bands, Movies, Products, etc
          • Very little traning data for these
          • Can’t simply rely on supervised learning
        • Ver terse (often contain insufficient context)
      • Idea: Freebase / Wikipedia lists provide a source of supervision, but these lists are highly ambiguous, Example: China as a country or band
      • e.g. JFK
        • On my way to JFK early in the….
        • JFK’s bomber jacket sells for
        • JFK Airport’s Pan Am Worldport
        • Waiting at JFK for our ride…
        • When JFK threw first pitch on…
      • relation extraction
      • event extraction
  • Missing Data in Distant Supervision
    • Leads to errors.
    • Binary Relations: lots of overlapping / correlated features
      • Dependency paths, POS sequences
    • Conditionally trained models are better
      • don’t make poor independence relation
    • e.g. Barack Obama -> Honolulu
    • Local extractors + relation mentions + Aggregate relations (born in, lived in), Deterministic OR => maximize likelihood
    • Learning
      • Structured Perceptron
      • MAP-based learning
    • MAP inference
      • Local search almost always optimal
        • With a carefully chosen set of search operators
        • Verified by comparing against optimal solutions found using A* search
          • Out of 100,000 real problems from out data, only missed an optimal solution in 3 cases


Case study: Extracting a Calendar from Twitter

  • Pipeline: Tweets->POS Tag->Temporal Resolution, NER, Event Tagger -> Significance Ranking & Event Classification -> Calendar Entries
  • When it does not work
    • For news, it usually does not work because sentences are longer and reasoning of temporal clue sometimes does not correspond to the particular event.
      • e.g. A twin suicide bomb attack at a crowded snooker club on 10 January killed…


Distant Supervision: Events vs. Relations

  • Relations: e.g. born in (Barack Obama, Honolulu)
  • Events:  e.g. air attack


Case study: Cybersecurity Events

  • account Hijacking seed events: e.g., associated press, reuters, us marines, cnn, justin bieber, zuckerberg
  • gathering unlabeled / test data
    • track a keyword associated with the event (e.g. hacked, DDOS, breach)
    • extract named entities
  • learn from unlabeled data and positive seeds
    • augment data likelihood with label regularization term:
      • Log Likelihood – Label regularization (encourage user-generated labels) – L^2 regularization


Case study: data driven response generation: potential applications (EMNLP 2011)

  • More natural generation in dialogue systems
  • Conversationally-aware predictive text entry
    • Speech interface to SMS/Twitter
  • e.g. I’m feeling sick -> hope you feel better



  • Combining calendar view, predictive NLP model is a really interesting idea. It can trigger many interesting visualization of temporal events.
  • Can we dynamically track Wikipedia? This is another interesting idea. But why Wikipedia does not support live streaming API as Twitter? Seems more engineering than science.
  • Can we learnt to extract events (X welcomes Y as our new mayor === Y sworn in as Y mayor)



  • Can we use both spatial and temporal visualization with it?
  • How to learn the relation between a name entity and associated phrases?
    • Answer: use inference techniques, add hidden variables
    • e.g. PRODUCT, top 20 entities: nintendo, ds lite, apple, ipod, generation black, ipod nano, apple iphone, gb black
    • e.g. TV-SHOW, pretty little american skins, nof, order svu
    • e.g. FACILITY voodoo lounge, grand ballroom, crash mansion, sullivan hall, memorials.
  • How about the correlation between FreeBase and Social Media like Twitter for future work?
  • State-tracking over time?
    • Sort of future work
  • More data? Social media in the past? 2 years of data?
    • Generate history of a particular person with temporal events…
    • Generate security event for a particular entity…





[04/01 Talk Summary] Recent Advances and Challenge Problem for Machine Learning in Climate Science


This is my second attending to the talk… Here’s my summary.


Recent Advances and Challenge Problem for Machine Learning in Climate Science


Claire Monteleoni

George Washington University

Note taker: Ruofei Du (UMIACS, University of Maryland, College Park)




The general idea: Machine learning can shed light on climate change


Motivation: Despite the scientific consensus on climate change, drastic uncertainties remain:

How does climate change affect extreme events? According to the Intergovernmental Panel on Climate Change, there’s a Shifted Mean in climate change, i.e., the PDF is something like normal distribution, but with climate change, the normal distribution is shifted a little left (to colder side). In addition, there’s uncertainty in extremes, especially regional areas. Warmer atmosphere can hold more water vapor, which induces heavier precipitation, storms, flooding. Global warming may increase surface evaporation such as heat waves and droughts. Possible changes in El Oscillation induces changes in floods in some regions, droughts in others.


Insight: Extreme events are rare by definition. It’s very hard to do machine learning on it. (Not enough evidence). But climate change may affect their distribution. Augment historical data with climate model simulations may assist in predicting climate change.


What did they do: They published several papers by adaptive average models, online learning with MRF, HMM and matrix completion, introduced (founded) the Workshop on Climate Informatics starting from 2011 and provided a 2014 NIPS Tutorial, finally they want to exploit topic models from NLP to do transfer learning.


Main types of climate data

  • Past: Historical data, limited amounts and very heterogeneous

  • Present: Observation data, increasingly measured. Large quantities for recent times. can be unlabeled, sparse, measured at higher resolution than relevant information

  • Past, Present, Future: climate model simulations: Very, high-dimensional, encodes scientific domain knowledge, some information is lost in discretizations, future predictions cannot be validated.

  • Including: GCMs / ESMs (CMIP3/5) (Tb/day), Satellite retrievals (Tb/day), Next-generation re-analysis


Challenge Problems in Climate Informatics

  • Past: Paleoclimate reconstruction

  • Local: Climate downscaling. What climate can I expect in my own backyard?

  • Spatiotemporal: Space and time. How to capture dependencies over space and time?

  • Future: Climate model ensembles. How to reduce uncertainty on future predictions

  • Tails/ impacts: extreme events. What are extreme events and how will climate change affect them?

  • Other problems: Climate model is a non-data-driven, complex system of interacting mathematical models, and is based on scientific first principles


Past research: IPCC findings

  • human influence on climate: without human-induced greenhouse gasses, the climate model simulation does not match the true observations well.

  • no one model predicts best all the time, for all variables. Avereage prediction over all models is better predictor than any single model. They usually use Bayesian approaches in climate science since 2008.


Research Question and Methods

Can we do better, using Machine Learning? How should we predict future climates while taking into account the multi-model assumption models?

  • They propose to use adaptive, weighted average prediction. e.g., model B dominates, model E follows by, model A,C,D contributes little.

  • They also explore the tradeoff between ‘’explore” and ‘’exploit’’.

  • Since their data is non-stationary, they exploit online learning to solve it. (see Midterm Problem 2.1)

  • They used a generalized Hidden Markov Model to do so. Compared with multi-model average (baseline), their learn-$\alpha$ algorithm fits the ground-truth the best. They got a best paper for it.

  • In 2012, they see climate predictions are made at higher geospatial resolutions. Model neighborhood influences among geospatial areas.

  • They propose incorporating neighborhood influence. Neighborhood-augmented Learn-$\alpha$. That’s online learning with some geospatial components.

  • It’s MRF-based approach. But by inducing time $t$, the 2D MRF becomes a 3D cube. They call it regional Learn-Alpha. They reduced the cumulative annual loss compared with the naive MRF-based Method in AAAI 2012.


Climate Prediction via Matrix Completion


Goal: combine/improve the predictions of the multi-model ensemble of GCMs, using sparse matrix completion.


  • They exploits past observations, and the predictions of the multi-model ensemble of GCMs.

  • Their learning approach is batch and unsupervised.

  • They create a sparse (incomplete) matrix from climate model predictions and observed temperature data.

  • Finally, they apply a matrix completion algorithm to recover the unobserved data.

  • Eventually, they yield predictions of unobserved results.


Outlook: these results suggest some low intrinsic dimensionality.


They induced some sparsity in the input matrix that need not ensure low intrinsic dimensionality. Past research also suggest low intrinsic dimensionality but only a small number (~2) climatological “predictive components” determine the predictive “skill” of climate models. It also suggests future work on tracking a small subset of the ensemble.


Next, how to define extremes is a problem. So they learnt from topic modeling and LDA. Compared with bag of words to topic and words, they cluster geo-location to climate topics and predictions. The parameters include number of spatial regions, number of observations in region, climate topic, climate descriptor (discretized observed climate variable and Dirichlet prior.


Next question is how to reconstruct paleoclimate. We have tree rings, ice cores and lake sediment cores to be used.


Can sparse matrix completion techniques play a role? Discover latent structure?


The issue is that there are only many small data sets….shall we use data fusion techniques? Shall we use multi-view learning?


My Questions


Why LDA is the best approach to model climate change? How is climate topic similar to document topics? Is there any specific features in climate topics? Naive transfer learning might not be perfect.


Prof. Jimmy Lin does not buy the idea to use topic modeling, and I don’t buy neither.


Relation to My Research


The 3D rendering of climate change may trigger interesting topics in virtual / augmented reality field. I would like to explore the geo-tagged climate data for future research if applicable.


Ruofei Du

[Summary] Discovering Semantic Themes in Online Reviews

It’s the first time for me to write summaries for CLIP talks. Please correct me if anywhere is misunderstood.


Speaker: Jorge Mejia


Summarizer: Ruofei Du


Motivation: Sometimes it’s challenging to make sense out of online reviews due to the polarized discussion. To solve the problem, the author tries to discover semantic themes from 130k reviews (15m words) of 2k restaurants. Another important motivation is to investigate the closure of restaurants.


Models: The author proposed two models

  • Text
  • Econometric



  • Preprocessing using stop words, stemming and etc.
  • Regression with document-term matrix (DTM)
  • Traditional approaches: one-dimensional statistical summaries such as words count, sentiment, readability
  • New tool: latent semantic analysis (LSA)  (pca for documents)   (dimension reduction,  apple, orange -> fruit)
  • Text model: X = U D V’ X denotes DTM,  U denotes topic score, D denotes variance explained , V denotes topics
    • V’ \in {quality_overall, food_efficiency, responsiveness, food_quality, atmosphere}
  • Econometric model of restaurant closure: ~2,000 open, ~500 closures from 2005 to 2013.
  • What challenges: Variety of restaurants features.
  • Past research: Matching: coarsened exact matching (King et al. 2011)
    • Coarsen variables: (e.g.  years of education (4, 5) -> categories (college, graduate)
    • Sort all units into strata with control and case.
    • Prune any stratum with 0,0: avoid bias
  • Experiment: Start with 2021 open/case, 446 closed/control; CEM results in 605 open/case, 446 closed/control. (n=1051)
  • We should compare case and control samples. (e.g. 446 case with 54.8 num of reviews, 605 controls with 59.5 num of reviews)
  • Past research: Mixed effects logistic regression or longitudinal survival model
    • Pr(closure)it ~ rating + reviews + wordCount + price_point + food_efficiency + atmosphere …
  • Comparison: Base model, semantic variables, closed sample, fixed effects
  • Results: Variables associated with closure
    • Cuisines (American), higher price, fewer reviews, lower ratings
    • Review text focused on
      • less on overall quality
      • less on responsiveness
      • more on efficiency (if the reviewer focus on more efficiency, it’s more related to complain)
  • Validation
    • longitudinal ROC curve: AUC = 0.760
  • Conclusion
    • Semantic structure can be extracted from online reviews.
    • This structure can explain business outcomes




  1. Real-life troubles (reading online reviews) may trigger interesting research problems (discovering semantic themes).
  2. Extending traditional approaches (one-dimensional statistical summaries such as words count, sentiment, readability)
    ) to higher dimension (DTM) might be a good contribution.
  3. Always ask “so what” to your models to seek for significance. In this way, the author turns to closure of restaurants.


My suggestions and proposed question during the talk:

  •  Temporal and spatial features may contribute a lot to closure of restaurant (e.g. in some location, restaurants may be more likely to close)


Other suggestions

  • Weights of reviewers vary a lot from person to person


Ruofei Du

Last, some other findings:

Vision + Geo + NLP + Psychology is a great future direction: here’s something I found in Instagram.



Leave a Reply

Your email address will not be published. Required fields are marked *