Statistical Semantic Classification of Crisis Information, Prashant Khare, Miriam Fernandez

Download this Presentation

0

Presentation Transcript

  • 1.Statistical Semantic Classification of Crisis Information Prashant Khare, Miriam Fernandez, Harith Alani 1 {prashant.khare, miriam.fernandez, h.alani} @open.ac.uk Knowledge Media Institute, The Open University, UK
  • 2.Motivation 2 People of NSW, be careful because there's fires spreading! Stay safe everyone! Hundreds of volunteers in Mexico tried to unearth children they hoped were still alive beneath a school's ruins Two trucks and one car in the water after a road collapse at Hwy 287 and Dillon. #cowx #boulderflood CRISIS Wildfire Floods Earthquake
  • 3.Motivation 3 Challenges A flood of data gets generated. For e.g.- On an average over a million tweets got generated during Hurricane Harvey 2017. 500% increase in the tweets bandwidth during 2011 Japan earthquake. Almost impossible to manually absorb and process the sheer volume. In addition, the characteristics of social media posts such as short length, colloquialism, syntactic issues pose additional challenges of processing the data.
  • 4.Motivation 4 Relevant and Non-Relevant
  • 5.Motivation 5 FEMA launched an initiative to use public social media data for situational awareness purpose1. 1: https://www.dhs.gov/sites/default/files/publications/privacy-pia-FEMA-OUSM-April2016.pdf Image source – fema.gov
  • 6.Previous Efforts - Identifying Crisis Related Information ML Classification Methods: Supervised Approaches: Often making use of n-grams, linguistic features, and/or statistical features of tweets. Unsupervised Approaches: Keyword processing and clustering. Semantic Models: Representation of the information emerging from Crisis Events, providing faceted search of crisis related information. 6
  • 7.Hypothesis and Aim Hypothesis: Semantics can establish a consistency within crisis relevant information and enhance the discriminative power of classifiers. Complement the previous approaches by investigating the impact of semantic features in ML classification along with statistical features. 7
  • 8.Method Collect Data from CrisisLex.org- collection of Crisis oriented tweets. Extract Statistical Features. Semantic Enrichment of tweets via annotation using Babelfy API. Expand the semantics by incorporating hypernyms through BabelNet. Filter out the less informative and abstract features (using a hypernym hierarchy). Classify using SVM classification method. 8
  • 9.9 #HighParkFire burn area map as of Monday night 10 p.m. http://t.co/1guBTcXX area region map representation Burn destroy Monday weekday day_of_the_week night period burn area map Monday night Annotated Tokens Semantically enriched (augmented hypernyms) Lets us consider this Tweet from Colorado Wildfire 2012: Semantic Enrichment of Tweets Image source: http://babelnet.org/about
  • 10.Semantic Features: Annotations, Expansion, & Filtering 10
  • 11.Semantic Features: Annotations, Expansion, & Filtering 11 BabelNet: Extracted 4 million relations by iteratively querying for hypernyms. Using Directed-graph and betweenness centrality – ‘Entity’ (SynSet ID ‘bn:00031027n’) was found to be most abstract concept. Using shortest path – Maximum depth of nodes found to be 21. Using Information Gain, most of the informative concepts were between the depth 3 and 7.
  • 12.Semantic Features: Annotations, Expansion, & Filtering 12 Features plotted against Levels/Information Gain – Training Data for Colorado Wildfire Test case
  • 13.Data CrisisLexT26 (3206 tweets – 1667 related, 1539 not related) 9 Events- predominantly in English 13
  • 14.Features Statistical Features (SF) Number of Nouns, Verbs, Pronouns Tweet Length Number of Words, Hashtags Readability -Gunning Fox Index using average sentence length (ASL) and percentage of complex words (PCW) : 0.4*(ASL + PCW) Unigrams Semantic Features (SemF) Semantic Annotation Features (SemAF) Semantic Expansion Features (SemEF) Semantic Filtering Features (SemFF) 14
  • 15.Experiment Classifier – Support Vector Machine with Linear Kernel Designed two type of experiments: Crisis Classification Model (train and test on all 9 crisis events) Create following classifier models SF SF + SemAF SF + SemAF + SemEF SF +SemFF (statistical features and filtered semantic annotations, along with hypernyms) Cross Crisis Classification (train on 8 crisis events and test on 9th crisis event) 15 Statistical Features (SF), Semantic Annotation Features (SemAF), Semantic Expansion Features (SemEF), Semantic Filtering Features (SemFF)
  • 16.Experiment Crisis Classification (10- fold cross validation) 16 Statistical Features (SF), Semantic Annotation Features (SemAF), Semantic Expansion Features (SemEF), Semantic Filtering Features (SemFF)
  • 17.Experiment Cross-Crisis Classification 17 Statistical Features (SF), Semantic Annotation Features (SemAF), Semantic Expansion Features (SemEF), Semantic Filtering Features (SemFF)
  • 18.Results and Observations Number of Hashtags, Nouns, and Pronouns are most relevant features among Statistical features – based on Information Gain of the features. Apart from Statistical features, Semantic expansion reflected concepts such as ‘Happening’ and ‘Event’ (which are hypernym of concepts ‘Incident’, ‘Fire’, ‘Crisis’, ‘Disaster’, and ‘Death’ in BabelNet) among Top 10 attributes. 18
  • 19.Results and Observations 19 Post A –“RT @LarimerCounty: #HighParkFire burn area map as of Monday night 10 p.m.” Post B –“Colorado wildfires their worst in a decade http://t.co/RtfLmfds” Post C – “RT @RedCross: Thanks to generosity of volunteer blood donors there is currently enough blood on the shelves to meet demand.#BostonMarathon” Misclassified in SF but correctly classified in SF+SemAF. ‘Burn’ did not occur in training data but annotation ‘Fire’ did. Misclassifed in SF+SemAF but correctly classified in SF+SemAF+SemEF. ‘Fire’, a high IG feature, is hypernym of original annotation ‘wildfire’ which was not a ranked feature in training data. Misclassified in SF+SemAF+SemEF but correctly classified in SF+SemFF. ‘Thanks’ and ‘Meet’ expanded to very low discriminative and abstract features as ‘Virtue’ and ‘Desire’. Excluding them raised the discriminative power of more informative concepts such as ‘Volunteer’ and ‘Benefactor’ (hypernym of ‘donor’).
  • 20.Take Away Potential in mixing the statistical and semantic features for classification. Most noteworthy improvement is achieved when hybrid model is used to classify an entirely new data. Semantic expansion can also result in noise. Filtering can help in addressing the noise. In future, we should expand the data to more crisis events (currently 5 types of events) and sample size of each event. 20
  • 21.21 Questions!