Multimedia Information Systems Lecture 3 - 8

Language processing ( NLP )

  • a field of CS, AI, or tasks involved in human languages, such as human-machine communication
  • Language involve many human activities, reading, writing, speaking
  • ability to perform massibe amont of textual data ( data mining )

NLP applications:
advanced: speech recognition, machine translation.
basic: spelling, grammar

Natural language on the web
basic components: text, links, graphic ( everything that is visual)
basic formating: text, links, graphic
Main container for document: page
performance: page
architecture: site
captions: highly stylize and associate with pictures

Webpages

  • contain all kinds of language, url, links…

Taxonomy vs. Folksonomy
Taxonomy ( top- down )

  • Hierachical
  • Exclusive : the same item cannot be in two distinct categories

Folksonomy

  • bottom-up
  • not exlusive: an item can be associated to many tags
  • flat

Linguistice concepts

  • language is a system of symbols with an agreed upon meaning that is used by people
  • grammar is a set of rules
  • syntax is the sentenced and their structure
  • semantics is meaning of words, their relationships
  • lexicon is the sum total of words in a language
  • smiotics is the theory of signs and symbols

Ambiguity

  • Natural language is highly ambiguous and must be disambiguated
  • Lexical ambiguity
  • syntactic ambiguity
  • referential amibituity

Ambiguity is ubiquitous

  • speech recognition, “younth in asia” vs “enthanasia”
  • syntactic analysis “I ate noodle with chopstick” vs. “I ate noodle with metaballs”
  • semantic analysis “The dog is in the pen” vs “the ink is in the pen”

Natural languages vs. computer languages

  • ambiguity is the primary difference
  • programming language designed to be unambiguious

Why is NLP hard

  • language is hard even for human, one can see the complexity and degree of specialization of fields of linguistic

Word segmentation

  • A possible solutions is maximum matching, make sure find correct word by macthing. start by ponting the beginning of the string, then choose the longerest world in the dictionary to match
  • Most successful word segmentation tools are based on machine learning techniques
  • nltk.org is a python tool

Word level NLP

  • Remove ambiguity
  • lexical analysis
  • stop world removal
  • stemming( fishing. fished, fish, fisher)
  • lemmatization: stemming + content
  • morphology (prefix)

Sentence Level NLP

  • syntax
  • disambiguation ( polysemy, anaphora, )
  • parsing

Two levels in lexical semantics

  • representing meaning of words
  • word sense disambiguation
  • composition semantics - to see how they combined and what that means

WordNet

  • semantics database for English

Question Answering

  • factoid
    who, what, when
  • non factoid
    defination

TEXT summarization

  • give a piece of text, automatically make a summary satisfying required constrains

Document clustering

  • How to organize a large set of documents of various topics

Classification methods

  1. manual classification
  2. Rule-based
    • There are IDE type development

Document Vectores

  • Focuments are represented as “bags of words”
  • Represented as vectores when used computationally

Similarity between documents

  • Sum of sqaured of documents
  • Angle between vectores

Document categories

  • The categories are just symbolic

  • Data: labeled instanced

    • Traning data
    • Held-out Data
    • Test data
  • Features: attribute-value pairs which characterize each X

  • Experimentation cycle

    • learn parameters
    • tune hyperparameters on held-out set
    • Compute accuracy of test set
    • Very important: never “peek” at the test set

kNN classification

  • kNN = k nearest neighbors
  • test bullet
  • Test

Q&A session

What is key tasks of NLP systems/algorithm

  • disambiguition

what is the difference between the connectivity matrix and the normalized connectivity matrix?

  • sum fof columns on normaized is 1

what is polysemy?

  • related words that have multiple meanings

difference between Text categorization and text clustering?

  • categorization the label are in advance

Lecture 6

  • Amptitude
  • Frequency

Lecture 8