Multimedia Information Systems Lecture 3 - 8
Language processing ( NLP )
- a field of CS, AI, or tasks involved in human languages, such as human-machine communication
- Language involve many human activities, reading, writing, speaking
- ability to perform massibe amont of textual data ( data mining )
NLP applications:
advanced: speech recognition, machine translation.
basic: spelling, grammar
Natural language on the web
basic components: text, links, graphic ( everything that is visual)
basic formating: text, links, graphic
Main container for document: page
performance: page
architecture: site
captions: highly stylize and associate with pictures
Webpages
- contain all kinds of language, url, links…
Taxonomy vs. Folksonomy
Taxonomy ( top- down )
- Hierachical
- Exclusive : the same item cannot be in two distinct categories
Folksonomy
- bottom-up
- not exlusive: an item can be associated to many tags
- flat
Linguistice concepts
- language is a system of symbols with an agreed upon meaning that is used by people
- grammar is a set of rules
- syntax is the sentenced and their structure
- semantics is meaning of words, their relationships
- lexicon is the sum total of words in a language
- smiotics is the theory of signs and symbols
Ambiguity
- Natural language is highly ambiguous and must be disambiguated
- Lexical ambiguity
- syntactic ambiguity
- referential amibituity
Ambiguity is ubiquitous
- speech recognition, “younth in asia” vs “enthanasia”
- syntactic analysis “I ate noodle with chopstick” vs. “I ate noodle with metaballs”
- semantic analysis “The dog is in the pen” vs “the ink is in the pen”
Natural languages vs. computer languages
- ambiguity is the primary difference
- programming language designed to be unambiguious
Why is NLP hard
- language is hard even for human, one can see the complexity and degree of specialization of fields of linguistic
Word segmentation
- A possible solutions is maximum matching, make sure find correct word by macthing. start by ponting the beginning of the string, then choose the longerest world in the dictionary to match
- Most successful word segmentation tools are based on machine learning techniques
- nltk.org is a python tool
Word level NLP
- Remove ambiguity
- lexical analysis
- stop world removal
- stemming( fishing. fished, fish, fisher)
- lemmatization: stemming + content
- morphology (prefix)
Sentence Level NLP
- syntax
- disambiguation ( polysemy, anaphora, )
- parsing
Two levels in lexical semantics
- representing meaning of words
- word sense disambiguation
- composition semantics - to see how they combined and what that means
WordNet
- semantics database for English
Question Answering
- factoid
who, what, when - non factoid
defination
TEXT summarization
- give a piece of text, automatically make a summary satisfying required constrains
Document clustering
- How to organize a large set of documents of various topics
Classification methods
- manual classification
- Rule-based
- There are IDE type development
Document Vectores
- Focuments are represented as “bags of words”
- Represented as vectores when used computationally
Similarity between documents
- Sum of sqaured of documents
- Angle between vectores
Document categories
The categories are just symbolic
Data: labeled instanced
- Traning data
- Held-out Data
- Test data
Features: attribute-value pairs which characterize each X
Experimentation cycle
- learn parameters
- tune hyperparameters on held-out set
- Compute accuracy of test set
- Very important: never “peek” at the test set
kNN classification
- kNN = k nearest neighbors
- test bullet
- Test
Q&A session
What is key tasks of NLP systems/algorithm
- disambiguition
what is the difference between the connectivity matrix and the normalized connectivity matrix?
- sum fof columns on normaized is 1
what is polysemy?
- related words that have multiple meanings
difference between Text categorization and text clustering?
- categorization the label are in advance
Lecture 6
- Amptitude
- Frequency