- easy
- last time: all code, all math. this time, no code, no math.
- simple generalized linear models, not panel data or time series
- introductory, practical, short
- unstructured == text documents, could concievably take similar approach with audio or visual data if you know equivalent preprocessing dsp or image processing, but I haven't done it
- motivation?
- improve overall predictive power of model
- imputation: fill in sparse or missing values in your structured dataset
- only have gender or visit reason or other categorical value for a subset
- have labeled data for observations outside of your dataset and equivalent corpora
- infer relationships between a variable of interest
and some latent random variable or variables you believe
are represented in some way, obvious or not, in your unstructured data
- discussions of interactions with staff
- operational incidents
- ongoing operational issues
- really important to understand whatever your "customer" is going to care most about, of course
- what does your structured data look like?
- how many observations you got?
- sparse, dense?
- missing values?
- What does your unstructured data look like?
- what's the link?
- predictions about authors?
- about objects of discussion?
- 1-to-1 vs 1-to-many (hierarchical)
- how much of it you got?
- labeled? how?
- you have some reason to believe there is valuable information in there somewhere
- what's the link?
- How important are specific model evaluation techniques?
- statistical significance
- do you need p-values? do you need all the p-values?
- How do you feel about precision and recall? RMSE? RMAE?
- statistical significance
- bag-of-ngrams
- chunking
- bag-of-chunks
- stopword filtering
- The Kitchen Sink approach
- just throw it all together and regularize the hell out of it
- or use some deep learning methods
- pro: "simple"
- pro: usually performs reasonably well, often very well
- con: no simple - or even complicated - answer for "why"
- Heuristics
- pro: simple to explain/interpret
- con: difficult to capture subtlety/nuance in any real way
- Full generative model
- pro: full probability model
- pro: you are probably smarter
- con: difficult to do well
- con: probably not worth the effort
- con: time consuming
- Meeting halfway...
- Sub-classifiers, with predictions used as features in core model
- need sub-labels
- propagate the uncertainty
- unsupervised and semi-supervised "summarization", with results used
as features in core model
- clustering
- abstraction
- dimensionality reduction
- need lots of data
- Sub-classifiers, with predictions used as features in core model
- econometrics and expectations
- interpretation of coefs
- significance testing
- predict receptiveness of potential targets for a promotional campaign
- use for rank ordering
- structured data: demographic info, customer activity
- unstructured data: social media presence/postings
- predicting operational metrics at different locations of a business
- use for informing business decisions, changes in operations
- structured data: operational metrics, geographic/region demographics
- unstructured data: social reviews and survey responses for location
- predict merchant upside for participating in a "daily deal" promotion
- use for rank ordering sales decisions and informing business decisions
- structured data: operational metrics, geographic/region demographics
- unstructured data: social reviews and survey responses for location
- tools matter generally
- community, tutorials, etc
- SAS user group
- can't help you
- python: can use first results of first three in statsmodels
0. numpy/scipy
- NLTK (comprehensive python NLP package)
- scikit-learn (python machine-learning package)
- gensim (python topic modeling toolkit)
- theano (deep learning)
- R
- tm (R package)
- hadley wickham gives good package
- Questions?
- everything will be on my personal website: www.obscureanalytics.com
- and on github