{
"metadata": {
"celltoolbar": "Slideshow",
"name": "",
"signature": "sha256:4e615f69521d1c7447158f9c3bcb7e9b23d635a0ab252dd7242206800ec209f5"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Text Mining: A Case Study"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#Outline\n",
"\n",
"* Text Classification Examples"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Machine Learning/Classification Pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Plain Vanilla approach: TF-IDF weighting + Support Vector Machine (SVM) "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Demo: banner classification"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Text Classification Examples\n",
"\n",
"## Banner Classification based on the raw text from a receipt\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#Text Classification Examples\n",
"\n",
"##Categorization based on item description \n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Machine Learning/Classification Pipeline\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"\n",
"
Banner Classification Pipeline
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#Term Frequency\n",
"\n",
"> The number of times that a given word(s) occur(s) on the receipts of the given banner (document).\n",
"\n",
"### Term Frequency Matrix (10000 receipts)\n",
"\n",
"\n",
"**word** | \n",
"**Walmart(2240)** | \n",
"**non-Walmart (7760)** | \n",
"
\n",
"\n",
"\n",
"*live* | \n",
"$$1934$$ | \n",
"$$204$$ | \n",
"
\n",
"\n",
"\n",
"*money* | \n",
"$$1871$$ | \n",
"$$88$$ | \n",
"
\n",
"\n",
"\n",
"*walmart* | \n",
"$$1632$$ | \n",
"$$29$$ | \n",
"
\n",
"\n",
"\n",
"*manager* | \n",
"$$1529$$ | \n",
"$$24$$ | \n",
"
\n",
"
\n",
"\n",
"* binary case 'live live' is only 1."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Inverse Document Frequency\n",
"\n",
"> Measure of how much information the word provides, that is, whether the term is common or rare across all receipts (documents).\n",
"\n",
"Number of receipts containing the given words (Document Frequency)\n",
"\n",
"\n",
"\n",
"**word** | \n",
"**Walmart(2240)** | \n",
"**non-Walmart (7760)** | \n",
"
\n",
"\n",
"\n",
"*live* | \n",
"$$\\log\\frac{10000}{1934}$$ | \n",
"$$\\log\\frac{10000}{204}$$ | \n",
"
\n",
"\n",
"\n",
"*money* | \n",
"$$\\log\\frac{10000}{1871}$$ | \n",
"$$\\log\\frac{10000}{88}$$ | \n",
"
\n",
"\n",
"\n",
"*walmart* | \n",
"$$\\log \\frac{10000}{1632}$$ | \n",
"$$\\log \\frac{10000}{29}$$ | \n",
"
\n",
"\n",
"\n",
"*manager* | \n",
"$$\\log \\frac{10000}{1529}$$ | \n",
"$$\\log \\frac{10000}{24}$$ | \n",
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# TF-IDF + SVM\n",
"\n",
"* multiply TF with IDF = TF-IDF matrix\n",
"\n",
"\n",
"* use the TF-IDF matrix to get the features\n",
"\n",
"\n",
"* pump the features into SVM\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#Tools\n",
"\n",
"\n",
""
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"\n",
"fn = 'c:/work/fun/ds-meetup/data.csv'\n",
"data = pd.read_csv(fn)\n",
"data[3:10]"
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" banner_key | \n",
" text | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" whole_foods | \n",
" WFZLE FOODS Invest in a future without poverty... | \n",
"
\n",
" \n",
" 4 | \n",
" dillons_marketplace | \n",
" dlito 11.4#;1111,770dIons.;_um Great food. Low... | \n",
"
\n",
" \n",
" 5 | \n",
" sams_club | \n",
" LLUR MANAGER J CUNNINGHAM (907) 522 - 2333 ANC... | \n",
"
\n",
" \n",
" 6 | \n",
" cvs | \n",
" CVS13114airmacy 10623 618(3(0110N, RIVERVIEW, ... | \n",
"
\n",
" \n",
" 7 | \n",
" rite_aid | \n",
" 1120 331 1ith us, ifs personal. Stcre #00443 3... | \n",
"
\n",
" \n",
" 8 | \n",
" walmart | \n",
" Wallmart Save money. Live better. Self Checkou... | \n",
"
\n",
" \n",
" 9 | \n",
" walmart | \n",
" Walmart Save money. Livn better. 205 1 7si 972... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 1,
"text": [
" banner_key text\n",
"3 whole_foods WFZLE FOODS Invest in a future without poverty...\n",
"4 dillons_marketplace dlito 11.4#;1111,770dIons.;_um Great food. Low...\n",
"5 sams_club LLUR MANAGER J CUNNINGHAM (907) 522 - 2333 ANC...\n",
"6 cvs CVS13114airmacy 10623 618(3(0110N, RIVERVIEW, ...\n",
"7 rite_aid 1120 331 1ith us, ifs personal. Stcre #00443 3...\n",
"8 walmart Wallmart Save money. Live better. Self Checkou...\n",
"9 walmart Walmart Save money. Livn better. 205 1 7si 972..."
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stat = data[['banner_key']]\n",
"stat['ratio'] = 0\n",
"stat = (stat.groupby('banner_key').aggregate(len) / float(stat.shape[0])).sort(['ratio'], ascending=False)\n",
"stat[:10]"
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ratio | \n",
"
\n",
" \n",
" banner_key | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" walmart | \n",
" 0.2240 | \n",
"
\n",
" \n",
" target | \n",
" 0.0913 | \n",
"
\n",
" \n",
" walgreens | \n",
" 0.0460 | \n",
"
\n",
" \n",
" publix | \n",
" 0.0405 | \n",
"
\n",
" \n",
" kroger | \n",
" 0.0399 | \n",
"
\n",
" \n",
" cvs | \n",
" 0.0338 | \n",
"
\n",
" \n",
" costco | \n",
" 0.0273 | \n",
"
\n",
" \n",
" dollar_tree | \n",
" 0.0249 | \n",
"
\n",
" \n",
" safeway | \n",
" 0.0208 | \n",
"
\n",
" \n",
" meijer | \n",
" 0.0189 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
" ratio\n",
"banner_key \n",
"walmart 0.2240\n",
"target 0.0913\n",
"walgreens 0.0460\n",
"publix 0.0405\n",
"kroger 0.0399\n",
"cvs 0.0338\n",
"costco 0.0273\n",
"dollar_tree 0.0249\n",
"safeway 0.0208\n",
"meijer 0.0189"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#let us focus on only the biggest banner: walmart ==> binary classifier\n",
"\n",
"#map ~walmart ==> other\n",
"data['banner_key'][~data['banner_key'].isin(['walmart'])] = 0\n",
"data['banner_key'][data['banner_key'].isin(['walmart'])] = 1\n",
"data[3:10]"
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" banner_key | \n",
" text | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 0 | \n",
" WFZLE FOODS Invest in a future without poverty... | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" dlito 11.4#;1111,770dIons.;_um Great food. Low... | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" LLUR MANAGER J CUNNINGHAM (907) 522 - 2333 ANC... | \n",
"
\n",
" \n",
" 6 | \n",
" 0 | \n",
" CVS13114airmacy 10623 618(3(0110N, RIVERVIEW, ... | \n",
"
\n",
" \n",
" 7 | \n",
" 0 | \n",
" 1120 331 1ith us, ifs personal. Stcre #00443 3... | \n",
"
\n",
" \n",
" 8 | \n",
" 1 | \n",
" Wallmart Save money. Live better. Self Checkou... | \n",
"
\n",
" \n",
" 9 | \n",
" 1 | \n",
" Walmart Save money. Livn better. 205 1 7si 972... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
" banner_key text\n",
"3 0 WFZLE FOODS Invest in a future without poverty...\n",
"4 0 dlito 11.4#;1111,770dIons.;_um Great food. Low...\n",
"5 0 LLUR MANAGER J CUNNINGHAM (907) 522 - 2333 ANC...\n",
"6 0 CVS13114airmacy 10623 618(3(0110N, RIVERVIEW, ...\n",
"7 0 1120 331 1ith us, ifs personal. Stcre #00443 3...\n",
"8 1 Wallmart Save money. Live better. Self Checkou...\n",
"9 1 Walmart Save money. Livn better. 205 1 7si 972..."
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.svm import LinearSVC\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.grid_search import GridSearchCV"
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"prompt_number": 38
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"\"\" Get train data; X=input & y=target \"\"\"\n",
"#only 200 samples\n",
"X_train = data['text'][:200]\n",
"y_train = data['banner_key'][:200].astype(int)"
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"prompt_number": 29
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"\"\" Pipeline: raw text ==> TFIDF ==> Linear SVM ==> banner \"\"\"\n",
"pl = Pipeline([\n",
" ('vectorizer', TfidfVectorizer(sublinear_tf=True,analyzer='word')),\n",
" ('classifier', LinearSVC(C=1))\n",
" ])"
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"prompt_number": 39
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"\"\" Setup the paramaters \"\"\"\n",
"parameters = {'vectorizer__use_idf':[True,False], \n",
" 'vectorizer__ngram_range':[(1,3)], \n",
" 'vectorizer__binary':[True,False], \n",
" 'classifier__dual':[True], \n",
" 'classifier__C':[1,10]} "
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"prompt_number": 40
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"\"\" GridSearch w/ cross-validation \"\"\"\n",
"n_cores = 1\n",
"grid_search = GridSearchCV(pl, parameters, cv = 5, scoring = 'f1', \n",
" n_jobs = n_cores, verbose=1, refit=True, \n",
" iid=False) \n",
"grid_search.fit(X_train, y_train) #Search the best parameter setting"
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.7s\n",
"[Parallel(n_jobs=1)]: Done 40 out of 40 | elapsed: 31.6s finished\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Fitting 5 folds for each of 8 candidates, totalling 40 fits\n"
]
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 41,
"text": [
"GridSearchCV(cv=5,\n",
" estimator=Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, charset=None,\n",
" charset_error=None, decode_error=u'strict',\n",
" dtype=, encoding=u'utf-8', input=u'content',\n",
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
" ngram_range=(1, 1),...ling=1, loss='l2', multi_class='ovr', penalty='l2',\n",
" random_state=None, tol=0.0001, verbose=0))]),\n",
" fit_params={}, iid=False, loss_func=None, n_jobs=1,\n",
" param_grid={'vectorizer__use_idf': [True, False], 'vectorizer__ngram_range': [(1, 3)], 'classifier__C': [1, 10], 'vectorizer__binary': [True, False], 'classifier__dual': [True]},\n",
" pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring='f1',\n",
" verbose=1)"
]
}
],
"prompt_number": 41
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'f1 score : %.2f%%' % (grid_search.best_score_*100)\n",
"print(\"Best parameter set:\")\n",
"best_parameters = grid_search.best_estimator_.get_params()\n",
"for param_name in sorted(parameters.keys()):\n",
" print(\"\\t%s: %r\" % (param_name, best_parameters[param_name]))\n",
"\n",
"clf_best = grid_search.best_estimator_"
],
"language": "python",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"f1 score : 97.89%\n",
"Best parameter set:\n",
"\tclassifier__C: 1\n",
"\tclassifier__dual: True\n",
"\tvectorizer__binary: False\n",
"\tvectorizer__ngram_range: (1, 3)\n",
"\tvectorizer__use_idf: False\n"
]
}
],
"prompt_number": 42
}
],
"metadata": {}
}
]
}