Elasticsearch Taste Plugin is Mahout Taste-based Collaborative Filtering implementation. This plugin provides the following features of a recommendation engine on Elasticsearch:
- Data management for Users/Items/Preferences.
- Item-based Recommender.
- User-based Recommender.
- Similar Users/Contents
- Text Analysis
Taste | Tested on elasticsearch |
---|---|
master | 2.4.X |
2.3.0 | 2.3.5 |
1.5.0 | 1.5.1 |
0.4.0 | 1.4.0 |
0.3.1 | 1.3.1 |
Note that this plugin supports Java 8 or the above. See README_river.md under Elasticsearch Taste 0.4.0.
Please file an issue. (Japanese forum is here.)
$ $ES_HOME/bin/plugin install org.codelibs/elasticsearch-taste/2.3.0
In this section, using MovieLens data set, you can learn about Taste plugin. For more information about MovieLens, see MovieLens site. The preference data set is u.data, it contains user id, item id, rating and timestamp. Download it and then insert data by Event API:
curl -o u.data http://files.grouplens.org/datasets/movielens/ml-100k/u.data
cat u.data | awk '{system("curl -XPOST localhost:9200/movielens/_taste/event?pretty -d \"{\\\"user\\\":{\\\"id\\\":" $1 "},\\\"item\\\":{\\\"id\\\":" $2 "},\\\"value\\\":" $3 ",\\\"timestamp\\\":" $4 "000}\"")}'
By the above request, the preference values are stored in "movielens" index. After inserting data, check them in the index:
curl -XGET "localhost:9200/movielens/_search?q=*:*&pretty"
To compute recommended items from users, execute the following request:
curl -XPOST localhost:9200/_taste/action/recommended_items_from_user -d '{
"num_of_items": 10,
"data_model": {
"cache": {
"weight": "100m"
}
},
"index_info": {
"index": "movielens"
}
}'
The result is stored in movielens/recommendation. The response of the above request has a action name at the "name" property. To check if your request is finished, send a GET request:
curl -XGET "localhost:9200/_taste/action?pretty"
If the response contains the action name, your request is not completed yet.
For checking the result, send the following query:
curl -XGET "localhost:9200/movielens/recommendation/_search?q=*:*&pretty"
A value of "user_id" property is a target user, "items" property is recommended items which contains "item_id" and "value" property. A "value" property is a value of similarity.
The computation might take a long time... If you want to stop it, execute below:
curl -XDELETE localhost:9200/_taste/action/{action_name}
You can evaluate parameters, such as similarity and neighborhood, with the following "evaluate_items_from_user" action.
curl -XPOST localhost:9200/_taste/action/evaluate_items_from_user -d '{
"evaluation_percentage": 1.0,
"training_percentage": 0.9,
"margin_for_error": 1.0,
"data_model": {
"cache": {
"weight": "100m"
}
},
"index_info": {
"index": "movielens"
},
"neighborhood": {
"factory": "org.codelibs.elasticsearch.taste.neighborhood.NearestNUserNeighborhoodFactory",
"neighborhood_size": 100
},
"evaluator": {
"id": "movielens_result",
"factory": "org.codelibs.elasticsearch.taste.eval.RMSEvaluatorFactory"
}
}'
The result is stored in report type:
curl -XGET "localhost:9200/movielens/report/_search?q=evaluator_id:movielens_result&pretty"
{
"_index": "movielens",
"_type": "report",
"_id": "COscswwlQIugP2Rf0XKH-w",
"_version": 1,
"_score": 1,
"_source": {
"evaluation": {
"score": 1.0199981244413419,
"preference": {
"total": 10092,
"no_estimate": 178,
"success": 6671,
"failure": 3243
},
"time": {
"average_processing": 18,
"total_processing": 184813,
"max_processing": 22881
}
"target": {
"test": 915,
"training": 943
}
}
"@timestamp": "2014-05-07T21:15:43.948Z",
"evaluator_id": "movielens_result",
"report_type": "user_based",
"config": {
"training_percentage": 0.9,
"evaluation_percentage": 1,
"margin_for_error": 1
}
}
}
Calcurating similar users, run the following request:
curl -XPOST localhost:9200/_taste/action/similar_users -d '{
"num_of_users": 10,
"data_model": {
"cache": {
"weight": "100m"
}
},
"index_info": {
"index": "movielens"
}
}'
If you check similar users for User ID 1, type the below:
curl -XGET "localhost:9200/movielens/user_similarity/_search?q=user_id:1&pretty"
This plugin has own ID, such as user_id or item_id. If you want to access your system ID, use the following search API:
curl -XGET "localhost:9200/{index}/{type}/_taste/{user|item}/{your_user_or_item_id}?pretty"
For example, if you get similar users from ID=115:
curl -XGET localhost:9200/movielens/recommendation/_taste/user/1
You can create a term vector from your index which has a field with "term_vector". For the following example, text data are stored into "description" field in "ap" index.
curl -XPOST localhost:9200/ap?pretty
curl -XPOST localhost:9200/ap/article/_mapping?pretty -d '{
"article" : {
"properties" : {
"id" : {
"type" : "string",
"index" : "not_analyzed"
},
"label" : {
"type" : "string",
"index" : "not_analyzed"
},
"description" : {
"type" : "string",
"analyzer" : "standard",
"term_vector": "with_positions_offsets_payloads",
"store" : true
}
}
}
}'
curl -o ap.txt http://mallet.cs.umass.edu/ap.txt
sed -e "s/[\'\`\\]//g" ap.txt | awk -F"\t" '{sub(" ","",$1);system("curl -XPOST localhost:9200/ap/article/" $1 " -d \"{\\\"id\\\":\\\"" $1 "\\\",\\\"label\\\":\\\"" $2 "\\\",\\\"description\\\":\\\"" $3 "\\\"}\"")}'
To generate the term vector, run the following river:
curl -XPOST localhost:9200/_taste/action/generate_term_values?pretty -d '{
"source": {
"index": "ap",
"type": "article",
"fields": ["description"]
},
"event": {
"user_index": "ap_term",
"user_type": "doc",
"item_index": "ap_term",
"item_type": "term",
"preference_index": "ap_term",
"preference_type": "preference"
}
}'
source property specify index, type and fields as source information, and event property is output information which is user/item/preference format.
This plugin manages data of Users, Items and Preferences on Elasticsearch. These data are stored in each index and type. By default, they are in:
Data | Index | Type |
---|---|---|
User | (Any) | user |
Item | (Any) | item |
Preference | (Any) | preference |
User index manages id and information about itself. User index contains the following properties:
Name | Type | Description |
---|---|---|
system_id | string | Unique ID. If your system has User ID, use it. |
user_id | long | Unique User ID for this plugin. |
@timestamp | date | Created/Updated time. |
(Any) | (any) | You can use other information, such as age, tel,... |
Item index manages an item information and contains:
Name | Type | Description |
---|---|---|
system_id | string | Unique ID. If your system has Item ID, use it. |
item_id | long | Unique Item ID for this plugin. |
@timestamp | date | Created/Updated time. |
(Any) | (any) | You can use other information, such as price, publish date,... |
Preference index manages values rated by an user for an item.
Name | Type | Description |
---|---|---|
user_id | long | Unique User ID for this plugin. |
item_id | long | Unique Item ID for this plugin. |
value | float | Value rated by user_id for item_id. |
@timestamp | date | Created/Updated time. |
You can create/update/delete the above index and document with Elasticsearch manner.
Taste plugin provides an useful API to register a preference value. If you send it to http://.../{index}/_taste/event, user_id and item_id are generated and then the preference value is inserted. You can send multiple data with one request by using carriage return.
For example, if User ID is "U0001", Item ID is "I1000" and the preference(rating) value is 10.0, the request is below:
curl -XPOST localhost:9200/sample/_taste/event -d '{ user: { id: "U0001" }, item: { id: "I1000" }, value: 10.0 }'
If you send multiple data, the request is below:
curl -XPOST localhost:9200/sample/_taste/event -d \
'{ user: { id: "U0001" }, item: { id: "I1000" }, value: 10.0 }
{ user: { id: "U0002" }, item: { id: "I1001" }, value: 15.0 }'
Values of user_id, item_id and @timestamp are generated automatically. They are stored in "sample" index. The index name can be changed to any name you want.
Recommended items for an user are computed from similar users. The computing process is started by creating a river configuration. If 10 recommended items are generated from similar users, the river configuration is:
curl -XPOST localhost:9200/_taste/action/recommended_items_from_user -d '{
"num_of_items": 10,
"max_duration": 120,
"data_model": {
"class": "org.codelibs.elasticsearch.taste.model.ElasticsearchDataModel",
"scroll": {
"size": 1000,
"keep_alive": 60
},
"cache": {
"weight": "100m"
}
},
"index_info": {
"preference": {
"index": "sample",
"type": "preference"
},
"user": {
"index": "sample",
"type": "user"
},
"item": {
"index": "sample",
"type": "item"
},
"recommendation": {
"index": "sample",
"type": "recommendation"
},
"field": {
"user_id": "user_id",
"item_id": "item_id",
"value": "value",
"items": "items",
"timestamp": "@timestamp"
}
},
"similarity": {
"factory": "org.codelibs.elasticsearch.taste.similarity.LogLikelihoodSimilarityFactory"
},
"neighborhood": {
"factory": "org.codelibs.elasticsearch.taste.neighborhood.NearestNUserNeighborhoodFactory",
"min_similarity": 0.9,
"neighborhood_size": 10
}
}'
The configuration is:
Name | Type | Description |
---|---|---|
num_of_items | int | The number of recommended items. |
max_duration | int | Max duration for computing(min). |
data_model.class | string | Class name for DataModel implementation. |
data_model.scroll | object | Elasticsearch scroll parameters. |
data_model.cache | string | Cache size for the data model. |
index_info | object | Index information(index/type/property name). |
similarity.factory | string | Factroy name for Similarity implementation. |
neighborhood.factory | string | Factroy name for Neighborhood implementation. |
The recommended items are stored in sample/recommendation. You can see the result by:
curl -XGET "localhost:9200/sample/recommendation/_search?q=*:*&pretty"
The value of "items" property is recommended items.
To evaluate parameters for generating recommended items, you can use the following "evaluate_items_from_user" action.
curl -XPOST localhost:9200/_taste/action/evaluate_items_from_user -d '{
"evaluation_percentage": 0.5,
"training_percentage": 0.9,
"data_model": {
"class": "org.codelibs.elasticsearch.taste.model.ElasticsearchDataModel",
"scroll": {
"size": 1000,
"keep_alive": 60
},
"cache": {
"weight": "100m"
}
},
"index_info": {
"preference": {
"index": "sample",
"type": "preference"
},
"user": {
"index": "sample",
"type": "user"
},
"item": {
"index": "sample",
"type": "item"
},
"report": {
"index": "sample",
"type": "report"
},
"field": {
"user_id": "user_id",
"item_id": "item_id",
"value": "value",
"timestamp": "@timestamp"
}
},
"similarity": {
"factory": "org.codelibs.elasticsearch.taste.similarity.LogLikelihoodSimilarityFactory"
},
"neighborhood": {
"factory": "org.codelibs.elasticsearch.taste.neighborhood.NearestNUserNeighborhoodFactory"
}
}'
Name | Type | Description |
---|---|---|
evaluation_percentage | float | A sampling rate for a data model. 50% data is used if 0.5, |
training_percentage | float | A rate for a training data set of sampled data model. If 0.9, 90% data is used as training one and 10% data is for testing. |
The result is stored in report index(This example is sample/report).
Recommended items for an item are computed from similar items. The computing process is started by creating a river configuration. If 10 recommended items are generated from similar items, the river configuration is:
curl -XPOST localhost:9200/_taste/action/recommended_items_from_item -d '{
"num_of_items": 10,
"max_duration": 120,
"data_model": {
"class": "org.codelibs.elasticsearch.taste.model.ElasticsearchDataModel",
"scroll": {
"size": 1000,
"keep_alive": 60
},
"cache": {
"weight": "100m"
}
},
"index_info": {
"preference": {
"index": "sample",
"type": "preference"
},
"user": {
"index": "sample",
"type": "user"
},
"item": {
"index": "sample",
"type": "item"
},
"item_similarity": {
"index": "sample",
"type": "item_similarity"
},
"field": {
"user_id": "user_id",
"item_id": "item_id",
"value": "value",
"items": "items",
"timestamp": "@timestamp"
}
},
"similarity": {
"factory": "org.codelibs.elasticsearch.taste.similarity.LogLikelihoodSimilarityFactory"
}
}'
The recommended items are stored in sample/item_similarity. You can see the result by:
curl -XGET "localhost:9200/sample/item_similarity/_search?q=*:*&pretty"
The value of "items" property is recommended items.