Examples
-
Language detection
Let's write a simple language detector. It will recognize german and english texts. But you can add more languages with ease.
We will need the NaiveText gem, so go ahead and install it:
gem install NaiveText
We will also need some examples text from both languages. So create a directory training with two subdirectories training/german and training/english. Now we put some trainings material into those directories. Normally some texts will do. But I decided to use a list of commonly used words for each language. You can find it together with the code here.require "NaiveText" class LanguageDetector def initialize german_examples = ExamplesFactory.from_files('training/german') english_examples = ExamplesFactory.from_files('training/english') categories_config = [{name: 'german', examples: german_examples }, {name: 'english', examples: swedish_examples}, ] classifier = NaiveText.build(categories: categories_config) end end
The above Code does two things:
- It loads our examples (saved previously).
- It builds a classifier.
To build a simple classifier you only need to specify an array of 'categories', consisting of a name for the category and some text examples as an orientation for the classifier
Next we will write to method to interface the classifier:
require "NaiveText" class LanguageDetector def initialize ... #like above end def get_language(args) puts classifier.classify(args[:text]).name end def probabilities(args) classifier.probabilities(args[:text]) end end
The first function get_language takes an text as a keyword argument and returns the corresponding language as specified in our categories array. Note that classifier.classify return an catgeory, so you need to call name on it to access the language name as a string.
The second function probabilities also takes an text as a keyword argument but it returns an collection of probabilities. You can print it using puts (example below).
Now we can use our Language detector:
... # Code from above detector = LanguageDetector.new puts detector.get_language(text: 'This is obviously an english text') # ==> english puts detector.get_language(text: 'Der Text ist deutsch') # ==> german multi_language = 'This is an english text containing just one german word "Herren" meaning gentleman. Lets see what happens.' puts detector.probabilities(text: multi_language)
The full source of the example can be found here.
-
Spam filter
In this example we will write a simple spam filter integrated in a rails app which lists user posts. It shows you how you can reduce spam in user created rails models.
So we begin by creating a simple rails app.
$ rails new SpamFree
And scaffold the posts resource, which contains one field for the posts text and one for the content_type. The content_type of a post is one of the following strings:
'spam', 'content' or 'verified_content'
and will be set by our spam filter. As you will assume models with a content_type of 'spam' are spam and aren't accessible by the user. The difference between 'verified_content' and 'content' is, that veriefied_content is checked by you (or a moderator) and is guaranteed to contain no spam. It is used to train the spam filter.
So go ahead an scaffold your rails app:
$ rails g scaffold Post text:string content_type:string $ rake db:migrate
We will need the NaiveText gem, so go ahead and put it in your Gemfile:
gem 'NaiveText'
Make sure to run bundler:
$ bundle
Also be sure to set the root route to 'posts#index'
For easier querying we add three scope to the Post model (learn more about scopes in the rails guides).
class Post < ActiveRecord::Base scope :spam, ->{ where(content_type: 'spam') } scope :content, ->{ where.not(content_type: 'spam') } scope :verified, ->{ where(content_type: 'verified_content') } end
The actual spam filter is pretty simple. (It is actually a singleton ruby class with one public method):
class SpamFilter def self.spam?(post) self.build_classifer @classifier.classify(post.text).name == 'spam' end private def self.build_classifer categories_config = [{name: 'spam', examples: Post.spam, weight: 1}, {name: 'content', examples: Post.verified, weight: 10}] @classifier = NaiveText.build(categories: categories_config, default: 'content') end end
To integrate the spam filter in our rails app. We need to to make some adjustments to the PostsController. At first we want to only list posts which are not categorized as spam. So in the index action we replace Post.all with Post.content.
class PostsController < ApplicationController def index # Changed Post.all to Post.content @posts = Post.content end ...
Same for our set_post method.
# Changed Post.find to Post.content.find def set_post @post = Post.content.find(params[:id]) end
We also need to remove the content_type from our permitted parameters. This way no user can submit a post and send a content_type along.
# Remove content_type def post_params params.require(:post).permit(:text) end
The real filtering happens at the controllers create action: We simply ste the posts content_type based on the response of SpamFilter.spam?.
def create @post = Post.new(post_params) if SpamFilter.spam?(@post) @post.content_type = 'spam' else @post.content_type = 'content' end respond_to ... end
Before we can use our app, we need to make sure to put some example posts in our seed file
Post.create([{text: 'Hello I am a friendly post', content_type: 'verified_content'}, {text: 'Arrgh this is a spam post', content_type: 'spam'} ])
and run
$ rake db:seed
Conclusion
This solution, as it is, is just a first dip into the problematic of filtering spam from your user input. There are some drawbacks in the current solution:
- Performance: As described above, every request will query all posts to create the classifier.
- Verification: You need to manually verify content to get good results.
- Unflexible: Right now a user can't mark posts as spam nor can he do anything if their post a falsliy classified as spam.
Make sure to checkout the source code.
subscribe via RSS