Spammers Are Vermin

by Jacob 19. July 2009 09:32

Cockroaches My apologies if you’ve tried to access my personal blogs recently. I’ve been inundated by comment spammers and it has been a tremendous pain in the buttocks getting them straightened out. For a while, I was getting only a half dozen or so a day. Short comments about what an amazing blog/post it was and that they’d definitely be back and/or bookmark/subscribe.

I could manually delete them without too much inconvenience for a while. Lately, though, there’s been a staggering increase in these weasels so I’ve adopted measures a little more… drastic.

A Comment Filter BlogEngine.Net Extension

I noticed that most of these spammers shared some distinctive characteristics. Many of them put down the same email address, for example. I also noticed that there were only three or four websites generally involved. Since the spam exists for the purpose of Google pagerank manipulation, the website is probably the important thing to note.

Now, I looked for a BE.Net extension that’d do this already. Unfortunately, most of the comment filters I found were tied into Akismet or some other blog filter service. That’s more overhead than I really want (in terms of configuration, registering, and complexity etc.). All I really need is something to check the email address, website, and maybe IP address against a known blacklist I can maintain myself. That shouldn’t be difficult, right?

Adventures in Comment Filtering

On the surface, these things weren’t that hard to accomplish. BlogEngine.Net has some quirks, though, that got in my way until I figured them out. For those interested, I’m going to explain them here. If you want to skip the gory details, head down to the next section. Or if you just want the extension, download it, pop it into the App_Data/Extensions folder and season to taste.

Finding the Right Event

My first impulse was to look at the Comment object for useful events to extend. Comment.Validating looked like a good candidate so I tried that one out. Unfortunately, that event never got hit on my blog. It took me a bit to realize that this is because I don’t actually validate comments. Validating comments is a setting where a comment doesn’t show up until it is approved. Since I only do blog maintenance once a day or so, I don’t want to prevent comments from showing up for that long. Validating comments would pretty much stop discussions in their tracks and I don’t want that.

Once I remembered that comments are managed on the Page object, things went much better. The Page.AddingComment event turned out to be the one I wanted.

ExtensionParameter Fun

This is the one that held me up the longest. ExtensionParameters can be assigned types that include things like “DropDown” and “ListBox”. That seemed like exactly the kind of thing I could use for my filters. You see, each filter will be of a limited number of valid types: “Website”, “Email”, “IP Address”, or “Length” (I added Length when I noticed that all these messages are really short and I might want to account for that in my filter).

Unfortunately, these ParamType values are a complete red herring for tabular data storage. I noticed that BE.Net wasn’t actually storing my selection when I tried to add filter entries. The thing is that BE.Net stores tabular values on each parameter in the DataStore and only maintains a link to them by the order in which they appear. So my parameters in the DataStore look like this once saved:

<Parameters>
  <Name>Filter</Name>
  <Label>Filter</Label>
  <MaxLength>100</MaxLength>
  <Required>true</Required>
  <KeyField>true</KeyField>
  <Values>http://www.sonicity.com/</Values>
  <Values>http://www.unlockprivateprofiles.com/</Values>
  <Values>http://www.lastminutejoy.de/</Values>
  <Values>http://www.mooladays.com/</Values>
  <Values>http://www.dbpclan.com/</Values>
  <Values>200</Values>
  <Values>email002545@hotmail.com</Values>
  <Values>http://www.ramshyam.com/</Values>
  <ParamType>String</ParamType>
  <SelectedValue/>
</Parameters>
<Parameters>
  <Name>FilterType</Name>
  <Label>Filter Type</Label>
  <MaxLength>100</MaxLength>
  <Required>true</Required>
  <KeyField>false</KeyField>
  <Values>Website</Values>
  <Values>Website</Values>
  <Values>Website</Values>
  <Values>Website</Values>
  <Values>Website</Values>
  <Values>Length</Values>
  <Values>Email</Values>
  <Values>Website</Values>
  <ParamType>String</ParamType>
  <SelectedValue/>
</Parameters>

It looks to me like list types (DropDown, ListBox, etc.) were mainly implemented with scalar settings in mind rather than tabular settings as this needs to be. This is unfortunate, but I can’t see an easy way to alter the architecture to enable list types easily. I could create my own custom admin page for the extension (and I still may) but that’s more work than I wanted to do to get this running.

The Extension

So my comment extension has been up and working for a day or two now and things have calmed down a lot. This is a good thing. I can’t say that it is extensively tested for the simple reason that I don’t get many legitimate comments on a regular basis.

Configuration is pretty simple as long as you don’t typo the Filter Type value. Each filter is its own entry in the tabular list on top.

CommentFilterConfiguration (Click image to enlarge)

Talking Back to Spammers

When I noticed that it still looks to the user like their comment is saved (because the comment is still part of the page object, it just isn’t saved to the DataStore), I had an inspiration. Since the comment is still displayed to the person who posted it (though not to anyone else), that’s an opportunity to make sure that someone running afoul of my length requirement doesn’t end up wondering what happened. Plus, it gives me a chance to tell spammers that they’ve been noticed (yeah, that’s of dubious value and I may rethink this, but for now, it just makes me feel better). If you enlarged the image above, you’ll see that there are templated values that will be used to replace the comment content. I can be as nasty as I want and the only ones who see it will be the spammers—though you’ll probably want to take it easy on those who stumble on your length filter (if any).

Spammers Should Die

A day or so after this filter went into effect I started to get new messages. These are clever little plays for sympathy saying things like “my comment got eaten but anyway… <regular spiel here>”. Or another “my blog is getting lots of comment spam, do know any way to help?” The website links were still classic spam sites so these weren’t real users looking for help. Cheeky little locusts, aren’t they? Seriously, someone with the right skills needs to hunt these bastards down and rearrange key organs into innovative new patterns.

Tags: , , , ,

Programming | Software

Multi-blog Obsession

by Jacob 6. April 2009 05:26

TwoBlogsThe multi-blog data provider for BlogEngine.Net has been taking up a lot of my brain space lately—to the point that I’m able to announce that it is installed and working “in the wild” on a hosted site (though not in anything like a heavy-load situation). I now have a copy of both my dev site and my personal site up and running from the same directory (and the same database). Frankly, I didn’t think it’d be as easy as it was. This success prompted me to create a 2.0 release (that is now up on the CodePlex site).

Getting Static

My main fear was with the heavy use of static variables in BlogEngine.Net. You see, BE.Net loads all the data into memory using static List variables. I found this out when I went looking for the best way to store a BlogId (so that it didn’t have to parse against an Url every time a request came through).

While there are pros and cons to keeping your entire blog in memory (pro: speed and ease, con: memory bloat and a large delay on any request that triggers a data load), my concern was how an application would react when it had to serve two sets of data. Fortunately, it seems that even when two sites share an application pool on IIS, they still keep their static spaces separate. I’m not sure what I was going to do if it didn’t but I was spared the tragedy.

Configuration

Installing the blog provider mainly involves copying the binary into the /bin directory and then updating the web.config to point to the right driver. There are three providers in your web.config that are affected.

Blog Provider

The blog provider handles the blog data. Settings, posts, categories and suchlike. Add the provider and update the “defaultProvider” tag and you’re ready to go.

<BlogEngine>
  <blogProvider defaultProvider="SQLBlogProvider">
    <providers>
      <add name="SQLBlogProvider" type="BlogEngine.SQLServer.SqlBlogProvider, BlogEngine.SQLServer" connectionStringName="BE"/>
    </providers>
  </blogProvider>
</BlogEngine>

Membership Provider

The membership provider handles user authentication and management (stuff like changing passwords and such). Technically, you don’t need to change this, but if you don’t the users will be the same across blogs (not a problem if you aren’t multi-blogging). I frankly haven’t tested if a mixed-configuration actually works but it should. Again, add the provider and update the “defaultProvider” tag and you’re ready to go.

<membership defaultProvider="LinqMembershipProvider">
  <providers>
    <clear/>
    <add name="LinqMembershipProvider" type="BlogEngine.SQLServer.LinqMembershipProvider, BlogEngine.SQLServer" passwordFormat="Hashed" connectionStringName="BE"/>
  </providers>
</membership>

Role Provider

The role provider handles authorization and what users are assigned to which roles. Again, you don’t technically have to change this if you don’t need it. Also again, it’s simply a matter of adding the provider and changing the “defaultProvider” tag.

<roleManager defaultProvider="LinqRoleProvider" enabled="true" cacheRolesInCookie="true" cookieName=".BLOGENGINEROLES">
  <providers>
    <clear/>
    <add name="LinqRoleProvider" type="BlogEngine.SQLServer.LinqRoleProvider, BlogEngine.SQLServer" connectionStringName="BE"/>
  </providers>
</roleManager>

Multiple-blog Configuration

To set stuff up for multiple blogs, you’ll need to run a script or two in your database and add a tag to all the providers. There are two script files (included in both the binary and source files), one for setting up the initial database changes (DatabaseSchemaChanges.sql—mostly adds tables) and another for adding the base values for a new blog (AddNewBlog.sql).

I wanted to make this easier by having the driver do the updates for you. That may still happen in the future, but since BlogEngine.Net itself requires manually running a script if you want to use the database provider I decided not to sweat it too hard. Presumably, anyone running in a database has to be running scripts manually anyway so this isn’t going to be a show stopper.

The provider will run just fine after running either script, even if you aren’t using multiple blogs. In other words, just because the database changed doesn’t mean that the single-blog installation is hosed. The exception to this is the “be_Settings” table. If you’re going to run for a while with a single-blog after running the first script, you’ll want to add a default to the BlogId column so it doesn’t choke when you insert and update settings.DefaultBlogId

Both scripts are “templated” so you can change key factors (a table prefix on the first and a couple of blog values in the second). Filling in the template is a matter of hitting ctrl-shift-M in Query Analyzer or SQL Server Management Studio. That’ll bring up a prompt for what values you want those template variables to have.TemplatePrompt

The final thing to setup is to add a multiblog attribute on the providers. That’ll make your providers look something like this.

<add name="SQLBlogProvider" type="BlogEngine.SQLServer.SqlBlogProvider, BlogEngine.SQLServer" connectionStringName="BE" multiblog="true"/>

The provider selects the blog it wants to deliver based on three configured values.

  • Host is the base address. The provider matches the Host value against the end of the host (so rabidpaladin.com will match “rabidpaladin.com”, “www.rabidpaladin.com” and “blog.rabidpaladin.com”).
  • Path is the rest of the Url. The provider matches the Path value against the start of the requested path.
  • Port is the port (if any) in the Url. Honestly, I threw this one in there as much for my testing as for any real-world use I expect it to see.

Tags

One thing I added (at the provider level) is that when a post comes in without any tags, the provider takes a moment to scan for tags in the post body. This is a feature I did the initial work for in Subtext so porting it over was a matter of a couple minutes. Any time a post is inserted into the database, the provider checks if it has tags yet. If no tags are present, it will scan the content for appropriate anchor markup (like those produced for Technorati tags). That means that on import, my posts all had their tags correctly populated—saving me a lot of extra work (or face losing tags on imported posts). That I was able to avoid the brain-damaged tag handling of BlogEngine.Net is just a bonus (they lower-case tags on creation and then re-capitalize them when serving them up).

Other Stuff

As I said, this should get you set up. Since I used this blog provider from the start on both my blogs, I can verify that the import tool works just fine in a multi-blog configuration. As far as BlogEngine.Net is aware, it’s doing the same stuff it always has. Indeed, the only change I made from BlogEngine.Net’s standard v1.4.5 release was in UrlRewrite.cs to allow links produced by Subtext to still work (so I don’t throw errors on old links).

else if (url.Contains("/POST/") || url.Contains("/ARCHIVE/"))

I submitted a patch at one time to have this hit the base source code but apparently it wasn’t deemed worthy.

Also, I found that running the provider in IIS7 is a bit tricky. Since BlogEngine.Net loads extensions from the database on application start you’ll get errors if you are configured for “Integrated” mode. That’s because “Integrated” mode (quite properly) fires the application start event before the HttpContext.Request is populated (which is what I’m using to determine what blog is being requested). Setting the application pool to “Classic” mode will solve this “problem”.IIS7ClassicMode

Looking Forward

My blogs are still running Subtext at their base addresses. I’m still not quite ready to take the plunge on BlogEngine.Net.  I am, however, undoubtedly one step closer.

Tags: , , , , , , ,

Programming

Multiple Blog Data

by Jacob 29. March 2009 09:01

PartialSchema So I have a working LINQ to SQL provider for BlogEngine.Net. Now what? Given a little spare time, how about I see if I can’t use it to support running multiple blogs from the same installation? More importantly, see if I can use it to support running multiple blogs from the same database?

Doing just that turns out not to be all that difficult.

Scheming

The current architecture for BlogEngine.Net’s data already has a bit more cohesion than it technically needs. All the objects have their own individual Ids and those Ids are used to relate objects to each other (though there is one exception). Since every object already has its own Id (usually a Guid), splitting objects into separate blogs isn’t the chore it might otherwise have been.

There are two options when it comes to dividing items up into multiple blogs. First, each object can have a column added to its table to indicate which blog it is associated with. Second, you can create a cross-reference table that associates a blog Id with the object Id for the blog.

Columns
My initial impulse in most cases would be to add a BlogId column to the tables where it is needed. The reason is simple: objects belonging to the blog are in a true parent-child relationship and that relationship is generally best expressed as a field on the child indicating its parent. The relationship can (and really should) be enforced with a foreign key constraint on the column to ensure that the relationship is intact.

Cross-references
Having cross-reference tables is a bit more problematic and carries with it some maintenance and performance concerns. Not only does it force a join when you want to read the objects for a specific blog, but it means that insert, update, and delete commands now have to involve two tables instead of just one. One advantage of cross-reference tables is that they’re easier to extract back out if you need to devolve your data. Additionally, foreign key constraint integrity is triggered when the cross-reference entry is created instead of on your blog objects themselves—making your touch a bit lighter if you have other actors in the system.

Complicating Things 
No decision is best for every occasion, and when it came time to design how I wanted multiple blogs to work, I was really reluctant to mess with the native tables of BlogEngine.Net. I’m not sure if my hesitation is a matter of respect for a project I’m not involved in or if I’m just being unreasonably squeamish, but I eventually chose to go the cross-reference route. My main reasoning is that I wanted my intrusions to remain light and easily devolved.

I ♥ Linq

Now, normally, adding a super-structure on top of an existing infrastructure is a real pain. Editing all your SQL statements manually becomes an exercise in precision string manipulation and if you’re working through stored procedures… ugh. Linq made this really easy.

Here’s an example from the FillProfiles method of the blog provider.

var profileData = from p in context.Profiles
                  select p;
if (isMultiBlog)
{
    profileData = from p in profileData
                  join bp in context.BlogProfiles on p.ProfileID equals bp.ProfileId
                  where bp.BlogId == Utils.GetBlogId()
                  select p;
}

The initial select is good for the general case. It pulls all the objects from the Profiles table. Adding a filter when we have multiple blogs is added in the if clause. Note that the second select references the first (“from p in profileData”). Linq knows that the second “from” is a refinement of the first and puts them together logically. Since Linq defers execution of the query until it’s actually used, the query sent to the server includes the full constraint (i.e. filtering happens on the database). Here’s the statement that’s actually sent.

SELECT [t0].[ProfileID], [t0].[UserName], [t0].[SettingName], [t0].[SettingValue]
FROM [dbo].[be_Profiles] AS [t0]
INNER JOIN [dbo].[be_BlogProfiles] AS [t1] ON [t0].[ProfileID] = [t1].[ProfileId]
WHERE [t1].[BlogId] = @p0

This method ensures that you only take the hit of the join if you are in a multi-blog setup. And without pulling everything to the client.

Settings

I had some fun with the Settings table because it is an exception to BlogEngine.Net’s Id rigor. It has interesting impact on the Linq situation, but I think I’ll give it its own (short) post later.

Beta Available

So I tested this in my own home-grown environment and it seems to work as expected. In consequence, I’ve created a new release at the project homepage. I’m calling it a beta, though it barely warrants the label. I worry that it has only been tested in a single environment. If you’re a hearty soul and a BE.Net user, please give it a go. I’ll be spending some time getting it set up and tested in an actual public setting with my personal blogs here shortly. As always, I welcome feedback either at codeplex or comments or via email.

Tags: , , , , , , ,

Programming

scruffylookingcatherder.com

Information

    Recent Posts

    Calendar

    <<  September 2010  >>
    MoTuWeThFrSaSu
    303112345
    6789101112
    13141516171819
    20212223242526
    27282930123
    45678910

    View posts in large calendar
    Disclaimer
    The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

    © Copyright 2010 Scruffy-looking Cat Herder