I’m doing some more yahoo pipes work–aggregating and filtering blog feeds. I’ve created a combination of whitelist and a highly filtered set of search results known as Chardonnay, and I’ll eventually make a less-filtered “2-Buck Chuck” and a highly-filtered Eiswein version.
My basic rule-of-thumb for the Chardonnay feed is that if the signal-to-noise ratio of a blog is less than 3:1 or so, I would bump it into Tier 2. Not that they don’t have any good content, but I was trying to keep my feed at least 8:2 signal-noise ratio.
For the Eiswein feed, I’m aiming for 9:1 signal-noise ratio. In order to do that, I have to filter everything, including myself. =)
As far as 2-Buck Chuck, well, let’s say it’s so unfiltered that it has chunks^wpieces of sediment in it. It’s also hard to build something like this and intentionally disable the quality controls you’ve built.
“Why the wine motif?” you ask. Well, I was looking for something that has a price and quality range, so wine fit right in there. I bought www.chateaublogsville.com which will be the entry site for the 3 security blog feeds. It might take me a couple of weeks to get up a simple site but in the meantime you’re free to subscribe to any of the feeds.
One thing that I’m finding out about blog feeds. For the Chardonnay, I had to look at a couple of approaches to feed aggregation. I started out with a linked-to list of people and a desire to have a google and technorati catch-all search to find some relevant information from little-known feeds. After working with some data munging for a couple of days, I notice that the source feeds fit into the following groups:
- Tier 1 Feeds that I want to let through pretty much unfiltered (Mine, Matasano, Curphey, ISM-Community, Bejtlich, etc)
- Tier 2 Feeds that need to be filtered for relevancy (Security Bloggers Network members, news site aggregages that I haven’t whitelisted above)
- Tier 3 Feeds that need to be filtered for spam and then filtered for relevancy whilst wearing lead gloves (technorati and google searches)
Now that I write it all down, it sounds exactly like writing email filters or SIEM tuning or any one of a bazillion uses that you could have for filtering, so I’ve once again recreated ideas that already exist. Of course, I probably could have saved some time by approaching the problem from this angle, but really I had to move the ideas around a dozen different ways until it fit in a way that made sense.
The funny thing is that I had the hardest time filtering on privacy. I was getting too much junk off the blog search feeds (privacy of timeshares, that kind of thing), so what I’m playing with is killing privacy from the main filter and then filtering the search feeds on privacy and a second keyword.
The usual disclaimers work here: I’m playing with content provided by other people, so I don’t even remotely pretend to have any control over it. There are a couple pieces of junk that will slip through the filters. Because the source of the filters is open for the world to see, you can cheat them by including the right words.
Similar Posts: