The second document returned by this query is this:
title: Kellogg recalls cereal because of glass fragments
description: Investigators say Joshua Herrin, 25, attacked the man during a road rage incident NEW YORK (AP) - Sony unveiled its next-generation gaming system, the PlayStation 4, and promised social and remote capabilities. Wednesday's announcement gives the
Clearly Bing indexed too much here. If you look at the 13abc.com page, you can see that it matched the term 'playstation' in the top stories block: http://www.screencast.com/t/0HfKC6bfunq0
The text in the description about the road rage incident is no longer visible on the the 13abc.com page.
Technically we've already 'fixed' the issue on our side by no longer trusting Bing, downloading the pages ourselves, removing all the clutter and then validate if our query terms are still present.
But of course this has a big impact on the cost side. We have to pay Microsoft for each 15 results page received through the API and after a few days of measuring we had to throw away between 15-20% of the results delivered.
So my questions are:
- Is there a way to avoid these "false" positives?
- Is Microsoft aware of this issue? If so, is there a plan on improve the quality? If not, how can I escalate that with Microsoft?
This is not to resolve the issue of false positives, but reduce the number returned.
I implemented in my code a negative word filter to add for each query. For example searching for Seattle Mariners MLB will return a ton of gambling and other "news" articles that I don't want. I add a -bet -bookmaking -gambling -gamble (and the
list goes on). This means that the results I do get are pretty clean. I did additionally add filters to just remove complete spam domains that generate entries daily for sports that having NOTHING to do with them.
Not ideal, but thought I would share because it made a huge difference for me in my baseball fan apps.