• Home
  • Forums

Welcome Back!

Posted by admin in December 3rd 2008  

That was some extended vacation. I must apologize for the lack of recent blog posts. Between the holidays, and the hectic schedule we’ve kept over the past couple months, i’ve completely ignored this palce. Lets not make a habit of that :)

Stay tuned for some new content and artciles.

Brian

No Comment
under: General
PublicBlend Add to del.icio.us Stumble it add to technorati

How does a search engine decide which duplicate to show in search results?

Posted by admin in August 6th 2008  

Lets start with a question we have all thought about at one point or another. A question that our past two days articles have been leading up to.

“How does a search engine decide which duplicate to show in search results, and which ones not to show?”

How do they choose? Pagerank? First one published? Shortest url? Article with the most links?

It doesn’t seem to be any one signal. It’s not pagerank alone, or distance from root directory. It’s probably not the first one published, because many sites are dynamic, and the time stamp on the original may be later than on the copy, and the first copy spidered might be the one the search engines think is the oldest. It doesn’t appear to be perceived authority. It could have something to do with the number and quality of inbound and outbound links from a page. It could be a mix of all of those things and others.

So what is it then? Lets dive into some research papers and find out!

Collapsing Equivalent Results

Thanks, Microsoft.

A new patent application published by Microsoft discusses some of the signals that may be used to determine which results to show, and which to filter, at least possibly in Windows Live Search.

It may not include all of the signals being looked at - some of those might be trade secrets.

The practices at Google and Yahoo and Ask.com may be different.

But, all of the major search engines are striving to create good user experiences for people who search using their services. And all of them want to avoid duplicate results filling up the early spots on search result pages. The patent application does provide some insight into what search engines consider in choosing which pages to show, and which to hide.

I was surprised by a couple of the factors, and by the appearance of something I believe I’ve seen Matt Cutts refer to as “Pretty URLs.”

System and method for optimizing search results through equivalent results collapsing
Invented by Brett D. Brewer
Assigned to Microsoft
US Patent Application 20060248066
Published November 2, 2006
Filed: April 28, 2005

Abstract

A system and method are provided for optimizing a set of search results typically produced in response to a query. The method may include detecting whether two or more results access equivalent content and selecting a single user-preferred result from the two or more results that access equivalent content. The method may additionally include creating a set of search results for display to a user, the set of search results including the single user-preferred result and excluding any other result that accesses the equivalent content. The system may include a duplication detection mechanism for detecting any results that access equivalent content and a user-preferred result selection mechanism for selecting one of the results that accesses the equivalent content as a user-preferred result.

The Duplicate Content Problem

1. A search engine finds documents that match queries and assigns them scores to determine the order in which they should be displayed.

2. Pages that may be very relevant as results may also be duplicates, or near duplicates, of each other.

3. Example: www.ymca.net and www.ymca.net/index.jsp lead to the same content with the first URL redirecting to the second one. And, www.ymca.com and www.ymca.com/index.jsp could be mirrors of www.ymca.net.

4. A search engine might include all four results in the top ten results of a search for the query “ymca”.

5. This is a bad user experience, because it keeps the searcher from seeing other results that might also be relevant, on the first page of results.

Choosing One Result

The system described would include:

* A crawler that visits web pages, and indexes and stores results in an index/storage system.

* Ranking components that may rank located results in response to searchers’ queries.

* Results storage components which may have a cache for recently stored results and an index system for storage of additional results.

* A duplication detection mechanism which would detect results having duplicate content. A technique for detecting duplicates referenced in the patent application involves using “shingleprints” as described in another Microsoft U.S. patent application, Method for duplicate detection and suppression.

* A result selection module decides which result to display to searchers, regardless of whether shingleprints or other methods are used to determine which are duplicates.

Result Selection Module

Some parts which may be included in the result selection module:

  • A query independent ranking component (something like pagerank, or a page quality score, or others, or combinations of all),
  • A result analysis component,
  • A navigation model selection mechanism,
  • a click through rate determination component,
  • A user-preferred result selection mechanism, and;
  • Result storage.

Upon finding that results are duplicates, or very near duplicates, those results would be placed in Result Storage, but the search engine would not display them all.

The Result Selection Module would determine (through the result analysis component) which was the “user preferred selection” (via the user-preferred result selection mechanism) to show in response to the query.

A different URL might be chosen as the URL that the search engine actually uses to navigate to the page (chosen via the navigation model selection mechanism).

Some Factors the Results Analysis Component Might Consider

* Extension - .com might be a better choice than .net - it “appeals” to users because they understand it

* Shorter URLs - In the YMCA example above, the user-preferred version of the URL may be www.ymca.com both because “.com” is more common than “.net” and because the www.ymca.com URL is shorter than the two “index.jsp” results.

* The Navigational Model Selection might chose a different URL - while the searcher is shown www.ymca.com, the link might actually go to www.ymca.com/index.jsp, which is selected by the navigation model selection mechanism and is stored in the result storage area, in order to save the user a redirect. Eliminating redirects leads to the fastest result.

* The URL might contain keywords that appear in the query. In that case, the URL acts as a document summary. So, www.sfgiants.com might be a better choice than www.mlb.com/sf/id1223/xyx.com when the query is “sf giants”

* Searcher Location or language - A different duplicate might be chosen based upon where the person searching is from. So a London-based searcher might see www.example.co.uk where a New York searcher would get www.example.com

* Popularity - how well linked to the page is by other sites might be determined by the query independent ranking component.

* Click through rates might be tested, and the version of the URL with the highest may be determined by the click through rate determination component, acting upon the assumption that high click-through rates indicate that users find the result satisfactory.

* Fewest redirects - as determined by the navigation model.

The user-preferred result selection mechanism uses input from the query independent ranking component, the result analysis component, and the click through determination component to select a user-preferred result. (That sounds much better than the technical term I’ve seen Matt Cutts use regarding displayed URLs in results in the context of redirects - the “prettiest URL.”)

Conclusion

So, something like pagerank does matter when it comes to filtering equivalent results, as does searcher location, clickthrough rates, amount of redirects, words used in URLs, length of URL, choice of tld, and possibly other signals.

The other interesting thing here is that a search engine may display one URL for searchers, and use a different one for navigation - Pretty URLs for people, and more direct URLs to navigate to the page.

2 Comments
under: Search Engine Concepts
Tags: content duplication, duplicate content, search engines, search results, serps
PublicBlend Add to del.icio.us Stumble it add to technorati

The Great Duplicate Content Myth

Posted by admin in August 5th 2008  

Yesterday we discussed the HOW portion of detecting duplicate content. Today I want to get into the actual process itself.

A wide spread Theory in the SEO world states that duplicate content not only carries a heavy penalty, but in fact can and will lead to a domain being banned or deindexed. Today I am going to discuss why I believe that this is not only unfounded, but perhaps completely untrue.

Lets start with some facts and figures. I’ve had the pleasure of reading dozens of research papers from msn, yahoo, google, and other leading members of the academic and professional search arena. From these papers it’s easy to determine that duplicate content detection is entirely possible in theory and at least partly in practice, but I believe the “practice” portion is where almost everyone may be wrong.

So what would it take for the big G to pull off duplicate content testing in the real world? Well, lets start by looking at the numbers. Lets assume it’s still 2004 and google still has “only” 8 billion pages in their index. Estimates show that they have several PETABYTES of data across their datacenters. So i’m joe webmaster and I put up a page about sprinklers. Does anyone here really believe that Google or anyone else on this planet actually has enough computer processing power to take my single page about sprinklers, shingle it and compare it to their other 7,999,999,999 pages of content each of which needs to be shingled as well? Shingling as we discussed yesterday, is the process by which search engines determine unique content from duplicate content. Of course, you do have the problem of it being a very intensive calculation because you’re not comparing A->B you’re comparing every document against all other documents.  I think they call this a O(n2) problem.  and it happens to be a very expensive process cpu time wise. Unless a page is flagged to begin with, it would be cost and time prohibitive to carry out such an expensive calculation on every page in their data set.

So if this is the case, what is duplicate content used for? What is the scope of the data google is looking for? I believe they check for duplicate content on a PER DOMAIN BASIS, meaning they take a single domain, check the content and run comparisons to give the overall domain a content quality or duplicate content quality score. Lets see why that makes sense on several levels. First, it’s within the ability of their crawler to do such a thing from a cpu processing power perspective, it also makes sense that they would factor this into the overall quality score for a domain.

Now the evidence:

1) A year ago I put up a 100 percent clone of wikipedia. I used the wikipedia template, I copied the data from their database, etc. This new domain was 100 percent identical to that of wikipedia.com.

The result? I rank well for thousands of terms, the domain has almost 1 million pages indexed in google, and it receives 3-5K uniques per day. So much for a duplicate content penalty. Of course the content is highly unique from page to page on the domain, but it isn’t unique when the scope is expanded to include the entire internet.

2) PublicBlend.com - By definition all social media sites contain 100 percent duplicate content that would never pass a shingling algorithm. All of our stories come directly from other web pages. In fact they are direct copies of articles from all over the internet.

The result? PublicBlend.com has been steadily growing in search engine traffic every month and now receives over 3,000 uniques a day from google alone. (we recently changed the domain name, so the indexing has started over)

3) News sites, not just social media, but regular news media as well. Reuters is the source for 90 percent of the news on the net. Everyone duplicates their stories word for word yet they all rank well for the resulting stories.

I hope the above sparks some debate and discussion on the topic of duplicate content. It may also raise some other interesting questions:

From a white hat perspective, what happens when 50 spam sites scrape your feed?  Will your content get penalized or will the spam sites get penalized? How would a search engine determine who wrote the article first? Would they simply rely on domain trust? If so that opens the door to all sorts of gaming options using old trusted domains.

7 Comments
under: Search Engine Concepts
Tags: black hat, duplicate content, myth, myth busters, penalty, search engine, seo, white hat
PublicBlend Add to del.icio.us Stumble it add to technorati

Duplicate Content Dissected

Posted by admin in August 4th 2008  

I’ve read seemingly hundreds of forum posts discussing duplicate content, none of which gave the full picture, leaving me with more questions than answers. I decided to spend some time doing research to find out exactly what goes on behind the scenes. Here is what I have discovered.

Most people are under the assumption that duplicate content is looked at on the page level when in fact it is far more complex than that. Simply saying that “by changing 25 percent of the text on a page it is no longer duplicate content” is not a true or accurate statement. Lets examine why that is.

To gain some understanding we need to take a look at the k-shingle algorithm that may or may not be in use by the major search engines (my money is that it is in use). I’ve seen the following used as an example so lets use it here as well.

Let’s suppose that you have a page that contains the following text:

The swift brown fox jumped over the lazy dog.

Before we get to this point the search engine has already stripped all tags and html from the page leaving just this plain text behind for us to take a look at.

The shingling algorithm essentially finds word groups within a body of text in order to determine the uniqueness of the text. The first thing they do is strip out all stop words like and, the, of, to. They also strip out all fill words, leaving us only with action words which are considered the core of the content. Once this is done the following “shingles” are created from the above text. (i’m going to include the stop words for simplicity)

The swift brown fox
swift brown fox jumped
brown fox jumped over
fox jumped over the
jumped over the lazy
over the lazy dog

These are essentially like unique fingerprints that identify this block of text. The search engine can now compare this “fingerprint” to other pages in an attempt to find duplicate content. As duplicates are found a “duplicate content” score is assigned to the page. If too many “fingerprints” match other documents the score becomes high enough that the search engines flag the page as duplicate content thus sending it to supplemental hell or worse deleting it from their index completely.
My old lady swears that she saw the lazy dog jump over the swift brown fox.

The above gives us the following shingles.
my old lady swears
old lady swears that
lady swears that she
swears that she saw
that she saw the

she saw the lazy
saw the lazy dog
the lazy dog jump
lazy dog jump over
dog jump over the
jump over the swift
over the swift brown
the swift brown fox

Comparing these two sets of shingles we can see that only one matches (”the swift brown fox“). Thus it is unlikely that these two documents are duplicates of one another. No one but google knows what the percentage match must be for these two documents to be considered duplicates, but some thorough testing would sure narrow it down ;).

So what can we take away from the above examples? First and foremost we quickly begin to realize that duplicate content is far more difficult than saying “document A and document B are 50 percent similar”. Second we can see that people adding “stop words” and “filler words” to avoid duplicate content are largely wasting their time. It’s the “action” words that should be the focus. Changing action words without altering the meaning of a body of text may very well be enough to get past these algorithms. Then again there may be other mechanisms at work that we can’t yet see rendering that impossible as well. I suggest experimenting and finding what works for you in your situation.

1 Comment
under: Search Engine Concepts
Tags: content, duplicate content, k-shingle, penalties, search engines, shingle
PublicBlend Add to del.icio.us Stumble it add to technorati

Welcome to BlackHat360.com

Posted by admin in August 2nd 2008  

BlackHat360 is a site dedicated to all things BlackHat. For those joining us that don’t know, blackhat is a type of SEO or Search Engine Optimization that is often misunderstood. We are here to dispell some of the myths surrounding the technique by educating people on the various methods and practices commonly used. We’re just starting out, so bear with us while we bring you new information and tools. Be sure to stop by our forums for in depth discussions and related information. BlackHat360 Forums

1 Comment
under: General
Tags: blackhat, blackhat360, forums, search engine optimization, seo, welcome
PublicBlend Add to del.icio.us Stumble it add to technorati

Feeds

feeds
get latest updates on news and subscribe to our feed

Search

Tags

  • blackhat black hat blackhat360 content content duplication duplicate content forums k-shingle myth myth busters penalties penalty search engine search engine optimization search engines search results seo serps shingle welcome white hat

Subscribe

  • stumble
  • technorati add aol netvibes rojo myyahoo modern freedictionary subrss chicklet plusmo newsburst ngsub wwgthis subscribes

Meta

    • Register
    • Log in
    • Entries RSS
    • Comments RSS
    • WordPress.org

Recent Entries

  • Welcome Back!
  • How does a search engine decide which duplicate to show in search results?
  • The Great Duplicate Content Myth
  • Duplicate Content Dissected
  • Welcome to BlackHat360.com

Recent Comments

  • AppzDrive.com &… in How does a search engine decide whi…
  • » Blog Ar… in How does a search engine decide whi…
  • admin in The Great Duplicate Content Myth
  • admin in The Great Duplicate Content Myth
  • pac1984 in The Great Duplicate Content Myth
  • » Blog Ar… in The Great Duplicate Content Myth
  • posylane in The Great Duplicate Content Myth
  • anty in The Great Duplicate Content Myth
  • James in The Great Duplicate Content Myth
  • SEO12345 in Duplicate Content Dissected

Most Comments

  • The Great Duplicate Content Myth (7)
  • How does a search engine decide which duplicate to show in search results? (2)
  • Welcome to BlackHat360.com (1)
  • Duplicate Content Dissected  (1)
©2006-2009 BlachHat360
    Valid XHTML    Valid CSS