What is the tech stack behind Google Search Engine?

The original Google algorithm was called PageRank, named after inventor Larry Page (though, fittingly, the algorithm does rank web pages). 

After 17 years of work by many software engineers, researchers, and statisticians, Google search uses algorithms upon algorithms upon algorithms.

How does Google’s indexing algorithm (so it can do things like fuzzy string matching) technically structure its index?

  • There is no single technique that works.
  • At a basic level, all search engines have something like an inverted index, so you can look up words and associated documents. There may also be a forward index.
  • One way of constructing such an index is by stemming words. Stemming is done with an algorithm than boils down words to their basic root. The most famous stemming algorithm is the Porter stemmer.
  • However, there are other approaches. One is to build n-grams, sequences of n letters, so that you can do partial matching. You often would choose multiple n’s, and thus have multiple indexes, since some n-letter combinations are common (e.g., “th”) for small n’s, but larger values of n undermine the intent.
  •  don’t know that we can say “nothing absolute is known”. Look at misspellings. Google can resolve a lot of them. This isn’t surprising; we’ve had spellcheckers for at least 40 years. However, the less common a misspelling, the harder it is for Google to catch.
  • One cool thing about Google is that they have been studying and collecting data on searches for more than 20 years. I don’t mean that they have been studying searching or search engines (although they have been), but that they have been studying how people search. They process several billion search queries each day. They have developed models of what people really want, which often isn’t what they say they want. That’s why they track every click you make on search results… well, that and the fact that they want to build effective models for ad placement.
  • Each year, Google changes its search algorithm around 500–600 times. While most of these changes are minor, Google occasionally rolls out a “major” algorithmic update (such as Google Panda and Google Penguin) that affects search results in significant ways.

    For search marketers, knowing the dates of these Google updates can help explain changes in rankings and organic website traffic and ultimately improve search engine optimization. Below, we’ve listed the major algorithmic changes that have had the biggest impact on search.

  • Originally, Google’s indexing algorithm was fairly simple.

    It took a starting page and added all the unique (if the word occurred more than once on the page, it was only counted once) words on the page to the index or incremented the index count if it was already in the index.

    The page was indexed by the number of references the algorithm found to the specific page. So each time the system found a link to the page on a newly discovered page, the page count was incremented.

    When you did a search, the system would identify all the pages with those words on it and show you the ones that had the most links to them.

    As people searched and visited pages from the search results, Google would also track the pages that people would click to from the search page. Those that people clicked would also be identified as a better quality match for that set of search terms. If the person quickly came back to the search page and clicked another link, the match quality would be reduced.

    Now, Google is using natural language processing, a method of trying to guess what the user really wants. From that it it finds similar words that might give a better set of results based on searches done by millions of other people like you. It might assume that you really meant this other word instead of the word you used in your search terms. It might just give you matches in the list with those other words as well as the words you provided.

    It really all boils down to the fact that Google has been monitoring a lot of people doing searches for a very long time. It has a huge list of websites and search terms that have done the job for a lot of people.

    There are a lot of proprietary algorithms, but the real magic is that they’ve been watching you and everyone else for a very long time.

What programming language powers Google’s search engine core?

C++, mostly. There are little bits in other languages, but the core of both the indexing system and the serving system is C++.

How does Google handle the technical aspect of fuzzy matching? How is the index implemented for that?

  • With n-grams and word stemming. And correcting bad written words. N-grams for partial matching anything.

Use a ping service. Ping services can speed up your indexing process.

  1. Search Google for “pingmylinks”
  2. Click on the “add url” in the upper left corner.
  3. Submit your website and make sure to use all the submission tools and your site should be indexed within hours.

Our ranking algorithm simply doesn’t rank google.com highly for the query “search engine.” There is not a single, simple reason why this is the case. If I had to guess, I would say that people who type “search engine” into Google are usually looking for general information about search engines or about alternative search engines, and neither query is well-answered by listing google.com.

To be clear, we have never manually altered the search results for this (or any other) specific query.

When I tried the query “search engine” on Bing, the results were similar; bing.com was #5 and google.com was #6.

What is the search algorithm used by the Google search engine? What is its complexity?

The basic idea is using an inverted index. This means for each word keeping a list of documents on the web that contain it.

Responding to a query corresponds to retrieval of the matching documents (This is basically done by intersecting the lists for the corresponding query words), processing the documents (extracting quality signals corresponding to the doc, query pair), ranking the documents (using document quality signals like Page Rank and query signals and query/doc signals) then returning the top 10 documents.

Here are some tricks for doing the retrieval part efficiently:
– distribute the whole thing over thousands and thousands of machines
– do it in memory
– caching
– looking first at the query word with the shortest document list
– keeping the documents in the list in reverse PageRank order so that we can stop early once we find enough good quality matches
– keep lists for pairs of words that occur frequently together
– shard by document id, this way the load is somewhat evenly distributed and the intersection is done in parallel
– compress messages that are sent across the network
etc

Jeff Dean in this great talk explains quite a few bits of the internal Google infrastructure. He mentions a few of the previous ideas in the talk.

He goes through the evolution of the Google Search Serving Design and through MapReduce while giving general advice about building large scale systems.

https://www.youtube.com/watch?v=modXC5IWTJI&t=30s
 
 

Here’s a link to his slides:

As for complexity, it’s pretty hard to analyze because of all the moving parts, but Jeff mentions that the the latency per query is about 0.2 s and that each query touches on average 1000 computers.

Is Google’s LaMDA conscious? A philosopher’s view (theconversation.com)

LaMDA is Google’s latest artificial intelligence (AI) chatbot. Blake Lemoine, a Google AI engineer, has claimed it is sentient. He’s been put on leave after publishing his conversations with LaMDA.

If Lemoine’s claims are true, it would be a milestone in the history of humankind and technological development.

Google strongly denies LaMDA has any sentient capacity.

Phone screen shows text: LaMDA: our breakthrough conversation technology


We know you like Sports and Geeky things, We do too, but you should build the skills that’ll drive your career into six figures. Cloud skills and certifications can be just the thing you need to make the move into cloud or to level up and advance your career. 85% of hiring managers say cloud certifications make a candidate more attractive.

Download the Djamga App for ios or android or Microsoft for drop in soccer, basketball, volleyball, badminton, football, hockey, cricket games details and location in your city.

taimienphi.vn
error: Content is protected !!