Results 1 to 6 of 6

Thread: Search Engine Design and Development - The Malaysian Initiatives

  1. #1
    Join Date
    Aug 2006
    Location
    Malaysia
    Posts
    1,576
    Rep Power
    188

    Search Engine Design and Development - The Malaysian Initiatives

    Background: The "so-called" Malaysian Search Engines, i.e. Cari, Catcha, Skali and others deliver disappointing search result performance.

    They becomes the provider for the business directories and forums which are more lucrative in order to survive.

    Objectives:
    - To discuss the ideal architecture of the search engine (SE) from the backend to frontend as SE is different from the usual web-based applications.
    Its backend (e.g. web crawler) can be just stand-alone network application that interacts with the internet with the HTTP protocol.
    - To discuss the workable search algorithm.
    - To discover the "secret recipes" of the search algorithm.
    - To learn/share/discuss about the SE terms and jargons.
    - To discuss the programming languages used for the SE backend and frontend.
    - To discuss new technologies (e.g. P2P, distributed network, AI - Neural Network AI, spamming) that can be applied in SE.

    Amateurs/Students/Master/PhD/Researchers specialised in data mining/networking/AI/digital lingusitics/data communication are highly welcome to participate.
    Last edited by genzy; 09-05-2007 at 01:14 AM.

  2. #2
    Join Date
    Aug 2006
    Location
    Malaysia
    Posts
    1,576
    Rep Power
    188
    OpenWebSpider - Open Source Web Spider and Search Engine

    The OpenWebSpider project was born from the idea that internet is free and all informations must be freely available for all users! Using all free software and being Open Source, OpenWebSpider would be the base for a new Search engine developed from a community of opensource developers

  3. #3
    Join Date
    Aug 2006
    Location
    Malaysia
    Posts
    1,576
    Rep Power
    188
    Google started with two workers (two founders) in 1998.

    http://en.wikipedia.org/wiki/Google
    Google was co-founded by Larry Page and Sergey Brin while they were students at Stanford University, and the company was first incorporated as a privately held company on September 7, 1998.

    Currently Google has 12,238 full-time employees (as of March 31, 2007).
    The Google Search Engine Whitepaper (summary of the PhD thesis)
    by Google Founders
    http://infolab.stanford.edu/~backrub/google.html
    The Anatomy of a Large-Scale Hypertextual Web Search Engine
    Sergey Brin and Lawrence Page
    Computer Science Department, Stanford University, Stanford, CA 94305

    In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/
    To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date.
    Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
    Keywords: World Wide Web, Search Engines, Information Retrieval, PageRank, Google

    ......................


    Figure 1.0: High Level Google Architecture

    .......................

    Sergey Brin:
    received his B.S. degree in mathematics and computer science from the University of Maryland at College Park in 1993. Currently, he is a Ph.D. candidate in computer science at Stanford University where he received his M.S. in 1995. He is a recipient of a National Science Foundation Graduate Fellowship. His research interests include search engines, information extraction from unstructured sources, and data mining of large text collections and scientific data.

    Lawrence Page:
    was born in East Lansing, Michigan, and received a B.S.E. in Computer Engineering at the University of Michigan Ann Arbor in 1995. He is currently a Ph.D. candidate in Computer Science at Stanford University. Some of his research interests include the link structure of the web, human computer interaction, search engines, scalability of information access interfaces, and personal data mining.

  4. #4
    Join Date
    Aug 2006
    Location
    Malaysia
    Posts
    1,576
    Rep Power
    188
    Simplicity and Enterprise Search
    A New Model for Managing Your Enterprise Information
    http://www.google.com/enterprise/pdf...erprise_wp.pdf
    “Make everything as simple as
    possible, but not simpler.”
    – Albert Einstein


    1. Fast, accurate search results. To be successful, enterprise search must be
    powerful enough to deliver the most relevant information, consistently and
    efficiently, whenever and wherever it’s needed.
    2. Minimal administrative overhead. Enterprise search must be quick enough to
    deploy and easy enough to manage that the cost of installing and maintaining it
    won’t exceed the benefit.
    3. An intelligible user interface. Enterprise search must be simple and effective
    enough that users will actually use it


    Search quality: Deliver the goods
    To fully realize the value of the information assets your business creates:
    • Information must be readily and reliably accessible to everyone who’s entitled to
    view it.
    • The information delivered must be current and relevant (the user needs the right
    document, usually in the most recent version).
    • A clear, accurate ranking system should guide users swiftly and accurately to the
    data they need.
    • Your intranet search should put your whole organization on the same page,
    providing a consistent view of information across your company, while keeping
    sensitive documents secure.


    Complex problems, simple solutions
    Innovation and simplicity are often the best way to attack a complex problem.
    The first Google search engine – built with 30 off-the-shelf PCs running the free Linux operating system – is a case in point. Google’s design coupled innovative algorithms with a clustered approach to hardware infrastructure that capitalized on the falling prices of PCs, disk drives, memory, bandwidth, and data centers, and on the availability of continually faster, cheaper processors. The open source Linux operating system was chosen for similar reasons: it was well supported and reliable, could be customized at will, and cost nothing to use. Grid computing enabled the modular, scalable framework into which these elements fit.

    This whole infrastructure provided a robust ecology in which Google’s data indexing and retrieval algorithms could thrive. The result was simple, powerful, flexible, and highly scalable – as evidenced by the fact that Google’s architecture remains essentially the same now as then, though with about 1,000 times as many machines.


    SIX WAYS THAT POOR SEARCH WASTES COMPANY TIME AND RESOURCES:
    1. Time lost to ineffective search
    2. Time and money lost to administration of search systems
    and data (both IT staff time and maintenance contracts)
    3. Time spent tweaking and weighting documents to satisfy the requirements of complex systems
    4.The lost value of missing company information
    5. The lost value of undocumented employee knowledge
    6. Revenue lost through delays in time-to-market


    THE CRITERIA FOR GOOD ENTERPRISE SEARCH:
    Fast, accurate search results.
    To be successful, enterprise search must be powerful enough to deliver the most relevant information, consistently and efficiently, whenever and wherever it’s needed.

    Minimal administrative overhead.
    Enterprise search must be quick enough to deploy and easy enough to manage that the cost of installing and maintaining it won’t exceed the benefit.

    An intelligible user interface.
    Enterprise search must be simple and effective enough that users will actually use it.

  5. #5
    Join Date
    Aug 2006
    Location
    Malaysia
    Posts
    1,576
    Rep Power
    188
    Google's FAQs for Googlebot
    http://www.google.com/support/webmasters/bin/topic.py?topic=8843
    How Google crawls my site

    Using a robots.txt file
    o How do I request that Google not crawl parts or all of my site?
    o How do I use a robots.txt file to control access to my site?
    o How do I create a robots.txt file?
    o Where do I place my robots.txt file?
    o How do I block Googlebot?
    o I don't want to list every file that I want to block. Can I use pattern matching?
    o How do I test my robots.txt file?
    o If I change my robots.txt file or upload a new one, how soon will it take effect?
    o I don't want certain pages of my site to be indexed, but I want to show AdSense ads on those pages. Can I do that?

    Googlebot
    o Why doesn't Google index all of the pages of my site?
    o How often will Googlebot access my web pages?
    o Googlebot is crawling my site too fast. What can I do?
    o Why is Googlebot asking for a file called robots.txt that isn't on my server?
    o Why is Googlebot trying to download incorrect links from my server? Or from a server that doesn't exist?
    o Why is Googlebot downloading information from our "secret" web server?
    o Why isn't Googlebot obeying my robots.txt file?
    o Why are there hits from multiple machines at Google.com, all with user-agent Googlebot?
    o Can you tell me the IP addresses from which Googlebot crawls so that I can filter my logs?
    o Why don't the pages of my site that Googlebot crawled show up in your index?
    o What kinds of links does Googlebot follow?
    o How do I prevent Googlebot from following links on my pages?
    o How do I tell Googlebot not to crawl a single outgoing link on a page?
    o Why is Googlebot downloading the same page on my site multiple times?
    o What is Feedfetcher, and why is it ignoring my robots.txt file?
    o What can I do if Google is creating too high a load on my server?
    o Why did my firewall report unauthorized access from Google?

    Feedfetcher
    o How do I add my feed to the search results for Google's personalized homepage or Google Reader?
    o How do I request that Google not retrieve some or all of my site's feeds?
    o How often will Feedfetcher retrieve my feeds?
    o Feedfetcher is retrieving my site's feeds too frequently. What can I do?
    o Why is Feedfetcher trying to download incorrect links from my server, or from a server that doesn't exist?
    o Why is Feedfetcher downloading information from our "secret" web server?
    o Why isn't Feedfetcher obeying my robots.txt file?
    o Why are there hits from multiple machines at Google.com, all with user-agent Feedfetcher?
    o Can you tell me the IP addresses from which Feedfetcher makes requests so that I can filter my logs?
    o Why is Feedfetcher downloading the same page on my site multiple times?
    o Why don't the feeds from my site that Feedfetcher requested show up in your index?
    o What kinds of links does Feedfetcher follow?

  6. #6
    Join Date
    Aug 2006
    Location
    Malaysia
    Posts
    1,576
    Rep Power
    188
    http://en.wikipedia.org/wiki/Googlebot

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Simple MP3 Search Engine..
    By rekomaster in forum Websites Review and Suggestion
    Replies: 2
    Last Post: 05-10-2009, 02:31 PM
  2. Search Engine
    By korbins in forum Website Programming
    Replies: 3
    Last Post: 18-11-2007, 06:38 PM
  3. How To Do Custom Search Engine?
    By louiss in forum Website Programming
    Replies: 7
    Last Post: 15-09-2007, 01:00 PM
  4. Search Engine Optimization
    By Jack The Ripper in forum Search Engine Marketing
    Replies: 12
    Last Post: 05-10-2006, 07:54 AM
  5. Search Engine
    By hymns in forum Website Programming
    Replies: 0
    Last Post: 13-10-2002, 02:27 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

Search Engine Optimization by vBSEO 3.5.0 RC1 PL1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39