YaCy: Unterschied zwischen den Versionen

Aus Doc-Wiki
Zur Navigation springen Zur Suche springen
imported>Burghardt
imported>Burghardt
 
(11 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt)
Zeile 1: Zeile 1:
== Search engine advantages ==
+
== Search Engine (Dis-) Advantages ==
We try to implement a Search Engine for all of our Computer Science related web sites at our University.
+
We would like to have a Search Engine for all of our Computer Science web sites at our University - searching on both research and study related content.
   
Of course you can always use Google, but you will either get more hits than you want or you set a filter ("<tt>site:uni-goettingen.de</tt>") and miss relevant content from other local resources off of <tt>*.uni-goettingen.de</tt>. So the main goal for ''this'' Search Engine is to ''restrict'' the search namespace to relevant sites.
+
Of course you can always use Google to search the web. But you will either get more hits than you want, or you set a filter ("<tt>site:uni-goettingen.de</tt>") and miss relevant content from other local resources off of <tt>*.uni-goettingen.de</tt>. So the main goal for ''our'' Search Engine is to ''restrict'' the search name space to relevant sites. The next try after dropping Google is to use the integrated search engine embedded on www.uni-goettingen.de on the top right. You will notice this is... Google - again!
   
The next try after leaving Google is to use the integrated search engine embedded on www.uni-goettingen.de. This has two major problems for us:
+
Also the embedded variant still has those two major problems for us:
* You still get too many hits. There is just no integrated way to restrict the search to pages below (for example) <tt>www.informatik.uni-goettingen</tt> etc. All content of all Institutes and projects are in the single [[GCMS]] database. There is no separate namespace for each Institute for Faculty.
+
* You still get too many hits. There is just no integrated way to restrict the search to pages below (for example) <tt>www.informatik.uni-goettingen</tt> etc. All content of all Institutes and projects are presented below <tt>www.uni-goettingen.de</tt>. There is no separate name space for each Institute or Faculty.
* On the other hand this search engine does ''not'' include GCMS-external but relevant sites like <nowiki>http://www.swe.informatik.uni-goettingen.de, http://www.math-cs.uni-goettingen.de</nowiki> etc.
+
* On the other hand this search engine does ''not'' include GCMS-external but relevant sites like <nowiki>http://www.swe.informatik.uni-goettingen.de or http://www.math-cs.uni-goettingen.de</nowiki>.
   
The kind of content we are preparing this search engine for is unrestricted and language agnostic
 
* Study related pages
 
* Research related pages including external "project"-pages
 
* University infrastructure pages - if there is a relationship regarding "our" Computer Science
 
* Technical infrastructure pages - if there is a relationship regarding "our" Computer Science
 
   
 
== Help us make this engine actually usable ==
We can index public pages only. This is unfortunate as especially locked down areas like group Wikis and project sites would benefit from a central search engine even more than these public pages. This problem may be solved later...
 
 
Please check the search results for your own area of interest. If you find something relevant missing '''please communicate the URLs to include in the index'''. Please use the dedicated email address
 
* <tt>search ät informatik.uni-goettingen.de</tt>
  +
for this kind of feedback. <small>You may also use the "Discussion"-page of this article to gather URLs. It is editable by everyone after login, this is a ''wiki''.</small>
   
  +
Any help optimizing the crawl process to increase the quality of the index is welcome! This includes adding missing content as well as working on a stringent blacklist to reduce the noise.
== Implementation ==
 
* Virtual machine (hosted at [[Gwdg]] ) running Debian GNU/Linux
 
   
  +
== Drawbacks ==
  +
* [[JavaScript]] required
 
* We can index publicly available pages only. This is unfortunate as especially those locked down areas like group Wikis and internal project sites would benefit from a central search engine even more than these public pages. This problem may be solved later -- currently marked "won't fix" for obvious reasons.
   
  +
== GCMS ==
== Help us make this engine actually usable ==
 
  +
The search engine standard view is embedded in [[GCMS]] on the right side of the startpage of the Institute. The direct URL (currently) is https://www.uni-goettingen.de/de/suche/575958.html.
Please check the search results for your own area of interest. If you find something relevant missing '''please communicate the URLs to include in the index''' -- [[User:Burghardt]]
 
 
   
 
== See also ==
 
== See also ==
Zeile 28: Zeile 27:
   
 
== Links ==
 
== Links ==
  +
* '''https://search.informatik.uni-goettingen.de''' -- plain Search
* http://yacy.net/
 
* '''http://search.informatik.uni-goettingen.de''' -- plain Search
+
* http://yacy.net/ -- the Source
* http://search.informatik.uni-goettingen.de/Status.html -- some additional information
 
 
   
   

Aktuelle Version vom 30. Juli 2018, 09:05 Uhr

Search Engine (Dis-) Advantages

We would like to have a Search Engine for all of our Computer Science web sites at our University - searching on both research and study related content.

Of course you can always use Google to search the web. But you will either get more hits than you want, or you set a filter ("site:uni-goettingen.de") and miss relevant content from other local resources off of *.uni-goettingen.de. So the main goal for our Search Engine is to restrict the search name space to relevant sites. The next try after dropping Google is to use the integrated search engine embedded on www.uni-goettingen.de on the top right. You will notice this is... Google - again!

Also the embedded variant still has those two major problems for us:

  • You still get too many hits. There is just no integrated way to restrict the search to pages below (for example) www.informatik.uni-goettingen etc. All content of all Institutes and projects are presented below www.uni-goettingen.de. There is no separate name space for each Institute or Faculty.
  • On the other hand this search engine does not include GCMS-external but relevant sites like http://www.swe.informatik.uni-goettingen.de or http://www.math-cs.uni-goettingen.de.


Help us make this engine actually usable

Please check the search results for your own area of interest. If you find something relevant missing please communicate the URLs to include in the index. Please use the dedicated email address

  • search ät informatik.uni-goettingen.de

for this kind of feedback. You may also use the "Discussion"-page of this article to gather URLs. It is editable by everyone after login, this is a wiki.

Any help optimizing the crawl process to increase the quality of the index is welcome! This includes adding missing content as well as working on a stringent blacklist to reduce the noise.

Drawbacks

  • JavaScript required
  • We can index publicly available pages only. This is unfortunate as especially those locked down areas like group Wikis and internal project sites would benefit from a central search engine even more than these public pages. This problem may be solved later -- currently marked "won't fix" for obvious reasons.

GCMS

The search engine standard view is embedded in GCMS on the right side of the startpage of the Institute. The direct URL (currently) is https://www.uni-goettingen.de/de/suche/575958.html.

See also

  • ...

Links