Autonomy HTTP Fetch 2.3 rev 14

Published Time: -
Filetype: pdf
Filesize: 739404
HTTP Fetch version 2.3.x - revision 14 Administrator’s Guide Information in this document is subject to change without notice. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express permission
of Autonomy Systems Ltd. Windows is a trademark of Microsoft Corp., UNIX is a trademark of X/OPEN Ltd. Copyright © 2005 Autonomy. All rights reserved. IDOL server and HTTP Fetch are trademarks of Autonomy Systems Ltd. Table of Contents Preface.....................................................................................................................................i Autonomy ........................................................................................................................i
Contact ........................................................................................................................... ii
Downloading manual updates from Automater ............................................................. iii
Typographical conventions ............................................................................................ iii
Related documentation ................................................................................................. iv
1. Autonomy infrastructure .............................................................................................1 IDOL server ....................................................................................................................3
Connectors .....................................................................................................................3
Interfaces ........................................................................................................................3
Distributed systems ........................................................................................................3
Administration .................................................................................................................4
PODS .............................................................................................................................4
Data flow and security ....................................................................................................5
2. Introduction ..................................................................................................................7 System architecture ........................................................................................................8 3. Installation ....................................................................................................................9 System requirements .....................................................................................................9
Required information ....................................................................................................10
Installing HTTP Fetch under Windows NT or 2000 ......................................................11
Directory structure: Windows ..................................................................................14 Installing HTTP Fetch under UNIX ...............................................................................16 Directory structure: UNIX ........................................................................................17 4. Configuring HTTP Fetch ............................................................................................19
Displaying help on configuration settings .....................................................................19
Modifying configuration parameter values ....................................................................20
Configuration file sections ............................................................................................21
[License] section .....................................................................................................21
[Service] section .....................................................................................................21
[Default] section ......................................................................................................22
[Spider] section .......................................................................................................23
5. Getting started with HTTP Fetch ...............................................................................25
The HTTP Fetch spidering process ..............................................................................25
Starting the HTTP Fetch Administration Utility .............................................................26
Disabling indexing into IDOL server .............................................................................27
Creating a new spider ..................................................................................................27
Activating a spider ........................................................................................................28
Starting the HTTP Fetch service ..................................................................................29
Viewing the spider’s output .......................................................................................... 30
Stopping the HTTP Fetch service ................................................................................ 32
Deactivating a spider ................................................................................................... 33
Deleting a spider .......................................................................................................... 34
Enabling indexing into IDOL server ............................................................................. 34
6. Using HTTP Fetch ...................................................................................................... 35 Selecting web sites to spider ....................................................................................... 35
Configuring HTTP Fetch spiders .................................................................................. 37
Minimum spider configuration ................................................................................ 37
Authentication ......................................................................................................... 39
7. Retrieving up-to-date web pages ............................................................................. 41
Configuring frequent spider runs .................................................................................. 41
Configuring date checking ........................................................................................... 42
Specifying date formats .......................................................................................... 43 Filtering by date ........................................................................................................... 44
Foreign language dates ............................................................................................... 44
Handling undated documents ...................................................................................... 45
Tutorial: Configuring a spider to retrieve the latest content ......................................... 46
8. Retrieving relevant web pages ................................................................................. 51 Filtering by text strings ................................................................................................. 51 Retrieving documents that contain required text strings ........................................ 52
Rejecting documents that contain unwanted text strings ....................................... 54
Using wildcards in text strings ................................................................................ 55
Tutorial: Setting up spiders to retrieve relevant documents ......................................... 56 9. Optimizing HTTP Fetch ............................................................................................. 59
Specifying the scope of the spider ............................................................................... 59
Configuring the spidering run time and speed ............................................................. 62
Setting an optimum number of connections or using a proxy server ........................... 64
Grouping similar spiders in the same configuration file ............................................... 65
Global and default settings ..................................................................................... 66
Using multiple configuration files ............................................................................ 67
Rejecting pages that contain little or no content .......................................................... 68
Limiting the overall quantity of data downloaded ......................................................... 69
Setting the spider’s identity to suit the web site that it is spidering .............................. 70
Importing and indexing documents in batches ............................................................. 71
10. HTTP Fetch Log Files ................................................................................................ 73 The <InstallationName>.log ......................................................................................... 73
licence.log and service.log files ................................................................................... 73
<MySpider>.log ............................................................................................................ 74
11. Troubleshooting ......................................................................................................... 81 Appendix A: Service port commands................................................................................87 GetConfig .....................................................................................................................88
GetLogStream ..............................................................................................................88
GetLogStreamNames ...................................................................................................89
GetStatistics .................................................................................................................89
GetStatus .....................................................................................................................90
GetStatusInfo ...............................................................................................................90
MergeConfig .................................................................................................................91
SetConfig ......................................................................................................................93
Stop ..............................................................................................................................94
Appendix B: The NTLM Proxy module ..............................................................................95 Appendix C: Error messages .............................................................................................99 Glossary .............................................................................................................................105 Index ...................................................................................................................................107 i Preface Autonomy Autonomy employs a fundamentally different and unique combination of technologies to enable
computers to form an understanding of a page of text, web pages, emails, voice, documents
and people.
Autonomy's solution is therefore able to power any application dependent upon unstructured
information within every market sector, including: e-commerce, customer relationship
management, knowledge management, enterprise information portals and online publishing
applications.
This is evidenced by the significant penetration of the technology in a diversity of vertical
markets and has been achieved principally because every market sector needs to manage and
leverage the benefits of unstructured information.
Autonomy was founded in 1996 and has offices in Boston, Chicago, Dallas, San Francisco,
New York, and Washington, D.C. in the United States, as well as offices throughout EMEA,
including Amsterdam, Brussels, Cambridge, Frankfurt, Milan, Paris, Oslo, and Sydney. In July
1998, the company went public on the EASDAQ exchange (EASDAQ:AUTN). Autonomy
floated on The NASDAQ National Market (NASDAQ: AUTNY) in May 2000, and on the London
Stock Exchange (LSE: AU.) in November 2000.
ii Contact To contact Autonomy, please get in touch with your nearest location listed below. Europe and South Pacific
Autonomy Systems Ltd.
Cambridge Business Park
Cowley Road
Cambridge
CB4 0WZ
Help Desk: +44 (0) 800 0 282 858 Switchboard: +44 (0) 1223 448 000 Fax: +44 (0) 1223 448 001 Email for information: autonomy@autonomy.com for support: uksupport@autonomy.com The Help Desk operates from 9.30 am to 6.00 pm (GMT) Monday to Friday.
Website: www.autonomy.com
USA
Autonomy Inc.
One Market
Spear Street Tower
San Francisco
CA 94105
Help Desk: +1 877 333 7744 Switchboard: +1 415 243 9955 Fax: +1 415 243 9984 Email for information: info@us.autonomy.com for support: support@us.autonomy.com The Help Desk operates from 9.30 am to 6.00 pm (CST) Monday to Friday, toll-free.
Website: www.autonomy.com
iii Downloading manual updates from Automater To assist you in utilizing the benefits that Autonomy’s solutions offer you, Autonomy provides
free downloads of the latest available documentation.
To download documentation updates:
1.
Enter the following URL in your web browser's Address field: http://automater.autonomy.com 2. Enter your Username and Password, and click on the Login button. 3. Click on the Download menu option. 4. Under the Documentation and Release Notes heading, click on the Click here link,
then click on the Manuals folder to display the latest available manual versions. You can
display any of the manuals in your browser and download them.
Note: the manual's version number (for example, version 4.1.x) corresponds to the product
version. The last number of the product version has been replaced with an x for all manuals
as this number relates to minor product releases that have no effect on the documentation. If
a manual has a revision number (for example revision 5), it indicates that this manual has
been revised since it was first released. Automater always contains the latest available
revision of all manuals.
Typographical conventions Autonomy documentation uses the following typographical conventions. Formatting convention: Type of information: Bold type References to any of the following: Interface options (for example, menus or buttons) Actions Parameters Courier font Configuration examples <text> A string that needs to be replaced with a personal setting. For
example <port> indicates that you have to specify a port
number, [<MySection>] indicates that you have to specify a
section name and so on.
Note that this only applies where this does not explicitly refer to
XML. Another exception are instructions for writing ACI
templates (an appendix to product manuals where this is
applicable) where personal settings are indicated by Italic type.
Preface iv Related documentation You should use the HTTP Fetch manual in connection with the following: Import Module manual Autonomy’s Import Module is an integral part of all Autonomy connectors. The Import
Module manual provides information on how you can configure the settings that determine
how content is treated during the importing process (before it is passed to IDOL server).
IDOL server manual IDOL server lies at the center of any Autonomy infrastructure, storing and processing the
data that conncetors index into it. The IDOL server manual describes the operations that
IDOL server can perform with detailed descriptions of how to set them up.
DIH manual The DIH (Distributed Index Handler) manual contains details on how you can use a DIH to
distribute aggregated documents across multiple IDOL servers.
Best Practices Guide The Best Practice Guide provides useful hints and tips on setting up and configuring
Autonomy solution as well as examples on how to combine multiple products effectively.
IAS manual The IAS manual contains details on how you can use Autonomy’s Intelligent Asset
Protection System (IAS) to ensure secure access through authentication and role
permissions.
DiSH manual The DiSH (Distributed Service Handler) manual contains details on how you can use a
DiSH server to administer and control multiple Autonomy services.
Online help The online help details the configuration settings that are available for HTTP Fetch. Please
refer to Displaying help on configuration settings on page 19 for details on how to
display help.
Page 1 1. Autonomy infrastructure "Today, 80% of business is conducted on unstructured information." Gartner Group "85 per cent of all data stored is held in an unstructured format." Butler Group "Unstructured data doubles every three months." Gartner Group Information that you need in order to conduct business successfully comprises the following types: In the past companies could only make use of 20% of the information that was relevant to them. In
order to deal with this information they used keyword search engines, tagging schemes, collaborative
filtering or linguistic methods. These methods were not only costly and time-inefficient but also non-
scalable, inaccurate and taking the focus from core business.
80% of relevant information could not be utilized. Autonomy infrastructure Page 2 Autonomy's software infrastructure allows you to utilize 100% of the information that is relevant to you.
It automates all the business processes that formerly had to be dealt with manually.
By developing a patented combination of Bayesian Inference, Shannon's information theory and
pattern matching, Autonomy has enabled computers to understand unstructured, structured and semi-
structured information. This means that Autonomy's software infrastructure solves a fundamental
problem that affects every industry, and can be used in virtually any application that handles
unstructured information:
E-Commerce CRM Knowledge Management Business Intelligence Enterprise Information Portals Online Publishing Autonomy's software infrastructure is fully scalable and allows you to process information: automatically in real time in any language Autonomy infrastructure Page 3 IDOL server Using Autonomy connectors, Autonomy's Intelligent Data Operating Layer (IDOL) server integrates
unstructured, semi-structured and structured information from multiple repositories through an
understanding of the content, delivering a real time environment in which operations across
applications and content are automated, removing all the manual processes involved in getting the
right information to the right people at the right time.
Connectors Connectors enable automatic content aggregation from any type of local or remote repository (for
example, a database, a web site, a real-time telephone conversation etc.), forming a unified solution
across all information assets within the organization.
Interfaces Portlets are windows that can be set up in Autonomy's Portal-in-a-Box or third party portals. Each
portlet contains an application that allows the portals' end users to benefit from a variety of IDOL
server functionality.
Retina™, an easy-to-use web interface application that provides a full scale of retrieval methods
that adjust to the individual user’s proficiency.
Autonomy Desktop Suite™ brings the power of Autonomy to every desktop. Conducting a real-
time analysis of the ideas involved in the content of any opened desktop application, Desktop
Suite’s ActiveKnowledge or Active Windows Extensions module provides real-time links to
relevant internal and external information without the user being needlessly diverted from his work
in progress to perform an exasperating search or retrieval operation.
Distributed systems Autonomy’s distribution solutions facilitate linear scaling of systems through faster command
execution and reduction of processing time
DAH™ (Distributed Load Handler) enables the distribution of ACI (Autonomy Content
Infrastructure) action commands to multiple Autonomy IDOL servers, providing failover and load
balancing.
DIH™ (Distributed Index Handler) enables distributed indexing of documents into multiple
Autonomy IDOL servers, providing failover and load balancing.
Autonomy infrastructure Page 4 Administration DiSH™ (Distributed Service Handler) provides crucial maintenance, administration, control and
monitoring functionality for the Autonomy infrastructure. DiSH delivers a unified way to
communicate with all Autonomy services such as connectors, DIH, DAH and so on from a
centralized location
Autonomy Service Dashboard™ is a stand-alone web application that allows administrators to
manage all Autonomy modules /services running locally or remotely.
The Dashboard communicates with the Distributed Service Handler (DiSH) module that is the
back end process for monitoring and controlling all the Autonomy child services. Autonomy
Service Dashboard provides the administrator with a list of all child services that DiSH is
monitoring, together with control buttons and status information.
PODS Autonomy’s Product Orientated Drop-in Solutions allow Autonomy solutions to be easily integrated
with third party applications and solution providers. PODS enable organizations to make their existing
applications compatible with IDOL with minimal configuration and administration requirements. Making
IDOL server a part of any solution delivers the direct benefits of content automation and the ability to
perform a vast range of IDOL server operations, irrelevant of file format or location.
Autonomy infrastructure Page 5 Data flow and security Autonomy infrastructure Page 6 Aggregation & Distribution
Connectors aggregate content from various repositories and index it into IDOL server or, if the content
needs to be distributed across multiple IDOL servers, a DIH (Distributed Index Handler).
Querying & Distribution
User queries are sent from a front end directly to IDOL server or distributed to multiple IDOL servers
using the DAH (Distributed Load Handler).
Distributed Administration
The DiSH (Distributed Service Handler) enables administrators to maintain, configure and control
multiple Autonomy services via the Autonomy Service Dashboard, a front-end web interface.
Security
The Autonomy IAS (Intellectual Asset Protection System) ensures secure access through
authentication and role permissions. When a user logs on to a front end (for example, Retina or a 3rd
party portal) his authentication details are sent to IDOL server which returns the user's security details
to the front end, where they are stored until the user logs off or his session times out. Every time the
user issues a query, his security details are attached to the query string that is sent to IDOL server.
The group servers store the user group information of repositories that store users in groups. This
allows the front end to quickly retrieve user security information from the group servers, and send the
query and the user's security information to IDOL server in order to check if the user is permitted to
view result documents before they are displayed to the user.
IDOL server passes the user's security details to the security libraries for the data repositories that
contain result documents for the user's query. The security libraries then check the user's security
details against the ACLs for the documents that match the query. If the user is entitled to view a
document, it is returned as a result to the front end.
Page 7 2. Introduction HTTP Fetch is a powerful tool for retrieving web site documents and indexing them into an Autonomy
IDOL server. The HTTP Fetch service uses spiders to find web pages and to process the web pages
for content and links to other web sites. HTTP Fetch can run several spiders simultaneously, making
efficient use of processor capability and bandwidth.
Once a spider has finished retrieving documents from the web site, HTTP Fetch imports the content
that the spider has retrieved into Autonomy’s indexing file format (IDX) or XML, and automatically
indexes the data into IDOL server. HTTP Fetch can retrieve content from various document types,
including web documents, Word, Excel and PDF files.
HTTP Fetch is highly configurable and can be scaled and tuned to meet your exact requirements. You
can configure HTTP Fetch to process the documents during importing and indexing, and to add
custom information to the document record in IDOL server.
Introduction Page 8 System architecture HTTP Fetch can be installed in a number of ways. The simplest is to install one HTTP Fetch service on
a single machine, and configure it to aggregate documents from a number of web sites.
You can also configure more complex systems, for example with more than one HTTP Fetch indexing
data into IDOL server
. Other system architectures are also possible. Page 9 3. Installation System requirements Minimum suggested server requirements Windows NT and 2000 (Intel) 200 MHz Pentium processor 64 MB RAM 2 GB hard disk recommended UNIX 128 MB of RAM 2 GB hard disk recommended 2.2.12 kernel Solaris 2.5 (SUN SPARC) 128 MB of RAM 2 GB hard disk recommended Installation Page 10 Required information Before installing HTTP Fetch you should have the following information ready: Installation Name
If you are licensed to have multiple installations, then each installation requires a distinct name to
differentiate it from other installations.
License Holder Name
The name of the person or company to which the product is licensed. If you have an evaluation version
and later obtain a full license key, you need to update the Holder, Key and Full settings in the
[License] section of the configuration file with your full license key details.
Firewall/Proxy server settings
If a firewall or proxy server is between the machine you wish to run HTTP Fetch on and the Internet,
then you need these settings in order to gather documents from remote web sites. These settings
include the IP Address, the port number of the firewall or proxy server, and the username and
password you use to gain access.
Autonomy IDOL server settings
HTTP Fetch automatically indexes documents into your IDOL server. You need to know the settings
for the IDOL server you want to index the data into. These settings are: the IP address or host name of
the IDOL server machine, the Query and Index port numbers and the name of the Database into which
HTTP Fetch will index the documents. You can find information on these settings in the IDOL server
configuration file.
Indexing Details
HTTP Fetch uses a file to store indexing information, IDOL server is instructed to index in this file. You
need to decide where you want these files to reside, and you need to know the full path of the directory
from the HTTP Fetch service and from IDOL server.
Installation Page 11 Installing HTTP Fetch under Windows NT or 2000 To install under Windows, insert the HTTP Fetch CD-ROM into your CD-Rom Drive, and start the
installation by double-clicking on the HTTP Fetch-2.3.x.exe program in the root directory of the CD-
ROM through Windows explorer.
Read and follow all installation instructions on the screen carefully. 1. The installation opens with the Welcome dialog. Read the text and click on Next. 2. If you have purchased a fully enabled version of HTTP Fetch: The Autonomy HTTP Fetch License dialog is displayed. Read the license agreement and click on Next to accept it. If this dialog is not displayed,
check whether the CD ROM contains the licensekey.txt file, and copy this into the same
directory as the HTTP Fetch installer. Contact Autonomy Helpdesk if you do not have the
license key. Please see the Contact section on page ii for the Helpdesk contact details.
If you are installing the HTTP Fetch 30 day evaluation version: The License Details dialog is displayed. Read the text and click on Next. The Limited Evaluation License Agreement dialog is displayed. Read the license
agreement and click on Next to accept it.
The Enter license holder name dialog is displayed. Enter the license holder’s name and
click on Next.
3. The Installation Name dialog is displayed. Enter a unique name for the HTTP Fetch installation, and click on Next. Note that the unique
name must not contain any spaces. By default, the installation name is HTTPFetch.
4. The Choose Destination Location dialog is displayed. Select the directory in which you want to install HTTP Fetch, and click on Next. By default HTTP
Fetch is installed on C:Autonomy<InstallationName>, but you can use the Browse button to
navigate to another location.
5. The Select Components dialog is displayed. Specify which HTTP Fetch components you want to install by checking the appropriate boxes, and
click on Next when you are finished. The following components are available:
Fetch Service & example configuration file The main HTTP Fetch component. You must check this box for the Fetch to be able to
operate.
Fetch Administration Utility A utility for administering and configuring HTTP Fetch. Fetch Administration Online Help Online help information for the Administration Utility. Installation Page 12 6. The DiSH Settings dialog is displayed. Enter the following details, and click on Next: DiSH Service Port The port number that HTTP Fetch will use for DiSH communication. Note: This port must
not be used by any other service.
Service Control Clients The IP address of machines that are permitted to control the HTTP Fetch service. Note
that if you want to permit a number of machines to control the HTTP Fetch service, you
can use a wildcard in the IP address.
For example: enter 187.*.*.* to permit any machine whose IP address begins with 187 to
control the HTTP Fetch service.
Service Status Clients The IP address of machines that are permitted to access the HTTP Fetch status but are
not permitted to control it. Note that if you want to permit a number of machines to access
the HTTP Fetch status, you can use a wildcard in the IP address.
For example: enter 187.*.*.* to permit any machine whose IP address begins with 187 to
access the HTTP Fetch status.
7. The Proxy Settings dialog is displayed. If you use a firewall or proxy server to access the internet, enter the following details for the proxy
server, and click on Next:
Proxy Server The IP address (or name) of the machine on which the proxy server is located. Proxy Port The port used by the proxy server. Proxy Username The username for the proxy server. Proxy Password The password for the proxy server. 8. The IDOL server Settings dialog is displayed: Enter the following details for the IDOL server into which you want to index the documents that
HTTP Fetch retrieves:
Hostname The IP address (or name) of the machine on which IDOL server is running. Query Port The port number that is used to send queries to IDOL server. Installation Page 13 Index Port The port number that is used to index documents into IDOL server. 9. The UserAgent Settings dialog is displayed. Enter some contact information, such as an e-mail address, so that the site that HTTP Fetch
spiders can contact you.
10. The Indexing Details dialog is displayed. Enter the following details for the indexing process, and click on Next: IDOL server reads index files from The path to the folder from which the IDOL server reads the index files. Database The name of the IDOL server database into which you want HTTP Fetch to index the
documents that it fetches.
11. The Start Installation dialog is displayed. Click on Next to confirm the settings that you have made and to start the installation. Alternatively,
click on Back to return to previous dialogs and make the appropriate changes.
12. The Installing dialog is displayed. The progress of the installation process is indicated. If you want to abort the installation process,
click Cancel.
13. The Start Menu Shortcuts dialog is displayed. Specify whether you want to add HTTP Fetch shortcuts to the Start menu and click on Next. 14. The Services dialog is displayed. Specify whether you want the HTTP Fetch service to be started immediately after the installation
is complete, and click on Next.
15. The Installation Complete dialog is displayed. HTTP Fetch has been installed successfully. Click on Finish to exit the installation. You can now edit the HTTP Fetch configuration file (which is located in the
C:Autonomy<InstallationName> folder) and start the HTTP Fetch Service.
Installation Page 14 Directory structure: Windows Once the installation of HTTP Fetch is completed, your installation directory contains the following files
and subdirectories (note that bold font indicates folders):
Conv Tables Folder containing text files for language conversion Filters binslave.cfg Configuration file for binslave.exe binslave.exe Executable that parses binary files looking for ASCII content,
which can be imported into IDOL server.
importslave.exe Executable that generates IDX files for IDOL server. omnislave.cfg Omnislave configuration file that contains the Omnislave
settings.
omnislave.exe Omnislave executable that parses files not in HTML or PDF
format to IDX files.
pdfslave.cfg Configuration file for pdfslave.exe pdfslave.exe Executable that parses PDF files to IDX files. various .dat files Data files used by HTTP Fetch various .dll files Filters used by Omnislave to convert various document types to
HTML format
httpfetch.dat Data file used by HTTP Fetch ImportWizard.exe Plugin for the HTTP Fetch Admin application, which controls how files are imported. Install.log Installation log file that lists the details of the installation
process.
<InstallationName>.cfg Configuration file that contains the HTTP Fetch settings. <InstallationName>.exe HTTP Fetch executable. <InstallationName>Admin.cfg Configuration file for the HTTP Fetch Admin application. <InstallationName>Admin.exe Executable for the HTTP Fetch Admin application. Unwise.exe Executable to uninstall HTTP Fetch from your computer. Installation Page 15 When you start HTTP Fetch for the first time, the following files are created: HTTPFETCHADMIN-SPIDER Folder that contains data used by HTTP Fetch. <InstallationName>.lck Lock file which prevents multiple instances of HTTP Fetch
running simultaneously.
<InstallationName>.log Log file for HTTP Fetch messages. <InstallationName>.pid Process identification file for HTTP Fetch. <InstallationName>.str Status information for the HTTP Fetch service. <InstallationName>cfg.log Log file for configuration changes in HTTP Fetch license.log Log file of license details for HTTP Fetch service.log Log file of service records <MyFetch>.log Log file for the <MyFetch> spider <MyFetch>.status File that records the site structure for the <MyFetch> spider Installation Page 16 Installing HTTP Fetch under UNIX To install under Unix insert the HTTP Fetch CD-ROM into your CD-Rom Drive. Read and follow all
installation instructions on the screen carefully.
1. Copy the HTTP Fetch installer from the CD to your local disk. 2. Uncompress and extract the installer using the command: tar –zxvf <Installer>.tar.Z This creates a subdirectory called HTTP Fetch-2.3.x. 3. Move to the new HTTP Fetch directory. 4. Run the installer script ./Setup.sh with the following arguments: <Base_Name> Enter a unique name for the HTTP Fetch installation. Note that the unique name must
not contain any spaces.
<Main dir (fully qualified)> Enter the full path of the main installation directory. <DiSH Service Port> Enter the port number that HTTP Fetch will use for DiSH communication. <User Agent Contact Details> Enter some contact information, such as an e-mail address, so that the site that HTTP
Fetch spiders can contact you.
<IP Address for IDOL server> Enter the IP address (or name) of the machine on which IDOL server is running. <Index Port Number for IDOL server> Enter the port that HTTP Fetch uses to index information into IDOL server. <Database name in IDOL server> Enter the name of the IDOL server database into which you want HTTP Fetch to index
the documents that it fetches.
5. The settings that you have entered are displayed. Confirm the settings by pressing Return.
Alternatively, cancel the installation by pressing Ctrl+C.
6. When HTTP Fetch has been installed successfully, you are returned to the command line prompt.
Click on Finish to exit the installation.
You can now edit the HTTP Fetch configuration file, located in the main installation directory, and
start the HTTP Fetch service.
Installation Page 17 Directory structure: UNIX Once the installation of HTTP Fetch is completed, your installation directory contains the following
files:
<BaseName>.cfg Configuration file that contains the HTTP Fetch settings. <BaseName>.exe HTTP Fetch executable. <BaseName>spider.exe Executable for the HTTP Fetch spider. Setup.sh Setup script for HTTP Fetch. Various .txt files Text conversion tables used during the importing process. Various .so files Files used by HTTP Fetch. binslave.cfg Configuration file for Binslave. binslave.exe Executable that parses binary files looking for ASCII content,
which can be imported into IDOL server.
importslave.exe Executable that generates IDX files for IDOL server. omnislave.cfg Omnislave configuration file that contains the Omnislave
settings.
omnislave.exe Omnislave executable that parses files not in HTML or PDF
format to IDX files
pdfslave.cfg Configuration file for pdfslave.exe. pdfslave.exe Executable that parses PDF files to IDX files. phraselist.dat File that Binslave uses phraselist_doc.dat File that Binslave uses phraselist_ppt.dat File that Binslave uses. phraselist_qxd.dat File that Binslave uses. pptconv.dat File that HTTP Fetch uses. startfetch.sh Script for starting HTTP Fetch. stopfetch.sh Script for stopping HTTP Fetch. uninstall.sh Executable to uninstall HTTP Fetch from your computer. Installation Page 18 Page 19 4. Configuring HTTP Fetch The settings that determine how HTTP Fetch operates are contained in the <InstallationName>.cfg
configuration file, which is located in your installation directory. You can modify these settings in order
to customize HTTP Fetch according to your requirements.
You can modify the configuration file directly, using a text editor, or via the HTTP Fetch Administration
Utility for Windows. For details on using the HTTP Fetch Administration Utility, please refer to the Help
files in the HTTP Fetch Administration Utility, which you can launch from the Utility by clicking on Help.
Displaying help on configuration settings All available configuration parameters are detailed in the HTTP Fetch HTML help. To generate HTML help: 1. Issue the following command from the command line: <HTTP Fetch_InstallationDirectory _and_InstallationName>.exe -help This command creates the following HTML files in your HTTP Fetch installation directory: <InstallationName>cfghelp.html
<InstallationName>cfgindex.html
<InstallationName>main.html
2. Open <InstallationName>main.html to view information on HTTP Fetch configuration
parameters.
Note: the configuration file sections that each configuration parameter can be used in are listed under
Allowed in Sections.
Configuring HTTP Fetch Page 20 Modifying configuration parameter values Entering Boolean values
For parameters that require Boolean settings the following settings are interchangeable:
TRUE = true = ON = on = Y = y = 1 FALSE = false = OFF = off = N = n =0 Entering string values
If the value that you want to enter for a parameter that requires a string contains quotation marks, you
must put the value into quotation marks and escape each quotation mark that the string contains by
putting a slash in front of it.
For example: FIELDSTART0="<font face="arial"size="+1"><b>" Here the beginning and end of the string is indicated by quotation marks while all quotation marks that
are contained in the string are escaped.
If you want to enter a comma separated list of strings for a parameter, and one of the strings contains
a comma, you must indicate the start and the end of this string with quotation marks.
For example: ParameterName=cat,dog,bird,"wing,beak",turtle If any string within a comma separated list contains quotation marks, you must put this string into
quotation marks and escaped the quotation marks in the string by putting a slash in front of them.
For example: ParameterName="<font face="arial"size="+1"><b>",dog,bird,"wing,beak",turtle Applying modifications to HTTP Fetch's operation
New configuration settings only take effect once the HTTP Fetch service is stopped and restarted.
Configuring HTTP Fetch Page 21 Configuration file sections The HTTP Fetch configuration file comprises a number of sections, which represent different areas
that you can configure by setting appropriate parameters.
The HTTP Fetch configuration file contains the following sections: [License] [Service] [Default] [Spider] Note: You should take care not to overload your Internet connection by using up all available bandwidth,
or by requesting too much information from a site in too short a time. This can prevent other users
from accessing the site and could cause your IP address to be banned from the site.
For import parameters that you can specify for spiders in the configuration file's [Default] and
[Spider] sections, see the Autonomy Import Module documentation.
[License] section The [License] section contains licensing details, which you should not change. For example: [License]
Holder=My Company
Key=01234567890
Operations=132|RqrHHAu5zC53pbxPrtEczRM8ld6Se+4V4D38Qi5EjS3Y6A=
[Service] section The [Service] section contains settings that determine which machines are permitted to use
and control the HTTP Fetch service.
For example: [Service]
ServicePort=13610
ServiceControlClients=127.0.0.1
ServiceStatusClients=127.0.0.1
Configuring HTTP Fetch Page 22 [Default] section The [Default] section contains the default settings that apply to all the jobs that you define in
[<MySpider>] sections. If you want to specify different settings for an individual spider, you can
set them in the [<MySpider>] section, in which case the default settings for this job are
overridden.
For example: [Default]
DREHost=127.10.9.2
IndexPort=2001
Database=News
LogFile=spider.log
nSockets=16
SpiderRepeatSecs=86400
SpiderCycles=-1
Depth=99
SpiderStartTime=03:00
SiteDuration=43200
MaxPages=2000
MaxPagesCheck=1
ImportStripLinks=true
FollowRedirect=true
StayOnSite=true
MinPageSize=4090
MaxPageSize=163840
MaxLinksPerPage=100
PageTimeout=100
CantHaveCheck=129
CantHaveCSVS=*archive*
AfterDate=-365
BeforeDate=7
SpiderAs=HTTP Fetch Spider, contact spider@MyCompany.com
DateFormats=DDMMYYYY,YYMMDD,YYMMD,DDMONTHYYYY,MMDDYY,MMDD
BatchProcess=IMPORT
BatchSize=200
ImportMetaToFields=true
FollowRobotProtocol=true
PageDelay=30
Configuring HTTP Fetch Page 23 [Spider] section The [Spider] section lists the spiders that you want to run. It contains a section for each of the
listed spiders, in which you configure the settings that determine how each spider runs. Any
settings that you specify for a spider override the [Default] settings (but only for this spider).
For example: [Spider]
Number=2
0=AUTONOMY
1=MyCompany
[AUTONOMY]
URL=http://www.autonomy.com/
Directory=AUTONOMY
LogFile=AUTONOMY.log
[MyCompany]
URL=http://www.mycompany.com/
Directory=MyCompany
LogFile=MyCompany.log
Configuring HTTP Fetch Page 24 Page 25 5. Getting started with HTTP Fetch Once you have installed HTTP Fetch, you can set up spiders to retrieve documents from web sites.
The following sections take you through the processes of:
creating and activating a new spider starting and stopping the HTTP Fetch service viewing the log and index files that the spider generates deactivating and deleting the spider when you no longer need it. The HTTP Fetch spidering process HTTP Fetch spiders search web pages for links and content. Firstly, the spider processes the page
content and either accepts or rejects the page for retrieval. If the page is accepted, the spider looks for
links from the page, filters the links and queues up the accepted links for spidering. If the page is
rejected, the spider looks for links only if you have configured it to follow links on rejected pages.
The links are filtered before they are added to the spidering queue. The spider then retrieves the page
content of accepted pages. The spider requests the next link in its queue and the process repeats. The
spidering process ends when one of the following occurs:
the spider has no more links to follow the maximum spider run-time has been reached the spider has downloaded the maximum permitted number of pages the total permitted download size has been reached If the spider ends because there are no more links to follow, the spidering process is complete. In all
other cases the retrieval process reaches an upper limit and the spider does not follow all the links that
it has queued up.
Getting started with HTTP Fetch Page 26 Starting the HTTP Fetch Administration Utility If you have installed the HTTP Fetch service on a Windows platform, you can use the HTTP Fetch
Administration Utility to configure the HTTP Fetch service and spiders via a graphical interface.
You can launch the HTTP Fetch Administration Utility: Using Windows explorer 1. Use Windows Explorer to navigate to the installation directory. 2. Double-click on the HTTPFetchAdmin.exe icon. From the Start menu (provided that you added Start menu shortcuts during installation). 1. Click on Start. 2. Navigate to the HTTP Fetch menu. 3. Click on Configure Autonomy HTTP Fetch. You can find details on how to use the Administration Utility in its help files. Getting started with HTTP Fetch Page 27 Disabling indexing into IDOL server You can set up and test a spider’s configuration without indexing data into IDOL server. You can
disable indexing into IDOL server in order to view the spider’s results document and check your
configuration.
To disable indexing into IDOL server: 1. Open the configuration file in a text editor. The configuration file is split into sections, each section
header is in square brackets, for example [License].
2. Scroll through the configuration file to the [Default] section. 3. Locate the DREHost setting. For example, if you configured your HTTP Fetch installation to index
into an IDOL server with the IP address 192.168.23.1, the DREHost setting is the following line in
the configuration file:
DREHost=192.168.23.1 4. Insert double forward-slashes at the start of the line to comment out the DREHost setting. HTTP
Fetch service ignores commented out lines when it reads the configuration file. For example:
//DREHost=192.168.23.1 5. Save and close the configuration file. Creating a new spider For each web site that you want to spider, you need to create a spider. To create a spider: 1. Open the HTTP Fetch configuration file in a text editor. 2. At the end of the file, specify the name that you want your spider to have on a new line and in
square brackets. For the Autonomy HTTP Fetch test web site, call your spider AUTONOMYTEST
by entering the following line:
[AUTONOMYTEST] The name that you choose must be unique and contain no spaces. Getting started with HTTP Fetch Page 28 3. Below the [<MyNewSpider>] line, specify the URL from which your spider starts, the directory in
which the spider stores temporary files and the name of the log file, using the settings URL,
Directory and Log File.
To spider the Autonomy HTTP Fetch test web site, you need to enter the following settings: URL=http://test1.autonomy.com/httpfetch/
Directory=AUTONOMYTEST
Log file=autonomytest.log
4. Save and close the configuration file. Activating a spider If you want your new spider to run when you next start the HTTP Fetch service, you need to activate it
first. Your spider only runs if it has been activated. You can configure several spiders within one
configuration file and specify which of the spiders you want to run by activating or deactivating them.
To activate a spider: 1. Open the configuration file in a text editor. 2. In the [Spider] section, increase the Number setting by 1 and add a new line with the number and
name of your spider.
To activate the AUTONOMYTEST spider, enter: Number=1
0=AUTONOMYTEST
The name that you enter in this list must match the spider name exactly. 3. Save and close the configuration file. 4. Restart the HTTP Fetch service to execute your changes. Getting started with HTTP Fetch Page 29 Starting the HTTP Fetch service You need to manually start the HTTP Fetch service to run the spiders. If the HTTP Fetch service is
already running, you need to stop it before restarting. Refer to Stopping the HTTP Fetch service on
page 32.
When the HTTP Fetch service is started for the first time, the license.log and service.log files are
created.
Start the HTTP Fetch service using one of the following methods: Windows HTTP Fetch Administration Utility 1. Launch the HTTP Fetch Administration Utility 2. Click on the Play icon in the Fetch service box. Windows explorer Double-click on the <InstallationName>.exe file in your installation directory Windows services 1. Display the Windows Services dialog. 2. Select the <HTTP Fetch installation name> service, and click on the Start button to start HTTP Fetch. 3. Click on the Close button to close the Services dialog. UNIX startfetch.sh script Change directory to the Installation directory. Enter the command ./startfetch.sh Getting started with HTTP Fetch Page 30 Viewing the spider’s output The AUTONOMYTEST spider creates, or updates, the following directory and files in the HTTP Fetch
installation directory:
AUTONOMYTEST (directory)
AUTONOMYTEST.log
AUTONOMYTEST.status
AUTONOMYTEST.idx
Directory The spider stores the documents that it aggregates from the web site in the <MySpider> directory.
When the spider has aggregated the data from the web site, HTTP Fetch imports the documents into
the <MySpider>.idx file and deletes the documents stored in the <MySpider> directory.
The spider logs the site structure in the <MySpider> directory and the structure files are not deleted
until the next spider cycle.
Log file The spider’s activities are logged in the <MySpider>.log file, located in the installation directory. To
see what the spider has done so far, you can view the log file by opening it in a text editor. When the
spider completes its spidering and fetching, it generates a Statistics section at the end of the log file.
Refer to HTTP Fetch Log Files on page 73 for a full description of the <MySpider>.log file. Status file The <MySpider>.status file logs information about the site structure and document contents. This file
is used by HTTP Fetch to determine whether a document has changed since it was last aggregated by
HTTP Fetch. You must not delete this file if you want to HTTP Fetch to check documents for changes
before downloading them.
Getting started with HTTP Fetch Page 31 IDX file The Import Module imports the documents that an HTTP Fetch spider aggregates into the IDOL server
IDX file format. You can see what is to be indexed into IDOL server by viewing the contents of this file.
By default, the IDX file is named <MySpider>.idx and located in the installation directory. To view the
IDX file, open it in a text editor.
For example, the IDX file generated by the AUTONOMYTEST spider, from the Autonomy HTTP Fetch
web site, should look something like this:
#DREREFERENCE http://test1.autonomy.com/httpfetch/
#DRETITLE
Autonomy TestingWelcome to the Autonomy HTTP Fetch test web site.
#DRESECTION 0
#DREDATE 1095065503
#DREDBNAME default
#DREFILENAME C:AutonomyHTTPFetchAUTONOMYTESTAUTONOMYTEST-1.html
#DRESTORECONTENT yes
#DREFIELD DRETITLE="Autonomy TestingWelcome to the Autonomy HTTP Fetch
test web site."
#DREFIELD DREREFERENCE="http://test1.autonomy.com/httpfetch/"
#DREFIELD SPIDERDATE="1095065472"
#DREFIELD SPIDERDOMAIN="test1.autonomy.com"
#DREFIELD SPIDERPAGEDEPTH="0"
#DREFIELD SPIDERPAGELINKS="3"
#DREFIELD SPIDERPAGEDURATION="0"
#DREFIELD SPIDERPAGESIZE="1224"
#DREFIELD CONTENTTYPE="text/html"
#DREFIELD IMPORTBODYLEN="610"
#DREFIELD IMPORTMETALEN="90"
#DREFIELD IMPORTLINKLEN="33"
#DREFIELD IMPORTTITLELEN="0"
#DREFIELD IMPORTQUALITY="105"
#DREFIELD DREDATE="1095065503"
#DREFIELD
DREFILENAME="C:AutonomyHTTPFetchAUTONOMYTESTAUTONOMYTEST-1.html"
#DRECONTENT
Autonomy Testing
Welcome to the Autonomy HTTP Fetch test web site.
If you can see this in your AUTONOMYTEST.idx file, HTTP Fetch has
installed successfully and the AUTONOMYTEST spider is working
correctly.
Getting started with HTTP Fetch Page 32 Stopping the HTTP Fetch service You can stop HTTP Fetch from running using: the stop script (for UNIX) services (for NT) 1. Display the Windows Services dialog 2. Select the <HTTP Fetch installation name> service, and click on the Stop button to stop
HTTP Fetch.
3. Click on the Close button to close the Services dialog. the service port Send the following command to HTTP Fetch’s service port (you need to have specified a service
port in the HTTP Fetch configuration file):
http://<host>:<Service_Port>/action=stop <host> The IP address (or name) of the machine on which HTTP Fetch is running. <Service_Port> HTTP Fetch’s service port (which is specified in the [Service] section of the HTTP Fetch
configuration file).
HTTP Fetch Administration Utility 1. Launch the HTTP Fetch Administration Utility 2. Click on the Stop icon in the Fetch service box. Getting started with HTTP Fetch Page 33 Deactivating a spider If you want to stop a spider from running and store its configuration settings for using again in the
future, you can deactivate the spider. You need to deactivate a spider before you delete it.
To deactivate a spider: 1. Open the configuration file in a text editor. 2. In the [Spider] section, decrease the Number setting by 1 and delete the line <N>=<MySpider> For example, to deactivate the AUTONOMYTEST spider, change the Number setting from 1 to 0
and delete the line
0=AUTONOMYTEST 3. Save and close the configuration file. 4. Restart the HTTP Fetch service to execute your changes. When you have more than one spider in your configuration file, you need to ensure that the spider
numbering is continuous when you delete spiders.
For example, you have the following [Spider] section in your configuration file: [Spider]
Number=3
0=News
1=UKNews
2=AsiaNews
If you want to to deactivate the UKNews spider, you can change the Number setting and edit the
numbering of the spiders list so that the section is as follows:
[Spider]
Number=2
0=News
1=AsiaNews
Getting started with HTTP Fetch Page 34 Deleting a spider When you have no further need for documents from a certain web site, or when a web site no longer
exists, you can delete the spider associated with that web site. If you want to temporarily remove a
spider from the HTTP Fetch service, without deleting all of its settings, refer to Deactivating a
spider on page 33.
1. Open the configuration file in a text editor. 2. If your spider is currently activated, you need to deactivate it. Refer to Deactivating a spider on
page 33.
3. Delete the [<MySpider>] section for the spider that you want to delete. This section starts at the
line [<MySpider>] and ends at the start of the next spider or at the end of the configuration file
4. Save and close the configuration file. 5. Restart the HTTP Fetch service to execute your changes. Enabling indexing into IDOL server When you want to start indexing data into IDOL server, you need to configure the IDOL server details
for HTTP Fetch.
1. Open the configuration file in a text editor. 2. Scroll through the configuration file to the [Default] section. 3. If you commented out the DREHost setting at the start of this step-by-step guide, you can remove
the double forward slashes from the line that the DREHost setting is in.
If you have no DREHost setting in the [Default] section, you need to insert the settings on a new
line. For example, if you want your HTTP Fetch installation to index into an IDOL server with the
IP address 192.168.23.1, insert the following line into the configuration file’s [Default] section:
DREHost=192.168.23.1 4. Save and close the configuration file. Page 35 6. Using HTTP Fetch When you use HTTP Fetch to retrieve documents from web sites, you need to: choose the web sites that you want to spider configure your spiders to retrieve the web site content Selecting web sites to spider When you choose web sites to spider, you need to find web sites that have the following features, to
ensure that the documents that HTTP Fetch retrieves are relevant to your requirements.
Relevant and complete documents The pages of the web site contain full length, relevant documents. IDOL server can conceptualize
and categorize longer documents more accurately than shorter documents.
HTML formatting HTTP Fetch processes web pages that use only HTML and CSS formatting more efficiently than
those that include JavaScript elements. When you select web sites to spider, choose sites that
use straightforward formatting. HTTP Fetch does not read text that is contained within graphics.
Note that HTTP Fetch can aggregate other document types, including Microsoft Word, Microsoft
Excel and PDF files. These files are not processed in the same way as web pages.
Logical and consistent structure HTTP Fetch retrieves documents most efficiently from web sites that have a logical and consistent
structure. The spidering process is most efficient when there are no duplications of the documents
on the web site.
Scripting If the web site that you have chosen to spider contains scripting, such as JavaScript, HTTP Fetch
may retrieve the scripts as well as the text content of the web site. It is better to choose web pages
and web sites that don’t use JavaScript.
Using HTTP Fetch Page 36 Links If you want your spider to follow links on web pages, you need to ensure that the links are
standard HTML, JavaScript or JSP links. HTTP Fetch can also handle JavaScript or JSP menus
and links that are embedded in Macromedia Flash components.
HTTP Fetch works more efficiently when the web site that you are spidering has a single URL for
each page. If a page is duplicated on the web site, or linked to using different URLs, the page is
downloaded and processed more than once.
Metadata Web pages can contain useful information in the metadata tags, such as the date or author of the
document, which you can view in the web page source code. You can configure the Import Module
to include the document metadata when the document is indexed into IDOL server. Please refer to
the Import Module documentation for full details.
Using HTTP Fetch Page 37 Configuring HTTP Fetch spiders When you configure your spiders, you need to: specify a minimum set of configuration parameters configure authentication settings for web sites that require you to log in Specific guidelines for configuring spiders for particular purposes are given in Retrieving up-to-date
web pages on page 41 and Retrieving relevant web pages on page 51.
To improve the efficiency of your HTTP Fetch configuration, follow the guidelines in Optimizing HTTP
Fetch on page 59.
You can use the Import Module to manipulate documents’ data and metadata during import. Please
refer to the Import Module documentation for information on how to use and configure these options.
Minimum spider configuration When you create a spider in the HTTP Fetch configuration file, you need to specify at least the
following settings:
URL You must specify a valid URL for your spider to start at. The starting web page must contain
links to other web pages, if you want more than one page to be retrieved. You must include the
initial http:// in your configuration setting.
To configure your spider’s starting point: Set the URL parameter to the address that you want your spider to start at. Directory HTTP Fetch uses the directory that you specify to store files from the web site that it has
downloaded, before importing them into the IDOL server indexing format, IDX.
To configure the directory in which the spider stores downloaded files: Set the Directory parameter to the directory that you want your spider to store
downloaded files in.
Using HTTP Fetch Page 38 Log File The HTTP Fetch spider logs its activities into the file that you specify for Log File. To specify the log file: Set the Log File parameter to the name of the file into which you want your spider to log
its activities. Include the directory path of the file if you want the file to be in a directory
other than the installation directory.
StayOnSite You can configure your HTTP Fetch spider to stay on the web site that it starts on, or allow it to
leave the starting web site and go to external web sites, in domains that are different to the
starting web stie domain. By default, the spider stays in the starting web site domain.
To configure the spider to go to external web sites: Set the StayOnSite parameter to false. FollowRobotProtocol The robots protocol is used by web site administrators to allow or disallow access to robots, via
a file named robots.txt in the root directory of the web site.
To view the robots protocol for a web site: In your web browser address bar, enter the web site domain followed by /robots.txt.
The robots protocol, if it exists for that web site, is displayed in your web browser.
For example, to view the robot protocol for the Autonomy web site, enter http://
www.autonomy.com/robots.txt in your web browser address bar.
By default, HTTP Fetch spiders follow the robots protocol. If the robots protocol disallows your
spider access to the web site, or to certain areas of the web site, you can set your spider to
ignore the robots protocol.
To configure your spider not to follow the robots protocol: Set the FollowRobotProtocol parameter to false. SiteDuration When spidering ends, HTTP Fetch stops the processes that the spider has been using and
writes the statistics information into the log file. It is important to set a maximum length of time
for spidering so that spider can finish properly before the next spider cycle starts. This time
length must be less than the time interval between spiders.
To set a maximum spidering time for your HTTP Fetch spider: Set the SiteDuration parameter to the maximum time you allow, in seconds. Using HTTP Fetch Page 39 Authentication If you want your spider to retrieve documents from web sites that require authentication with a user
name and password, you need to configure the authentication settings in the [<MySpider>] section of
the HTTP Fetch configuration file.
You can find full details of the authentication settings in the HTTP Fetch online help. See Displaying
help on configuration settings on page 19 for information on how to display the help file.
To enable your spider to access a web site using authentication: 1. Open the configuration file in a text editor 2. In the relevant [<MySpider>] section, enter the URL for the web site that you want to spider using
the LoginURL parameter.
For example: LoginURL=http://<URL> 3. Specify the authentication method that you want to spider using the LoginMethod parameter.
There are two varieties of login methods that web sites use:
Pop-up screen The web site displays a pop-up screen for you to enter your user name and password. You
need to set the LoginMethod parameter to AUTHENTICATE method of authentication.
Embedded form The web site has a login form embedded into the page. You need to set the LoginMethod
parameter to either FORMGET or FORMPOST. Use either method and change your setting
to the alternative method if the first one fails.
For example: LoginMethod=AUTHENTICATE 4. If the web site that you want to spider uses user name and password fields that are not called
username and password, you need to include the field names in your spider’s configuration.
using the LoginUserField and LoginPassField settings.
For example, to login to a web site that has an email field instead of a username field, and a
passcode field in the place of a password field, you can configure your spider as follows:
LoginUserField=email
LoginPassField=passcode
5. You need to encrypt the user name and password before entering them into the LoginUserValue
and LoginPassValue settings. Double-click on autpassw.exe in the installation directory to
display the Autonomy Password Encryption Utility.
Using HTTP Fetch Page 40 6. Enter your user name and click on Encrypt. The encrypted version of your user name is shown.
Copy the encrypted user name into the LoginNameValue setting in the configuration file.
For example: LoginUserValue=9t7YyPA 7. Enter your plain text password and click on Encrypt. The encrypted version of your password is
shown. Copy the encrypted password into the LoginPassValue setting in the configuration file.
For example: LoginPassValue=9t7YyIre3cze3trC38nw 8. Click on Close to exit the Autonomy Password Encryption Utility. 9. Save and close the configuration file. 10. Restart the HTTP Fetch service to execute your changes. Using NTLM Proxy to access sites that use NTLM authentication If HTTP Fetch fails to access a site, even though you have configured the login information, the site
may require NTLM authentication. In this case, HTTP Fetch can only access this site if you have
installed and configured the NTLM Proxy module. Please refer to the appendix The NTLM Proxy
module for details on how to find out whether a web site requires NTLM authentication, how to install
and configure the NTLM Proxy module, and how to configure HTTP Fetch to point to the NTLM Proxy
module.
Using SSL for secure connections SSL has been universally accepted on the World Wide Web for authenticated and encrypted
communication between clients and servers. You can configure your spider to use SSL by specifying
an SSL type. If you do not specify an SSL type, the spider does not use SSL.
To configure your spider to use SSL: 1. Open the configuration file in a text editor 2. In the relevant [Default] section, specify the security type that your spider uses, using the
SecurityType setting. For example, to configure your spider to use SSL version 3, enter the
following line:
SecurityType=SSL_V3 3. Save and close the configuration file. 4. Restart the HTTP Fetch service to execute your changes. Page 41 7. Retrieving up-to-date web pages You can use HTTP Fetch to retrieve the latest content from web sites by: configuring the frequency of the spider runs enabling the date checking and date filtering options in HTTP Fetch setting up foreign language date recognition configuring your spider to handle undated documents. Configuring frequent spider runs You can configure your HTTP Fetch spiders to run at regular intervals, for a specific number of cycles
and at a specific time of day.
To configure HTTP Fetch to run regularly: 1. Open the configuration file in a text editor 2. In the [Default] section, use the SpiderRepeatSecs setting to specify how often you want the
spiders to run, in seconds. For example, to run the spider weekly, insert the following line:
SpiderRepeatSecs=604800 3. In the [Default] section, use the SpiderCycles setting to specify the number of times that you
want your spider to run, or set SpiderCycles to -1 if you want the spider to run an indefinite
number of times. For example, to spider a web site 20 times, insert the following line:
SpiderCycles=20 4. In the [Default] section, use the SpiderStartTime setting to specify the time at which you want
your spiders to start running, using the 24-hour clock hh:mm format. For example, to start
spidering at 11:30pm, insert the following line:
SpiderStartTime=23:00 5. In the relevant [<MySpider>] section, use the SiteDuration setting to specify the maximum
length of time that the spider can run for. For example, to limit the spider to two hours, that is 7200
seconds, insert the following line:
SiteDuration=7200 6. Save and close the configuration file. 7. Restart the HTTP Fetch service to execute your changes. Retrieving up-to-date web pages Page 42 Notes: You must set a value for the SiteDuration parameter that is less than the time interval between
spiders to ensure that the spider processes end and the spider statistics are written into the log
file.
The settings configured in the [Default] section apply to all spiders in the configuration file. It is not
possible to configure spiders within the same configuration file to have different start times, repeat
intervals or numbers of cycles.
Configuring date checking You can configure HTTP Fetch to look for date information in the page URL, the page header or the
page content. In the case of looking in the URL, the spider can check the date before or after
download. You need to use the DateCheck parameter to configure where and how the spider looks for
date information.
You can combine these options by adding the DateCheck values together. For example, to search the
URL and search the page header, combine the first two options together by adding the values 1 and 4.
This gives you the value that you need to set DateCheck to 5.
If you configure HTTP Fetch to reject undated or out-of-date documents, by default the HTTP Fetch
spider still searches the rejected documents for links to add to its spidering. You can configure the
spider to ignore links on rejected pages.
To ignore links on pages that have been rejected by the spider, set the FollowLinksOnRejectedPage
parameter to false.
Set DateCheck to: In order to: 1 Search the URL for date information 4 Search the page header for date information 8 Search the page content for date information 64 Ignore the case of the date information 128 Search the document for its date before download. 256 Reject the document if the date is not found. Retrieving up-to-date web pages Page 43 Specifying date formats You use the following date format specifiers when configuring HTTP Fetch to find document dates. Note: You must specify at least one date format when you have configured HTTP Fetch to check the
document for the date.
Notes: You can combine the date format specifiers to describe a full date. For example, 07/08/04 is
represented by DD/MM/YY.
If you want to use a date format that includes spaces and punctuation, use quotation marks
around the specifiers. For example, 7 August 2004 is represented by "D+ LONGMONTH YYYY"
Specifier Explanation YY 2-digit year (for example, 99, 04) YYYY 4-digit year (for example, 1999, 2003) LONGMONTH Full month name (for example March, October) SHORTMONTH 3-letter month (for example, Mar, Oct) MM 2-digit number (for example, 03, 10) M+ 1 or 2-digit month (for example, 3, 10) DD 2-digit day (for example, 01, 15) D+ 1 or 2-digit day (for example, 1, 15) HH 2-digit hour (for example, 05, 16) H+ 1 or 2-digit hour (for example, 5, 16) NN 2-digit minute (for example, 09, 43) N+ 1 or 2-digit minute (for example, 9, 43) SS 2-digit second (for example, 08, 32) S+ 1 or 2-digit second (for example, 9, 32) ZZZ 3-letter time zone code (for example, GMT, PDT) Retrieving up-to-date web pages Page 44 Filtering by date You can configure your HTTP Fetch spider to accept or reject dated documents by specifying an
acceptable date range, relative to the date on which you run the spider. You can specify the date range
using the BeforeDate and AfterDate settings.
Notes: The BeforeDate is the date before which the document must be dated. The AfterDate is the date after which the document must be dated. To configure date filtering: 1. Open the configuration file in a text editor. 2. In the relevant [<MySpider>] section, specify the date range in which the document date must lie
using the AfterDate and BeforeDate settings.
For example, to retrieve documents that are dated after the current date minus six days, and
before the current date plus one day:
AfterDate=-6
BeforeDate=1
3. Save and close the configuration file. 4. Restart the HTTP Fetch service to implement your changes. Foreign language dates If you are spidering web sites that are not in English, you can configure your spider to interpret foreign
language dates. You need to enter the full and abbreviated month names into the configuration file.
To configure your spider to interpret foreign language dates: 1. Open the HTTP Fetch configuration file in a text editor. 2. In the relevant [<MySpider>] section, specify the names and abbreviations of the months using
the DateLongMonthCSVs and DateMonthCSVs settings.
Retrieving up-to-date web pages Page 45 For example, to configure your spider to read for date information in Spanish, you can enter: DateLongMonthCSVs=enero,febrero,marzo,abril,mayo,junio,julio,agosto,
septiembre,octubre,noviembre,dicembre
DateMonthCSVs=enero,feb,mar,abr,mayo,jun,jul,agosto,sept,oct,nov,dic
3. Save and close the configuration file. 4. Restart the HTTP Fetch service to implement your changes. If you do not specify DateLongMonthCSVs or DateMonthCSVs, the spider uses the English settings. Handling undated documents HTTP Fetch can reject or accept undated documents. If you want to reject undated documents, add
256 to the DateCheck setting. By default, HTTP Fetch spiders accept undated documents and assign
the current date to them. You can configure the spider to assign any particular date, or to assign a date
that is offset from the date on which the document was aggregated.
To set the date for undated documents: 1. Open the configuration file in a text editor. 2. In the relevant [<MySpider>] section, specify the either a fixed date or the offset to the date using
the DefaultDate setting.
For example: To assign the fixed date 18th August 2004 to undated documents: DefaultDate=2004/08/18 To assign a date three weeks before the current date: DefaultDate=-3weeks 3. Save and close the configuration file. 4. Restart the HTTP Fetch service to implement your changes. Notes: If you specify a date for DefaultDate, the format is yyyy/mm/dd, for consistency with IDOL server. You can specify the relative date in days, weeks or years. For example, to specify a date ten days
after the spidering date, enter +10days.
Retrieving up-to-date web pages Page 46 Tutorial: Configuring a spider to retrieve the latest content In this tutorial you are going to set up a spider to automatically aggregate the latest press releases
from the Autonomy web site every week during off-peak periods until further notice. In order to do this,
you need to research the web site, then configure a spider to fetch the Autonomy press releases.
Researching the web site Open the Autonomy web site, http://www.autonomy.com, in your browser. Spend a few moments
looking through the web site to locate the press releases and to become familiar with the web site
structure before considering the following points:
1. Web site structure The Autonomy web site contains a large number of web pages. In addition, there are many PDF
files and links to external web sites. The press releases are located within the section http://
www.autonomy.com/content/Press/Archives/ and filed according to year.
To access the press releases directly:
Set the URL parameter to http://www.autonomy.com/content/Press/Archives/
The press release web pages are within two links of the Autonomy Press Release Archives page.
You can limit your spider to access only those pages that are within two links of the starting page
by using the Depth parameter. The depth of a web page is the number of links that the spider
needs to follow from the starting page until it reaches the web page.
To limit the spider to documents that are within two links of the starting page:
Set the Depth parameter to 2.
2. Page date format Web pages can be dated in the main body of the page, in the page headers, for example the title
bar, or in the URL. The spider looks for dates by matching the page components with date formats
that you specify. If the page date is in the URL, it can be checked by the HTTP Fetch spider before
download.
In the case of the Autonomy Press Releases web site, the date is given in the web pages URL and
body content.
Refer to Configuring date checking on page 42 for details of the date checking options and to
Specifying date formats on page 43 for more information on the format codes.
Location Example Format specifier Page body September 8, 2004 "LONGMONTH D+, YYYY" URL 2004/0908 YYYY/MMDD Retrieving up-to-date web pages Page 47 To check the date in the URL before download:
Set the DateCheck parameter to 129 (1 + 128).
Set the DateFormat parameter to YYYY/MMDD. Note that the format includes the forward slash,
as used in the URL.
3. Spidering frequency The HTTP Fetch service can run spiders at regular intervals, specified in seconds. To spider the web site weekly:
Set the SpiderRepeatSecs parameter to 604800.
4. Fetching the new content only By default, your spider fetches all press releases, including those that it has retrieved before. You can use the AfterDate and BeforeDate parameters to specify a date range that the document
dates must be in. Both parameters are stated as a number of days relative to the spidering date.
Notes: The BeforeDate is the date before which the document must be dated. The AfterDate is the date after which the document must be dated. To accept only documents that have a date that is after one week ago and before or on the
spidering date:
Set the AfterDate parameter to -7
Set the BeforeDate parameter to 0. 5. Number of spider cycles HTTP Fetch can be configured to run for a specified number of cycles, or it can be set to run
indefinitely. The Autonomy Press Releases spider must run until further notice.
To run the fetch indefinitely:
Set the SpiderCycles parameter to -1.
6. Spidering time You can run your HTTP Fetch spiders at a specific time, for example overnight when the web site
has low traffic.
To start the fetch at a particular time:
Set the SpiderStartTime parameter to 23:00.
Retrieving up-to-date web pages Page 48 7. Spidering time period When spidering ends, HTTP fetch stops the processes that the spider has been using and writes
the statistics information into the log file. It is important to set a maximum length of time for
spidering so that spider can finish properly before the next spider cycle starts. This time length
must be less than the time interval between spiders. For the Autonomy Press Releases spider,
limited to a small section of the web site and fetching one or two new documents in each cycle,
twelve hours is sufficient.
To set a maximum spidering time of twelve hours:
Set the SiteDuration parameter to 43200.
8. Robots protocol The Autonomy web site robots protocol prohibits spiders from entering the /Content/ subdirectory
in which the Press Releases are stored. To ensure that your Autonomy Press Releases spider can
access the web pages, you need to configure your spider to ignore the robots protocol.
To ignore the robots protocol:
Set the FollowRobotProtocol to false.
Retrieving up-to-date web pages Page 49 Configuring the Autonomy Press Releases spider Using the information that you have found from the web site and from your requirements, you can set
up and activate the Autonomy Press Releases spider by inserting the following lines into the
appropriate sections of the configuration file. Make sure that you do not duplicate the section headings
or settings within the configuration file. The spider does not run until you save the configuration file and
restart the HTTP Fetch service.
This example assumes that the Autonomy Press Releases spider is the only spider in the configuration
file; edit the [Spider] section if there are other spiders in the file.
[Default]
SpiderRepeatSecs=604800
SpiderCycles=-1
SpiderStartTime=23:00
[Spider]
n=1
0=AutonomyPressReleases
[AutonomyPressReleases]
URL=http://www.autonomy.com/content/Press/Archives/
Directory=AutonomyPressReleases
Log file=AutonomyPressReleases.log
Depth=2
DateCheck=129
DateFormat=YYYY/MMDD
AfterDate=-7
BeforeDate=0
SiteDuration=43200
Retrieving up-to-date web pages Page 50 Page 51 8. Retrieving relevant web pages You can use HTTP Fetch to retrieve only those documents that are relevant to your requirements by
configuring your spiders to:
retrieve documents only if they contain particular characters, words or phrases reject documents that contain unwanted characters, words or phrases Filtering by text strings You can configure your spider to search the document contents for particular text strings (such as
words, phrases or characters) that a document either must or cannot contain in order for it to be
downloaded. The HTTP Fetch spider can search the page URL, header or content for the text strings
that you specify in the configuration file. Use the following settings for filtering by text strings:
MustHaveCheck and MustHaveCSVs
To configure document filtering on text strings that the documents must contain.
CantHaveCheck and CantHaveCSVs
To configure the document filtering on text strings that the documents can’t contain.
If you specify text strings that the document must contain, and you also specify text strings that the
document cannot contain, the spider retrieves a document only if it contains at least one
MustHaveCSVs text string and none of the CantHaveCSVs text strings.
Retrieving relevant web pages Page 52 Retrieving documents that contain required text strings If you set a spider to search for text strings that the document must have, the spider retrieves only
documents that contain the strings. You need to configure two settings:
MustHaveCSVs Specify the text string(s) that you want HTTP Fetch to search for. MustHaveCheck Allows you to control where HTTP Fetch looks for the text strings, whether or not HTTP Fetch is
case-sensitive and, if checking the URL, whether or not to check the URL before or after
download.
You can use the following values for the MustHaveCheck setting: You can specify more than one action by adding the MustHaveCheck values of the separate
actions together. For example, to search both the URL and the page header, combine the first
two action values together by adding 1 and 4 to obtain the value 5, to which you need to set the
MustHaveCheck parameter.
By default the HTTP Fetch spider searches documents that it has rejected for links to add to its
spidering. You can configure the spider to ignore links on rejected pages.
To ignore links on pages that have been rejected by the spider: Set the FollowLinksOnRejectedPage parameter to false. To retrieve documents containing the text strings that you have specified in particular sections of the
web page:
1. Open the HTTP Fetch configuration file in a text editor. 2. In the relevant [<MySpider>] section, specify the MustHaveCheck options that you want your
spider to use.
MustHaveCheck values Action 1 Search the URL for text strings 4 Search the page header for text strings 8 Search the page content for text strings 64 Ignore the case of the text strings 128 Search the document for text strings before download. Retrieving relevant web pages Page 53 For example, if you want the spider to check the page URL (MustHaveCheck value 1) and the
page content (MustHaveCheck value 8) before the document is downloaded (MustHaveCheck
value 128), you need to use a total MustHaveCheck value of 137 (1 + 8 + 128 = 137).
MustHaveCheck=137 3. Specify the text strings that you want the spider to search for, using the MustHaveCSVs setting.
For example, if you want the spider to retrieve only those documents which contain either or both
of the text strings "Tony Blair" and "Michael Howard" you would enter the following line:
MustHaveCSVs="Tony Blair","Michael Howard" 4. Save and close the configuration file. 5. Restart the HTTP Fetch service to implement your changes. If you specify more than one text string in the MustHaveCSVs setting, the page is downloaded if one
or more of the specified text strings are found in the document.
Retrieving relevant web pages Page 54 Rejecting documents that contain unwanted text strings If you set a spider to search for text strings that the document can’t have, the spider retrieves only
documents that do not contain the text strings. You need to configure two settings:
CantHaveCSVs Specify the text string(s) that you want HTTP Fetch to search for. CantHaveCheck Allows you to control where HTTP Fetch looks for the text strings, whether or not HTTP Fetch is
case-sensitive and, if checking the URL, whether or not to check the URL before or after
download.
You can use the following values for the CantHaveCheck setting. You can specify more than one action by adding the CantHaveCheck values of the separate
actions together. For example, to search both the URL and the page header, combine the first
two actions’ values together by adding the values 1 and 4 to o
Google Search
Google
Popular Articles