Class NetworkCrawler

All Implemented Interfaces:
DataProvider

public class NetworkCrawler extends AbstractListCrawler<URL>
Provider for data files directly fetched from network.

This class handles a list of URLs pointing to data files or zip/jar on the net. Since the net is not a tree structure the list elements cannot be top elements recursively browsed as in DirectoryCrawler, they must be data files or zip/jar archives.

The files fetched from network can be locally cached on disk. This prevents too frequent network access if the URLs are remote ones (for example original internet URLs).

If the URL points to a remote server (typically on the web) on the other side of a proxy server, you need to configure the networking layer of your application to use the proxy. For a typical authenticating proxy as used in many corporate environments, this can be done as follows using for example the AuthenticatorDialog graphical authenticator class that can be found in the tests directories:

   System.setProperty("http.proxyHost",     "proxy.your.domain.com");
   System.setProperty("http.proxyPort",     "8080");
   System.setProperty("http.nonProxyHosts", "localhost|*.your.domain.com");
   Authenticator.setDefault(new AuthenticatorDialog());
 

All registered filters are applied.

Zip archives entries are supported recursively.

This is a simple application of the visitor design pattern for list browsing.

Author:
Luc Maisonobe
See Also:
  • Constructor Details

    • NetworkCrawler

      public NetworkCrawler(URL... inputs)
      Build a data classpath crawler.

      The default timeout is set to 10 seconds.

      Parameters:
      inputs - list of input file URLs
  • Method Details

    • setTimeout

      public void setTimeout(int timeout)
      Set the timeout for connection.
      Parameters:
      timeout - connection timeout in milliseconds
    • getCompleteName

      protected String getCompleteName(URL input)
      Get the complete name of a input.
      Specified by:
      getCompleteName in class AbstractListCrawler<URL>
      Parameters:
      input - input to consider
      Returns:
      complete name of the input
    • getBaseName

      protected String getBaseName(URL input)
      Get the base name of an input.
      Specified by:
      getBaseName in class AbstractListCrawler<URL>
      Parameters:
      input - input to consider
      Returns:
      base name of the input
    • getZipJarCrawler

      protected ZipJarCrawler getZipJarCrawler(URL input)
      Get a zip/jar crawler for an input.
      Specified by:
      getZipJarCrawler in class AbstractListCrawler<URL>
      Parameters:
      input - input to consider
      Returns:
      zip/jar crawler for an input
    • getStream

      protected InputStream getStream(URL input) throws IOException
      Get the stream to read from an input.
      Specified by:
      getStream in class AbstractListCrawler<URL>
      Parameters:
      input - input to read from
      Returns:
      stream to read the content of the input
      Throws:
      IOException - if the input cannot be opened for reading