Restrict parser network access

Parsing can cause network requests to be performed, especially if a URI is given as an argument such as with raptor_parse_uri() however there may also be indirect requests such as with the GRDDL parser that retrieves URIs depending on the results of initial parse requests. The URIs requested may not be wanted to be fetched or need to be filtered, and this can be done in three ways.

Filtering parser network requests with feature RAPTOR_FEATURE_NO_NET

The parser feature RAPTOR_FEATURE_NO_NET can be set with raptor_set_feature() and forbids all network requests. There is no customisation with this approach, for that see the URI filter in the next section.

  rdf_parser = raptor_new_parser("rdfxml");

  /* Disable internal network requests */
  raptor_set_feature(rdf_parser, RAPTOR_FEATURE_NO_NET, 1);

Filtering parser network requests with raptor_www_set_uri_filter()

The raptor_www_set_uri_filter() allows setting of a filtering function to operate on all URIs retrieved by a WWW connection. This connection can be used in parsing when operated by hand.

void write_bytes_handler(raptor_www* www, void *user_data, 
                         const void *ptr, size_t size, size_t nmemb) {
{
  raptor_parser* rdf_parser=(raptor_parser*)user_data;
  raptor_parse_chunk(rdf_parser, (unsigned char*)ptr, size*nmemb, 0);
}

int uri_filter(void* filter_user_data, raptor_uri* uri) {
  /* return non-0 to forbid the request */
}

int main(int argc, char *argv[]) { 
  ...

  rdf_parser = raptor_new_parser("rdfxml");
  www = raptor_new_www();

  /* filter all URI requests */
  raptor_www_set_uri_filter(www, uri_filter, filter_user_data);

  /* make WWW write bytes to parser */
  raptor_www_set_write_bytes_handler(www, write_bytes_handler, rdf_parser);

  raptor_start_parse(rdf_parser, uri);
  raptor_www_fetch(www, uri);
  /* tell the parser that we are done */
  raptor_parse_chunk(rdf_parser, NULL, 0, 1);

  raptor_www_free(www);
  raptor_free_parser(rdf_parser);

  ...
}

Filtering parser network requests with raptor_parser_set_uri_filter()

The raptor_parser_set_uri_filter() allows setting of a filtering function to operate on all URIs that the parser sees. This operates on the internal raptor_www object used inside parsing to retrieve URIs, similar to that described in the previous section.

  int uri_filter(void* filter_user_data, raptor_uri* uri) {
    /* return non-0 to forbid the request */
  }

  rdf_parser = raptor_new_parser("rdfxml");
  raptor_parser_set_uri_filter(rdf_parser, uri_filter, filter_user_data);

  /* parse content as normal */
  raptor_parse_uri(rdf_parser, uri, base_uri);

Setting timeout for parser network requests with feature RAPTOR_FEATURE_WWW_TIMEOUT

If the value of feature RAPTOR_FEATURE_WWW_TIMEOUT if set to a number >0, it is used as the timeout in seconds for retrieving of URIs during parsing (primarily for GRDDL). This uses raptor_www_set_connection_timeout() internally.

  rdf_parser = raptor_new_parser("grddl");

  /* set internal URI retrieval maximum time to 5 seconds */
  raptor_set_feature(rdf_parser, RAPTOR_FEATURE_WWW_TIMEOUT , 5);