1. basc_warc.WarcFile — Managing WARC files

This class is how you create and manage WARC files.

class basc_warc.WarcFile(records=[])

A WARC (Web ARChive) file.

1.1. Creating a new record

You can create a new record from a basc_warc.WarcFile.

WarcFile.create_record(record_type, defaults=True)

Create a new blank record.

Parameters:
  • record_type (str) – WARC record type.
  • defaults (bool) – Create new record with WARC-Record-ID and WARC-Date.
Returns:

class:basc_warc.Record

Return type:

New

1.2. Adding specific records

These functions let you add standard types of records easily.

WarcFile.add_warcinfo_record(fields={}, operator=None, software=None, robots=None, hostname=None, ip=None, http_header_user_agent=None, http_header_from=None)

Add a warcinfo record to this file.

Parameters:
  • fields (dict) – Fields for this record.
  • operator (string) – Contact information for the operator who created this resource. A name or a name and email address is recommended.
  • software (string) – Software and software version used to create this WARC resource (defaults to BASC-Warc’s version informaton).
  • robots (string) – The robots policy followed by the harvester creating this WARC resource. The string 'classic' indicates the 1994 web robots exclusion standard rules are being obeyed.
  • hostname (string) – The hostname of the machine that created this WARC resource, such as “crawling17.archive.org”.
  • ip (string) – The IP address of the machine that created this WARC resource, such as “123.2.3.4”.
  • http_header_user_agent (string) – The HTTP ‘user-agent’ header usually sent by the harvester along with each request. If ‘request’ records are used to save verbatim requests, this information is redundant.
  • http_header_from (string) – The HTTP ‘From’ header usually sent by the harvester along with each request (redundant when ‘request’ records are used, as above).
Returns:

Index of the new added record.

1.3. Adding custom records

These functions let you add basc_warc.Record objects directly into this WARC file.

In a threaded application, if you are adding multiple records that relate to each other, you should use the basc_warc.WarcFile.add_records() function, as this will ensure the given records are adjacent.

WarcFile.add_record(record)

Add the given Record to our records.

Parameters:record (basc_warc.Record) – Record to add to this WARC file.
Returns:The index of the added record.
WarcFile.add_records(*records)

Add the given Records to our records.

Parameters:record (list of basc_warc.Record) – Records to add to this WARC file.
Returns:Indexes of the added records.

1.4. Writing files out

To write files out, you simply use the basc_warc.WarcFile.bytes() function and write the output to a file.

WarcFile.bytes(compress_records=False)

Return bytes to write.

Parameters:compress_records (bool) – Whether to apply gzip compression to records.
Returns:Bytes that represent this WARC file.