BASC-WARC

Bibliotheca Anonoma’s library for creating and managing WARC files.

Warning

This is not even in alpha right now. This is in the planning / pre-alpha stage. If you use this, ANYTHING can change without any notice whatsoever, everything can be overhauled, and development may even stop entirely without any warning.

This library is primarily being written for BASC-Archiver, and planned to be integrated into a new/existing downloading library.

Planned Features

  • Python 2/3 compatibility.
  • Thread-safe.
  • Streaming reading/writing of WARC files, for dealing with very large files on systems with smaller amounts of memory.
  • CDX file creation and management.
  • Included scripts that do useful work, possibly allowing viewing or extracting information and files from WARCs / appending WARCs / creating CDX files from WARCs, similar to megawarc, CDX-Writer, or warctools.

License

Written in 2015 by Daniel Oaks <daniel@danieloaks.net>

To the extent possible under law, the author(s) have dedicated all copyright and related and neighboring rights to this software to the public domain worldwide. This software is distributed without any warranty.

You should have received a copy of the CC0 Public Domain Dedication along with this software. If not, see http://creativecommons.org/publicdomain/zero/1.0/.

Library

basc_warc.WarcFile — Managing WARC files

This class is how you create and manage WARC files.

class basc_warc.WarcFile(records=[])

A WARC (Web ARChive) file.

Creating a new record

You can create a new record from a basc_warc.WarcFile.

WarcFile.create_record(record_type, defaults=True)

Create a new blank record.

Parameters:
  • record_type (str) – WARC record type.
  • defaults (bool) – Create new record with WARC-Record-ID and WARC-Date.
Returns:

class:basc_warc.Record

Return type:

New

Adding specific records

These functions let you add standard types of records easily.

WarcFile.add_warcinfo_record(fields={}, operator=None, software=None, robots=None, hostname=None, ip=None, http_header_user_agent=None, http_header_from=None)

Add a warcinfo record to this file.

Parameters:
  • fields (dict) – Fields for this record.
  • operator (string) – Contact information for the operator who created this resource. A name or a name and email address is recommended.
  • software (string) – Software and software version used to create this WARC resource (defaults to BASC-Warc’s version informaton).
  • robots (string) – The robots policy followed by the harvester creating this WARC resource. The string 'classic' indicates the 1994 web robots exclusion standard rules are being obeyed.
  • hostname (string) – The hostname of the machine that created this WARC resource, such as “crawling17.archive.org”.
  • ip (string) – The IP address of the machine that created this WARC resource, such as “123.2.3.4”.
  • http_header_user_agent (string) – The HTTP ‘user-agent’ header usually sent by the harvester along with each request. If ‘request’ records are used to save verbatim requests, this information is redundant.
  • http_header_from (string) – The HTTP ‘From’ header usually sent by the harvester along with each request (redundant when ‘request’ records are used, as above).
Returns:

Index of the new added record.

Adding custom records

These functions let you add basc_warc.Record objects directly into this WARC file.

In a threaded application, if you are adding multiple records that relate to each other, you should use the basc_warc.WarcFile.add_records() function, as this will ensure the given records are adjacent.

WarcFile.add_record(record)

Add the given Record to our records.

Parameters:record (basc_warc.Record) – Record to add to this WARC file.
Returns:The index of the added record.
WarcFile.add_records(*records)

Add the given Records to our records.

Parameters:record (list of basc_warc.Record) – Records to add to this WARC file.
Returns:Indexes of the added records.

Writing files out

To write files out, you simply use the basc_warc.WarcFile.bytes() function and write the output to a file.

WarcFile.bytes(compress_records=False)

Return bytes to write.

Parameters:compress_records (bool) – Whether to apply gzip compression to records.
Returns:Bytes that represent this WARC file.

basc_warc.Record — WARC Records

You use this class to create and add new records.

Creation

These functions let you add standard types of records easily.

class basc_warc.Record(record_type, header=None, block=None)

A record in a WARC file.

Parameters:

basc_warc.RecordHeader - WARC Record Header

This class is a header for a basc_warc.Record object.

Fields

The following methods let you set standard WARC fields.

class basc_warc.RecordHeader(fields={})

A header for a WARC record.

Parameters:fields (dict) – Fields to create this header with.
RecordHeader.set_field(name, value)

Set field to the given value.

Parameters:
  • name (string) – Name of the field.
  • value (string or int) – Value of the field.

Simple field access

These are convenient ways to access certain fields.

RecordHeader.record_id

ID of this Record, should be unique in the WARC.

RecordHeader.date

Datetime the data capture that created this Record started.

WARC Record blocks

You use any of these classes as a content block for a basc_warc.Record object.

Bytes

This is the standard type of block, and lets you expose a series of bytes (a file, HTTP request, etc) as a block for a basc_warc.Record.

class basc_warc.RecordBlock(content=None)

Block for an arbitrary record.

Parameters:content (bytes) – Block of content to expose in this record.

warcinfo Block

This type of block is used for warcinfo Records, and lets you easily set keys and values.

class basc_warc.WarcinfoBlock(fields={})

Block for a warcinfo record.

Parameters:fields (dict) – Fields to create this block with.
WarcinfoBlock.set_field(name, value)

Set field to given value.

Parameters:
  • name (string) – Name of the field.
  • value (string or int) – Value of the field.