BASC-WARC¶
Bibliotheca Anonoma’s library for creating and managing WARC files.
Warning
This is not even in alpha right now. This is in the planning / pre-alpha stage. If you use this, ANYTHING can change without any notice whatsoever, everything can be overhauled, and development may even stop entirely without any warning.
This library is primarily being written for BASC-Archiver, and planned to be integrated into a new/existing downloading library.
Planned Features¶
- Python 2/3 compatibility.
- Thread-safe.
- Streaming reading/writing of WARC files, for dealing with very large files on systems with smaller amounts of memory.
- CDX file creation and management.
- Included scripts that do useful work, possibly allowing viewing or extracting information and files from WARCs / appending WARCs / creating CDX files from WARCs, similar to megawarc, CDX-Writer, or warctools.
License¶
Written in 2015 by Daniel Oaks <daniel@danieloaks.net>
To the extent possible under law, the author(s) have dedicated all copyright and related and neighboring rights to this software to the public domain worldwide. This software is distributed without any warranty.
You should have received a copy of the CC0 Public Domain Dedication along with this software. If not, see http://creativecommons.org/publicdomain/zero/1.0/.
Library¶
basc_warc.WarcFile
— Managing WARC files¶
This class is how you create and manage WARC files.
-
class
basc_warc.
WarcFile
(records=[])¶ A WARC (Web ARChive) file.
Creating a new record¶
You can create a new record from a basc_warc.WarcFile
.
-
WarcFile.
create_record
(record_type, defaults=True)¶ Create a new blank record.
Parameters: - record_type (str) – WARC record type.
- defaults (bool) – Create new record with
WARC-Record-ID
andWARC-Date
.
Returns: class:basc_warc.Record
Return type: New
Adding specific records¶
These functions let you add standard types of records easily.
-
WarcFile.
add_warcinfo_record
(fields={}, operator=None, software=None, robots=None, hostname=None, ip=None, http_header_user_agent=None, http_header_from=None)¶ Add a warcinfo record to this file.
Parameters: - fields (dict) – Fields for this record.
- operator (string) – Contact information for the operator who created this resource. A name or a name and email address is recommended.
- software (string) – Software and software version used to create this WARC resource (defaults to BASC-Warc’s version informaton).
- robots (string) – The robots policy followed by the harvester creating this WARC
resource. The string
'classic'
indicates the 1994 web robots exclusion standard rules are being obeyed. - hostname (string) – The hostname of the machine that created this WARC resource, such as “crawling17.archive.org”.
- ip (string) – The IP address of the machine that created this WARC resource, such as “123.2.3.4”.
- http_header_user_agent (string) – The HTTP ‘user-agent’ header usually sent by the harvester along with each request. If ‘request’ records are used to save verbatim requests, this information is redundant.
- http_header_from (string) – The HTTP ‘From’ header usually sent by the harvester along with each request (redundant when ‘request’ records are used, as above).
Returns: Index of the new added record.
Adding custom records¶
These functions let you add basc_warc.Record
objects directly into this WARC file.
In a threaded application, if you are adding multiple records that relate to each other, you should use the basc_warc.WarcFile.add_records()
function, as this will ensure the given records are adjacent.
-
WarcFile.
add_record
(record)¶ Add the given Record to our records.
Parameters: record ( basc_warc.Record
) – Record to add to this WARC file.Returns: The index of the added record.
-
WarcFile.
add_records
(*records)¶ Add the given Records to our records.
Parameters: record (list of basc_warc.Record
) – Records to add to this WARC file.Returns: Indexes of the added records.
Writing files out¶
To write files out, you simply use the basc_warc.WarcFile.bytes()
function and write the output to a file.
-
WarcFile.
bytes
(compress_records=False)¶ Return bytes to write.
Parameters: compress_records (bool) – Whether to apply gzip compression to records. Returns: Bytes that represent this WARC file.
basc_warc.Record
— WARC Records¶
You use this class to create and add new records.
Creation¶
These functions let you add standard types of records easily.
-
class
basc_warc.
Record
(record_type, header=None, block=None)¶ A record in a WARC file.
Parameters: - record_type (string) – Name of this type of record. ie:
'warcinfo'
. - header (RecordHeader) – A
basc_warc.RecordHeader
object. - block (RecordBlock) – A
basc_warc.RecordBlock
object.
- record_type (string) – Name of this type of record. ie:
basc_warc.RecordHeader
- WARC Record Header¶
This class is a header for a basc_warc.Record
object.
Fields¶
The following methods let you set standard WARC fields.
-
class
basc_warc.
RecordHeader
(fields={})¶ A header for a WARC record.
Parameters: fields (dict) – Fields to create this header with.
-
RecordHeader.
set_field
(name, value)¶ Set field to the given value.
Parameters: - name (string) – Name of the field.
- value (string or int) – Value of the field.
WARC Record blocks¶
You use any of these classes as a content block for a basc_warc.Record
object.
Bytes¶
This is the standard type of block, and lets you expose a series of bytes (a file, HTTP request, etc) as a block for a basc_warc.Record
.
warcinfo
Block¶
This type of block is used for warcinfo
Records, and lets you easily set keys and values.
-
class
basc_warc.
WarcinfoBlock
(fields={})¶ Block for a warcinfo record.
Parameters: fields (dict) – Fields to create this block with.
-
WarcinfoBlock.
set_field
(name, value)¶ Set field to given value.
Parameters: - name (string) – Name of the field.
- value (string or int) – Value of the field.