1. basc_warc.WarcFile
— Managing WARC files¶
This class is how you create and manage WARC files.
-
class
basc_warc.
WarcFile
(records=[])¶ A WARC (Web ARChive) file.
1.1. Creating a new record¶
You can create a new record from a basc_warc.WarcFile
.
-
WarcFile.
create_record
(record_type, defaults=True)¶ Create a new blank record.
Parameters: - record_type (str) – WARC record type.
- defaults (bool) – Create new record with
WARC-Record-ID
andWARC-Date
.
Returns: class:basc_warc.Record
Return type: New
1.2. Adding specific records¶
These functions let you add standard types of records easily.
-
WarcFile.
add_warcinfo_record
(fields={}, operator=None, software=None, robots=None, hostname=None, ip=None, http_header_user_agent=None, http_header_from=None)¶ Add a warcinfo record to this file.
Parameters: - fields (dict) – Fields for this record.
- operator (string) – Contact information for the operator who created this resource. A name or a name and email address is recommended.
- software (string) – Software and software version used to create this WARC resource (defaults to BASC-Warc’s version informaton).
- robots (string) – The robots policy followed by the harvester creating this WARC
resource. The string
'classic'
indicates the 1994 web robots exclusion standard rules are being obeyed. - hostname (string) – The hostname of the machine that created this WARC resource, such as “crawling17.archive.org”.
- ip (string) – The IP address of the machine that created this WARC resource, such as “123.2.3.4”.
- http_header_user_agent (string) – The HTTP ‘user-agent’ header usually sent by the harvester along with each request. If ‘request’ records are used to save verbatim requests, this information is redundant.
- http_header_from (string) – The HTTP ‘From’ header usually sent by the harvester along with each request (redundant when ‘request’ records are used, as above).
Returns: Index of the new added record.
1.3. Adding custom records¶
These functions let you add basc_warc.Record
objects directly into this WARC file.
In a threaded application, if you are adding multiple records that relate to each other, you should use the basc_warc.WarcFile.add_records()
function, as this will ensure the given records are adjacent.
-
WarcFile.
add_record
(record)¶ Add the given Record to our records.
Parameters: record ( basc_warc.Record
) – Record to add to this WARC file.Returns: The index of the added record.
-
WarcFile.
add_records
(*records)¶ Add the given Records to our records.
Parameters: record (list of basc_warc.Record
) – Records to add to this WARC file.Returns: Indexes of the added records.
1.4. Writing files out¶
To write files out, you simply use the basc_warc.WarcFile.bytes()
function and write the output to a file.
-
WarcFile.
bytes
(compress_records=False)¶ Return bytes to write.
Parameters: compress_records (bool) – Whether to apply gzip compression to records. Returns: Bytes that represent this WARC file.