A storage place for samples of all kinds. Typically focused around large-scale collection of phishing kits and other email/website artifacts.
The basic plan at the moment is to treat each submitted archive as a “job”. There’s “known good” jobs and “other” jobs. “Known good” job lots would be something like the Wordpress installer, while “other” would be a backup of a compromised/phishing site. There’s likely lots of commonality between them, but the interesting parts are the differences.
If you have any kind of suggestion or issue, please create a github issue - I’ll gladly discuss it. Pull requests for features or fixes are even better :)
Table of Contents
Starting the web interface:
pipenv run python -m dewar web
Starting the ingestor (not … really working yet)
pipenv run python -m dewar ingestor
Internal “element” types
- the file reference is always the sha256 hash of the file
- a collection of files that group together. typically encapsulated in an archive file as you ingest it
- bucket - a place where files are stored or ingested from (ie, storage, incoming-knowngood)
- This should be a simple string reference, so storage backends can implement the
dir() function and return a list of files regardless of the method of storage.
- other (ie, processing results)?
- store each file only once, identified by its sha256 hash
- compression or optimisation is up to the storage backend
- built to have swappable bits, so if you’ve got big fat database servers for metadata you can use them, or if you want to store files in mongodb/elastic/whatever, go ahead.
- simple tokenisation of file contents could help find uncommon code structures or things that’d lead to IOCs or allow for tracking over time
Various bits to build
- ingestion methods:
- “known_good” - automatically tagged as good
- “other” -
known_good = False
- [ ] have a simple API for submitting files, part of the frontend
- ingestion pipelines
[ ] simple single threaded widget
[ ] pubsub queue with multiple nodes doing things
- storage backends
- metadata backends
- processing of samples
- processing pipelines
- Data interaction
Starting a new backend implementation
An example would be a
Storage backend. The “base” template is
Storage backends should always be imported as
from dewar.storage.<backend> import Storage so they can be consistently used. The S3 implementation then is
from dewar.storage.s3 import Storage.
Methods that storage backends should support (inspired by http verbs)
- get (contents and metadata)
- put (contents and metadata)
- update (update metadata)
- head (check a file exists and reutrn metadata, or false if doesn’t exist)
- search (by metadata, or maybe file contents?)
- dir (list contents of a Bucket)
- put_hash (with metadata)
- get (generic)
- put (generic)