Go to file

MTRNord 070af63942 Copy templates over		2022-10-11 14:42:23 +02:00
.github/workflows	Make sure that we actually lfs pull	2022-10-10 21:42:53 +02:00
crates/model_server	Log index access	2022-10-10 22:15:53 +02:00
input	Berty the Bert	2022-10-10 19:58:50 +02:00
logs/scalars	Move to python script	2022-09-26 16:08:07 +02:00
models	Move to a transfer training approach to not get issues due to the dataset. (hyperparameter search disabled as training it needs ~200-400GB which I currently dont have free)	2022-10-01 02:48:31 +02:00
supply-chain	Berty the Bert	2022-10-10 19:58:50 +02:00
.editorconfig	Berty the Bert	2022-10-10 19:58:50 +02:00
.gitattributes	Initial experiments	2022-09-25 02:12:42 +02:00
.gitignore	Add new sample of spam and make sure that data send for review is sanitized	2022-09-28 18:29:03 +02:00
Cargo.lock	Berty the Bert	2022-10-10 19:58:50 +02:00
Cargo.toml	Use a rust server to serve the model via a highlevel api	2022-09-27 20:30:38 +02:00
Dockerfile	Copy templates over	2022-10-11 14:42:23 +02:00
LICENSE	Create LICENSE	2022-09-25 03:14:50 +02:00
README.md	Move to a transfer training approach to not get issues due to the dataset. (hyperparameter search disabled as training it needs ~200-400GB which I currently dont have free)	2022-10-01 02:48:31 +02:00
bert.ipynb	Berty the Bert	2022-10-10 19:58:50 +02:00
dataset_analysis.ipynb	Berty the Bert	2022-10-10 19:58:50 +02:00
model_v2.py	Berty the Bert	2022-10-10 19:58:50 +02:00

README.md

Matrix-Spam ML

This project consists of tooling to generate a spam detection model for the Matrix protocol.

It utilizes Tensorflow and builds the model in python and then provides a Rust server that provides some APIs to interact with the model easily and also extend it.

The current code base is fast moving. Expect to change rapidly.

Usage

Training

To train the model, you need to have a set of labeled data. This data is at ./input/MatrixData. It is a TSV file. Please note that URLs should not be added as well as newlines. Newlines will be stripped anyway and URLs tend to break the model result.

To train the model, run python3 model_v2.py. This will train the model and save it to ./model/. Please make sure you installed tensorflow.

Notes about the data

Please ensure to remove all urls, html tags and new lines. Also make sure to strip duplicate whitespace. All of these reduce accuracy easily.

Running the server

To run the server, run cargo run --release. This will start the server on port 3000.

If you dont see any log try preprending RUST_LOG=info to the command.

API

POST /test

This endpoint takes a JSON body with the following format:

{
    "input_data": "This is a message to be classified"
}

It will return a JSON response with the following format:

{
    "input_data": "This is a message to be classified",
    "score": 1.1349515e-24, // Note that this is a float
}

You do not have to strip urls like in the training data. However it might yield better results if you strip the html tags.

Support

For support you can join #matrix-spam-ml:midnightthoughts.space on Matrix.