Lab 4b: Watermarking

Lab 4b: Watermarking#

About This Lab#

Throughout this lab, you will encounter two types of interactive elements:

            ![Activity](../mlu_utils/images/activity.png)

            ![Challenge](../mlu_utils/images/challenge.png)

            No coding is needed for an activity. You try to understand a concept,

answer questions, or run a code cell.

            Challenges are where you test your understanding by implementing something new or taking a short quiz.

Please work through this notebook from top to bottom to avoid errors due to missing code or context.

Table of Contents#

This notebook demonstrates how to use various techniques that can help improve the safety and security of LLM-backed applications. The coding examples cover watermarking as an authentication technique.

1. Install and import libraries#

Let’s start by installing all required packages as specified in the requirements.txt file and importing several libraries.

%%capture
!pip3 install -r ../requirements.txt --quiet
!rm -rf lm-watermarking
!git clone https://github.com/jwkirchenbauer/lm-watermarking.git --quiet

import warnings, sys, os

warnings.filterwarnings("ignore")
cwd = os.getcwd()
print(f"current working directory >>>>> {cwd} \n")
sys.path.append(cwd + "/lm-watermarking/")

import json
from IPython.display import Markdown

current working directory >>>>> /home/sagemaker-user/mlu-eep-generative-ai/Module 2 - Responsible Generative AI/Labs/Lab-4 

2. Watermarking for authentication#

Potential harms of LLMs can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. Watermarks can be embedded with negligible impact on text quality, and can be detected using efficient open-source algorithms without access to the language model API or model parameters. The watermark works by selecting a randomized set of “green” tokens before a word is generated, and then softly promoting use of green tokens during sampling. For more details about watermarks for LLMs have a look at the paper A Watermark for Large Language Models.

First, you need to load in a tokenizer and model that allows access to the tokens and associated logit values. This means, you will need to use a Huggingface 🤗 or other third-party LLM that you can run locally. Bedrock-hosted models can be queried but not downloaded, thus they are not apt for this demo.

The following uses a tiny LLM, dlite-v2-124m, derived from OpenAI’s smallest GPT-2 model and fine-tuned on a single GPU. dlite-v2-124m is not a state-of-the-art model. We are using it here to demonstrate watermarking in a lean setup with a CPU instance. If you have access to larger, GPU-enabled instances, feel free to try larger models available in Huggingface.

import IPython

# you can uncomment and auto-restart if you run into issues 
#IPython.get_ipython().kernel.do_shutdown(restart=True)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch 

model_id = "aisquared/dlite-v2-124m"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    padding_side="left",
    device_map='auto'
)
tokenizer.eos_token_id  = tokenizer.pad_token_id

# Load tiny model in BF16 precision
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

2026-06-11 22:39:44.057191: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

With the model and tokenizer you can now generate output tokens and pass the logits values to the watermark processor that will add certain random tokens. WatermarkLogitsProcessor loads a 🤗 language model that can perform text generation via model.generate, and prepares to call the generation method with a special LogitsProcessor that implements watermarking at the current hyperparameter values. The most important parameters to specify are:

gamma: Gamma denotes the fraction of the vocabulary that will be in each green list.
delta: The magnitude of the logit bias delta determines the strength of the watermark.

As a baseline generation setting, default values of gamma=0.25 and delta=2.0 are suggested. Reduce delta if text quality is negatively impacted.

from extended_watermark_processor import WatermarkLogitsProcessor
from transformers import LogitsProcessorList

# instantiate watermarking processor
watermark_processor = WatermarkLogitsProcessor(
    vocab=list(tokenizer.get_vocab().values()),
    gamma=0.25,
    delta=2.0,
    seeding_scheme="selfhash",
)

# tokenize input
tokenized_input = tokenizer("What did you do today?", return_tensors="pt").to(model.device)

# generate output tokens and parse through watermarking
output_tokens = model.generate(
    **tokenized_input,
    pad_token_id=50256,
    logits_processor=LogitsProcessorList([watermark_processor])
)

# isolate newly generated tokens as only those are watermarked, the input/prompt is not
output_tokens = output_tokens[:, tokenized_input["input_ids"].shape[-1] :]

# convert back to text
output_text = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]

Have a look at the resulting text.

Markdown(output_text)

I was born on March 31st, 1891.

I was born on March

Let’s now try to detect the watermarked text.

The WatermarkDetector is the detector for all watermarks imprinted with WatermarkLogitsProcessor. It needs to be given the exact same settings that were given during text generation to replicate the watermark greenlist generation and so detect the watermark. This includes the correct device that was used during text generation, the correct tokenizer, the correct seeding_scheme name, and parameters.

The detector below shows a high confidence that the input text has been watermarked.

from extended_watermark_processor import WatermarkDetector

watermark_detector = WatermarkDetector(
    vocab=list(tokenizer.get_vocab().values()),
    gamma=0.25,  # should match original setting
    seeding_scheme="selfhash",  # should match original setting
    device=model.device,  # must match the original rng device type
    tokenizer=tokenizer,
    z_threshold=4.0,
    normalizers=[],
    ignore_repeated_ngrams=True,
)

score_dict = watermark_detector.detect(
    output_text
)  # or any other text of interest to analyze

score_dict

{'num_tokens_scored': 13,
 'num_green_tokens': 11,
 'green_fraction': 0.8461538461538461,
 'z_score': 4.963972767957701,
 'p_value': np.float64(3.453281582346863e-07),
 'z_score_at_T': tensor([1.7321, 2.4495, 3.0000, 3.4641, 3.8730, 4.2426, 3.7097, 4.0825, 4.4264,
         4.7469, 5.0483, 5.3333, 4.9640, 4.9640, 4.9640, 4.9640, 4.9640]),
 'prediction': True,
 'confidence': np.float64(0.9999996546718418)}

Now compare with the watermarker detector acting on regularly generated text. In this case, the detector correctly predicts that the text is not watermarked.

# tokenize input
tokenized_input = tokenizer("What did you do today?", return_tensors="pt").to(model.device)

# generate output tokens and parse through watermarking
output_tokens = model.generate(
    **tokenized_input,
    pad_token_id=50256,
)

# isolate newly generated tokens as only those are watermarked, the input/prompt is not
output_tokens = output_tokens[:, tokenized_input["input_ids"].shape[-1] :]

# convert back to text
output_text = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]

score_dict = watermark_detector.detect(output_text)

score_dict

{'num_tokens_scored': 17,
 'num_green_tokens': 6,
 'green_fraction': 0.35294117647058826,
 'z_score': 0.9801960588196068,
 'p_value': np.float64(0.16349467479900753),
 'z_score_at_T': tensor([-0.5774, -0.8165, -1.0000, -1.1547, -1.2910, -1.4142, -1.5275, -0.8165,
         -0.9623, -0.3651,  0.1741,  0.0000, -0.1601,  0.3086,  0.1491,  0.5774,
          0.9802]),
 'prediction': False}

Activity

Activity: Try your own watermarking#

    Try your own prompt and add a watermark authentication to it. Also try to change the different parameters for WatermarkLogitsProcessor to see how the output is changing.
    Note: due to the limited capabilities of the dlite-v2-124m model, not all experiments might work. You can try more capable LLMs if you have access to larger instances to run them.

############## CODE HERE ####################

############## END OF CODE ##################

3. Quizzes#

Well done on completing the lab! Now, it's time for a brief knowledge assessment.

Challenge

Challenge: Knowledge Assessment#

    Answer the following questions to test your understanding of embeddings, document loaders and RAG workflows.

import sys
sys.path.append('..')

from mlu_utils.quiz_questions import lab4b_question1

lab4b_question1.display()

# run this cell when you finish the notebook to clean up your environment
!rm -rf ./lm-watermarking

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Conclusion#

In this lab, you have:

    Learned about watermarking as an authentication technique for LLM outputs
    Implemented watermarking using the WatermarkLogitsProcessor
    Detected watermarks in generated text using the WatermarkDetector
    Explored how watermarking can help monitor and audit LLM usage

Additional Resources#

    Microsoft PromptBench
    Prompting Guide Techniques
    UpTrain AI