Processing Text
This section covers processing text strings with PII Eraser's /text/* REST API routes.
Main Features
- Massive Context Window: PII Eraser supports up to 1M tokens per API request out of the box. This means you can process entire documents, transcripts, or database exports in a single API call—no chunking, no splitting, no reassembly logic required. The context window can be raised beyond 1M via the
max_tokensparameter inconfig.yaml. - Automatic Language Handling: PII Eraser does not require a language code or any language detection step, as inputs are routed automatically to the correct model. You can send German, French, English, or mixed-language text in the same request and PII Eraser will detect entities correctly across all of them.
- Batch Processing: All
/text/*endpoints accept a list of strings, allowing you to process multiple texts in a single HTTP request. This reduces network overhead for high-volume workloads.
Counting Tokens
Before processing large documents, you may want to check the token count to estimate processing time or verify that inputs fall within the configured max_tokens limit. Use the /text/count_tokens endpoint:
import json
import requests
payload = {
"text": [
"Herr Dr. Stefan Müller wohnt in der Schillerstraße 42, 80336 München.",
"Please contact Jane Doe at [email protected] for further details."
]
}
r = requests.post("http://localhost:8000/text/count_tokens", json=payload)
print(json.dumps(r.json(), indent=4))
Response:
The total_tokens value is the sum of tokens across all strings in the request. This is the same token count reported in the stats object of detect and transform responses.
Detect
Use the /text/detect endpoint when you need to know what sensitive entities are present and where they are, without altering the original text. This is useful for analytics dashboards, compliance audits, or flagging documents for human review. This endpoint can also be used for NER (Named Entity Recognition) usecases.
Request:
import json
import requests
payload = {
"text": [
"Stefan Müller lives at Schillerstraße 42, 80336 Munich. "
"His Steuernummer is 181/815/08155 and he can be reached at +49 89 123456."
],
"entity_types": ["NAME", "ADDRESS", "TAX_ID", "PHONE"]
}
r = requests.post("http://localhost:8000/text/detect", json=payload)
print(json.dumps(r.json(), indent=4))
Response:
The response provides a list of lists containing all entities detected, along with their start and end positions in each input string and the detection confidence. Note that low confidence entities are already removed and aren't returned. Please see the API Reference for further details.
{
"entities": [
[
{
"entity_type": "NAME",
"start": 0,
"end": 13,
"score": 0.9902657270431519
},
{
"entity_type": "ADDRESS",
"start": 23,
"end": 54,
"score": 0.9985126256942749
},
{
"entity_type": "TAX_ID",
"start": 76,
"end": 89,
"score": 0.5953037142753601
},
{
"entity_type": "PHONE",
"start": 115,
"end": 128,
"score": 0.9985167384147644
}
]
],
"stats": {
"total_tokens": 43,
"tps": 2065.14
}
}
Transform
Use the /text/transform endpoint when you need to modify the text to remove or obscure PII, PCI or other sensitive entities. This is the primary endpoint for anonymization pipelines, LLM pretraining, and data export workflows.
The operator parameter, set in either in the API request or config.yaml file controls how detected entities are transformed or modified.
| Operator | Description | Example Input | Example Output |
|---|---|---|---|
redact |
The default operator. Replaces the entity with a semantic type tag. Recommended for most applications. | Call Stefan Müller |
Call <NAME> |
mask |
Replaces characters with a configurable symbol (default #). Recommended for ASR transcripts. |
ID: 181/815/08155 |
ID: ############# |
hash |
Replaces the entity with a deterministic SHA-256 or SHA-512 hash. Recommended for pseudonymization. | Stefan Müller |
a8b92f1c... |
redact_constant |
Replaces all entities with the same static string regardless of type. | Call Stefan Müller |
Call <REDACTED> |
Operators can only be customized via the config.yaml file. For example, you can change the masking character from # to *, or switch to SHA-512 hashing. See the Config File Reference for the full list of operator options.
Redact
redact is the default operator. It replaces each entity with a tag like <NAME> or <EMAIL>, preserving the semantic meaning of the sentence. This is the most common choice for for analytics and anonymization pipelines.
The redact_constant operator functions similarly to redact, except every entity is replaced by the replace_value specified in config.yaml.
Request:
import json
import requests
payload = {
"text": [
"Alicia lives in Montpellier."
],
"operator": "redact"
}
r = requests.post("http://localhost:8000/text/transform", json=payload)
print(json.dumps(r.json(), indent=4))
Response:
The response provides a list of texts with detected entities transformed according to operator. There is also a list of entities that contains the entity types and start/end positions of the transformed entities in the transformed text and a processing stats object. Please see the API Reference for further details.
{
"text": [
"<NAME> lives in <LOCATION>."
],
"entities": [
[
{
"entity_type": "NAME",
"output_start": 0,
"output_end": 6
},
{
"entity_type": "LOCATION",
"output_start": 16,
"output_end": 26
}
]
],
"stats": {
"total_tokens": 11,
"tps": 852.34
}
}
Mask
mask is ideal for PCI, ASR transcripts and call center data, where replacing characters with a symbol produces a more natural, readable output. The mask character and other parameters can be set in config.yaml.
Request:
import json
import requests
payload = {
"text": [
"Alicia lives in Montpellier."
],
"operator": "mask"
}
r = requests.post("http://localhost:8000/text/transform", json=payload)
print(json.dumps(r.json(), indent=4))
Response:
{
"text": [
"###### lives in ###########."
],
"entities": [
[
{
"entity_type": "NAME",
"output_start": 0,
"output_end": 6
},
{
"entity_type": "LOCATION",
"output_start": 16,
"output_end": 27
}
]
],
"stats": {
"total_tokens": 11,
"tps": 1015.53
}
}
Hash
The hash operator replaces each entity with it's SHA256 or SHA512 hash, depending on the hash_type set in config.yaml.
Request:
import json
import requests
payload = {
"text": [
"Alicia lives in Montpellier."
],
"operator": "hash"
}
r = requests.post("http://localhost:8000/text/transform", json=payload)
print(json.dumps(r.json(), indent=4))
Response:
{
"text": [
"2a4f079d2c3bd979ae519dc09fdbe9b7ef3b913996a0b5d970ab35abe895224f lives in ef348bd17202c36d669d4bce1806ddaa5e33366880fc9865218048c62a2f87c4."
],
"entities": [
[
{
"entity_type": "NAME",
"output_start": 0,
"output_end": 64
},
{
"entity_type": "LOCATION",
"output_start": 74,
"output_end": 138
}
]
],
"stats": {
"total_tokens": 11,
"tps": 926.66
}
}
Detect vs. Transform Entities
While both endpoints return an entities array, they serve different purposes. Use the table below to understand how the data structures differ:
| Feature | /text/detect |
/text/transform |
|---|---|---|
| Context | Corresponds to individual detections in the input text. | Corresponds to the resulting entities in the output text. |
| Character Indices | Positions denote where the untransformed entity sits in the original string. | Positions denote where the transformed entity sits in the processed string. |
| Overlap Handling | Allowed. Detections from different models may overlap or nest. | Merged. Overlapping detections are consolidated into single, non-overlapping entities. |
| Confidence Score | Included. The score represents the model's confidence in the detection. |
N/A. Because an entity may be the result of multiple merged detections, no single score is provided. |
Use /text/detect when you need to perform granular analysis or custom logic on raw findings. Use /text/transform when you need a "clean" version of the text with PII safely handled and mapped to its new coordinates.