21 Feb 2025

Technical note

19 minute read

Introducing the GneissWeb dataset

The amount and quality of data that a model is trained on play a vital role in determining the performance of a large language model (LLM). High-quality data, in particular, can significantly boost the LLM’s ability to generalize on a wide range of downstream tasks. To better serve the needs of IBM’s burgeoning family of Granite models, this team focused on producing a 10 trillion-token dataset, named GneissWeb, that is higher quality than all other datasets of similar size available. Gneiss, pronounced as nice, is a durable igneous rock, just like IBM’s open-source Granite models trained from it.

In this post, we introduce the GneissWeb dataset, along with the recipe of how we produced this dataset. The GneissWeb recipe consists of a sharded exact substring deduplication and a judiciously constructed ensemble of quality filters. Below, we present the key evaluations that guided our design choices and provide filtering thresholds that can be used to filter the dataset to match the token and quality needs of Stage-1 (early pre-training) or Stage-2 (annealing) datasets.

Our evaluations demonstrate that GneissWeb outperforms state-of-the-art large open datasets over 5T tokens. Specifically, ablation models trained on GneissWeb outperform those trained on FineWeb.V1.1 by 2.73 percentage points in terms of average score computed on a set of 11 benchmarks (for both zero-shot and few-shot) commonly used to evaluate pre-train datasets. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), ablation models trained on GneissWeb outperform those trained on FineWeb.V1.1 by 1.75 percentage points. In the future, we plan to release a detailed technical paper with fine grained details and the IBM Data Prep Kit to create the GneissWeb dataset.

The GneissWeb Recipe in a Nutshell

Building on top of FineWeb

Hugging Face had introduced FineWeb V1.1.0, a large-scale dataset for LLM pre-training, consisting of 15 trillion tokens (which takes up 44TB of disk space). FineWeb is derived from 96 Common Crawl snapshots, focusing on English text by applying a series of processing steps, including language classification, deduplication, and heuristic rule-based quality filters. Models trained on FineWeb are shown to outperform those trained on other publicly available datasets, such as C4, RefinedWeb, Dolma, RedPajamav2, SlimPajama, and The Pile. While we focused on FineWeb V1.1.0 to prepare GneissWeb, our recipe can also be applied on FineWeb V1.2, which was recently released.

Subsequently, Hugging Face released two smaller but higher quality versions called FineWeb.Edu (also referred to as FineWeb-Edu-Small) and FineWeb.Edu.Score2 (also referred to as FineWeb-Edu-Large), derived from FineWeb. These datasets consist of 1.3 trillion and 5.4 trillion tokens respectively. The smaller high-quality versions of FineWeb are created by retaining documents perceived to have higher educational value from the original FineWeb dataset.

We started with the goal of distilling roughly 10 trillion high quality tokens from FineWeb V1.1.0, so that we get sufficiently large number of quality tokens suitable for Stage-1 pre-training. Unlike the FineWeb.Edu families, which rely on a single quality annotator and perform aggressive filtering, we developed a multi-faceted ensemble of quality annotators to enable fine-grained quality filtering. This allowed us to achieve a finer trade-off between the quality and quantity of the tokens retained. While the GneissWeb recipe was focused at obtaining more than 10 trillion high-quality tokens suitable for Stage-1 pre-training, it is also possible to adapt the recipe by tuning filtering parameters to produce smaller and higher quality datasets fit for Stage-2 kind of training.

An Overview of the GneissWeb recipe

The GneissWeb dataset was obtained by applying the following processing steps to FineWeb:

Exact substring deduplication at line level
Custom built fastText quality filter
Custom built fastText category classifier
Custom built Category-aware readability score quality filter
Custom built Category-aware extreme_tokenized quality filter

These were applied in the order shown in the Fig. 1.

The net impact was that the dataset size of 15 trillion tokens was filtered down to approximately 10 trillion tokens. In subsequent sections, we’ll describe the overall performance obtained using GneissWeb compared to other baselines. We’ll then dive deeper into each of these processing steps in detail and the impact they have individually through a series of ablations.

Evaluation strategy

To compare GneissWeb against the baselines, we trained decoder models with 1.4B, 3B, and 7B parameters on a Llama architecture. These were trained on 35B (roughly optimal according to the Chinchilla scaling law) tokens to obtain signals and select hyperparameters for each processing step. We further trained ablation models on 100B (roughly three times the optimal according to the Chinchilla scaling law) as well as 350B tokens to validate the performance of each processing step. The data was tokenized using a StarCoder tokenizer, and training was done with a sequence length of 8,192.

The baselines from which equivalent data was subsampled and used for this comparison included:

Dataset	Number of Tokens	Dataset	Number of Tokens
FineWeb V1.10	15T	Dolma	3T
FineWeb-Edu-Score-2	5.4T	FineWeb-Edu	1.3T
DCLM-Baseline	3.8T	RefinedWeb	600B

Fig. 2 shows how the subsamples were created for the FineWeb baselines as well for GneissWeb. A similar strategy as for the creation of the FineWeb baseline was used for other baselines as well.

Figure 2: subsampling and ablation strategy.

We trained and evaluated our models on an LSF (Load Sharing Facility) cluster with each node equipped with eight H100 GPUs. For training tasks involving 35 billion tokens, we typically trained models with 1.4 billion trainable parameters across 64 GPUs. For more compute intensive tasks, we scaled up to 128 or 256 GPUs to reduce training time. For evaluation tasks we generally used 8 GPUs.

The tokens for an experimental dataset are read from IBM’s GPFS (General Parallel File System) to minimize network traffic during training. With this computational infrastructure, the training speed of an FSDP model with 1.4 billion parameters is approximately 32,000 tokens/GPU/sec. Consequently, training the model with 35 billion tokens on 64 GPUs typically takes about 4.6 hours. Model checkpoints are saved regularly and evaluated in real time, with results automatically uploaded, stored and visualized.

Evaluation benchmarks selection

We evaluated our ablation models using lm-evaluation-harness on two categories of tasks: 11 High-Signal tasks (0-shot and few-shot) and 20 Extended tasks (0-shot and few-shot).

High-signal tasks:

Since ablations are performed by training ‘small’ models (1.4B parameter models) for a few billion tokens (typically 35B tokens), it is important to identify benchmarks that provide a good signal at this relatively small scale. Similar to FineWeb, we used the following criteria for selecting the 11 high-signal/early-signal tasks: accuracy above random guessing; accuracy monotonically increasing over training epochs; and small variance across runs. These are shown in Fig. 3 and cover commonsense reasoning, reading comprehension, world knowledge and language understanding task categories. We used both the zero-shot as well as few-shot variations of these tasks.

The high-signal tasks were used to analyze individual ingredients and possible recipe combinations via ablations. After we narrowed a few candidate recipes using these signals, we used the extended set of benchmarks to evaluate the model’s ability to generalize.

Extended tasks:

The extended tasks shown in Fig. 4 are a superset of the high-signal tasks. Besides the task categories of commonsense reasoning, reading comprehension, world knowledge, language understanding, it also has the category of symbolic problem solving. For the extended set, we also focus on zero-shot as well as few-shot variations.

The extended task set has some tasks which are not in high-signal. These tasks are useful but at ablation scale may have high standard deviation (like PubMedQA) or are at random guessing the entire training cycle (like MMLU) or are above random guessing but do not show improvement with training (like GSM8k). However, these tasks are useful indicators for larger model performance and thus have been retained in the Extended Tasks set.

These differences between the tasks are seen in Fig. 5 where we see a comparison of the high signal tasks compared to those which are in the extended tasks and excluded from the high signal tasks. We see that the average accuracy increases in the former and is relatively static in the latter. This was one criteria for excluding them from the high signal task set.

Figure 5: High-signal tasks show increasing accuracy with more training

The high-signal tasks also show lower coefficient of variation compared to the excluded tasks as shown in Figure 6. The coefficient of variation is calculated as the ratio between the standard deviation of the average score divided by the mean, where statistics are computed across three random training seeds. Lower coefficient of variation shows more stable results, due to lower variance across random seeds. Their lower coefficient of variation makes the high-signal tasks more reliable at the ablation scale.

Figure 6: Coefficient of variation (standard deviation divided by mean) for high-signal set and excluded set

Evaluation results

At 1.4 billion model size trained on 350 billion tokens:

Large Datasets (5T+ Tokens), suitable for Stage-1 Pre-Training

Dataset	Tokens	High-Signal Eval Score	Extended Eval Score
FineWeb.V1.1.0	15T	56.26 ± 0.14	47.33 ± 0.3
GneissWeb	10T	58.40 ± 0.19 (+2.14)	48.82 ± 0.27 (+1.49)
FineWeb-Edu-Score-2	5.4T	57.36 ± 0.42	48.16 ± 0.29

Small Datasets (<5T Tokens), which can be used for Stage-2 Pre-Training

Dataset	Tokens	High-Signal Eval Score	Extended Eval Score
DCLM-Baseline	3.8T	61.36 ± 0.11	51.09 ± 0.42
Dolma	3T	54.18 ± 0.65	47.39 +/- 0.75
FineWeb-Edu	1.3T	58.44 ± 0.14	48.91 ± 0.13
RefineWeb	0.6T	57.77 ± 0.10	48.11 ± 0.3

Figure 7: Average scores of 1.4 billion parameter models trained on 350 billion tokens randomly sampled from state-of-the-art open datasets. Scores are averaged over three random seeds used for data sampling and are reported along with standard deviations. GneissWeb performs the best among the class of large datasets.

The datasets evaluated are broken down into those which are above 5 trillion tokens in size and those below 5 trillion. The former are useful for Stage-1 training and are the primary focus of this study, The latter are useful for Stage-2 training and with certain tuning of parameters of filtering a version of GneissWeb can be produced for this space.

For those in the greater than 5 trillion token set size, in Fig. 8 we show the performance broken down into the various categories of tasks — commonsense reasoning, language understanding, reading comprehension, world knowledge and symbolic problem solving. As shown, GneissWeb is not only the best overall but actually does best in all categories of tasks, barring world knowledge.

Dataset	Tokens	Commonsense Reasoning	Language Understanding	Reading Comprehension	World Knowledge	Symbolic Problem Solving	Average
FineWeb.V1.1	15T	45.23	47.58	62.67	39.01	26.16	47.17
GneissWeb	10T	45.53	48.77	65.21	41.09	27.92	48.82
FineWeb-Edu-score-2	5.4T	45.32	47.2	63.29	42.24	27.25	48.16

Figure 8: Comparison of average evaluation scores grouped by categories for 1.4 billion models trained on 350 billion tokens.

In Fig. 9, we show the progression of accuracy with training for high-signal tasks for 1.4 billion parameter model for 350 billion tokens. We see that for all three datasets compared, the accuracy increases over time and the accuracy of GneissWeb is consistently higher than FineWeb and FineWeb-Edu-score-2.

Figure 9: Average evaluation score on High-Signal tasks versus the number of tokens for 1.4 Billion parameter models. The model trained on GneissWeb consistently outperforms the ones trained on FineWeb.V1.1.0 and FineWeb-Edu-score-2.

At 3 and 7 billion model size with 350 billion tokens:

Given than training models of size 3 and 7 Billion parameters require lot more compute and so does evaluation, we have restricted comparison with large datasets (FineWeb and FineWeb-Edu-Score-2). We see that the 7 billion parameter models do better than the 3 billion parameter models. We also see that the models trained on GneissWeb outperform the models trained on FineWeb.V1.1.0 and FineWeb-Edu-score-2.

At 3 Billion model size, models trained on GneissWeb outperform those trained on FineWeb.V1.1.0 by 2.52 percent points in terms of the average score computed on a set of 11 High-signal benchmarks (both zero-shot and few-shot), and 1.95 percent points on Extended benchmarks (both zero-shot and few-shot).

	High-Signal Eval Score	Extended Eval Score
FineWeb.V1.10	60.31 ± 0.21	50.15 ± 0.07
GneissWeb	62.83 ± 0.24 (+2.52)	52.1 ± 0.22 (+1.95)
FineWeb-Edu-score-2	61.63 ± 0.04	51.13 ± 0.17

Figure 10: Comparison of average evaluation scores on high-signal and extended eval tasks at 3B model size. Scores are averaged over three random seeds used for data sampling and are reported along with standard deviations.

Figure 11: Average evaluation score on high-signal tasks versus the number of tokens at 3 billion model size for 350 billion tokens. The model trained on GneissWeb consistently outperforms the one trained on FineWeb.V1.1.0 throughout the training.

This gain further increases at the 7B model size, models trained on GneissWeb outperform those trained on FineWeb.V1.1.0 by 2.73 percentage points in terms of the average score computed on a set of 11 high-signal benchmarks (both zero-shot and few-shot), and 1.75 percentage points on extended benchmarks (both zero-shot and few-shot).

	High-Signal Eval Score	Extended Eval Score
FineWeb.V1.10	64.61 ± 0.23	53.39 ± 0.25
GneissWeb	67.34 ± 0.26 (+2.73)	55.14 ± 0.28 (+1.75)
FineWeb-Edu-score-2	65.51 ± 0.34	54.61 ± 0.31

Figure 12: Comparison of average evaluation scores on high-signal and extended eval tasks at a 7B model size. Scores are averaged over three random seeds used for data sampling and are reported along with standard deviations.

Figure 13: Average evaluation score on high-signal tasks versus the number of tokens at a 7B model size for 350 billion tokens. The model trained on GneissWeb consistently outperforms the one trained on FineWeb.V1.1.0 throughout the training.

GneissWeb recipe details

In this section, we describe the key ingredients of the GneissWeb recipes that provide significant gains by explaining each of the components (or processing steps) along with the evaluation results of their individual ablation experiments.

Exact substring deduplication

Removing duplicates from training data has been shown to reduce memorization and improve model performance Lee et al. (2022). FineWeb applied per snapshot fuzzy deduplication and removed near-duplicate documents using the MinHash algorithm. Furthermore, FineWeb also applied repetition filter, intra-document deduplication which removes documents with many repeated lines and paragraphs. However, duplicates still remain at sequence-level within and across documents. Such repeated substrings bypass the document level deduplication steps of FineWeb for several reasons: they may not represent a significant enough portion of a document or a single document may include repeated sections from various documents.

We apply exact substring deduplication to remove any substring of predetermined length that repeats verbatim more than once by adapting the implementation from Lee et al. (2022) based on suffix arrays. Exact substring deduplication can be fine tuned through two hyper-parameters: length-threshold (the minimum length of repeated text sequences) and frequency-threshold. We utilize a length-threshold of 50, consistent with the implementation in Lee et al. (2022).

We make several modifications to the exact substring deduplication implementation from Lee et al. (2022) to run at scale. Furthermore, we adapt it to remove exact substring duplicates in a sharded manner. In particular, we shard each snapshot of FineWeb-V1.1.0 into sets of roughly equal size and apply exact substring deduplication on each shard independently. Also, rather than removing all copies of a duplicate substring, we retain the first occurrence of each duplicate substring and remove any subsequent matches exceeding 50 consecutive tokens.

In Fig. 14, we show the progression of accuracy with training for High Signal Tasks at 1.4 billion parameter model for 350 billion tokens. We see that for both datasets compared, the accuracy increases over time and the accuracy of the dataset with exact substring deduplication is consistently higher ending at 57.39 than the baseline which ends at 55.99.

Figure 14: Ablation experiment comparing exact substring deduplication against the FineWeb.V1.1.0 baseline at 1.4 billion model size for 350 billion tokens.

Custom data quality classifiers (fastText)

The fastText family of binary classifiers have been shown to perform well in identifying high-quality pre-training documents. Specifically, DCLM trained a fastText classifier on a mix of instruction-formatted data (OpenHermes-2.5) and high scoring posts from ELI5, and demonstrated that its effectiveness for quality filtering, surpassing compute-heavy methods such as AskLLM (prompting an LLM to ask if a document is helpful). After annotating a subset of using the DCLM-fastText, we observed that it favors well-structured, well-formatted documents (such as including bullet points), but tends to miss high-quality informational documents without substantial formatting.

In addition to DCLM-fastText, we trained a custom fastText classifier on a mix of high-quality synthetic data and data annotated by LLM for high educational value. Specifically, we used 400,000 documents, equality split between positive (i.e., high-quality) and negative (i.e., low-quality) classes. We obtained the 200,000 positive documents as:

190,000 synthetic documents randomly sampled from the Cosmopedia dataset — an open synthetic dataset consisting of textbooks, blogposts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.
10,000 documents with high educational value as follows: we annotated 600,000 random documents from FineWeb.V1.1, using the Mixtral-8x22B-Instruct model to score each document between 1 and 5 for its educational quality (with 5 being the highest quality), using a prompt similar to the one used by FineWeb-Edu. Next, we selected 10k random documents with scores >= 4.

We selected 200,000 documents out of 600,000 Mixtral-annotated documents with scores <=2 as the negative documents.

We performed an ablation where we combined the DCLM-fastText filter and the Cosmopedia-Edu-fastText filter using an OR rule. In particular, we retained documents which at least one filter votes as high-quality. Using the OR rule allowed us to achieve similar performance as the AND rule (wherein documents are retained only if both the classifiers vote as high-quality) and better performance than individual fastText classifiers, while retaining substantially larger number of tokens.

Figure 15: Ablation experiment comparing a combination of fastText filters against the FineWeb.V1.1.0 baseline.

In Figure 15, we show the plot of the average eval score on high-signal tasks versus the number of training tokens for a 1.4 billion parameter model. We observe that filtering with the combination of fastText classifiers outperforms the FineWeb.V1.1.0 baseline throughout the training.

Readability scores

Readability scores are formulas based on text statistics (such as sentence length, average number of words, and the number of syllables) designed to assess how easily the text can be read and understood. We apply readability scores as a novel quality metric to facilitate identifying and filtering hard-to-read low-quality documents.

A large number of readability score formulas have been developed to asses text difficulty. We experimented with a number of readability score formulas and selected McAlpine-EFLAW readability score. McAlpine-EFLAW readability score of a document is a numerical score computed as a function of the number of words in a document plus the number of mini-words (consisting of <= 3 characters) divided by the number of sentences. Lower score means the document is easier to understand for a reader with English as a foreign language. We also demonstrate the effectiveness of the McAlpine-EFLAW score compared to other readability scores through ablation experiments, and determined that McAlpine-EFLAW yields the best results.

We analyzed readability score distributions of the documents grouped by categories. Specifically, we considered the documents from the following 3 snapshots from FineWeb-V1.1.0: CC-MAIN-2024-10, CC-MAIN-2023-40 and CC-MAIN-2023-14 and computed the top-level category for each document using the WatsonNLP hierarchical text categorization. The WatsonNLP categorization is based on the Interactive Advertising Bureau (IAB) Tech Lab categories taxonomy. We observe the readability score distributions in certain categories, such as science, education, technology and medical health differ from the overall distribution across all categories. This variation in distributions can be attributed to the observation that several documents in these categories demand a higher level of education to understand and have high readability score, leading to a higher average readability score.

Based on this observation, there is a risk of losing high-quality documents if a threshold is selected based on the overall data distribution and the same threshold is applied to all documents. Guided by readability score distributions in different categories, we leverage the category information of documents and develop a category-aware readability score quality filter as part of our ensemble quality filter. In general, we use a more lenient threshold for these specific categories to prevent filtering out documents with potential educational value solely because of their high readability scores which results in better performance compared to filtering without leveraging category information.

In Figure 16, we show the progression of accuracy with training for High Signal Tasks for 1.4 billion parameter model on 35 billion tokens. We see that for both datasets compared, the accuracy increases over time and the accuracy of the dataset with Readability Score quality filter is consistently higher and ending at 53.20 than the baseline at 51.94.

Figure 16: Ablation experiment comparing readability score filter against the FineWeb.V1.1.0 baseline at 1.4B model size for 35 billion tokens.

Extreme tokenized documents removal

After manually inspecting fastText model-quality annotations and readability scores of large number of low-quality documents, we found that several abnormal documents were mislabeled by these annotators. We observed a peculiar pattern after tokenizing these documents: while most of these documents had similar lengths, they produced significantly different token counts. To quantify this effect, we propose novel annotations that effectively leverages information from the “pre-tokenization” stage (document char length, document size) and the “post-tokenization” stage (token counts) to identify potential low-quality documents. We refer to the documents with extremely high or low number of tokens per character (or tokens per byte) as extreme-tokenized documents. See Figure 17 for a schematic.

Figure 17: A schematic outlining the steps for removing extreme tokenized documents.

We analyzed the distributions of TokensPerChar and TokensPerByte for documents grouped by category. Specifically, we considered the documents from the following 3 snapshots from FineWeb-V1.1.0: CC-MAIN-2024-10, CC-MAIN-2023-40 and CC-MAIN-2023-14, and computed the top-level category for each document using the WatsonNLP hierarchical text categorization. The WatsonNLP categorization is based on the Interactive Advertising Bureau (IAB) Tech Lab categories taxonomy. We observe that the distributions are generally bell-shaped for each category, but the values of the mean and variance differ by category. Furthermore, we observe that low-quality documents typically fall into the two extremes of the distribution. Therefore, we characterize extreme-tokenized documents of a given category as those falling into the two extremes of the TokensPerChar (or TokensPerByte) distribution for the category.

Guided by the distributions of TokensPerChar and TokensPerByte in different categories, we leverage the category information of documents and develop a category-aware extreme-tokenized quality filter as part of our ensemble quality filter. At a high level, we use stricter thresholds on TokensPerChar/TokensPerByte for documents outside the key categories and use more lenient thresholds for documents in these key categories.

In Figure 18, we show the progression of accuracy with training for High Signal Tasks for 1.4 billion parameter model for 35 billion tokens. We see that for both datasets compared, the accuracy increases over time and the accuracy of the dataset with Extreme_tokenized quality filter at 52.78 is higher than the baseline at 51.94.

Figure 18: Ablation experiment comparing Extreme_tokenized filter against the FineWeb.V1.10 baseline at 1.4B model size for 35 billion tokens.

Document categorization classifiers

As mentioned above, the quality score distributions in certain categories that potentially contain higher education level documents — such as science, education, technology and computing, and medical health — differs from the overall distribution across all categories in our dataset. In particular, we observe that the following IAB categories supported by WatsonNLP categorization have significantly different distributions than the overall distribution across all categories: science, education, technology & computing, and medical health. Thus, for each of these key categories, we annotate whether each document falls into the category.

To perform category classification on the 96 snapshots in FineWeb-V1.1.0 at scale, we train four binary fastText category classifiers for each of the four key categories. Specifically, we generated labeled data using the WatsonNLP hierarchical categorization, and used the supervised fastText package to train the fastText classifiers on the following documents:

Positive documents: 400,000 documents randomly sampled from the documents labeled with that specific category with a confidence score 0.95 and above.
Negative documents: 400,000 documents randomly sampled from the documents labeled with any category other than these four categories with a confidence score of 0.95 and above.

Each classifier takes as input a document and produces a label whether the document belongs to the category, along with a confidence score between [0,1]. We use our trained document category classifiers to annotate all the snapshots from FineWeb-V1.1.0. We leverage these category annotations in our category-aware readability score quality filtering and extreme-tokenized quality filtering which results in better performance compared to filtering without leveraging category information.

Combining GneissWeb components into a winning recipe

There are various ways to combine the key ingredients and build a recipe, including deciding which components to include and their order as well as designing ensemble filtering rules using multiple quality annotators. We performed rigorous ablations by combining the key ingredients in multiple variations and sequences with the aim of maximizing downstream task performance under the constraint of retaining at least 10 trillion tokens from FineWeb.V1.1.0.

Figure 19: Key ingredients selected for building the GneissWeb recipe.

The GneissWeb recipe illustrated in Figure 1 produces the highest performance gain. The GneissWeb recipe consists of first applying the exact substring deduplication, computing category and quality annotations, and then applying the ensemble quality filter as shown in Figure 1. We obtain the GneissWeb dataset of 10 trillion tokens by applying the GneissWeb recipe to the 15 trillion tokens in the 96 snapshots of FineWeb-V1.1.0. We prepared GneissWeb using a version of IBM Data Prep Kit which will be open-sourced in future.

Equipped with fastText classifiers, category-aware readability score filter, and category-aware extreme-tokenized documents filter, we perform ablations over various ensemble filtering rules. We first select the thresholds for category-aware readability score filter and category-aware extreme-tokenized filter as discussed in the above sections. Then, we tune the thresholds for fastText classifiers for a given ensemble filtering rule such that at least 10 trillion tokens are retained from the 15 trillion tokens of FineWeb-V1.1.0. Specifically, we consider the following two ensemble aggregation rules:

Using the notation:

A: Custom built fastText quality filter
B: Custom built category-aware readability score quality filter by leveraging custom built fastText category classifier
C: Custom built category-aware extreme_tokenized quality filter by leveraging custom built fastText category classifier

GneissWeb recipe:

Exact substring deduplication → ((A AND B) OR (A AND C))

GneissWeb ensemble filtering rule: A document is retained if either the fastText combination and category-aware readability score filter agree, or the fastText combination and category-aware extreme-toeknized filter agree. Here the fastText combination is logical OR of the fastText classifiers, i.e., either of the fastText classifiers agrees. See the detailed rule in Figure 1.

Recipe 2:

Exact substring deduplication → (A AND B AND C)

Ensemble filtering rule 2: A document is retained if either of the fastText classifiers agrees and category-aware readability score filter agrees and category-aware extreme tokenized filter agrees. Note that this rule is equivalent to sequentially applying the filters (in arbitrary order).

Figure 20 shows the average eval score on high-signal tasks as well as extended tasks for the filtering rules along with the baseline of FineWeb-V1.1.0. We observe that the GneissWeb filtering ensemble rule outperforms the other rule on both high-signal and extended tasks.

	High-Signal Eval Score	Extended Eval Score
FineWeb.V1.1_7b	61.05 ± 0.25	51.01 ± 0.28
Recipe2_7b	62.65 ± 0.37	51.82 ± 0.41
GneissWeb_7b	63.09 ± 0.10 (+2.04)	52.33 ± 0.24 (+1.32)

Figure 20: Comparison of ablations at 7B model size for 100 billion tokens.

Conclusion and future work

This blog presents GneissWeb dataset produced by IBM Research using an internal version of IBM DataPrep Kit. GneissWeb consists of 96 common-crawl snapshots outperforming some state-of-the-art datasets of comparative size. We continue to perform further data ablation experiments and plan to open-source the recipe via IBM DataPrep Kit. We are currently processing the latest 7 snapshots that we aim to include in GneissWeb after conducting further evaluations and verifications.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

IBM’s new benchmark puts industrial agents to the test
Research
Kim Martineau
15 Jul 2025
How AI could help stretch the life of industrial equipment
Research
Kim Martineau
02 Jul 2025
Lossless compression tailored for AI
Research
Kim Martineau
30 Jun 2025
- AI
- Generative AI
IBM Research Ireland moves to the heart of Dublin with a new lab at Trinity College Dublin
News
Mike Murphy
27 Jun 2025