Benjamin Feuer

2025/7/15 New first-author paper (+ code) introducing MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy..

2025/6/24 New paper (+ code) introducing Open Thoughts, introducing strong public data recipes for reasoning models.

2025/5/19 New paper on large reasoning models for agriculture, introducing AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning.

2025/3/1 New invited survey paper published in IEEE Data Engineering Bulletin on data systems and engineering, describing challenges and opportunities in the use of LLMs for data integration and discovery.

2025/01/31 New paper (+ code). LiveBench is a leading benchmark for foundation models; unlike traditional benchmarks, questions and categories evolve over time. Featured in Gemini, Qwen and Deepseek technical reports. ICLR 2025.

2025/01/31 New first-author paper (+ code). Do LLM-judge preferences translate to progress on other, more concrete metrics for alignment? No, because LLM judges exhibit implicit biases; without being prompted, they reweight existing criteria and implement their own standards by which they evaluate provided judgment criteria. This work emphasizes the importance of blending ground truth and LLM judge evaluations for foundation models. ICLR 2025.

2025/01/31 New paper. We demonstrate a distortion-free watermarking method for images, based on a combination of a diffusion model's initial noise and generated Fourier patterns. ICLR 2025.

2024/9/28 New first-author paper (+ code) describing BioTrove, the largest publicly accessible dataset designed to advance AI for biodiversity applications. We also release a suite of CLIP models trained using a subset of 40 million captioned images. We introduce several new benchmarks for rigorous assessment, report accuracy for zero-shot learning, and evaluations. NeurIPS 2024 (Spotlight), USDA Highlighted Project.

2024/9/28 New first-author paper (+ code) describing TuneTables, a novel tabular classification and regression model which is competitive with boosted trees, and can scale to problems of any size. NeurIPS 2024.

2024/6/24 New paper (+ code) introducing LiveBench, a benchmark for LLMs designed with test set contamination and objective evaluation in mind. LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge. Featured in Venture Beat.

2024/2/13 New paper (+ code) benchmarking the performance of tabular algorithms on the largest suite of datasets to date. NeurIPS 2023 (Datasets and Benchmarks).

2023/11/07 New first-author paper studying the effects of two important dataset-level constituents: label set design, and class balance. NeurIPS 2023 (1st Workshop on Attributing Model Behavior at Scale) .

2023/10/28 New first-author paper investigating sketching and feature-selection methods for prior-fitted networks. NeurIPS 2023 (Second Table Representation Learning Workshop) .

2023/10/27 New first-author paper (+ code) introducing ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. VLDB 2024.

2023/08/01 New first-author paper introducing JANuS (Joint Annotations and Names Set), a collection of four new training datasets with images, labels, and corresponding captions, and conducting controlled investigations of factors contributing to robustness in image classification. TMLR 2023.

Benjamin Feuer

News

Publications