Benjamin Feuer

Benjamin Feuer

Hello! I am a Ph.D. candidate in the Department of Computer Science and Engineering at NYU. I am a member of the DICE Lab and an active collaborator with AI startups Oumi.AI, Arthur.AI and Abacus.AI. Previously, I received an BA in Film Studies from Wesleyan University, an MFA in Screenwriting from Columbia University, and an MS in Computer Science from New York University. My awards include a NeurIPS Spotlight award and the the Deborah M. Rosenthal Award (Best CS Qualifying Exam).

Research: I have wide-ranging research interests; some of my recent topics include data-centric factors in machine learning systems, robust LLM benchmarking, evaluation and alignment, and scalable data integration for very large databases.

Education: I am currently working towards my PhD in Computer Science at New York University, advised by Chinmay Hegde. Previously, I studied at Columbia University and Wesleyan University. Other frequent collaborators include Micah Goldblum and Colin White and John P Dickerson and Juliana Freire.

News

  • 2025/01/31 New paper (+ code). LiveBench is a leading benchmark for foundation models; unlike traditional benchmarks, questions and categories evolve over time. Featured in Gemini, Qwen and Deepseek technical reports. ICLR 2025.
  • 2025/01/31 New first-author paper (+ code). Do LLM-judge preferences translate to progress on other, more concrete metrics for alignment? No, because LLM judges exhibit implicit biases; without being prompted, they reweight existing criteria and implement their own standards by which they evaluate provided judgment criteria. This work emphasizes the importance of blending ground truth and LLM judge evaluations for foundation models. ICLR 2025.
  • 2025/01/31 New paper. We demonstrate a distortion-free watermarking method for images, based on a combination of a diffusion model's initial noise and generated Fourier patterns. ICLR 2025.
  • 2024/9/28 New first-author paper (+ code) describing BioTrove, the largest publicly accessible dataset designed to advance AI for biodiversity applications. We also release a suite of CLIP models trained using a subset of 40 million captioned images. We introduce several new benchmarks for rigorous assessment, report accuracy for zero-shot learning, and evaluations. NeurIPS 2024 (Spotlight), USDA Highlighted Project.
  • 2024/9/28 New first-author paper (+ code) describing TuneTables, a novel tabular classification and regression model which is competitive with boosted trees, and can scale to problems of any size. NeurIPS 2024.
  • 2024/6/24 New paper (+ code) introducing LiveBench, a benchmark for LLMs designed with test set contamination and objective evaluation in mind. LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge. Featured in Venture Beat.
  • 2024/2/13 New paper (+ code) benchmarking the performance of tabular algorithms on the largest suite of datasets to date. NeurIPS 2023 (Datasets and Benchmarks).
  • 2023/11/07 New first-author paper studying the effects of two important dataset-level constituents: label set design, and class balance. NeurIPS 2023 (1st Workshop on Attributing Model Behavior at Scale) .
  • 2023/10/28 New first-author paper investigating sketching and feature-selection methods for prior-fitted networks. NeurIPS 2023 (Second Table Representation Learning Workshop) .
  • 2023/10/27 New first-author paper (+ code) introducing ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. VLDB 2024.
  • 2023/08/01 New first-author paper introducing JANuS (Joint Annotations and Names Set), a collection of four new training datasets with images, labels, and corresponding captions, and conducting controlled investigations of factors contributing to robustness in image classification. TMLR 2023.

Publications

A full list of my publications can be found here.