Jellyfish: A Large Language Model for Data Preprocessing

{}^{1}

Haochen Zhang,

{}^{2}

Yuyang Dong,

{}^{1,3}

Chuan Xiao,

{}^{2}

Masafumi Oyamada

{}^{1}

Osaka University,

{}^{2}

NEC Corporation,

{}^{3}

Nagoya University
{chou.koushin, chuanx}@ist.osaka-u.ac.jp, dongyuyang, oyamada@nec.com

Abstract

This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 – 13B models) as universal DP task solver. We select a collection of datasets across four representative DP tasks and construct instruction-tuning data using serialization and knowledge injection techniques tailored to DP. As such, the instruction-tuned LLMs empower users to manually craft instructions for DP. Meanwhile, they can operate on a local, single, and low-priced GPU, ensuring data security and enabling further tuning. Our experiments show that our dataset constructed for DP instruction tuning, namely Jellyfish, effectively enhances LLMs’ DP performances and barely compromises their abilities in NLP tasks. By tuning Mistral-7B and OpenOrca-Platypus2-13B with Jellyfish, the models deliver competitiveness compared to state-of-the-art DP methods and strong generalizability to unseen tasks. The models’ performance rivals that of GPT series models, and the interpretation offers enhanced reasoning capabilities compared to GPT-3.5. The 7B and 13B Jellyfish models are available at Hugging Face:
https://huggingface.co/NECOUDBFM/Jellyfish-7B
https://huggingface.co/NECOUDBFM/Jellyfish-13B

^*^*footnotetext: Haochen Zhang and Yuyang Dong contributed equally to this work. Yuyang Dong is the corresponding author.

1 Introduction

The proliferation of large language models (LLMs) has catalyzed a diverse array of applications, extending beyond the domain of NLP to encompass a wide range of fields that require the processing of natural language data. Notably, LLMs have been applied in areas such as software engineering [80, 93], computer simulation [107, 24], data analytics [8, 88], and tabular data processing [54, 64, 115].

This paper focuses on the utilization of LLMs for data preprocessing (DP), a critical step in the data mining pipeline that involves transforming raw data into a manageable and processable format ready for use. Over the past decades, significant strides have been made in various DP tasks. Until 2021, most efforts were concentrated on one or two specific tasks such as error detection (ED) [28, 68], data imputation (DI) [83, 67, 69], schema matching (SM) [113], and entity matching (EM) [43, 57]. A key challenge in developing generic solutions to DP is that these tasks differ in nature: they deal with errors, anomalies, matches, etc. and require different actions such as detection, repairing, and alignment.

Refer to caption — Figure 1: Overview of instruction-tuning a large language model for data preprocessing.

With the advent of LLMs like GPT-3 and subsequent versions, researchers have found a key to address this challenge, spurring the development of generic solutions for a wider array of DP tasks [73, 112]. The application of LLMs in DP has the following strengths: (1) The primary strengths of using LLMs in DP lie in their ability to process natural language. Most LLMs provide a prompting interface with which users can interact and assign tasks in natural language, contrasting with existing DP solutions that require computer programming or specific tools (e.g., HoloClean [83] and Magellan [43]). (2) With the knowledge acquired through training on vast amounts of data, LLMs are universal problem solvers capable of identifying errors, anomalies, and matches in data (and particularly unseen datasets in unseen tasks), aligning with the aims of DP tasks without needing human-engineered rules [82]. (3) LLMs are excellent reasoners [42], enabling them to not only return DP results but also provide the reasons for these results. In this sense, their answers are more interpretable than those of other deep learning approaches. (4) LLMs can be conditioned by few- [5] or zero-shot [42] prompting. As such, we can condition the criteria for DP tasks (e.g., the degree of matching) using few-shot examples or zero-shot prompts, contrasting with traditional solutions based on a threshold [86, 43] or a time-consuming training process to fit to the data [69].

Despite these strengths, existing LLM-based solutions to DP [73, 112, 44], with reliance on GPT APIs, have raised concerns about data breaches, as evidenced by OpenAI’s first confirmed data breach involving ChatGPT [76]. Another limitation is the difficulty in domain specification [73]. When dealing with data from highly specialized domains, training the LLMs used in these solutions can be costly (e.g., GPT-3.5) and even unavailable due to frozen parameters (e.g., GPT-4), posing difficulty in customizing the model.

In response to the aforementioned challenges, we propose Jellyfish, a dataset for tuning LLMs for various DP tasks. Jellyfish distinguishes itself with several key features: (1) Jellyfish is used for building universal DP task solvers by instruction-tuning [114] LLMs to the following tasks: ED and DI for data cleaning, and SM and EM for data integration. (2) Jellyfish is suited to 7 – 13B models which can operate on a local, single, and low-priced GPU, ensuring data security and allowing further tuning. (3) Capable of understanding natural language, the tuned model allows users to manually craft instructions for DP tasks and apply prompt engineering techniques to tailor it to specific tasks and datasets. (4) Unlike many existing methods that rely heavily on handcrafted knowledge during inference [83, 81], Jellyfish features domain knowledge in its instruction-tuning dataset and enables optional knowledge injection during inference. (5) It includes reasoning data for tuning model’s interpretation ability that provides natural language explanations of the model’s outputs.

As depicted in Figure 1, Jellyfish is constructed by manually selecting data from several public datasets widely used for DP evaluation. By instance serialization, raw data is serialized into the prompts used to tune the model. By knowledge injection, task- and dataset-specific knowledge – particularly domain knowledge that can be extended to unseen datasets – is infused to the prompts. Moreover, we resort to GPT-4 to generate reasoning data. As such, the tune model’s interpretation distills GPT-4’s knowledge in reasoning DP results.

Our evaluation of Jellyfish focuses on tuning OpenOrca-Platypus2-13B (as Jellyfish-13B) and Mistral-7B-Instruct-v0.2 (as Jellyfish-7B), and compares with two categories of methods: non-LLM methods – typically solutions based on machine learning (ML) or pre-trained language models (PLMs) prior to the prevalence of LLMs – and LLM methods – typically GPT series methods. The results show that Jellyfish-13B consistently outperforms non-LLM methods on its seen datasets. Its effectiveness on unseen datasets is comparable to that of non-LLM methods on their respective seen datasets. Meanwhile, Jellyfish-7B also exhibits competitiveness, especially on DI tasks. In two case studies of unseen tasks, Jellyfish models deliver strong performance, showcasing their generalizability to a wider range of DP tasks beyond the four tasks used for tuning. Their performance rivals that of GPT series models, and Jellyfish-7B even outperforms GPT-4 on the attribute value extraction task. Our evaluation reveals the impact of data configuration in Jellyfish, and discovers that tuning with Jellyfish barely compromises models’ abilities in NLP tasks. Furthermore, additional experiments demonstrate the advantage of Jellyfish interpretation over GPT-3.5 in reasoning capabilities as well as the effectiveness of the techniques employed in building Jellyfish.

Our contributions are summarized as follows.

•

We develop Jellyfish, a dataset for instruction-tuning LLMs as universal DP task solvers.
•

LLMs tuned with Jellyfish showcase several notable features: universal model design, moderate model size, assurance of data security, feasibility for further tuning, natural language instruction handling, optional specification of prior knowledge, and model interpretability.
•

Our experiments demonstrate Jellyfish-7B and 13B models’ effectiveness in DP task solving, generalizability to new tasks beyond what they are tuned for, and the superior reasoning abilities.

The rest of the paper is organized as follows: Section 2 introduces the DP tasks targeted by our model and briefly reviews LLMs. Section 3 describes the Jellyfish dataset for instruction tuning. Section 4 introduces how to use Jellyfish for solving DP tasks. Section 5 discusses the extensions to unseen tasks. Section 6 reports experimental results and analysis. Section 7 reviews related works on DP. Section 8 concludes this paper.

2 Preliminaries

2.1 Data Preprocessing

In data mining, DP is a crucial step that deals with noise, missing values, inconsistencies, and heterogeneity in data. Major DP procedures include data cleaning, data integration, data transformation, and data reduction [26]. In this initial exploration of LLMs for DP, we concentrate on tabular data, one of the most common data types. Our data model operates on relational tables specified by schemas. We assume all attributes are either numerical values (incl. binary values) or textual values (incl. categorical values). Diverging from the traditional definition that presents the entire dataset and finds or fixes all the errors (or matches, etc.) within, we define the problem by handling one record (or a pair, depending on the task) at a time, so the prompt can be easily written and its length is within LLMs’ token limitation. Next, we outline the DP tasks involved in this study:

(1) Error Detection (ED): Given a record (i.e., a tuple in a relational table) and an attribute, our task is to detect whether there is an error in the cell value of this attribute. (2) Data Imputation (DI): Given a record and an attribute such that cell value for this attribute is missing, our task is to infer its correct value. (3) Schema Matching (SM): Given a pair of attributes represented in the form of (name, description), our task is to find whether they refer to the same attribute. (4) Entity Matching (EM): Given a pair of records, our task is to infer whether they refer to the same entity.

The above four tasks collectively form the most critical part of DP and are used for instruction-tuning. Besides, we consider two unseen tasks which belong to the intersection of DP and other topics: (1) Column Type Annotation (CTA): Given a table with no header, our task is to infer the type of each column from a set of predefined semantic types (e.g., name, time, location). (2) Attribute Value Extraction (AVE): Given a text description of an entity and a set of predefined attributes, extract attribute values from the text description.

We term each input object an instance, i.e., a record for ED and DI, a pair of attributes for SM, a pair of records for EM, a table or a column for CTA, and a text description for AVE.

2.2 Large Language Models

With advancements in the field of natural language processing (NLP), LLMs have become one of the hottest topics in the AI research community. Representative LLMs include OpenAI’s GPT series (in particular, GPT-3, 3.5, and 4), Anthropic’s Claude, Google’s Gemini, Mistral AI’s Mistral [37], Meta’s Llama [97] and Llama 2 [98], as well as their variants that can be found at Hugging Face [35]. Due to their superb ability to process natural language, LLMs have not only been used in NLP applications (e.g., ChatGPT and Claude), but also catalyzed the rise of LLM-powered autonomous agents [101] as AI assistants (e.g., by GPTs) or tools for engineering [80, 31] or simulation [108, 107] purposes. Another popular LLM-centric research direction is retrieval-augmented generation (RAG) [52, 53], which gives LLMs access to external information to improve generation performance. We refer readers to [118] for a survey on LLMs. Some LLMs are open-source (e.g., Llama and Llama 2), and they can be fine-tuned with additional tasks to improve their abilities in logical reasoning, question answering, and so on. Among these fine-tuning approaches, instruction tuning [114] has become a prevalent one which further trains LLMs on a dataset consisting of (instruction, output) pairs in a supervised fashion, hence bridging the gap between the next-word prediction objective of LLMs and the users’ objective of having LLMs adhere to human instructions. For efficiency of fine-tuning, parameter-efficient fine-tuning (PEFT) approaches enable adaptation of LLMs to downstream applications without fine-tuning all the parameters. Notable methods are adapter tuning [32], prefix-tuning [56], and low-rank adaptation (LoRA) [33]. In particular, LoRA achieves significantly fewer trainable parameters and no additional inference latency, and has become a prevalent PEFT approach.

In addition to the strengths outlined in Section 1, we discuss the limitations of LLMs in the context of DP: (1) LLMs often require substantial computational resources, thereby increasing the cost of use and compromising the efficiency and scalability of DP on large-scale data. (2) Due to token limitation (the maximum input length, e.g., 4k tokens for GPT-3.5) and lack of memory for keeping historical information, the input to the LLM is often instance-by-instance, and the DP results may exhibit inconsistency across different instances. Simply raising the token limitation (e.g., 128k tokens for GPT-4-turbo) does not solve the problem, because performance may degrade due to increased lengths of input [60]. (3) LLMs sometimes exhibit hallucination [117], i.e., they generate text that is plausible-sounding but factually incorrect or non-sensical, as they lack a fundamental understanding of the world and rely solely on the patterns they learned during training.

3 Instruction Tuning with Jellyfish

3.1 Dataset Preparation

For the four seen tasks, we choose a series of datasets that have been widely used in previous studies and cover a variety of application domains. (1) ED: Adult and Hospital, used in [28]; (2) DI: Buy and Restaurant, used in [69]; (3) SM: MIMIC-III and Synthea, used in [113]; (4) EM: Amamzon-Google, Beer, DBLP-ACM, DBLP-GoogleScholar, Fodors-Zagats, and iTunes-Amazon from the Magellan data repository [16]. We use the publicly available version of these datasets [73], where errors and missing values are already injected to the datasets of ED and DI, respectively. The statistics are provided in Table 1.

Table 1: The Jellyfish dataset statistics. #Positives indicates the number of positive instances, i.e., there is an error (for ED) or the two objects match (for SM and EM).

Task	Dataset	#Instances	#Positives
ED	Adult	550 $\times$ 2	35 $\times$ 2
ED	Hospital	1710 $\times$ 2	44 $\times$ 2
DI	Buy	586	N/A
DI	Restaurant	778	N/A
SM	MIMIC-III	7000	11
SM	Synthea	5000	18
EM	Amazon-Google	6874	699
	Beer	359	54
	DBLP-ACM	5000	885
	DBLP-GoogleScholar	5000	924
	Fodors-Zagats	757	88
	iTunes-Amazon	430	105

For determining the number of instances in each dataset, a rationale is that the dataset across different tasks should be balanced and there is no dataset dominating the entire corpus. In particular, we undertake the following efforts to prepare data: (1) Given the disproportionately low number of positive instances compared to negative ones, we incorporate all positive instances available in the datasets into Jellyfish. (2) For ED, since missing values can be interpreted as either errors or non-errors depending on the context, we create two versions of each instance with missing values during tuning: one treating the missing values as errors and the other as non-errors. This duplication is guided by the injection of knowledge, which will be detailed later in this section. (3) In instances of low-quality data for a task (e.g., SM), we moderately increase the percentage of the data for this task in the corpus, to ensure the LLM can learn effectively. Upon the above preparation, we split each dataset into training and validation with a ratio of 80:20. Next, we introduce how we transform raw data to instructions. We prepare instruction data and reasoning data, one for tuning the task solving ability and the other for tuning the the interpretation ability. Then can be jointly used for tuning a Jellyfish model.

3.2 Instruction Data

To prepare the instruction data for an LLM, we need to serialize (a.k.a. contextualize) each instance in the raw data to a prompt. The prompt contains the task description, the instance content, and any injected knowledge. To describe our techniques for constructing the instruction data for training, we use an example for an instance in the Beer dataset used for EM, as shown in Figure 2.

At the beginning, there is a system message guiding the model behavior. Here, we instruct the model to act as an AI assistant to answer the user’s question, and its response should always respect this constraint. Then, we describe the DP task, i.e., EM in this example. The following part refers to injected knowledge. There are two types of injected knowledge: (1) general knowledge that applies to many datasets, and (2) specific knowledge that only applies to the given dataset. In this example, the knowledge belongs to general knowledge and concerns with missing values. Such knowledge injection may prevent the model incorrectly handling certain values in the dataset, especially when training data is noisy. The following part pertain to the instance content. Finally, there is a question presented to the model, and the output format is specified afterwards.

Whereas in the above example we specify knowledge on missing values, there are other forms of general knowledge used in tuning, including error types and terminology. For example, for ED, we inform the model of the fact that errors can include, but are not limited to, spelling errors, inconsistencies, or values that do not make sense for that attribute; for EM, we instruct the model to consider the full name of an attribute and its acronym to determine if the two values are the same. Specific knowledge highly depends on the application domain, mainly including constraints or rules that pertain to the dataset. For example, in publication datasets, authors’ names may occur in different forms and different orders even for the same article. Additionally, the model can be configured to assign greater importance to certain attributes. In the context of product data, for example, the model is directed to prioritize the comparison of product numbers. Specific knowledge can be applicable to datasets within the same domain, thereby enhancing the model’s performance on unseen datasets, particularly in scenarios where prior knowledge about these datasets is absent. Overall, the knowledge injected through tuning becomes the built-in knowledge of the model and can be used even without user-specification during inference.

3.3 Reasoning Data

The reasoning data uses the same set of datasets as instruction data, except that we use roughly smaller numbers of instances: 2500 for SM and 360 for EM, with detailed statistics given in Table 2.

Table 2: Jellyfish-preview dataset statistics. #Positives indicates the number of positive instances, i.e., there is an error (for ED) or the two objects match (for SM and EM).

Task	Dataset	#Instances	#Positives
ED	Adult	550	35
ED	Hospital	1250	40
DI	Buy	586	N/A
DI	Restaurant	600	N/A
SM	Synthea	2500	15
EM	Amazon-Google	359	31
	Beer	359	54
	DBLP-ACM	359	71
	DBLP-GoogleScholar	359	69
	Fodors-Zagats	359	41
	iTunes-Amazon	360	92

This smaller set of training data is also used for initial experiments on selecting base LLMs, and thus the tuning process using this set of data is dubbed Jellyfish-preview. The prompt in reasoning data is similar to the instruction data introduced above. The only difference is the reasoning instructions, as given in Appendix B.

Unlike the labeled ground truth for instruction data, we resort to GPT-4 to retrieve reasonable answers in reasoning data. Such practice is also used in constructing training data for various Llama and Llama 2 variants such as Alpaca [94] and Orca [72].

4 Data Preprocessing with Jellyfish Models

Given a dataset in CSV format, the task solver uses an instance serializer that iterates through all the instances and transforms each instance to a prompt. The prompt is same as the instruction and reasoning data for tuning the task solver and interpreter, respectively. We apply general knowledge for DP tasks, e.g., missing values in matching tasks and error types in ED. The task solver also provides a knowledge injector with which users can input dataset-specific knowledge, such as the domain knowledge (e.g., constraints) outlined in the previous section. Such user-specified knowledge is optional.

Feature Engineering. Users can optionally select a subset of features to improve performance. For instance, for EM in the Beer dataset, name and factory are more relevant features, while style and ABV are less relevant. Hence users may choose to use only name and factory as attributes. Such feature engineering can be also implemented in the prompt as specific knowledge, e.g., you should only consider name and factory and ignore other attributes.

Prompt Engineering. Prompt engineering [105] is the process of structuring text to enhance the model performance. We incorporate few-shot prompting [5], which conditions the Jellyfish models to learn from a small selection of examples drawn from the dataset. The prompts for few-shot examples are reported in Appendix D.

Batch prompting [9] is a prompt engineering technique designed to enable models to perform inference in batches, rather than processing single instances individually. This approach involves presenting multiple instances within a single prompt, with the model instructed to respond to all of them concurrently. Though proven effective for GPT-3.5 and GPT-4 in reducing token consumption and execution time [112], we have opted not to employ this method in the current version of Jellyfish. Our concern is that overburdening a 7B or 13B model with an excessive number of tokens, even when staying within its token limitation, could lead to diminished attention. This might significantly impair performance, potentially resulting in the model overlooking responses to some instances. We anticipate that future research, utilizing larger models or increasing the token input capacity, could effectively address this limitation.

5 Extensions to Unseen Tasks

In Section 4, we introduce DP task solving and focus on seen tasks. For unseen tasks, we consider two case studies: CTA and AVE, as outlined in Section 2.1. Jellyfish models can be easily extended to support them by employing the prompt engineering techniques in existing LLM-based solutions, hence simplifying its use in unseen tasks.

Column Type Annotation. As a task in the realm of table understanding, CTA is an essentially DP step for data search [7], knowledge base completion [85], and data integration a data lake [25]. We follow the two-stage pipeline proposed in [44], which was designed for ChatGPT and based on chain-of-thought [104], a technique that enables complex reasoning capabilities through intermediate reasoning steps.

Given a table to be annotated, in the first stage, the model predicts the domain of the table. In the second stage, given a set of predefined types, the model determines the type of column based on sample values extracted from it. The chain-of-thought prompt instructs the model in a step-by-step manner. For example, to predict the domain of the table, there are four steps: (1) look at the input and make a table out of it, (2) look at the cell values in detail, (3) decide if the table describes domain A, domain B … and (4) answer with the domain. Then, the model follows this prompt to cope with the task. The column type selection in the second stage works in the same way, except that table is replaced by column and domains are replaced by candidate types.

Attribute Value Extraction. Given a text description, AVE is an information extraction task that discovers missing values of attributes and reconstructs a table. For this task, we follow the prompt in [4] designed for GPT-4. The prompt is simple, beginning with the task description. Then, the instance content follows, with the description of the entity and the attribute to be extracted. Finally, an exception rule is mentioned: if the attribute cannot be extracted, the model should answer “N/A”.

We also would like to mention that Jellyfish models enable further fine-tuning. Users may choose to condition the model for specific DP tasks or domains to seek better performance. Moreover, Jellyfish models can be utilized for multiple tasks in a DP pipeline, e.g., data cleaning followed by data integration on the same sets of data. It is likely that the DP tasks within this pipeline belong to the same domain. In this case, Jellyfish models may deliver consistency in handling the data in different tasks due to the built-in domain knowledge acquired through instruction tuning for DP.

6 Experiments

6.1 Experimental Setup

Datasets. Apart from the seen datasets in Jellyfish (Section 3), we use following datasets as unseen data. CTA and AVE are used for case studies on unseen tasks. (1) ED: Flights and Rayyan, used in [68]; (2) DI: Flipkart [22] and Phone [84] from Kaggle; (3) SM: CMS, used in [113]; (4) EM: Abt-Buy and Walmart-Amazon (despite the same domain as the Amazon dataset used in Jellyfish, the entities belong to a different category of products) from the Magellan data repository [16]; (5) CTA: SOTAB, used in [44]; (6) AVE: AE-110k and OA-Mine, used in [4]. The statistics of the datasets are reported in Table 3. We generate train/valid/test splits following the protocol in [28] for Adult and Hospital, [69] for Flipkart and Phone, and [113] for MIMIC-III and CMS. The other datasets have already been provided with splits [73, 44, 4]. A subset of the train/valid splits is used in Jellyfish, as reported in Table 1.

Table 3: Testing dataset statistics. #Train and #Valid numbers only apply to GPT-3.5 on AVE and non-LLM methods on other tasks.

Task	Type	Dataset	#Train	#Valid	#Test	#Total
ED	Seen	Adult	550	550	9900	11000
	Seen	Hospital	1710	190	17101	19001
	Unseen	Flights	715	714	12832	14261
	Unseen	Rayyan	501	502	8997	10000
DI	Seen	Buy	469	117	65	651
	Seen	Restaurant	622	156	86	864
	Unseen	Flipkart	6240	0	2675	8915
	Unseen	Phone	2537	0	1194	3731
SM	Seen	MIMIC-III	51264	6408	6408	64080
	Seen	Synthea	23709	2964	2964	29637
	Unseen	CMS	22784	2848	2564	28196
EM	Seen	Amazon-Google	6874	2293	2293	11460
		Beer	268	91	91	450
		DBLP-ACM	6417	2473	2473	11363
		DBLP-GoogleScholar	17223	5742	5742	28707
		Fodors-Zagats	567	190	189	946
		iTunes-Amazon	321	109	109	539
	Unseen	Abt-Buy	5743	1916	1946	9575
	Unseen	Walmart-Amazon	6144	2049	2049	10242
CTA	Unseen	SOTAB	356	0	250	606
AVE	Unseen	AE-110K	4360	0	1482	5842
AVE	Unseen	OA-Mine	7360	0	2451	9811

Table 4: DP performance on seen tasks, with winner in boldface and runner-up underlined. Few-shot is disabled for Jellyfish models on seen datasets and enabled on unseen datasets. “–” indicates numbers not reported in prior works.

Task	Type	Dataset	Model
Task	Type	Dataset	Best of non-LLM	GPT-3	GPT-3.5	GPT-4	Table-GPT	Jellyfish-7B	Jellyfish-7B-I	Jellyfish-13B
ED	Seen	Adult	99.10	99.10	92.01	92.01	–	94.70	91.96	99.33
	Seen	Hospital	94.40	97.80	90.74	90.74	–	95.09	96.27	95.59
	Unseen	Flights	81.00	–	–	83.48	–	65.30	66.92	82.52
	Unseen	Rayyan	79.00	–	–	81.95	–	73.81	69.82	90.65
DI	Seen	Buy	96.50	98.50	98.46	100	–	98.46	96.92	100
	Seen	Restaurant	77.20	88.40	94.19	97.67	–	86.05	88.37	89.53
	Unseen	Flipkart	68.00	–	–	89.94	–	81.87	79.44	81.68
	Unseen	Phone	86.70	–	–	90.79	–	83.67	85.00	87.21
SM	Seen	MIMIC-III	20.00	–	–	40.00	–	43.14	40.00	40.00
	Seen	Synthea	38.50	45.20	57.14	66.67	–	55.55	44.44	56.00
	Unseen	CMS	50.00	–	–	19.35	–	20.00	13.79	59.29
EM	Seen	Amazon-Google	75.58	63.50	66.50	74.21	70.10	81.29	80.83	81.34
		Beer	94.37	100	96.30	100	96.30	96.30	96.55	96.77
		DBLP-ACM	98.99	96.60	96.99	97.44	93.80	98.54	98.88	98.98
		DBLP-GoogleScholar	95.70	83.80	76.12	91.87	92.40	94.89	95.16	98.51
		Fodors-Zagats	100	100	100	100	100	100	100	100
		iTunes-Amazon	97.06	98.20	96.40	100	94.30	96.30	96.30	98.11
	Unseen	Abt-Buy	89.33	–	–	92.77	–	79.78	82.38	89.58
	Unseen	Walmart-Amazon	86.89	87.00	86.17	90.27	82.40	78.22	85.64	89.42

LLMs. We mainly instruction-tune two LLMs with Jellyfish: (1) OOP2-13B, short for OpenOrca-Platypus2-13B [50], a Llama 2 variant with enhanced reasoning capabilities and logic proficiency; (2) Mistral-7B, short for Mistral-7B-Instruct-v0.2 [37], a prevalent 7B model. The tuned models using instruction data are dubbed Jellyfish-13B and Jellyfish-7B, respectively. We also tune Mistral-7B with both instruction and reasoning data, dubbed Jellyfish-7B-I. The hyperparameter setup is provided in Appendix A. Injected knowledge is reported in Appendix C. When few-shot prompting is enabled, we equip LLMs with three examples for each dataset, covering both positive and negative examples (Appendix D).

For inference, the prompts are the same as instruction and reasoning data, respectively. We apply general knowledge for DP tasks, e.g., missing values in matching tasks and error types in ED. Dataset-specific knowledge is not used.

Baseline DP Methods. We categorize existing methods into non-LLM methods and LLM methods. For non-LLM methods, we select the following baselines, in line with [73]: (1) ED: HoloDetect [28] and Raha [68]; (2) DI: IPM [69]; (3) SM: SMAT [113]; (4) EM: Ditto [57] and Unicorn [99]; (5) CTA: RoBERTa [63]. For their performance, we follow the best numbers reported in prior works [73, 44, 99]. Other methods such as Baran [67], HoloClean [83], and DODUO [92], have been shown to be outperformed by the above competitors [69, 73, 44], and hence are not compared here.

LLM methods are GPT-3 (text-davinci-002), GPT-3.5 (gpt-3.5-turbo-0301), Table-GPT [54] (GPT-3.5 fine-tuned for tables), GPT-4 (gpt-4-0314), Stable Beluga 2 70B [66], and SOLAR 70B [100]. We follow the numbers reported in [73, 112, 4].

Metrics. For DP task solving, we measure accuracy for DI, F1 score for ED, DI, EM, and AVE, and micro-F1 for CTA, all reported on a 100-scale.

Environment. Training and inference of LLMs are accelerated with NVIDIA A100 GPUs with 80GB graphics memory. We employ LoRA [33] and FlashAttention-2 [15] to speed up tuning and vLLM with PageAttention [47] to speed up inference.

6.2 DP Performance

6.2.1 Seen Tasks

We evaluate the performance on the seen tasks used for tuning. Table 4 reports the accuracy for DI and F1 score for the other three tasks. For Jellyfish models, few-shot prompting is disabled on seen datasets and enabled for unseen datasets. Among all the competitors, GPT-4 generally performs the best. This is expected, as it is the most advanced model with the largest number of parameters. However, its performance on SM is mediocre. Jellyfish-13B is generally the runner-up model and significantly outperforms GPT-4 on SM. Its accuracy or F1 score on the unseen datasets are over 80%, except on SM. In addition, Jellyfish-13B outperforms non-LLM methods on all but one seen datasets, and on all unseen datasets. Note that for non-LLM methods, because they need training on the input dataset, all the datasets are seen for them. This means that even without training on these datasets, Jellyfish-13B’s performance still surpasses the performance of non-LLM methods with training. Comparing Jellyfish-13B with GPT-3, GPT-3.5, and Table-GPT, Jellyfish-13B wins in more cases. Meanwhile, the 7B Jellyfish models also exhibit competitiveness, especially for DI, despite a small model size. By comparing Jellyfish-7B and Jellyfish-7B-I, we find that adding reasoning data to tuning does not significantly impair the model’s task solving ability.

Table 5: Precision (P), recall (R), and F1 score on SM.

Type	Dataset	Model
Type	Dataset	SMAT			GPT-4			Jellyfish-13B
		P	R	F1	P	R	F1	P	R	F1
Seen	MIMIC-III	11.5	84.6	20.2	33.33	50.0	40.0	45.45	35.71	40.0
Seen	Synthea	24.4	90.9	38.5	71.42	62.5	66.67	41.18	87.50	56.00
Unseen	CMS	33.9	95.0	50.0	60.0	11.5	19.35	57.14	61.54	59.26

Among the four tasks, SM is the hardest task, and all the competitors report relatively low F1 score. Looking into the datasets, we find that even humans have difficulties in telling whether the two attributes match, given only name and description. To compare the methods in more detail, we report precision and recall in Table 5. The non-LLM method, SMAT, reports the highest recall, yet with a very low precision. Among its results, only 1 out of 3 – 9 is true positive. This iss because many SM-tailored methods seek high recall, in order to find more candidates for further verification. Jellyfish-13B exhibits relatively high precision (41% – 57%), and is close to GPT-4 on the unseen dataset of CMS. This suggests that Jellyfish-13B can be used as a verification method (1 out of 2 is true positive) on top of a filtering approach (e.g., SMAT).

Table 6: Micro-F1 score on the unseen DP task of CTA, few-shot disabled for Jellyfish models.

Dataset	Model
Dataset	RoBERTa (159 shots)	RoBERTa (356 shots)	GPT-3.5	GPT-4	Jellyfish-7B	Jellyfish-7B-I	Jellyfish-13B
SOTAB	79.20	89.73	89.47	91.55	83.54	80.89	82.00

Table 7: F1 score on the unseen DP task of AVE, few-shot disabled for Jellyfish models.

Dataset	Model
Dataset	Stable Beluga 2 70B	SOLAR 70B	GPT-3.5	GPT-4	Jellyfish-7B	Jellyfish-7B-I	Jellyfish-13B
AE-110k	52.10	49.20	61.30	55.50	74.17	76.85	58.12
OA-Mine	50.80	55.20	62.70	68.90	75.35	76.04	55.96

6.2.2 Unseen Tasks

Table 6 reports the performance comparison on CTA. RoBERTa needs fine-tuning for this task. We report its results with two options, one for 159 shots of training data, the other for 356 shots of training data, following the numbers in [44]. GPT-4 still performs the best. Even without any tuning for this task, Jellyfish models still outperform RoBERTa fine-tuned with 159 shots. Another observation is that Jellyfish-7B is slightly better than Jellyfish-13B. We think this could be attributed to higher generalizability of Mistral-7B model.

For AVE, we report results in Table 7. the two 7B Jellyfish models are by far the winners, showcasing superb generalizability to this unseen task. Jellyfish-13B also surpasses the two 70B models.

Table 8: Impact of instruction tuning for DP on the unseen task of CTA. “+ task” denotes the model tuned for the task.

OOP2-13B	+ ED	+ DI	+ SM	+ EM	Jellyfish-13B
56.40	74.20	79.20	76.70	71.50	82.00

Table 9: Impact of prompt engineering on the unseen task of CTA, varying options in stages and chain-of-thought (CoT) over Jellyfish-13B.

One-stage, w/o CoT	One-stage, w/ CoT	Two-stage, w/o CoT	Two-stage, w/ CoT
51.50	58.00	67.00	82.00

To drill down to the impact of tuning on unseen tasks, we investigate CTA as an example. Table 8 helps us find out which task contributes the most to this unseen task. When tuning with only one task, the model reports a micro-F1 in the range of 71% – 79%, with DI being the highest. We suppose this is because DI is exactly the inverse operation of CTA, i.e., DI fills the value of an attribute, and meanwhile CTA infers the type of an attribute given a set of sample values. Moreover, the four tasks jointly contributes to an overall micro-F1 of 82% and it surpasses the performance of tuning with only DI, showcasing the usefulness of other tasks as well.

Further, we conduct an ablation study to study the impact of prompting and report the results in Table 9. The two-stage pipeline performs better than the one-stage pipeline, and chain-of-thought, which splits the inference of column types into four steps, is also useful, in line with the observation in [44]. This demonstrates that the prompt engineering techniques developed for existing LLM-based solutions also work for Jellyfish-13B. In doing so, the design of prompts for Jellyfish-13B on unseen tasks is rendered much easier, as users may directly follow those used in existing works.

Table 10: NLP performance on the Open LLM Leaderboard.

Model	MMLU	WinoGrande	ARC	TruthfulQA	GSM8K	HellaSwag	Average
Model	(5-shot)	(0-shot)	(25-shot)	(0-shot)	(8-shot)	(10-shot)
OOP2-13B	54.49	74.03	62.63	52.56	25.32	83.24	58.71
Jellyfish-13B	53.04 (-1.45)	74.19 (+0.16)	62.88 (+0.25)	52.56 (+0.00)	24.26 (-1.06)	83.16 (-0.08)	58.35 (-0.36)
Mistral-7B	62.91	73.88	63.48	66.91	41.32	84.79	65.55
Jellyfish-7B	62.08 (-0.83)	72.69 (-1.19)	63.48 (+0.00)	64.76 (-2.15)	37.91 (-3.41)	84.48 (-0.31)	64.23 (-1.32)

6.3 NLP Performance

We compare Jellyfish models and their original models on various NLP benchmarks [30, 87, 59, 11, 14, 111] used in the Open LLM Leaderboard [20], as shown in Table 10. For OOP-13B, its performance on NLP tasks barely decreases after tuning for DP, with a 0.36 drop on average, and even improves on two benchmarks. For Mistral-7B, the NLP performance deterioration is more significant, but still within a 1.32 drop on average. We think this is because the 7B model, with fewer parameters, is more prone to lose some of its original capability after fine-tuning [65].

6.4 Impact of Data Configuration in Jellyfish

We study how data configuration in the four tasks used for tuning impacts the performance. For this set of experiments, we randomly sample data from the datasets in Table 1 and disable the data preparation techniques in Section 3.1, in order to see the impact of dataset size clearly.

To simplify the evaluation, we first tune OOP2-13B with data for a single DP task and evaluate its effect. By varying the amount of data, Figure 3 displays how the tuning data for a specific task affects the DP performance. In general, the four tasks are all useful in improving the overall DP performance. For intra-task performance (e.g., ED to ED), as expected, the tuning data has a significantly positive impact. For inter-task performance, ED and SM are generally positive to other tasks, while DI and EM report negative effects. Such impact on the overall DP performance is also observed when we increase the amount of tuning data (e.g., doubling EM from 21k to 43k). We also find that DI can benefit from all the other three tasks. We think this is because the other three tasks all contain correct values for the attributes, thereby enhancing the model’s ability in filling missing values. In addition, the benefit of increasing tuning data for SM is obvious, in line with our data preparation technique in Section 3.1, i.e., in case of low-quality data, the amount of data should be increased to ensure the LLM can learn effectively. Overall, this experiment suggests that for the sake of DP performance, we shall moderately increase data for SM reduce data for DI.

Next, we study the impact of tuning OOP2-13B with multi-task data and plot the results in Figure 4. By feeding the tuning set with data for more tasks, it is obvious that they jointly contribute to better DP performance, and the improvement is consistent. When fully utilized the data, as indicated by (1, 1, 1, 1), it achieves the best performance. Based on the above results, we construct the Jellyfish data by appropriately choosing the size of data for each task. Moreover, with the data preparation techniques (Section 3.1) applied, Jellyfish-13B, even with less amount of tuning data, performs better than (1, 1, 1, 1) in Figure 4,

Then, we evaluate how the data for a specific DP task affects the NLP performance and report the results in Figure 5. In general, ED and EM exhibit positive impacts on the overall NLP performance. By increasing the amount of tuning data, all the tasks, except DI, are positive to NLP tasks. Specifically, SM turns from negative to positive when the dataset size is doubled, whereas the trend for DI is reversed, resulting in a significant drop. To drill down to each benchmark, all the four tasks are positive to WinoGrande, while they are generally negative to MMLU, and neutral to the other benchmarks, roughly in line with the results in Table 10. This experiment indicates that we need to choose an appropriate data size for each DP task, specifically, with moderately less data for DI, to prevent the model from losing its NLP capability.

We also test the impact of tuning OOP2-13B with multi-task data on its NLP performance over the six benchmarks used in Table 10. The results are reported in Figure 6. The general trend is that with data for more tasks, the NLP performance has a drop, yet this change, as shown in more sporadic points, is less consistent than what we observed in Figure 4. It is noteworthy that the overall decrease in NLP performance is moderate, with an average of 0.36 (from 58.71 to 58.35) for Jellyfish-13B.

Table 11: Impact of base models. “+ DP(P)” denotes the model tuned with Jellyfish-preview data. Knowledge injection is disabled.

Task	Dataset	Model
Task	Dataset	Llama-2-13B-Chat	OO-13B (Llama 2 + OpenOrca)	Platypus2-13B (Llama 2 + Open-Platypus)	OOP2-13B (Llama 2 + OpenOrca + Open-Platypus)	Mistral-7B	Llama-2-13B-Chat + DP(P)	OO-13B + DP(P)	Platypus2-13B + DP(P)	OOP2-13B + DP(P)	Mistral-7B + DP(P)
ED	Adult	5.92	33.67	7.73	42.77	20.66	93.62	93.49	93.49	96.62	98.79
ED	Hospital	8.78	64.05	6.29	63.24	37.09	81.55	89.67	90.58	92.01	94.13
DI	Buy	95.38	75.38	41.54	89.23	76.92	92.31	90.77	87.69	100	98.46
DI	Restaurant	90.70	88.37	86.05	81.40	18.75	89.53	90.70	88.37	89.53	88.37
SM	Synthea	0.97	0.00	0.68	22.22	26.67	22.22	22.22	28.57	36.36	25.00
EM	Amazon-Google	14.58	25.62	25.64	36.70	36.51	40.00	49.77	42.35	48.20	69.03
	Beer	39.13	81.48	11.76	85.71	69.57	95.55	93.33	93.33	96.55	92.86
	DBLP-ACM	45.95	78.84	0.00	78.86	85.30	97.45	97.66	97.35	97.35	97.51
	DBLP-GoogleScholar	35.71	56.07	40.73	59.48	59.54	92.27	92.22	92.87	92.83	92.30
	Fodors-Zagats	42.86	84.21	39.56	92.68	66.67	97.67	100	100	100	100
	iTunes-Amazon	30.43	63.53	0.00	57.45	70.97	96.15	96.15	96.15	96.30	96.30

6.5 Impact of Base Models

In addition to OOP2-13B and Mistral-7B evaluated in the above experiments, we also consider the following LLMs, which are the basis for constructing OOP2-13B: (1) Llama-2-13B-Chat, the chat model of Llama 2. (2) OO-13B, which is short for OpenOrcaxOpenChat-Preview2-13B, Llama 2-13B fine-tuned with OpenOrca [58]; (3) Platypus2-13B, Llama 2-13B fine-tuned with Open-Platypus [49].

Table 11 report the results of various LLMs tuned for DP tasks. Here, we use Jellyfish-preview data (Section 3.3) – smaller than the full tuning data – for fast tuning and comparison. For models without tuning, it can be seen that OO-13B, tuned with augmented FLAN, roughly performs better than Llama 2. Platypus2-13B, tuned with Open-Platypus, though not delivering better overall performance than Llama 2, jointly contributes to the superiority of OOP2-13B. This advantage is also observed when Jellyfish-preview is applied, with OOP2-13B + DP(P) being the winner on 5 out of 11 datasets. Moreover, with Jellyfish-preview, we also observe the advantage of Platypus2-13B over Llama 2, showcasing its usefulness not only jointly but also individually. Meanwhile, we also observe the competitiveness of Mistral-7B, on a par with 13B models and even better on ED. Overall, this evaluation justifies the effectiveness of instruction tuning for DP and demonstrates the usefulness of enhancing reasoning capabilities (OO-13B) and logic proficiency (Platypus2-13B).

Table 12: Impact of knowledge injection on EM. “+ EM(P)” denotes the model tuned using only EM datasets in Jellyfish-preview. Few-shot is disabled.

Type	Dataset	Model
Type	Dataset	OOP2-13B	+ EM(P) w/o knowledge	+ EM(P) w/ knowledge
Seen	Amazon-Google	36.70	47.54	50.53
	Beer	85.71	85.71	92.86
	DBLP-ACM	78.86	85.33	90.26
	DBLP-GoogleScholar	59.48	90.46	91.54
	Fodors-Zagats	92.68	100	100
	iTunes-Amazon	57.45	98.11	98.18
Unseen	Abt-Buy	61.78	83.35	84.44
Unseen	Walmart-Amazon	67.29	71.71	73.18

6.6 Impact of Knowledge Injection

To evaluate the impact of knowledge injection, we consider the case of EM as an example and compare the models tuned with the EM datasets in Jellyfish-preview. Table 12 reports the results for OOP2-13B and its tuned version with knowledge either injected or not. Comparing OOP2-13B and the one without injected knowledge, as expected, the performance on EM is significantly raised. When we turn on knowledge injection, the performance further improves and the improvement is consistent. Furthermore, the improvement is also observed on unseen datasets, because like seen datasets of Amazon-Google and Beer, they are also product data. This observation suggests that the domain knowledge learned through tuning indeed enhances the model’s generalizability to unseen datasets.

Table 13: Head-to-head comparison of GPT-3.5 and Jellyfish-7B-I on interpretation, judged by GPT-4.

Task	Dataset	Model
Task	Dataset	GPT-3.5	Jellyfish-7B-I
ED	Adult	11	9
ED	Hospital	9	11
DI	Buy	0	20
DI	Restaurant	10	10
SM	Synthea	8	12
EM	Amazon-Google	8	12
	Beer	7	13
	DBLP-ACM	8	12
	DBLP-GoogleScho8lar	4	16
	Fodors-Zagats	12	8
	iTunes-Amazon	19	1
Total		96	124
Winning Rate		43.64%	56.36%

6.7 Evaluation of Interpretation

We evaluate the performance of Jellyfish-7B-I and compare it with GPT-3.5 (gpt-3.5-turbo-0613). Given an answer output by Jellyfish-7B-I, we generate reasons using both Jellyfish-7B-I and GPT-3.5, and request GPT-4 to decide which one is better. Note that GPT-4 is unaware of the correct answer to the question in DP. As such, it needs to judge by its own analysis of the question as well. The reason why we do not choose other LLMs (in particular, Gemini Pro and Claude 2) as judge is because these models fail to convincingly compare Jellyfish-7B-I’s and GPT-3.5’s answers. They tend to choose overly simplified answers over comprehensive ones, and even prefer GPT-3.5’s answers to GPT-4’s.

Table 13 reports the head-to-head comparison of the two models. Jellyfish-7B-I defeats GPT-3.5 on 8 out of 11 datasets, achieving an overall winning rate of 56% across 220 tests. To delve into detailed comparison, we show an example on the Amazon-Google dataset in EM. The prompt is as follows.

The reason given by Jellyfish-7B-I is as follows. The two products are not the same. Product A is ”Adobe Dreamweaver CS3” specifically for Mac, while Product B is ”Adobe CS3 Web Premium”. Adobe Dreamweaver is a web development tool used for designing, coding, and managing websites. On the other hand, Adobe CS3 Web Premium is a suite of applications that includes Adobe Dreamweaver along with other software like Photoshop, Illustrator, and Flash. Therefore, Product B offers more than just the Dreamweaver software, which makes it different from Product A.
No

The reason given by GPT-3.5 is as follows. No, Product A and Product B are not the same product. The reason is that although both products are from Adobe and are part of the CS3 suite, they have different names and prices. Product A is specifically named ”Adobe Dreamweaver CS3 [Mac]” and is priced at $399.0, while Product B is named ”Adobe CS3 Web Premium” and is priced at $1659.99. The difference in names and prices indicates that they are different products within the CS3 suite, likely with different features and functionalities.
Therefore, Product A and Product B are not the same product.
Final answer: No.

GPT-4 compares the outputs of both models in four aspects (we summarize GPT-4’s response due to its length): (1) Structure: Both models effectively follow the instruction to provide a detailed explanation before the final answer. (2) Focus: Jellyfish-7B-I emphasizes the different functionalities of the software, while GPT-3.5 focuses more on the difference in names and prices. (3) Adherence to Instruction: GPT-3.5 includes an unnecessary repetition of the conclusion before the final answer, which is not strictly in line with the instruction to finish the response with the final answer only. (4) Clarity: Jellyfish-7B-I presents a more concise and focused response by avoiding repetition and sticking to the instruction more closely. Based on the analysis, GPT-4 decides that Jellyfish-7B-I performed better because it adhered more strictly to the original instruction by providing a clear, concise response without unnecessary repetition. In addition, with its built-in knowledge, Jellyfish-7B-I pointed out the difference in functionalities, whereas GPT-3.5 merely described the difference on the surface.

Furthermore, we find out that when reasoning, GPT-3.5 even fails to respond with a correct answer of matching or not for EM, as shown in its mediocre performance on datasets like Amazon-Google and DBLP-GoogleScholar in Table 4. In contrast to the above example of Jellyfish-7B-I’s landslide win, GPT-3.5 only has a slight edge when it wins. For instance, in an example of the Amazon-Google dataset, GPT-4 points out that GPT-3.5 has more focused justification and additional insights into the implications of the differences between the products, yet it also mentioned that GPT-3.5’s repetition of the final answer is a minor deviation from the instruction’s format.

6.8 Comparison of Efficiency

Instruction tuning spends around 5 hours for Jellyfish-13B, 2.5 hours for Jellyfish-7B, and 3.5 hours for Jellyfish-7B-I. For inference, Jellyfish-13B spends 0.08 – 0.15 seconds on average to process an instance. As a reference, GPT-4 spends an average of 1 – 8 seconds per instance. Although LLMs require substantial computational resources, thereby increasing the cost of use and compromising the efficiency, some non-LLM methods, such as RoBERTa and those built upon it (e.g., IPM), need fine-tuning when applied to unseen datasets. This fine-tuning time should be counted towards total time expense for fair comparison. Moreover, advanced learning techniques enables Jellyfish models to be quantized [61] or distilled to improve efficiency, which will be considered in the future.

7 Related Works

Since works on LLMs have been introduced in Section 2.2, we briefly review related works on DP here.

Seen Tasks. The tasks targeted in this paper collectively form the most critical part of DP, and they have been extensively studied.

•

ED: Traditional methods mainly depend on hand-crafted rules [12], pattern discovery [13], outlier detection [79], or statistical modeling [34, 102]. Recent works employ more advanced ML techniques such as few-shot learning based on a noisy channel model (HoloDetect) [28], or resort to a series of ML pipelines (Raha) [68], including feature engineering, clustering, and classification.
•

DI: While rule-based solutions [83, 91] remain one of the prevalent approaches, another stream of works develops ML models for this task, including variational autoencoders [75], generative adversarial networks [110], and attention mechanisms [106, 96]. To seek better imputation performance, recent progress utilizes PLMs to capture semantics [69].
•

SM: The use of similarity matrices is a traditional way [86]. More advanced methods utilize ML techniques [23], including deep learning models [90]. SMAT [113] is an approach leveraging attention-based deep learning. A recent attempt employs GPT-4 for SM [89].
•

EM The procedure is divided into blocking and in-block pairwise matching for the sake of efficiency. Blocking groups pairs of entities that potentially match into the same block, and then pairwise matching is performed within each block to find matching entities. Traditional solutions for blocking mostly rely on attribute equivalence, hashes, or similarities [77]. Recently, the feasibility of using DL methods for blocking has also been examined [95], following the use of DL for pairwise matching [71]. In addition, there are tools that handle both steps such as Megallan [43] and Ditto [57]. A recent evaluation validates the effectiveness of in-context learning in enhancing LLMs’ EM performance [78].

Unseen Tasks. We review the related studies on CTA and AVE.

•

CTA As a typical table understanding task, it often appears in the studies on table representation learning [36, 17, 92]. These approaches fine-tune PLMs, typically BERT [18] and its variants [38, 63]. Recently, ChatGPT has been utilized to solve this task [44].
•

AVE Early approaches employ LSTM-CRF [46, 120]. With the prevalence of PLMs, like CTA, many solutions to AVE resort to using BERT [109, 103, 121]. A recent work [4] considered fine-tuning GPT-3.5 and prompting GPT-4, and compared with open-source LLMs like Stable Beluga 2 [66] and SOLAR [100].

Generic Solution. Whereas the above solutions are specialized for a task, recent progress developed generic solutions to DP based on GPT-3 [73], or GPT-3.5 and GPT-4 [112], basically employing various prompt engineering techniques on frozen LLMs. Fine-tuning GPT-3.5 and ChatGPT for a variety of table-related tasks has also been investigated [54], and several DP tasks are covered.

Other DP Tasks. Besides the ones covered by this paper, there are many other DP tasks. We name a few examples.

•

Data repairing corrects erroneous values in a dataset. Typical solutions are HoloClean [83] and Baran [67]. HoloClean can detect errors and perform repairing subsequently. Baran only repair errors and resort to Raha to detect errors. Recent advancements [51, 81] utilized Bayesian inference to capture dependencies between attributes.
•

Data fusion is the process of integrating multiple data sources that contain information about the same set of entities, with possibly conflicting attribute values. Surveys of early attempts are available [55, 6], with a detailed comparison of various fusion methods on deep web data [55]. More recent endeavors targeted multi-truth data fusion [2] and golden record [29].
•

Data transformation is the process of converting data from one format into another format. Notable approaches are transformation by user-specified examples [27] and learning from large collections of paired table columns [39]. In addition, the aforementioned generic DP solution also covers this task [73].

Data Preparation. DP is also studied in the name of data preparation, which manipulates raw data into a form that can be readily analyzed. A notable Python library is DataPrep [48]. In addition to the DP tasks listed above, data augmentation [10, 70, 62, 119] is another key operation in data preparation. Another line of work studies dataset discovery [3, 45, 21, 74], particularly for integrating data lake tables [41] where joinable [19], unionable [40], and related table search [116] are often used for identifying candidates. Despite search speed being a key concern, LLMs are anticipated to be used on top of their outcomes for automated integration in a data lake [1].

8 Conclusions

We studied the problem of instruction-tuning LLMs as universal DP task solvers. By devising a series of data preparation, instance serialization, and knowledge injection techniques, we proposed Jellyfish, a dataset for this purpose. LLMs tuned with Jellyfish are adept at understanding natural language, enabling users to craft instructions manually for DP tasks. Another notable feature of Jellyfish is its reasoning data, which can be used for tuning for interpretation, thereby providing explanations of the model’s outputs. For evaluation, we tuned two models, Jellyfish-7B and Jellyfish-13B, which can operate on a local GPU without compromising data security. The experiments demonstrated their competitiveness against existing DP solutions, impressive generalizability to new tasks, the ability of retaining performance in NLP tasks, as well as the competence in interpreting the model’s output.

Future research directions include expanding Jellyfish to encompass more DP tasks, such as data repairing and data transformation. Furthermore, we are considering the development of a quantized or distilled model to enhance processing speed, as well as a multi-agent system for adaptable, conversational, code-free DP pipeline.

Acknowledgements

This work is supported by NEC Corporation and JSPS Kakenhi 22H03903, 23H03406, 23K17456, and JST CREST JPMJCR22M2. We thank Prof. Makoto Onizuka, Yuki Arase, and Yuya Sasaki for providing equipment support for completing this research.

References

[1] S. Arora, B. Yang, S. Eyuboglu, A. Narayan, A. Hojel, I. Trummer, and C. Ré. Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv preprint arXiv:2304.09433, 2023.
[2] F. Azzalini, D. Piantella, E. Rabosio, and L. Tanca. Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity. The VLDB Journal, 32(3):475–500, 2023.
[3] A. Bogatu, A. A. Fernandes, N. W. Paton, and N. Konstantinou. Dataset discovery in data lakes. In ICDE, pages 709–720. IEEE, 2020.
[4] A. Brinkmann, R. Shraga, and C. Bizer. Product attribute value extraction using large language models. arXiv preprint arXiv:2310.12537, 2023.
[5] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
[6] G. K. Canalle, A. C. Salgado, and B. F. Loscio. A survey on data fusion: what for? in what form? what is next? Journal of Intelligent Information Systems, 57:25–50, 2021.
[7] A. Chapman, E. Simperl, L. Koesten, G. Konstantinidis, L.-D. Ibáñez, E. Kacprzak, and P. Groth. Dataset search: a survey. The VLDB Journal, 29(1):251–272, 2020.
[8] L. Cheng, X. Li, and L. Bing. Is GPT-4 a good data analyst? arXiv preprint arXiv:2305.15038, 2023.
[9] Z. Cheng, J. Kasai, and T. Yu. Batch prompting: Efficient inference with large language model APIs. arXiv preprint arXiv:2301.08721, 2023.
[10] N. Chepurko, R. Marcus, E. Zgraggen, R. C. Fernandez, T. Kraska, and D. Karger. ARDA: automatic relational data augmentation for machine learning. arXiv preprint arXiv:2003.09758, 2020.
[11] F. Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
[12] X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458–469. IEEE, 2013.
[13] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247–1261, 2015.
[14] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[15] T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
[16] S. Das, A. Doan, P. S. G. C., C. Gokhale, P. Konda, Y. Govind, and D. Paulsen. The magellan data repository. https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository.
[17] X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu. TURL: Table understanding through representation learning. ACM SIGMOD Record, 51(1):33–40, 2022.
[18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[19] Y. Dong, C. Xiao, T. Nozawa, M. Enomoto, and M. Oyamada. DeepJoin: Joinable table discovery with pre-trained language models. arXiv preprint arXiv:2212.07588, 2022.
[20] H. Face. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2024.
[21] G. Fan, J. Wang, Y. Li, D. Zhang, and R. Miller. Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning. arXiv preprint arXiv:2210.01922, 2022.
[22] Flipkart.com. Flipkart products. https://www.kaggle.com/datasets/PromptCloudHQ/flipkart-products.
[23] A. Gal, H. Roitman, and R. Shraga. Learning to rerank schema matches. IEEE Transactions on Knowledge and Data Engineering, 33(8):3104–3116, 2019.
[24] C. Gao, X. Lan, N. Li, Y. Yuan, J. Ding, Z. Zhou, F. Xu, and Y. Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives. arXiv preprint arXiv:2312.11970, 2023.
[25] R. Hai, C. Koutras, C. Quix, and M. Jarke. Data lakes: A survey of functions and systems. IEEE Transactions on Knowledge and Data Engineering, 2023.
[26] J. Han, J. Pei, and H. Tong. Data mining: concepts and techniques. Morgan kaufmann, 2022.
[27] Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya, and S. Chaudhuri. Transform-data-by-example (TDE) an extensible search engine for data transformations. PVLDB, 11(10):1165–1177, 2018.
[28] A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. HoloDetect: Few-shot learning for error detection. In SIGMOD, pages 829–846, 2019.
[29] A. Heidari, G. Michalopoulos, I. F. Ilyas, and T. Rekatsinas. Record fusion via inference and data augmentation. ACM/JMS Journal of Data Science, 2023.
[30] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
[31] S. Hong, X. Zheng, J. Chen, Y. Cheng, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, et al. MetaGPT: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
[32] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for NLP. In ICML, pages 2790–2799. PMLR, 2019.
[33] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
[34] Z. Huang and Y. He. Auto-detect: Data-driven error detection in tables. In SIGMOD, pages 1377–1392, 2018.
[35] Hugging Face. Llama and llama 2 variants. https://huggingface.co/models?other=llama, 2023.
[36] H. Iida, D. Thai, V. Manjunatha, and M. Iyyer. Tabbie: Pretrained representations of tabular data. arXiv preprint arXiv:2105.02584, 2021.
[37] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
[38] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
[39] Z. Jin, Y. He, and S. Chauduri. Auto-transform: learning-to-transform by patterns. PVLDB, 13(12):2368–2381, 2020.
[40] A. Khatiwada, G. Fan, R. Shraga, Z. Chen, W. Gatterbauer, R. J. Miller, and M. Riedewald. SANTOS: Relationship-based semantic table union search. SIGMOD, 1(1):1–25, 2023.
[41] A. Khatiwada, R. Shraga, W. Gatterbauer, and R. J. Miller. Integrating data lake tables. PVLDB, 16(4):932–945, 2022.
[42] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
[43] P. Konda, S. Das, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, et al. Magellan: toward building entity matching management systems over data science stacks. PVLDB, 9(13):1581–1584, 2016.
[44] K. Korini and C. Bizer. Column type annotation using ChatGPT. arXiv preprint arXiv:2306.00745, 2023.
[45] C. Koutras, G. Siachamis, A. Ionescu, K. Psarakis, J. Brons, M. Fragkoulis, C. Lofi, A. Bonifati, and A. Katsifodimos. Valentine: Evaluating matching techniques for dataset discovery. In ICDE, pages 468–479. IEEE, 2021.
[46] Z. Kozareva, Q. Li, K. Zhai, and W. Guo. Recognizing salient entities in shopping queries. In ACL, pages 107–111, 2016.
[47] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In SOSP, 2023.
[48] S. F. U. D. S. Lab. DataPrep. https://dataprep.ai/.
[49] A. N. Lee, C. J. Hunter, and N. Ruiz. Platypus: Quick, cheap, and powerful refinement of LLMs. arXiv preprint arXiv:2308.07317, 2023.
[50] A. N. Lee, C. J. Hunter, N. Ruiz, B. Goodson, W. Lian, G. Wang, E. Pentland, A. Cook, C. Vong, and ”Teknium”. OpenOrcaPlatypus: Llama2-13B model instruct-tuned on filtered OpenOrcaV1 GPT-4 dataset and merged with divergent STEM and logic dataset model. https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B, 2023.
[51] A. Lew, M. Agrawal, D. Sontag, and V. Mansinghka. PClean: Bayesian data cleaning at scale with domain-specific probabilistic programming. In AISTATS, pages 1927–1935. PMLR, 2021.
[52] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 33:9459–9474, 2020.
[53] H. Li, Y. Su, D. Cai, Y. Wang, and L. Liu. A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110, 2022.
[54] P. Li, Y. He, D. Yashar, W. Cui, S. Ge, H. Zhang, D. R. Fainman, D. Zhang, and S. Chaudhuri. Table-GPT: Table-tuned GPT for diverse table tasks. arXiv preprint arXiv:2310.09263, 2023.
[55] X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? arXiv preprint arXiv:1503.00303, 2015.
[56] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
[57] Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan. Deep entity matching with pre-trained language models. PVLDB, 14(1):50–60, 2020.
[58] W. Lian, B. Goodson, E. Pentland, A. Cook, C. Vong, and “Teknium”. OpenOrca: An open dataset of GPT augmented FLAN reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
[59] S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
[60] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
[61] S.-y. Liu, Z. Liu, X. Huang, P. Dong, and K.-T. Cheng. LLM-FP4: 4-bit floating-point quantized transformers. arXiv preprint arXiv:2310.16836, 2023.
[62] T. Liu, J. Fan, Y. Luo, N. Tang, G. Li, and X. Du. Adaptive data augmentation for supervised learning over missing data. PVLDB, 14(7):1202–1214, 2021.
[63] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[64] W. Lu, J. Zhang, J. Zhang, and Y. Chen. Large language model for table processing: A survey. arXiv preprint arXiv:2402.05121, 2024.
[65] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, and Y. Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
[66] D. Mahan, R. Carlow, L. Castricato, N. Cooper, and C. Laforte. Stable beluga 2. https://huggingface.co/stabilityai/StableBeluga2, 2023.
[67] M. Mahdavi and Z. Abedjan. Baran: Effective error correction via a unified context representation and transfer learning. PVLDB, 13(12):1948–1961, 2020.
[68] M. Mahdavi, Z. Abedjan, R. Castro Fernandez, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Raha: A configuration-free error detection system. In SIGMOD, pages 865–882, 2019.
[69] Y. Mei, S. Song, C. Fang, H. Yang, J. Fang, and J. Long. Capturing semantics for imputation with pre-trained language models. In ICDE, pages 61–72. IEEE, 2021.
[70] Z. Miao, Y. Li, and X. Wang. Rotom: A meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond. In SIGMOD, pages 1303–1316, 2021.
[71] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, pages 19–34, 2018.
[72] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah. Orca: Progressive learning from complex explanation traces of GPT-4. arXiv preprint arXiv:2306.02707, 2023.
[73] A. Narayan, I. Chami, L. Orr, and C. Ré. Can foundation models wrangle your data? PVLDB, 16(4):738–746, 2022.
[74] F. Nargesian, K. Pu, B. Ghadiri-Bashardoost, E. Zhu, and R. J. Miller. Data lake organization. IEEE Transactions on Knowledge and Data Engineering, 35(1):237–250, 2022.
[75] A. Nazabal, P. M. Olmos, Z. Ghahramani, and I. Valera. Handling incomplete heterogeneous data using VAEs. Pattern Recognition, 107:107501, 2020.
[76] OpenAI. March 20 ChatGPT outage: Here’s what happened, 2023.
[77] G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas. Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys, 53(2):1–42, 2020.
[78] R. Peeters and C. Bizer. Entity matching using large language models. arXiv preprint arXiv:2310.11244, 2023.
[79] N. Prokoshyna, J. Szlichta, F. Chiang, R. J. Miller, and D. Srivastava. Combining quantitative and logical data cleaning. PVLDB, 9(4):300–311, 2015.
[80] C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, and M. Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
[81] J. Qin, S. Huang, Y. Wang, J. Zhu, Y. Zhang, Y. Miao, R. Mao, M. Onizuka, and C. Xiao. BClean: A bayesian data cleaning system. arXiv preprint arXiv:2311.06517, 2023.
[82] S. Razniewski, A. Yates, N. Kassner, and G. Weikum. Language models as or for knowledge bases. arXiv preprint arXiv:2110.04888, 2021.
[83] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. HoloClean: Holistic data repairs with probabilistic inference. PVLDB, 10(10):1190–1201, 2017.
[84] A. Reviews. Amazon reviews: Unlocked mobile phones. https://www.kaggle.com/datasets/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones.
[85] D. Ritze, O. Lehmberg, Y. Oulabi, and C. Bizer. Profiling the potential of web tables for augmenting cross-domain knowledge bases. In WWW, pages 251–261, 2016.
[86] T. Sagi and A. Gal. Schema matching prediction with applications to data source discovery and dynamic ensembling. The VLDB Journal, 22:689–710, 2013.
[87] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
[88] J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann, and H. Xu. Can gpt-4 support analysis of textual data in tasks requiring highly specialized domain expertise? arXiv preprint arXiv:2306.13906, 2023.
[89] E. Sheetrit, M. Brief, M. Mishaeli, and O. Elisha. Rematch: Retrieval enhanced schema matching with llms. arXiv preprint arXiv:2403.01567, 2024.
[90] R. Shraga, A. Gal, and H. Roitman. Adnev: Cross-domain schema matching using deep similarity matrix adjustment and evaluation. PVLDB, 13(9):1401–1415, 2020.
[91] S. Song, Y. Sun, A. Zhang, L. Chen, and J. Wang. Enriching data imputation under similarity rule constraints. IEEE transactions on knowledge and data engineering, 32(2):275–287, 2018.
[92] Y. Suhara, J. Li, Y. Li, D. Zhang, Ç. Demiralp, C. Chen, and W.-C. Tan. Annotating columns with pre-trained language models. In SIGMOD, pages 1493–1503, 2022.
[93] D. Tang, Z. Chen, K. Kim, Y. Song, H. Tian, S. Ezzini, Y. Huang, and J. K. T. F. Bissyande. Collaborative agents for software engineering. arXiv preprint arXiv:2402.02172, 2024.
[94] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models, 3(6):7, 2023.
[95] S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, and A. Doan. Deep learning for blocking in entity matching: a design space exploration. PVLDB, 14(11):2459–2472, 2021.
[96] S. Tihon, M. U. Javaid, D. Fourure, N. Posocco, and T. Peel. DAEMA: Denoising autoencoder with mask attention. In ICANN, pages 229–240, 2021.
[97] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[98] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[99] J. Tu, J. Fan, N. Tang, P. Wang, G. Li, X. Du, X. Jia, and S. Gao. Unicorn: A unified multi-tasking model for supporting matching tasks in data integration. Proceedings of the ACM on Management of Data, 1(1):1–26, 2023.
[100] Upstage. Solar-0-70b-16bit. https://huggingface.co/upstage/SOLAR-0-70b-16bit, 2023.
[101] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
[102] P. Wang and Y. He. Uni-detect: A unified approach to automated error detection in tables. In SIGMOD, pages 811–828, 2019.
[103] Q. Wang, L. Yang, B. Kanagal, S. Sanghai, D. Sivakumar, B. Shu, Z. Yu, and J. Elsas. Learning to extract attribute value from product via question answering: A multi-task approach. In KDD, pages 47–55, 2020.
[104] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
[105] L. Weng. Prompt engineering. https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/, 2023.
[106] R. Wu, A. Zhang, I. Ilyas, and T. Rekatsinas. Attention-based learning for missing data imputation in HoloClean. MLSys, 2:307–325, 2020.
[107] Z. Wu, R. Peng, X. Han, S. Zheng, Y. Zhang, and C. Xiao. Smart agent-based modeling: On the use of large language models in computer simulations. arXiv preprint arXiv:2311.06330, 2023.
[108] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
[109] H. Xu, W. Wang, X. Mao, X. Jiang, and M. Lan. Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. In ACL, pages 5214–5223, 2019.
[110] J. Yoon, J. Jordon, and M. Schaar. GAIN: Missing data imputation using generative adversarial nets. In ICML, pages 5689–5698, 2018.
[111] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
[112] H. Zhang, Y. Dong, C. Xiao, and M. Oyamada. Large language models as data preprocessors. arXiv preprint arXiv:2308.16361, 2023.
[113] J. Zhang, B. Shin, J. D. Choi, and J. C. Ho. SMAT: An attention-based deep learning solution to the automation of schema matching. In ADBIS, pages 260–274. Springer, 2021.
[114] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
[115] T. Zhang, X. Yue, Y. Li, and H. Sun. Tablellama: Towards open large generalist models for tables. arXiv preprint arXiv:2311.09206, 2023.
[116] Y. Zhang and Z. G. Ives. Finding related tables in data lakes for interactive data science. In SIGMOD, pages 1951–1966, 2020.
[117] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
[118] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
[119] Z. Zhao and R. Castro Fernandez. Leva: Boosting machine learning performance with relational embedding data augmentation. In SIGMOD, pages 1504–1517, 2022.
[120] G. Zheng, S. Mukherjee, X. L. Dong, and F. Li. Opentag: Open attribute value extraction from product profiles. In KDD, pages 1049–1058, 2018.
[121] T. Zhu, Y. Wang, H. Li, Y. Wu, X. He, and B. Zhou. Multimodal joint attribute prediction and value extraction for e-commerce product. arXiv preprint arXiv:2009.07162, 2020.

Appendix A Model Setup

The hyperparameter setup for tuning a Jellyfish model is:

•

lora_target: q_proj, k_proj, v_proj, o_proj;
•

per_device_train_batch_size: 2;
•

gradient_accumulation_steps: 2;
•

learning_rate: 3e-5;
•

num_train_epochs: 5.0;
•

lora_rank: 32;
•

lora_alpha: 32.

The following hyperparameters are used for inference:

•

temperature: 0.35;
•

top_p: 0.9;
•

top_k: 10.

Appendix B Data Construction Prompts

B.1 Instruction Data

For instruction data, we show the prompt for each task, using one dataset as an example. Then, we show the prompt for reasoning data, which slightly differs from instruction data. The prompts for inference are the same as tuning, except that dataset-specific knowledge is optional. The prompts for reasoning ground truth collection and head-to-head judge are used for GPT-4.

B.2 Reasoning Data

To construct reasoning data, we use the following prompt.

We use the following prompt to collect ground truth from GPT-4 (for the Beer dataset in EM).

In the above prompt, we inject a piece of knowledge specific to the dataset. With this additional knowledge, GPT-4 can produce high-quality reasoning result. Note that such knowledge is not prompted to Jellyfish models, as for unseen datasets such knowledge is not always available. In addition to the injected knowledge, GPT-4 also receives a hint to the answer of yes or no. As such, we can guarantee that the output reason always points to the correct direction. A sample answer from GPT-4 is given below.

The two products listed are not the same. Product A is named ”Sequoia American Amber Ale” and is produced by ”Wig And Pen”. In contrast, Product B is ”Aarhus Cains Triple A American Amber Ale” and is produced by ”Aarhus Bryghus”. Despite both being types of American Amber Ale, the names and manufacturers of the two products are different, indicating that they are distinct products.
No

Appendix C Injected Knowledge

Table 14: General knowledge.

Prompt
Missing values (N/A or ”nan”) should not be used as a basis for your decision.
If there are missing values, you should make inferences based only on the information that is available.

Table 15: Task-specific knowledge.

Task	Prompt
ED	Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don’t make sense given the context of the whole record. (Used when showing the whole record)
	Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don’t make sense for that attribute. (Used when showing only one attribute)
	Capitalization should not be a factor in deciding whether there is an error or not.
DI	Note that values such as ’nan’ and ’N/A’ mean missing vaules, and they are not considered as errors. (used when we decide not to treat missing values as errors)
DI	Note that values such as ’nan’ and ’N/A’ mean missing values, and they ARE errors. (used when we decide to treat missing values as errors)
EM	To determine if two values are identical, you need to examine both their full names and corresponding acronyms.

Table 16: Dataset-specific knowledge.

Task	Dataset	Prompt
ED	Adult	Both the ’age’ attribute and the ’hoursperweek’ attribute can represent a range of integer values.
	Adult	Verify the consistency of target attribute with related attributes to identify any errors.
	Hospital	The value of attribute ”score” can be a percentage number.
DI	Restaurant	The city can often be deduced from the area code of the phone number and the specific street name.
EM	Amazon-Google	Different editions, versions, or operating systems for the same software are all considered as different products.
	Amazon-Google	You should compare the two product numbers first.
	Beer	Note that different factories can belong to the same parent company.
	Beer	Beverages that undergo different production processes, such as the use of various types of wood in the barrelling process, may be considered distinct products.
	Fodors-Zagats	The type of a specific restaurant might vary between different datasets.
	iTunes-Amazon	The length of the same song might vary slightly across different datasets due to rounding or data entry discrepancies.
	DBLP-ACM	The names of authors might be presented in various formats or sequences, even when referring to the same article.
	DBLP-GoogleScholar	The names of authors might be presented in various formats or sequences, even when referring to the same article.

Appendix D Few-Shot Prompting

We apply few-shot prompting by manually selecting a subset of data instances from the dataset and labeling them. For instance, a few-shot example for the Beer dataset is presented as follows:

The example follows the same format of instance content, question, and output format as in the instruction data. It also provides the answer indicated by ### Response: Yes. Whereas we only show an positive example here, it is suggested to include both positive and negative examples. After the final example, the instance to be processed is presented in the prompt, and the model follows the same output format as demonstrated in the examples.

Since ground truths are usually not available in real applications, users can handcraft few-shot examples for inference. On the other hand, few-shot examples can be automatically generated by randomly injecting errors for ED and DI, such as missing values, typographical/formatting errors, and randomly swapping values for two columns in a tuple or two tuples in a column. For SM and EM, we can employ rule-based methods (e.g., blocking rules [43]) to quickly find a few matches and use them as few-shot examples.

D.1 Error Detection

The few-shot examples for the Flights and Rayyan datasets are given as follows.

D.2 Data Imputation

The few-shot examples for the Flikpkart and Phone datasets are given as follows.

D.3 Schema Matching

The few-shot examples for the CMS dataset are given as follows.

D.4 Entity Matching

The few-shot examples for the Abt-Buy and Walmart-Amazon datasets are given as follows.