HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: eqparbox

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-SA 4.0
arXiv:2312.01678v4 [cs.AI] 13 Mar 2024

Jellyfish: A Large Language Model for Data Preprocessing

11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTHaochen Zhang, 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTYuyang Dong, 1,313{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPTChuan Xiao, 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTMasafumi Oyamada
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTOsaka University, 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTNEC Corporation, 33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTNagoya University
{chou.koushin, chuanx}@ist.osaka-u.ac.jp, dongyuyang, oyamada@nec.com
Abstract

This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 – 13B models) as universal DP task solver. We select a collection of datasets across four representative DP tasks and construct instruction-tuning data using serialization and knowledge injection techniques tailored to DP. As such, the instruction-tuned LLMs empower users to manually craft instructions for DP. Meanwhile, they can operate on a local, single, and low-priced GPU, ensuring data security and enabling further tuning. Our experiments show that our dataset constructed for DP instruction tuning, namely Jellyfish, effectively enhances LLMs’ DP performances and barely compromises their abilities in NLP tasks. By tuning Mistral-7B and OpenOrca-Platypus2-13B with Jellyfish, the models deliver competitiveness compared to state-of-the-art DP methods and strong generalizability to unseen tasks. The models’ performance rivals that of GPT series models, and the interpretation offers enhanced reasoning capabilities compared to GPT-3.5. The 7B and 13B Jellyfish models are available at Hugging Face:
https://huggingface.co/NECOUDBFM/Jellyfish-7B
https://huggingface.co/NECOUDBFM/Jellyfish-13B

**footnotetext: Haochen Zhang and Yuyang Dong contributed equally to this work. Yuyang Dong is the corresponding author.

1 Introduction

The proliferation of large language models (LLMs) has catalyzed a diverse array of applications, extending beyond the domain of NLP to encompass a wide range of fields that require the processing of natural language data. Notably, LLMs have been applied in areas such as software engineering [80, 93], computer simulation [107, 24], data analytics [8, 88], and tabular data processing [54, 64, 115].

This paper focuses on the utilization of LLMs for data preprocessing (DP), a critical step in the data mining pipeline that involves transforming raw data into a manageable and processable format ready for use. Over the past decades, significant strides have been made in various DP tasks. Until 2021, most efforts were concentrated on one or two specific tasks such as error detection (ED) [28, 68], data imputation (DI) [83, 67, 69], schema matching (SM) [113], and entity matching (EM) [43, 57]. A key challenge in developing generic solutions to DP is that these tasks differ in nature: they deal with errors, anomalies, matches, etc. and require different actions such as detection, repairing, and alignment.

Refer to caption
Figure 1: Overview of instruction-tuning a large language model for data preprocessing.

With the advent of LLMs like GPT-3 and subsequent versions, researchers have found a key to address this challenge, spurring the development of generic solutions for a wider array of DP tasks [73, 112]. The application of LLMs in DP has the following strengths: (1) The primary strengths of using LLMs in DP lie in their ability to process natural language. Most LLMs provide a prompting interface with which users can interact and assign tasks in natural language, contrasting with existing DP solutions that require computer programming or specific tools (e.g., HoloClean [83] and Magellan [43]). (2) With the knowledge acquired through training on vast amounts of data, LLMs are universal problem solvers capable of identifying errors, anomalies, and matches in data (and particularly unseen datasets in unseen tasks), aligning with the aims of DP tasks without needing human-engineered rules [82]. (3) LLMs are excellent reasoners [42], enabling them to not only return DP results but also provide the reasons for these results. In this sense, their answers are more interpretable than those of other deep learning approaches. (4) LLMs can be conditioned by few- [5] or zero-shot [42] prompting. As such, we can condition the criteria for DP tasks (e.g., the degree of matching) using few-shot examples or zero-shot prompts, contrasting with traditional solutions based on a threshold [86, 43] or a time-consuming training process to fit to the data [69].

Despite these strengths, existing LLM-based solutions to DP [73, 112, 44], with reliance on GPT APIs, have raised concerns about data breaches, as evidenced by OpenAI’s first confirmed data breach involving ChatGPT [76]. Another limitation is the difficulty in domain specification [73]. When dealing with data from highly specialized domains, training the LLMs used in these solutions can be costly (e.g., GPT-3.5) and even unavailable due to frozen parameters (e.g., GPT-4), posing difficulty in customizing the model.

In response to the aforementioned challenges, we propose Jellyfish, a dataset for tuning LLMs for various DP tasks. Jellyfish distinguishes itself with several key features: (1) Jellyfish is used for building universal DP task solvers by instruction-tuning [114] LLMs to the following tasks: ED and DI for data cleaning, and SM and EM for data integration. (2) Jellyfish is suited to 7 – 13B models which can operate on a local, single, and low-priced GPU, ensuring data security and allowing further tuning. (3) Capable of understanding natural language, the tuned model allows users to manually craft instructions for DP tasks and apply prompt engineering techniques to tailor it to specific tasks and datasets. (4) Unlike many existing methods that rely heavily on handcrafted knowledge during inference [83, 81], Jellyfish features domain knowledge in its instruction-tuning dataset and enables optional knowledge injection during inference. (5) It includes reasoning data for tuning model’s interpretation ability that provides natural language explanations of the model’s outputs.

As depicted in Figure 1, Jellyfish is constructed by manually selecting data from several public datasets widely used for DP evaluation. By instance serialization, raw data is serialized into the prompts used to tune the model. By knowledge injection, task- and dataset-specific knowledge – particularly domain knowledge that can be extended to unseen datasets – is infused to the prompts. Moreover, we resort to GPT-4 to generate reasoning data. As such, the tune model’s interpretation distills GPT-4’s knowledge in reasoning DP results.

Our evaluation of Jellyfish focuses on tuning OpenOrca-Platypus2-13B (as Jellyfish-13B) and Mistral-7B-Instruct-v0.2 (as Jellyfish-7B), and compares with two categories of methods: non-LLM methods – typically solutions based on machine learning (ML) or pre-trained language models (PLMs) prior to the prevalence of LLMs – and LLM methods – typically GPT series methods. The results show that Jellyfish-13B consistently outperforms non-LLM methods on its seen datasets. Its effectiveness on unseen datasets is comparable to that of non-LLM methods on their respective seen datasets. Meanwhile, Jellyfish-7B also exhibits competitiveness, especially on DI tasks. In two case studies of unseen tasks, Jellyfish models deliver strong performance, showcasing their generalizability to a wider range of DP tasks beyond the four tasks used for tuning. Their performance rivals that of GPT series models, and Jellyfish-7B even outperforms GPT-4 on the attribute value extraction task. Our evaluation reveals the impact of data configuration in Jellyfish, and discovers that tuning with Jellyfish barely compromises models’ abilities in NLP tasks. Furthermore, additional experiments demonstrate the advantage of Jellyfish interpretation over GPT-3.5 in reasoning capabilities as well as the effectiveness of the techniques employed in building Jellyfish.

Our contributions are summarized as follows.

  • We develop Jellyfish, a dataset for instruction-tuning LLMs as universal DP task solvers.

  • LLMs tuned with Jellyfish showcase several notable features: universal model design, moderate model size, assurance of data security, feasibility for further tuning, natural language instruction handling, optional specification of prior knowledge, and model interpretability.

  • Our experiments demonstrate Jellyfish-7B and 13B models’ effectiveness in DP task solving, generalizability to new tasks beyond what they are tuned for, and the superior reasoning abilities.

The rest of the paper is organized as follows: Section 2 introduces the DP tasks targeted by our model and briefly reviews LLMs. Section 3 describes the Jellyfish dataset for instruction tuning. Section 4 introduces how to use Jellyfish for solving DP tasks. Section 5 discusses the extensions to unseen tasks. Section 6 reports experimental results and analysis. Section 7 reviews related works on DP. Section 8 concludes this paper.

2 Preliminaries

2.1 Data Preprocessing

In data mining, DP is a crucial step that deals with noise, missing values, inconsistencies, and heterogeneity in data. Major DP procedures include data cleaning, data integration, data transformation, and data reduction [26]. In this initial exploration of LLMs for DP, we concentrate on tabular data, one of the most common data types. Our data model operates on relational tables specified by schemas. We assume all attributes are either numerical values (incl. binary values) or textual values (incl. categorical values). Diverging from the traditional definition that presents the entire dataset and finds or fixes all the errors (or matches, etc.) within, we define the problem by handling one record (or a pair, depending on the task) at a time, so the prompt can be easily written and its length is within LLMs’ token limitation. Next, we outline the DP tasks involved in this study:

(1) Error Detection (ED): Given a record (i.e., a tuple in a relational table) and an attribute, our task is to detect whether there is an error in the cell value of this attribute. (2) Data Imputation (DI): Given a record and an attribute such that cell value for this attribute is missing, our task is to infer its correct value. (3) Schema Matching (SM): Given a pair of attributes represented in the form of (name, description), our task is to find whether they refer to the same attribute. (4) Entity Matching (EM): Given a pair of records, our task is to infer whether they refer to the same entity.

The above four tasks collectively form the most critical part of DP and are used for instruction-tuning. Besides, we consider two unseen tasks which belong to the intersection of DP and other topics: (1) Column Type Annotation (CTA): Given a table with no header, our task is to infer the type of each column from a set of predefined semantic types (e.g., name, time, location). (2) Attribute Value Extraction (AVE): Given a text description of an entity and a set of predefined attributes, extract attribute values from the text description.

We term each input object an instance, i.e., a record for ED and DI, a pair of attributes for SM, a pair of records for EM, a table or a column for CTA, and a text description for AVE.

2.2 Large Language Models

With advancements in the field of natural language processing (NLP), LLMs have become one of the hottest topics in the AI research community. Representative LLMs include OpenAI’s GPT series (in particular, GPT-3, 3.5, and 4), Anthropic’s Claude, Google’s Gemini, Mistral AI’s Mistral [37], Meta’s Llama [97] and Llama 2 [98], as well as their variants that can be found at Hugging Face [35]. Due to their superb ability to process natural language, LLMs have not only been used in NLP applications (e.g., ChatGPT and Claude), but also catalyzed the rise of LLM-powered autonomous agents [101] as AI assistants (e.g., by GPTs) or tools for engineering [80, 31] or simulation [108, 107] purposes. Another popular LLM-centric research direction is retrieval-augmented generation (RAG) [52, 53], which gives LLMs access to external information to improve generation performance. We refer readers to [118] for a survey on LLMs. Some LLMs are open-source (e.g., Llama and Llama 2), and they can be fine-tuned with additional tasks to improve their abilities in logical reasoning, question answering, and so on. Among these fine-tuning approaches, instruction tuning [114] has become a prevalent one which further trains LLMs on a dataset consisting of (instruction, output) pairs in a supervised fashion, hence bridging the gap between the next-word prediction objective of LLMs and the users’ objective of having LLMs adhere to human instructions. For efficiency of fine-tuning, parameter-efficient fine-tuning (PEFT) approaches enable adaptation of LLMs to downstream applications without fine-tuning all the parameters. Notable methods are adapter tuning [32], prefix-tuning [56], and low-rank adaptation (LoRA) [33]. In particular, LoRA achieves significantly fewer trainable parameters and no additional inference latency, and has become a prevalent PEFT approach.

In addition to the strengths outlined in Section 1, we discuss the limitations of LLMs in the context of DP: (1) LLMs often require substantial computational resources, thereby increasing the cost of use and compromising the efficiency and scalability of DP on large-scale data. (2) Due to token limitation (the maximum input length, e.g., 4k tokens for GPT-3.5) and lack of memory for keeping historical information, the input to the LLM is often instance-by-instance, and the DP results may exhibit inconsistency across different instances. Simply raising the token limitation (e.g., 128k tokens for GPT-4-turbo) does not solve the problem, because performance may degrade due to increased lengths of input [60]. (3) LLMs sometimes exhibit hallucination [117], i.e., they generate text that is plausible-sounding but factually incorrect or non-sensical, as they lack a fundamental understanding of the world and rely solely on the patterns they learned during training.

3 Instruction Tuning with Jellyfish

3.1 Dataset Preparation

For the four seen tasks, we choose a series of datasets that have been widely used in previous studies and cover a variety of application domains. (1) ED: Adult and Hospital, used in [28]; (2) DI: Buy and Restaurant, used in [69]; (3) SM: MIMIC-III and Synthea, used in [113]; (4) EM: Amamzon-Google, Beer, DBLP-ACM, DBLP-GoogleScholar, Fodors-Zagats, and iTunes-Amazon from the Magellan data repository [16]. We use the publicly available version of these datasets [73], where errors and missing values are already injected to the datasets of ED and DI, respectively. The statistics are provided in Table 1.

Table 1: The Jellyfish dataset statistics. #Positives indicates the number of positive instances, i.e., there is an error (for ED) or the two objects match (for SM and EM).
Task Dataset #Instances #Positives
ED Adult 550×\times×2 35×\times×2
Hospital 1710×\times×2 44×\times×2
DI Buy 586 N/A
Restaurant 778 N/A
SM MIMIC-III 7000 11
Synthea 5000 18
EM Amazon-Google 6874 699
Beer 359 54
DBLP-ACM 5000 885
DBLP-GoogleScholar 5000 924
Fodors-Zagats 757 88
iTunes-Amazon 430 105

For determining the number of instances in each dataset, a rationale is that the dataset across different tasks should be balanced and there is no dataset dominating the entire corpus. In particular, we undertake the following efforts to prepare data: (1) Given the disproportionately low number of positive instances compared to negative ones, we incorporate all positive instances available in the datasets into Jellyfish. (2) For ED, since missing values can be interpreted as either errors or non-errors depending on the context, we create two versions of each instance with missing values during tuning: one treating the missing values as errors and the other as non-errors. This duplication is guided by the injection of knowledge, which will be detailed later in this section. (3) In instances of low-quality data for a task (e.g., SM), we moderately increase the percentage of the data for this task in the corpus, to ensure the LLM can learn effectively. Upon the above preparation, we split each dataset into training and validation with a ratio of 80:20. Next, we introduce how we transform raw data to instructions. We prepare instruction data and reasoning data, one for tuning the task solving ability and the other for tuning the the interpretation ability. Then can be jointly used for tuning a Jellyfish model.

3.2 Instruction Data

To prepare the instruction data for an LLM, we need to serialize (a.k.a. contextualize) each instance in the raw data to a prompt. The prompt contains the task description, the instance content, and any injected knowledge. To describe our techniques for constructing the instruction data for training, we use an example for an instance in the Beer dataset used for EM, as shown in Figure 2.

Refer to caption
Figure 2: Prompt of instruction data. Boldface in parenthesis are comments not prompted to the model.

At the beginning, there is a system message guiding the model behavior. Here, we instruct the model to act as an AI assistant to answer the user’s question, and its response should always respect this constraint. Then, we describe the DP task, i.e., EM in this example. The following part refers to injected knowledge. There are two types of injected knowledge: (1) general knowledge that applies to many datasets, and (2) specific knowledge that only applies to the given dataset. In this example, the knowledge belongs to general knowledge and concerns with missing values. Such knowledge injection may prevent the model incorrectly handling certain values in the dataset, especially when training data is noisy. The following part pertain to the instance content. Finally, there is a question presented to the model, and the output format is specified afterwards.

Whereas in the above example we specify knowledge on missing values, there are other forms of general knowledge used in tuning, including error types and terminology. For example, for ED, we inform the model of the fact that errors can include, but are not limited to, spelling errors, inconsistencies, or values that do not make sense for that attribute; for EM, we instruct the model to consider the full name of an attribute and its acronym to determine if the two values are the same. Specific knowledge highly depends on the application domain, mainly including constraints or rules that pertain to the dataset. For example, in publication datasets, authors’ names may occur in different forms and different orders even for the same article. Additionally, the model can be configured to assign greater importance to certain attributes. In the context of product data, for example, the model is directed to prioritize the comparison of product numbers. Specific knowledge can be applicable to datasets within the same domain, thereby enhancing the model’s performance on unseen datasets, particularly in scenarios where prior knowledge about these datasets is absent. Overall, the knowledge injected through tuning becomes the built-in knowledge of the model and can be used even without user-specification during inference.

3.3 Reasoning Data

The reasoning data uses the same set of datasets as instruction data, except that we use roughly smaller numbers of instances: 2500 for SM and 360 for EM, with detailed statistics given in Table 2.

Table 2: Jellyfish-preview dataset statistics. #Positives indicates the number of positive instances, i.e., there is an error (for ED) or the two objects match (for SM and EM).
Task Dataset #Instances #Positives
ED Adult 550 35
Hospital 1250 40
DI Buy 586 N/A
Restaurant 600 N/A
SM Synthea 2500 15
EM Amazon-Google 359 31
Beer 359 54
DBLP-ACM 359 71
DBLP-GoogleScholar 359 69
Fodors-Zagats 359 41
iTunes-Amazon 360 92

This smaller set of training data is also used for initial experiments on selecting base LLMs, and thus the tuning process using this set of data is dubbed Jellyfish-preview. The prompt in reasoning data is similar to the instruction data introduced above. The only difference is the reasoning instructions, as given in Appendix B.

Unlike the labeled ground truth for instruction data, we resort to GPT-4 to retrieve reasonable answers in reasoning data. Such practice is also used in constructing training data for various Llama and Llama 2 variants such as Alpaca [94] and Orca [72].

4 Data Preprocessing with Jellyfish Models

Given a dataset in CSV format, the task solver uses an instance serializer that iterates through all the instances and transforms each instance to a prompt. The prompt is same as the instruction and reasoning data for tuning the task solver and interpreter, respectively. We apply general knowledge for DP tasks, e.g., missing values in matching tasks and error types in ED. The task solver also provides a knowledge injector with which users can input dataset-specific knowledge, such as the domain knowledge (e.g., constraints) outlined in the previous section. Such user-specified knowledge is optional.

Feature Engineering. Users can optionally select a subset of features to improve performance. For instance, for EM in the Beer dataset, name and factory are more relevant features, while style and ABV are less relevant. Hence users may choose to use only name and factory as attributes. Such feature engineering can be also implemented in the prompt as specific knowledge, e.g., you should only consider name and factory and ignore other attributes.

Prompt Engineering. Prompt engineering [105] is the process of structuring text to enhance the model performance. We incorporate few-shot prompting [5], which conditions the Jellyfish models to learn from a small selection of examples drawn from the dataset. The prompts for few-shot examples are reported in Appendix D.

Batch prompting [9] is a prompt engineering technique designed to enable models to perform inference in batches, rather than processing single instances individually. This approach involves presenting multiple instances within a single prompt, with the model instructed to respond to all of them concurrently. Though proven effective for GPT-3.5 and GPT-4 in reducing token consumption and execution time [112], we have opted not to employ this method in the current version of Jellyfish. Our concern is that overburdening a 7B or 13B model with an excessive number of tokens, even when staying within its token limitation, could lead to diminished attention. This might significantly impair performance, potentially resulting in the model overlooking responses to some instances. We anticipate that future research, utilizing larger models or increasing the token input capacity, could effectively address this limitation.

5 Extensions to Unseen Tasks

In Section 4, we introduce DP task solving and focus on seen tasks. For unseen tasks, we consider two case studies: CTA and AVE, as outlined in Section 2.1. Jellyfish models can be easily extended to support them by employing the prompt engineering techniques in existing LLM-based solutions, hence simplifying its use in unseen tasks.

Column Type Annotation. As a task in the realm of table understanding, CTA is an essentially DP step for data search [7], knowledge base completion [85], and data integration a data lake [25]. We follow the two-stage pipeline proposed in [44], which was designed for ChatGPT and based on chain-of-thought [104], a technique that enables complex reasoning capabilities through intermediate reasoning steps.

Given a table to be annotated, in the first stage, the model predicts the domain of the table. In the second stage, given a set of predefined types, the model determines the type of column based on sample values extracted from it. The chain-of-thought prompt instructs the model in a step-by-step manner. For example, to predict the domain of the table, there are four steps: (1) look at the input and make a table out of it, (2) look at the cell values in detail, (3) decide if the table describes domain A, domain B … and (4) answer with the domain. Then, the model follows this prompt to cope with the task. The column type selection in the second stage works in the same way, except that table is replaced by column and domains are replaced by candidate types.

Attribute Value Extraction. Given a text description, AVE is an information extraction task that discovers missing values of attributes and reconstructs a table. For this task, we follow the prompt in [4] designed for GPT-4. The prompt is simple, beginning with the task description. Then, the instance content follows, with the description of the entity and the attribute to be extracted. Finally, an exception rule is mentioned: if the attribute cannot be extracted, the model should answer “N/A”.

We also would like to mention that Jellyfish models enable further fine-tuning. Users may choose to condition the model for specific DP tasks or domains to seek better performance. Moreover, Jellyfish models can be utilized for multiple tasks in a DP pipeline, e.g., data cleaning followed by data integration on the same sets of data. It is likely that the DP tasks within this pipeline belong to the same domain. In this case, Jellyfish models may deliver consistency in handling the data in different tasks due to the built-in domain knowledge acquired through instruction tuning for DP.

6 Experiments

6.1 Experimental Setup

Datasets. Apart from the seen datasets in Jellyfish (Section 3), we use following datasets as unseen data. CTA and AVE are used for case studies on unseen tasks. (1) ED: Flights and Rayyan, used in [68]; (2) DI: Flipkart [22] and Phone [84] from Kaggle; (3) SM: CMS, used in [113]; (4) EM: Abt-Buy and Walmart-Amazon (despite the same domain as the Amazon dataset used in Jellyfish, the entities belong to a different category of products) from the Magellan data repository [16]; (5) CTA: SOTAB, used in [44]; (6) AVE: AE-110k and OA-Mine, used in [4]. The statistics of the datasets are reported in Table 3. We generate train/valid/test splits following the protocol in [28] for Adult and Hospital, [69] for Flipkart and Phone, and [113] for MIMIC-III and CMS. The other datasets have already been provided with splits [73, 44, 4]. A subset of the train/valid splits is used in Jellyfish, as reported in Table 1.

Table 3: Testing dataset statistics. #Train and #Valid numbers only apply to GPT-3.5 on AVE and non-LLM methods on other tasks.
Task Type Dataset #Train #Valid #Test #Total
ED Seen Adult 550 550 9900 11000
Hospital 1710 190 17101 19001
Unseen Flights 715 714 12832 14261
Rayyan 501 502 8997 10000
DI Seen Buy 469 117 65 651
Restaurant 622 156 86 864
Unseen Flipkart 6240 0 2675 8915
Phone 2537 0 1194 3731
SM Seen MIMIC-III 51264 6408 6408 64080
Synthea 23709 2964 2964 29637
Unseen CMS 22784 2848 2564 28196
EM Seen Amazon-Google 6874 2293 2293 11460
Beer 268 91 91 450
DBLP-ACM 6417 2473 2473 11363
DBLP-GoogleScholar 17223 5742 5742 28707
Fodors-Zagats 567 190 189 946
iTunes-Amazon 321 109 109 539
Unseen Abt-Buy 5743 1916 1946 9575
Walmart-Amazon 6144 2049 2049 10242
CTA Unseen SOTAB 356 0 250 606
AVE Unseen AE-110K 4360 0 1482 5842
OA-Mine 7360 0 2451 9811
Table 4: DP performance on seen tasks, with winner in boldface and runner-up underlined. Few-shot is disabled for Jellyfish models on seen datasets and enabled on unseen datasets. “–” indicates numbers not reported in prior works.
Task Type Dataset Model

Best of non-LLM

GPT-3

GPT-3.5

GPT-4

Table-GPT

Jellyfish-7B

Jellyfish-7B-I

Jellyfish-13B

ED Seen Adult 99.10 99.10

92.01

92.01

94.70

91.96

99.33
Hospital

94.40

97.80

90.74

90.74

95.09

96.27

95.59

Unseen Flights

81.00

83.48

65.30

66.92

82.52
Rayyan

79.00

81.95

73.81

69.82

90.65
DI Seen Buy

96.50

98.50

98.46

100

98.46

96.92

100
Restaurant

77.20

88.40

94.19 97.67

86.05

88.37

89.53

Unseen Flipkart

68.00

89.94

81.87

79.44

81.68

Phone

86.70

90.79

83.67

85.00

87.21
SM Seen MIMIC-III

20.00

40.00

43.14 40.00 40.00
Synthea

38.50

45.20

57.14 66.67

55.55

44.44

56.00

Unseen CMS 50.00

19.35

20.00

13.79

59.29
EM Seen Amazon-Google

75.58

63.50

66.50

74.21

70.10

81.29

80.83

81.34
Beer

94.37

100

96.30

100

96.30

96.30

96.55

96.77

DBLP-ACM 98.99

96.60

96.99

97.44

93.80

98.54

98.88

98.98
DBLP-GoogleScholar 95.70

83.80

76.12

91.87

92.40

94.89

95.16

98.51
Fodors-Zagats 100 100 100 100 100 100 100 100
iTunes-Amazon

97.06

98.20

96.40

100

94.30

96.30

96.30

98.11

Unseen Abt-Buy

89.33

92.77

79.78

82.38

89.58
Walmart-Amazon

86.89

87.00

86.17

90.27

82.40

78.22

85.64

89.42

LLMs. We mainly instruction-tune two LLMs with Jellyfish: (1) OOP2-13B, short for OpenOrca-Platypus2-13B [50], a Llama 2 variant with enhanced reasoning capabilities and logic proficiency; (2) Mistral-7B, short for Mistral-7B-Instruct-v0.2 [37], a prevalent 7B model. The tuned models using instruction data are dubbed Jellyfish-13B and Jellyfish-7B, respectively. We also tune Mistral-7B with both instruction and reasoning data, dubbed Jellyfish-7B-I. The hyperparameter setup is provided in Appendix A. Injected knowledge is reported in Appendix C. When few-shot prompting is enabled, we equip LLMs with three examples for each dataset, covering both positive and negative examples (Appendix D).

For inference, the prompts are the same as instruction and reasoning data, respectively. We apply general knowledge for DP tasks, e.g., missing values in matching tasks and error types in ED. Dataset-specific knowledge is not used.

Baseline DP Methods. We categorize existing methods into non-LLM methods and LLM methods. For non-LLM methods, we select the following baselines, in line with [73]: (1) ED: HoloDetect [28] and Raha [68]; (2) DI: IPM [69]; (3) SM: SMAT [113]; (4) EM: Ditto [57] and Unicorn [99]; (5) CTA: RoBERTa [63]. For their performance, we follow the best numbers reported in prior works [73, 44, 99]. Other methods such as Baran [67], HoloClean [83], and DODUO [92], have been shown to be outperformed by the above competitors [69, 73, 44], and hence are not compared here.

LLM methods are GPT-3 (text-davinci-002), GPT-3.5 (gpt-3.5-turbo-0301), Table-GPT [54] (GPT-3.5 fine-tuned for tables), GPT-4 (gpt-4-0314), Stable Beluga 2 70B [66], and SOLAR 70B [100]. We follow the numbers reported in [73, 112, 4].

Metrics. For DP task solving, we measure accuracy for DI, F1 score for ED, DI, EM, and AVE, and micro-F1 for CTA, all reported on a 100-scale.

Environment. Training and inference of LLMs are accelerated with NVIDIA A100 GPUs with 80GB graphics memory. We employ LoRA [33] and FlashAttention-2 [15] to speed up tuning and vLLM with PageAttention [47] to speed up inference.

6.2 DP Performance

6.2.1 Seen Tasks

We evaluate the performance on the seen tasks used for tuning. Table 4 reports the accuracy for DI and F1 score for the other three tasks. For Jellyfish models, few-shot prompting is disabled on seen datasets and enabled for unseen datasets. Among all the competitors, GPT-4 generally performs the best. This is expected, as it is the most advanced model with the largest number of parameters. However, its performance on SM is mediocre. Jellyfish-13B is generally the runner-up model and significantly outperforms GPT-4 on SM. Its accuracy or F1 score on the unseen datasets are over 80%, except on SM. In addition, Jellyfish-13B outperforms non-LLM methods on all but one seen datasets, and on all unseen datasets. Note that for non-LLM methods, because they need training on the input dataset, all the datasets are seen for them. This means that even without training on these datasets, Jellyfish-13B’s performance still surpasses the performance of non-LLM methods with training. Comparing Jellyfish-13B with GPT-3, GPT-3.5, and Table-GPT, Jellyfish-13B wins in more cases. Meanwhile, the 7B Jellyfish models also exhibit competitiveness, especially for DI, despite a small model size. By comparing Jellyfish-7B and Jellyfish-7B-I, we find that adding reasoning data to tuning does not significantly impair the model’s task solving ability.

Table 5: Precision (P), recall (R), and F1 score on SM.
Type Dataset Model
SMAT GPT-4 Jellyfish-13B
P R F1 P R F1 P R F1
Seen MIMIC-III 11.5 84.6 20.2 33.33 50.0 40.0 45.45 35.71 40.0
Synthea 24.4 90.9 38.5 71.42 62.5 66.67 41.18 87.50 56.00
Unseen CMS 33.9 95.0 50.0 60.0 11.5 19.35 57.14 61.54 59.26

Among the four tasks, SM is the hardest task, and all the competitors report relatively low F1 score. Looking into the datasets, we find that even humans have difficulties in telling whether the two attributes match, given only name and description. To compare the methods in more detail, we report precision and recall in Table 5. The non-LLM method, SMAT, reports the highest recall, yet with a very low precision. Among its results, only 1 out of 3 – 9 is true positive. This iss because many SM-tailored methods seek high recall, in order to find more candidates for further verification. Jellyfish-13B exhibits relatively high precision (41% – 57%), and is close to GPT-4 on the unseen dataset of CMS. This suggests that Jellyfish-13B can be used as a verification method (1 out of 2 is true positive) on top of a filtering approach (e.g., SMAT).

Table 6: Micro-F1 score on the unseen DP task of CTA, few-shot disabled for Jellyfish models.
Dataset Model

RoBERTa (159 shots)

RoBERTa (356 shots)

GPT-3.5

GPT-4

Jellyfish-7B

Jellyfish-7B-I

Jellyfish-13B

SOTAB

79.20

89.73

89.47

91.55

83.54

80.89

82.00

Table 7: F1 score on the unseen DP task of AVE, few-shot disabled for Jellyfish models.
Dataset Model

Stable Beluga 2 70B

SOLAR 70B

GPT-3.5

GPT-4

Jellyfish-7B

Jellyfish-7B-I

Jellyfish-13B

AE-110k

52.10

49.20

61.30

55.50

74.17

76.85

58.12

OA-Mine

50.80

55.20

62.70

68.90

75.35

76.04

55.96

6.2.2 Unseen Tasks

Table 6 reports the performance comparison on CTA. RoBERTa needs fine-tuning for this task. We report its results with two options, one for 159 shots of training data, the other for 356 shots of training data, following the numbers in [44]. GPT-4 still performs the best. Even without any tuning for this task, Jellyfish models still outperform RoBERTa fine-tuned with 159 shots. Another observation is that Jellyfish-7B is slightly better than Jellyfish-13B. We think this could be attributed to higher generalizability of Mistral-7B model.

For AVE, we report results in Table 7. the two 7B Jellyfish models are by far the winners, showcasing superb generalizability to this unseen task. Jellyfish-13B also surpasses the two 70B models.

Table 8: Impact of instruction tuning for DP on the unseen task of CTA. “+ task” denotes the model tuned for the task.
OOP2-13B + ED + DI + SM + EM Jellyfish-13B
56.40 74.20 79.20 76.70 71.50 82.00
Table 9: Impact of prompt engineering on the unseen task of CTA, varying options in stages and chain-of-thought (CoT) over Jellyfish-13B.

One-stage, w/o CoT

One-stage, w/ CoT

Two-stage, w/o CoT

Two-stage, w/ CoT

51.50

58.00

67.00

82.00

To drill down to the impact of tuning on unseen tasks, we investigate CTA as an example. Table 8 helps us find out which task contributes the most to this unseen task. When tuning with only one task, the model reports a micro-F1 in the range of 71% – 79%, with DI being the highest. We suppose this is because DI is exactly the inverse operation of CTA, i.e., DI fills the value of an attribute, and meanwhile CTA infers the type of an attribute given a set of sample values. Moreover, the four tasks jointly contributes to an overall micro-F1 of 82% and it surpasses the performance of tuning with only DI, showcasing the usefulness of other tasks as well.

Further, we conduct an ablation study to study the impact of prompting and report the results in Table 9. The two-stage pipeline performs better than the one-stage pipeline, and chain-of-thought, which splits the inference of column types into four steps, is also useful, in line with the observation in [44]. This demonstrates that the prompt engineering techniques developed for existing LLM-based solutions also work for Jellyfish-13B. In doing so, the design of prompts for Jellyfish-13B on unseen tasks is rendered much easier, as users may directly follow those used in existing works.

Table 10: NLP performance on the Open LLM Leaderboard.
Model

MMLU

WinoGrande

ARC

TruthfulQA

GSM8K

HellaSwag

Average

(5-shot)

(0-shot)

(25-shot)

(0-shot)

(8-shot)

(10-shot)

OOP2-13B

54.49

74.03

62.63

52.56

25.32

83.24

58.71

Jellyfish-13B

53.04 (-1.45)

74.19 (+0.16)

62.88 (+0.25)

52.56 (+0.00)

24.26 (-1.06)

83.16 (-0.08)

58.35 (-0.36)

Mistral-7B

62.91

73.88

63.48

66.91

41.32

84.79

65.55

Jellyfish-7B

62.08 (-0.83)

72.69 (-1.19)

63.48 (+0.00)

64.76 (-2.15)

37.91 (-3.41)

84.48 (-0.31)

64.23 (-1.32)

6.3 NLP Performance

We compare Jellyfish models and their original models on various NLP benchmarks [30, 87, 59, 11, 14, 111] used in the Open LLM Leaderboard [20], as shown in Table 10. For OOP-13B, its performance on NLP tasks barely decreases after tuning for DP, with a 0.36 drop on average, and even improves on two benchmarks. For Mistral-7B, the NLP performance deterioration is more significant, but still within a 1.32 drop on average. We think this is because the 7B model, with fewer parameters, is more prone to lose some of its original capability after fine-tuning [65].

Refer to caption
Figure 3: Impact of tuning with single-task data on DP performance. Above red line is positive.
Refer to caption
Figure 4: Impact of tuning with multi-task data on DP performance. Numbers in parenthesis indicate the percentage of data used for each task.
Refer to caption
Figure 5: Impact of tuning with single-task data on NLP performance. Above red line is positive.
Refer to caption
Figure 6: Impact of tuning with multi-task data on NLP performance. Numbers in parenthesis indicate the percentage of data used for each task.

6.4 Impact of Data Configuration in Jellyfish

We study how data configuration in the four tasks used for tuning impacts the performance. For this set of experiments, we randomly sample data from the datasets in Table 1 and disable the data preparation techniques in Section 3.1, in order to see the impact of dataset size clearly.

To simplify the evaluation, we first tune OOP2-13B with data for a single DP task and evaluate its effect. By varying the amount of data, Figure 3 displays how the tuning data for a specific task affects the DP performance. In general, the four tasks are all useful in improving the overall DP performance. For intra-task performance (e.g., ED to ED), as expected, the tuning data has a significantly positive impact. For inter-task performance, ED and SM are generally positive to other tasks, while DI and EM report negative effects. Such impact on the overall DP performance is also observed when we increase the amount of tuning data (e.g., doubling EM from 21k to 43k). We also find that DI can benefit from all the other three tasks. We think this is because the other three tasks all contain correct values for the attributes, thereby enhancing the model’s ability in filling missing values. In addition, the benefit of increasing tuning data for SM is obvious, in line with our data preparation technique in Section 3.1, i.e., in case of low-quality data, the amount of data should be increased to ensure the LLM can learn effectively. Overall, this experiment suggests that for the sake of DP performance, we shall moderately increase data for SM reduce data for DI.

Next, we study the impact of tuning OOP2-13B with multi-task data and plot the results in Figure 4. By feeding the tuning set with data for more tasks, it is obvious that they jointly contribute to better DP performance, and the improvement is consistent. When fully utilized the data, as indicated by (1, 1, 1, 1), it achieves the best performance. Based on the above results, we construct the Jellyfish data by appropriately choosing the size of data for each task. Moreover, with the data preparation techniques (Section 3.1) applied, Jellyfish-13B, even with less amount of tuning data, performs better than (1, 1, 1, 1) in Figure 4,

Then, we evaluate how the data for a specific DP task affects the NLP performance and report the results in Figure 5. In general, ED and EM exhibit positive impacts on the overall NLP performance. By increasing the amount of tuning data, all the tasks, except DI, are positive to NLP tasks. Specifically, SM turns from negative to positive when the dataset size is doubled, whereas the trend for DI is reversed, resulting in a significant drop. To drill down to each benchmark, all the four tasks are positive to WinoGrande, while they are generally negative to MMLU, and neutral to the other benchmarks, roughly in line with the results in Table 10. This experiment indicates that we need to choose an appropriate data size for each DP task, specifically, with moderately less data for DI, to prevent the model from losing its NLP capability.

We also test the impact of tuning OOP2-13B with multi-task data on its NLP performance over the six benchmarks used in Table 10. The results are reported in Figure 6. The general trend is that with data for more tasks, the NLP performance has a drop, yet this change, as shown in more sporadic points, is less consistent than what we observed in Figure 4. It is noteworthy that the overall decrease in NLP performance is moderate, with an average of 0.36 (from 58.71 to 58.35) for Jellyfish-13B.

Table 11: Impact of base models. “+ DP(P)” denotes the model tuned with Jellyfish-preview data. Knowledge injection is disabled.
Task Dataset Model

Llama-2-13B-Chat

OO-13B (Llama 2 + OpenOrca)

Platypus2-13B (Llama 2 + Open-Platypus)

OOP2-13B (Llama 2 + OpenOrca + Open-Platypus)

Mistral-7B

Llama-2-13B-Chat + DP(P)

OO-13B + DP(P)

Platypus2-13B + DP(P)

OOP2-13B + DP(P)

Mistral-7B + DP(P)

ED Adult

5.92

33.67

7.73

42.77

20.66

93.62

93.49

93.49

96.62

98.79
Hospital

8.78

64.05

6.29

63.24

37.09

81.55

89.67

90.58

92.01

94.13
DI Buy 95.38

75.38

41.54

89.23

76.92

92.31

90.77

87.69

100

98.46

Restaurant 90.70

88.37

86.05

81.40

18.75

89.53

90.70

88.37

89.53

88.37

SM Synthea

0.97

0.00

0.68

22.22

26.67

22.22

22.22

28.57

36.36

25.00

EM Amazon-Google

14.58

25.62

25.64

36.70

36.51

40.00

49.77

42.35

48.20

69.03

Beer

39.13

81.48

11.76

85.71

69.57

95.55

93.33

93.33

96.55

92.86

DBLP-ACM

45.95

78.84

0.00

78.86

85.30

97.45

97.66

97.35

97.35

97.51

DBLP-GoogleScholar

35.71

56.07

40.73

59.48

59.54

92.27

92.22

92.87

92.83

92.30

Fodors-Zagats

42.86

84.21

39.56

92.68

66.67

97.67

100 100 100 100
iTunes-Amazon

30.43

63.53

0.00

57.45

70.97

96.15

96.15

96.15

96.30 96.30

6.5 Impact of Base Models

In addition to OOP2-13B and Mistral-7B evaluated in the above experiments, we also consider the following LLMs, which are the basis for constructing OOP2-13B: (1) Llama-2-13B-Chat, the chat model of Llama 2. (2) OO-13B, which is short for OpenOrcaxOpenChat-Preview2-13B, Llama 2-13B fine-tuned with OpenOrca [58]; (3) Platypus2-13B, Llama 2-13B fine-tuned with Open-Platypus [49].

Table 11 report the results of various LLMs tuned for DP tasks. Here, we use Jellyfish-preview data (Section 3.3) – smaller than the full tuning data – for fast tuning and comparison. For models without tuning, it can be seen that OO-13B, tuned with augmented FLAN, roughly performs better than Llama 2. Platypus2-13B, tuned with Open-Platypus, though not delivering better overall performance than Llama 2, jointly contributes to the superiority of OOP2-13B. This advantage is also observed when Jellyfish-preview is applied, with OOP2-13B + DP(P) being the winner on 5 out of 11 datasets. Moreover, with Jellyfish-preview, we also observe the advantage of Platypus2-13B over Llama 2, showcasing its usefulness not only jointly but also individually. Meanwhile, we also observe the competitiveness of Mistral-7B, on a par with 13B models and even better on ED. Overall, this evaluation justifies the effectiveness of instruction tuning for DP and demonstrates the usefulness of enhancing reasoning capabilities (OO-13B) and logic proficiency (Platypus2-13B).

Table 12: Impact of knowledge injection on EM. “+ EM(P)” denotes the model tuned using only EM datasets in Jellyfish-preview. Few-shot is disabled.
Type Dataset Model

OOP2-13B

+ EM(P) w/o knowledge

+ EM(P) w/ knowledge

Seen

Amazon-Google

36.70

47.54

50.53

Beer

85.71

85.71

92.86

DBLP-ACM

78.86

85.33

90.26

DBLP-GoogleScholar

59.48

90.46

91.54

Fodors-Zagats

92.68

100 100

iTunes-Amazon

57.45

98.11

98.18
Unseen

Abt-Buy

61.78

83.35

84.44

Walmart-Amazon

67.29

71.71

73.18

6.6 Impact of Knowledge Injection

To evaluate the impact of knowledge injection, we consider the case of EM as an example and compare the models tuned with the EM datasets in Jellyfish-preview. Table 12 reports the results for OOP2-13B and its tuned version with knowledge either injected or not. Comparing OOP2-13B and the one without injected knowledge, as expected, the performance on EM is significantly raised. When we turn on knowledge injection, the performance further improves and the improvement is consistent. Furthermore, the improvement is also observed on unseen datasets, because like seen datasets of Amazon-Google and Beer, they are also product data. This observation suggests that the domain knowledge learned through tuning indeed enhances the model’s generalizability to unseen datasets.

Table 13: Head-to-head comparison of GPT-3.5 and Jellyfish-7B-I on interpretation, judged by GPT-4.
Task Dataset Model

GPT-3.5

Jellyfish-7B-I

ED Adult 11

9

Hospital

9

11
DI Buy

0

20
Restaurant 10 10
SM Synthea

8

12
EM Amazon-Google

8

12
Beer

7

13
DBLP-ACM

8

12
DBLP-GoogleScho8lar

4

16
Fodors-Zagats 12

8

iTunes-Amazon 19

1

Total

96

124
Winning Rate

43.64%

56.36%

6.7 Evaluation of Interpretation

We evaluate the performance of Jellyfish-7B-I and compare it with GPT-3.5 (gpt-3.5-turbo-0613). Given an answer output by Jellyfish-7B-I, we generate reasons using both Jellyfish-7B-I and GPT-3.5, and request GPT-4 to decide which one is better. Note that GPT-4 is unaware of the correct answer to the question in DP. As such, it needs to judge by its own analysis of the question as well. The reason why we do not choose other LLMs (in particular, Gemini Pro and Claude 2) as judge is because these models fail to convincingly compare Jellyfish-7B-I’s and GPT-3.5’s answers. They tend to choose overly simplified answers over comprehensive ones, and even prefer GPT-3.5’s answers to GPT-4’s.

Table 13 reports the head-to-head comparison of the two models. Jellyfish-7B-I defeats GPT-3.5 on 8 out of 11 datasets, achieving an overall winning rate of 56% across 220 tests. To delve into detailed comparison, we show an example on the Amazon-Google dataset in EM. The prompt is as follows.

Head-to-Head Instance You are an AI assistant that follows instruction extremely well. The user will give you a question. Your task is to answer as faithfully as you can. While answering, provide detailed explanation and justify your answer. Answer your reason first, then finish your response in a separate line with and ONLY with your final answer. Product A: [name: ”adobe dreamweaver cs3 [ mac ]”, manufacturer: ”adobe”, price: ”399.0”] Product B: [name: ”adobe cs3 web premium”, price: ”1659.99”] Are Product A and Product B the same Product?

The reason given by Jellyfish-7B-I is as follows. The two products are not the same. Product A is ”Adobe Dreamweaver CS3” specifically for Mac, while Product B is ”Adobe CS3 Web Premium”. Adobe Dreamweaver is a web development tool used for designing, coding, and managing websites. On the other hand, Adobe CS3 Web Premium is a suite of applications that includes Adobe Dreamweaver along with other software like Photoshop, Illustrator, and Flash. Therefore, Product B offers more than just the Dreamweaver software, which makes it different from Product A.
No

The reason given by GPT-3.5 is as follows. No, Product A and Product B are not the same product. The reason is that although both products are from Adobe and are part of the CS3 suite, they have different names and prices. Product A is specifically named ”Adobe Dreamweaver CS3 [Mac]” and is priced at $399.0, while Product B is named ”Adobe CS3 Web Premium” and is priced at $1659.99. The difference in names and prices indicates that they are different products within the CS3 suite, likely with different features and functionalities.
Therefore, Product A and Product B are not the same product.
Final answer: No.

GPT-4 compares the outputs of both models in four aspects (we summarize GPT-4’s response due to its length): (1) Structure: Both models effectively follow the instruction to provide a detailed explanation before the final answer. (2) Focus: Jellyfish-7B-I emphasizes the different functionalities of the software, while GPT-3.5 focuses more on the difference in names and prices. (3) Adherence to Instruction: GPT-3.5 includes an unnecessary repetition of the conclusion before the final answer, which is not strictly in line with the instruction to finish the response with the final answer only. (4) Clarity: Jellyfish-7B-I presents a more concise and focused response by avoiding repetition and sticking to the instruction more closely. Based on the analysis, GPT-4 decides that Jellyfish-7B-I performed better because it adhered more strictly to the original instruction by providing a clear, concise response without unnecessary repetition. In addition, with its built-in knowledge, Jellyfish-7B-I pointed out the difference in functionalities, whereas GPT-3.5 merely described the difference on the surface.

Furthermore, we find out that when reasoning, GPT-3.5 even fails to respond with a correct answer of matching or not for EM, as shown in its mediocre performance on datasets like Amazon-Google and DBLP-GoogleScholar in Table 4. In contrast to the above example of Jellyfish-7B-I’s landslide win, GPT-3.5 only has a slight edge when it wins. For instance, in an example of the Amazon-Google dataset, GPT-4 points out that GPT-3.5 has more focused justification and additional insights into the implications of the differences between the products, yet it also mentioned that GPT-3.5’s repetition of the final answer is a minor deviation from the instruction’s format.

6.8 Comparison of Efficiency

Instruction tuning spends around 5 hours for Jellyfish-13B, 2.5 hours for Jellyfish-7B, and 3.5 hours for Jellyfish-7B-I. For inference, Jellyfish-13B spends 0.08 – 0.15 seconds on average to process an instance. As a reference, GPT-4 spends an average of 1 – 8 seconds per instance. Although LLMs require substantial computational resources, thereby increasing the cost of use and compromising the efficiency, some non-LLM methods, such as RoBERTa and those built upon it (e.g., IPM), need fine-tuning when applied to unseen datasets. This fine-tuning time should be counted towards total time expense for fair comparison. Moreover, advanced learning techniques enables Jellyfish models to be quantized [61] or distilled to improve efficiency, which will be considered in the future.

7 Related Works

Since works on LLMs have been introduced in Section 2.2, we briefly review related works on DP here.

Seen Tasks. The tasks targeted in this paper collectively form the most critical part of DP, and they have been extensively studied.

  • ED: Traditional methods mainly depend on hand-crafted rules [12], pattern discovery [13], outlier detection [79], or statistical modeling [34, 102]. Recent works employ more advanced ML techniques such as few-shot learning based on a noisy channel model (HoloDetect) [28], or resort to a series of ML pipelines (Raha) [68], including feature engineering, clustering, and classification.

  • DI: While rule-based solutions [83, 91] remain one of the prevalent approaches, another stream of works develops ML models for this task, including variational autoencoders [75], generative adversarial networks [110], and attention mechanisms [106, 96]. To seek better imputation performance, recent progress utilizes PLMs to capture semantics [69].

  • SM: The use of similarity matrices is a traditional way [86]. More advanced methods utilize ML techniques [23], including deep learning models [90]. SMAT [113] is an approach leveraging attention-based deep learning. A recent attempt employs GPT-4 for SM [89].

  • EM The procedure is divided into blocking and in-block pairwise matching for the sake of efficiency. Blocking groups pairs of entities that potentially match into the same block, and then pairwise matching is performed within each block to find matching entities. Traditional solutions for blocking mostly rely on attribute equivalence, hashes, or similarities [77]. Recently, the feasibility of using DL methods for blocking has also been examined [95], following the use of DL for pairwise matching [71]. In addition, there are tools that handle both steps such as Megallan [43] and Ditto [57]. A recent evaluation validates the effectiveness of in-context learning in enhancing LLMs’ EM performance [78].

Unseen Tasks. We review the related studies on CTA and AVE.

  • CTA As a typical table understanding task, it often appears in the studies on table representation learning [36, 17, 92]. These approaches fine-tune PLMs, typically BERT [18] and its variants [38, 63]. Recently, ChatGPT has been utilized to solve this task [44].

  • AVE Early approaches employ LSTM-CRF [46, 120]. With the prevalence of PLMs, like CTA, many solutions to AVE resort to using BERT [109, 103, 121]. A recent work [4] considered fine-tuning GPT-3.5 and prompting GPT-4, and compared with open-source LLMs like Stable Beluga 2 [66] and SOLAR [100].

Generic Solution. Whereas the above solutions are specialized for a task, recent progress developed generic solutions to DP based on GPT-3 [73], or GPT-3.5 and GPT-4 [112], basically employing various prompt engineering techniques on frozen LLMs. Fine-tuning GPT-3.5 and ChatGPT for a variety of table-related tasks has also been investigated [54], and several DP tasks are covered.

Other DP Tasks. Besides the ones covered by this paper, there are many other DP tasks. We name a few examples.

  • Data repairing corrects erroneous values in a dataset. Typical solutions are HoloClean [83] and Baran [67]. HoloClean can detect errors and perform repairing subsequently. Baran only repair errors and resort to Raha to detect errors. Recent advancements [51, 81] utilized Bayesian inference to capture dependencies between attributes.

  • Data fusion is the process of integrating multiple data sources that contain information about the same set of entities, with possibly conflicting attribute values. Surveys of early attempts are available [55, 6], with a detailed comparison of various fusion methods on deep web data [55]. More recent endeavors targeted multi-truth data fusion [2] and golden record [29].

  • Data transformation is the process of converting data from one format into another format. Notable approaches are transformation by user-specified examples [27] and learning from large collections of paired table columns [39]. In addition, the aforementioned generic DP solution also covers this task [73].

Data Preparation. DP is also studied in the name of data preparation, which manipulates raw data into a form that can be readily analyzed. A notable Python library is DataPrep [48]. In addition to the DP tasks listed above, data augmentation [10, 70, 62, 119] is another key operation in data preparation. Another line of work studies dataset discovery [3, 45, 21, 74], particularly for integrating data lake tables [41] where joinable [19], unionable [40], and related table search [116] are often used for identifying candidates. Despite search speed being a key concern, LLMs are anticipated to be used on top of their outcomes for automated integration in a data lake [1].

8 Conclusions

We studied the problem of instruction-tuning LLMs as universal DP task solvers. By devising a series of data preparation, instance serialization, and knowledge injection techniques, we proposed Jellyfish, a dataset for this purpose. LLMs tuned with Jellyfish are adept at understanding natural language, enabling users to craft instructions manually for DP tasks. Another notable feature of Jellyfish is its reasoning data, which can be used for tuning for interpretation, thereby providing explanations of the model’s outputs. For evaluation, we tuned two models, Jellyfish-7B and Jellyfish-13B, which can operate on a local GPU without compromising data security. The experiments demonstrated their competitiveness against existing DP solutions, impressive generalizability to new tasks, the ability of retaining performance in NLP tasks, as well as the competence in interpreting the model’s output.

Future research directions include expanding Jellyfish to encompass more DP tasks, such as data repairing and data transformation. Furthermore, we are considering the development of a quantized or distilled model to enhance processing speed, as well as a multi-agent system for adaptable, conversational, code-free DP pipeline.

Acknowledgements

This work is supported by NEC Corporation and JSPS Kakenhi 22H03903, 23H03406, 23K17456, and JST CREST JPMJCR22M2. We thank Prof. Makoto Onizuka, Yuki Arase, and Yuya Sasaki for providing equipment support for completing this research.

References

  • [1] S. Arora, B. Yang, S. Eyuboglu, A. Narayan, A. Hojel, I. Trummer, and C. Ré. Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv preprint arXiv:2304.09433, 2023.
  • [2] F. Azzalini, D. Piantella, E. Rabosio, and L. Tanca. Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity. The VLDB Journal, 32(3):475–500, 2023.
  • [3] A. Bogatu, A. A. Fernandes, N. W. Paton, and N. Konstantinou. Dataset discovery in data lakes. In ICDE, pages 709–720. IEEE, 2020.
  • [4] A. Brinkmann, R. Shraga, and C. Bizer. Product attribute value extraction using large language models. arXiv preprint arXiv:2310.12537, 2023.
  • [5] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  • [6] G. K. Canalle, A. C. Salgado, and B. F. Loscio. A survey on data fusion: what for? in what form? what is next? Journal of Intelligent Information Systems, 57:25–50, 2021.
  • [7] A. Chapman, E. Simperl, L. Koesten, G. Konstantinidis, L.-D. Ibáñez, E. Kacprzak, and P. Groth. Dataset search: a survey. The VLDB Journal, 29(1):251–272, 2020.
  • [8] L. Cheng, X. Li, and L. Bing. Is GPT-4 a good data analyst? arXiv preprint arXiv:2305.15038, 2023.
  • [9] Z. Cheng, J. Kasai, and T. Yu. Batch prompting: Efficient inference with large language model APIs. arXiv preprint arXiv:2301.08721, 2023.
  • [10] N. Chepurko, R. Marcus, E. Zgraggen, R. C. Fernandez, T. Kraska, and D. Karger. ARDA: automatic relational data augmentation for machine learning. arXiv preprint arXiv:2003.09758, 2020.
  • [11] F. Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
  • [12] X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458–469. IEEE, 2013.
  • [13] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247–1261, 2015.
  • [14] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  • [15] T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  • [16] S. Das, A. Doan, P. S. G. C., C. Gokhale, P. Konda, Y. Govind, and D. Paulsen. The magellan data repository. https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository.
  • [17] X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu. TURL: Table understanding through representation learning. ACM SIGMOD Record, 51(1):33–40, 2022.
  • [18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [19] Y. Dong, C. Xiao, T. Nozawa, M. Enomoto, and M. Oyamada. DeepJoin: Joinable table discovery with pre-trained language models. arXiv preprint arXiv:2212.07588, 2022.
  • [20] H. Face. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2024.
  • [21] G. Fan, J. Wang, Y. Li, D. Zhang, and R. Miller. Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning. arXiv preprint arXiv:2210.01922, 2022.
  • [22] Flipkart.com. Flipkart products. https://www.kaggle.com/datasets/PromptCloudHQ/flipkart-products.
  • [23] A. Gal, H. Roitman, and R. Shraga. Learning to rerank schema matches. IEEE Transactions on Knowledge and Data Engineering, 33(8):3104–3116, 2019.
  • [24] C. Gao, X. Lan, N. Li, Y. Yuan, J. Ding, Z. Zhou, F. Xu, and Y. Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives. arXiv preprint arXiv:2312.11970, 2023.
  • [25] R. Hai, C. Koutras, C. Quix, and M. Jarke. Data lakes: A survey of functions and systems. IEEE Transactions on Knowledge and Data Engineering, 2023.
  • [26] J. Han, J. Pei, and H. Tong. Data mining: concepts and techniques. Morgan kaufmann, 2022.
  • [27] Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya, and S. Chaudhuri. Transform-data-by-example (TDE) an extensible search engine for data transformations. PVLDB, 11(10):1165–1177, 2018.
  • [28] A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. HoloDetect: Few-shot learning for error detection. In SIGMOD, pages 829–846, 2019.
  • [29] A. Heidari, G. Michalopoulos, I. F. Ilyas, and T. Rekatsinas. Record fusion via inference and data augmentation. ACM/JMS Journal of Data Science, 2023.
  • [30] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • [31] S. Hong, X. Zheng, J. Chen, Y. Cheng, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, et al. MetaGPT: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
  • [32] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for NLP. In ICML, pages 2790–2799. PMLR, 2019.
  • [33] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • [34] Z. Huang and Y. He. Auto-detect: Data-driven error detection in tables. In SIGMOD, pages 1377–1392, 2018.
  • [35] Hugging Face. Llama and llama 2 variants. https://huggingface.co/models?other=llama, 2023.
  • [36] H. Iida, D. Thai, V. Manjunatha, and M. Iyyer. Tabbie: Pretrained representations of tabular data. arXiv preprint arXiv:2105.02584, 2021.
  • [37] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • [38] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
  • [39] Z. Jin, Y. He, and S. Chauduri. Auto-transform: learning-to-transform by patterns. PVLDB, 13(12):2368–2381, 2020.
  • [40] A. Khatiwada, G. Fan, R. Shraga, Z. Chen, W. Gatterbauer, R. J. Miller, and M. Riedewald. SANTOS: Relationship-based semantic table union search. SIGMOD, 1(1):1–25, 2023.
  • [41] A. Khatiwada, R. Shraga, W. Gatterbauer, and R. J. Miller. Integrating data lake tables. PVLDB, 16(4):932–945, 2022.
  • [42] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
  • [43] P. Konda, S. Das, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, et al. Magellan: toward building entity matching management systems over data science stacks. PVLDB, 9(13):1581–1584, 2016.
  • [44] K. Korini and C. Bizer. Column type annotation using ChatGPT. arXiv preprint arXiv:2306.00745, 2023.
  • [45] C. Koutras, G. Siachamis, A. Ionescu, K. Psarakis, J. Brons, M. Fragkoulis, C. Lofi, A. Bonifati, and A. Katsifodimos. Valentine: Evaluating matching techniques for dataset discovery. In ICDE, pages 468–479. IEEE, 2021.
  • [46] Z. Kozareva, Q. Li, K. Zhai, and W. Guo. Recognizing salient entities in shopping queries. In ACL, pages 107–111, 2016.
  • [47] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In SOSP, 2023.
  • [48] S. F. U. D. S. Lab. DataPrep. https://dataprep.ai/.
  • [49] A. N. Lee, C. J. Hunter, and N. Ruiz. Platypus: Quick, cheap, and powerful refinement of LLMs. arXiv preprint arXiv:2308.07317, 2023.
  • [50] A. N. Lee, C. J. Hunter, N. Ruiz, B. Goodson, W. Lian, G. Wang, E. Pentland, A. Cook, C. Vong, and ”Teknium”. OpenOrcaPlatypus: Llama2-13B model instruct-tuned on filtered OpenOrcaV1 GPT-4 dataset and merged with divergent STEM and logic dataset model. https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B, 2023.
  • [51] A. Lew, M. Agrawal, D. Sontag, and V. Mansinghka. PClean: Bayesian data cleaning at scale with domain-specific probabilistic programming. In AISTATS, pages 1927–1935. PMLR, 2021.
  • [52] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 33:9459–9474, 2020.
  • [53] H. Li, Y. Su, D. Cai, Y. Wang, and L. Liu. A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110, 2022.
  • [54] P. Li, Y. He, D. Yashar, W. Cui, S. Ge, H. Zhang, D. R. Fainman, D. Zhang, and S. Chaudhuri. Table-GPT: Table-tuned GPT for diverse table tasks. arXiv preprint arXiv:2310.09263, 2023.
  • [55] X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? arXiv preprint arXiv:1503.00303, 2015.
  • [56] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  • [57] Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan. Deep entity matching with pre-trained language models. PVLDB, 14(1):50–60, 2020.
  • [58] W. Lian, B. Goodson, E. Pentland, A. Cook, C. Vong, and “Teknium”. OpenOrca: An open dataset of GPT augmented FLAN reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
  • [59] S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  • [60] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  • [61] S.-y. Liu, Z. Liu, X. Huang, P. Dong, and K.-T. Cheng. LLM-FP4: 4-bit floating-point quantized transformers. arXiv preprint arXiv:2310.16836, 2023.
  • [62] T. Liu, J. Fan, Y. Luo, N. Tang, G. Li, and X. Du. Adaptive data augmentation for supervised learning over missing data. PVLDB, 14(7):1202–1214, 2021.
  • [63] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • [64] W. Lu, J. Zhang, J. Zhang, and Y. Chen. Large language model for table processing: A survey. arXiv preprint arXiv:2402.05121, 2024.
  • [65] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, and Y. Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
  • [66] D. Mahan, R. Carlow, L. Castricato, N. Cooper, and C. Laforte. Stable beluga 2. https://huggingface.co/stabilityai/StableBeluga2, 2023.
  • [67] M. Mahdavi and Z. Abedjan. Baran: Effective error correction via a unified context representation and transfer learning. PVLDB, 13(12):1948–1961, 2020.
  • [68] M. Mahdavi, Z. Abedjan, R. Castro Fernandez, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Raha: A configuration-free error detection system. In SIGMOD, pages 865–882, 2019.
  • [69] Y. Mei, S. Song, C. Fang, H. Yang, J. Fang, and J. Long. Capturing semantics for imputation with pre-trained language models. In ICDE, pages 61–72. IEEE, 2021.
  • [70] Z. Miao, Y. Li, and X. Wang. Rotom: A meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond. In SIGMOD, pages 1303–1316, 2021.
  • [71] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, pages 19–34, 2018.
  • [72] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah. Orca: Progressive learning from complex explanation traces of GPT-4. arXiv preprint arXiv:2306.02707, 2023.
  • [73] A. Narayan, I. Chami, L. Orr, and C. Ré. Can foundation models wrangle your data? PVLDB, 16(4):738–746, 2022.
  • [74] F. Nargesian, K. Pu, B. Ghadiri-Bashardoost, E. Zhu, and R. J. Miller. Data lake organization. IEEE Transactions on Knowledge and Data Engineering, 35(1):237–250, 2022.
  • [75] A. Nazabal, P. M. Olmos, Z. Ghahramani, and I. Valera. Handling incomplete heterogeneous data using VAEs. Pattern Recognition, 107:107501, 2020.
  • [76] OpenAI. March 20 ChatGPT outage: Here’s what happened, 2023.
  • [77] G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas. Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys, 53(2):1–42, 2020.
  • [78] R. Peeters and C. Bizer. Entity matching using large language models. arXiv preprint arXiv:2310.11244, 2023.
  • [79] N. Prokoshyna, J. Szlichta, F. Chiang, R. J. Miller, and D. Srivastava. Combining quantitative and logical data cleaning. PVLDB, 9(4):300–311, 2015.
  • [80] C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, and M. Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
  • [81] J. Qin, S. Huang, Y. Wang, J. Zhu, Y. Zhang, Y. Miao, R. Mao, M. Onizuka, and C. Xiao. BClean: A bayesian data cleaning system. arXiv preprint arXiv:2311.06517, 2023.
  • [82] S. Razniewski, A. Yates, N. Kassner, and G. Weikum. Language models as or for knowledge bases. arXiv preprint arXiv:2110.04888, 2021.
  • [83] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. HoloClean: Holistic data repairs with probabilistic inference. PVLDB, 10(10):1190–1201, 2017.
  • [84] A. Reviews. Amazon reviews: Unlocked mobile phones. https://www.kaggle.com/datasets/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones.
  • [85] D. Ritze, O. Lehmberg, Y. Oulabi, and C. Bizer. Profiling the potential of web tables for augmenting cross-domain knowledge bases. In WWW, pages 251–261, 2016.
  • [86] T. Sagi and A. Gal. Schema matching prediction with applications to data source discovery and dynamic ensembling. The VLDB Journal, 22:689–710, 2013.
  • [87] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  • [88] J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann, and H. Xu. Can gpt-4 support analysis of textual data in tasks requiring highly specialized domain expertise? arXiv preprint arXiv:2306.13906, 2023.
  • [89] E. Sheetrit, M. Brief, M. Mishaeli, and O. Elisha. Rematch: Retrieval enhanced schema matching with llms. arXiv preprint arXiv:2403.01567, 2024.
  • [90] R. Shraga, A. Gal, and H. Roitman. Adnev: Cross-domain schema matching using deep similarity matrix adjustment and evaluation. PVLDB, 13(9):1401–1415, 2020.
  • [91] S. Song, Y. Sun, A. Zhang, L. Chen, and J. Wang. Enriching data imputation under similarity rule constraints. IEEE transactions on knowledge and data engineering, 32(2):275–287, 2018.
  • [92] Y. Suhara, J. Li, Y. Li, D. Zhang, Ç. Demiralp, C. Chen, and W.-C. Tan. Annotating columns with pre-trained language models. In SIGMOD, pages 1493–1503, 2022.
  • [93] D. Tang, Z. Chen, K. Kim, Y. Song, H. Tian, S. Ezzini, Y. Huang, and J. K. T. F. Bissyande. Collaborative agents for software engineering. arXiv preprint arXiv:2402.02172, 2024.
  • [94] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models, 3(6):7, 2023.
  • [95] S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, and A. Doan. Deep learning for blocking in entity matching: a design space exploration. PVLDB, 14(11):2459–2472, 2021.
  • [96] S. Tihon, M. U. Javaid, D. Fourure, N. Posocco, and T. Peel. DAEMA: Denoising autoencoder with mask attention. In ICANN, pages 229–240, 2021.
  • [97] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [98] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [99] J. Tu, J. Fan, N. Tang, P. Wang, G. Li, X. Du, X. Jia, and S. Gao. Unicorn: A unified multi-tasking model for supporting matching tasks in data integration. Proceedings of the ACM on Management of Data, 1(1):1–26, 2023.
  • [100] Upstage. Solar-0-70b-16bit. https://huggingface.co/upstage/SOLAR-0-70b-16bit, 2023.
  • [101] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
  • [102] P. Wang and Y. He. Uni-detect: A unified approach to automated error detection in tables. In SIGMOD, pages 811–828, 2019.
  • [103] Q. Wang, L. Yang, B. Kanagal, S. Sanghai, D. Sivakumar, B. Shu, Z. Yu, and J. Elsas. Learning to extract attribute value from product via question answering: A multi-task approach. In KDD, pages 47–55, 2020.
  • [104] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  • [105] L. Weng. Prompt engineering. https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/, 2023.
  • [106] R. Wu, A. Zhang, I. Ilyas, and T. Rekatsinas. Attention-based learning for missing data imputation in HoloClean. MLSys, 2:307–325, 2020.
  • [107] Z. Wu, R. Peng, X. Han, S. Zheng, Y. Zhang, and C. Xiao. Smart agent-based modeling: On the use of large language models in computer simulations. arXiv preprint arXiv:2311.06330, 2023.
  • [108] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  • [109] H. Xu, W. Wang, X. Mao, X. Jiang, and M. Lan. Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. In ACL, pages 5214–5223, 2019.
  • [110] J. Yoon, J. Jordon, and M. Schaar. GAIN: Missing data imputation using generative adversarial nets. In ICML, pages 5689–5698, 2018.
  • [111] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  • [112] H. Zhang, Y. Dong, C. Xiao, and M. Oyamada. Large language models as data preprocessors. arXiv preprint arXiv:2308.16361, 2023.
  • [113] J. Zhang, B. Shin, J. D. Choi, and J. C. Ho. SMAT: An attention-based deep learning solution to the automation of schema matching. In ADBIS, pages 260–274. Springer, 2021.
  • [114] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
  • [115] T. Zhang, X. Yue, Y. Li, and H. Sun. Tablellama: Towards open large generalist models for tables. arXiv preprint arXiv:2311.09206, 2023.
  • [116] Y. Zhang and Z. G. Ives. Finding related tables in data lakes for interactive data science. In SIGMOD, pages 1951–1966, 2020.
  • [117] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  • [118] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  • [119] Z. Zhao and R. Castro Fernandez. Leva: Boosting machine learning performance with relational embedding data augmentation. In SIGMOD, pages 1504–1517, 2022.
  • [120] G. Zheng, S. Mukherjee, X. L. Dong, and F. Li. Opentag: Open attribute value extraction from product profiles. In KDD, pages 1049–1058, 2018.
  • [121] T. Zhu, Y. Wang, H. Li, Y. Wu, X. He, and B. Zhou. Multimodal joint attribute prediction and value extraction for e-commerce product. arXiv preprint arXiv:2009.07162, 2020.

Appendix A Model Setup

The hyperparameter setup for tuning a Jellyfish model is:

  • lora_target: q_proj, k_proj, v_proj, o_proj;

  • per_device_train_batch_size: 2;

  • gradient_accumulation_steps: 2;

  • learning_rate: 3e-5;

  • num_train_epochs: 5.0;

  • lora_rank: 32;

  • lora_alpha: 32.

The following hyperparameters are used for inference:

  • temperature: 0.35;

  • top_p: 0.9;

  • top_k: 10.

Appendix B Data Construction Prompts

B.1 Instruction Data

For instruction data, we show the prompt for each task, using one dataset as an example. Then, we show the prompt for reasoning data, which slightly differs from instruction data. The prompts for inference are the same as tuning, except that dataset-specific knowledge is optional. The prompts for reasoning ground truth collection and head-to-head judge are used for GPT-4.

Instruction Data – ED (Adult) (system message) You are an AI assistant that follows instruction extremely well. User will give you a question. Your task is to answer as faithfully as you can. (task description) Your task is to determine if there is an error in the value of a specific attribute within the whole record provided. The attributes may include age, workclass, education, marital status, occupation, relationship, race, sex, hours per week, country, and income. (injected knowledge) Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don’t make sense given the context of the whole record. (instance content) Record [age: ”18-21”, workclass: ”Private”, education: ”Some-college”, maritalstatus: ”Never-married”, occupation: ”Other-service”, relationship: ”Own-child”, race: ”White”, sex: ”Male”, hoursperweek: ”30”, country: ”United-States”, income: ”eLssThan50K”] Attribute for Verification: [income: ”eLssThan50K”] (question) Is there an error in the value of the ”income” attribute? (output format) Choose your answer from: [Yes, No]
Instruction Data – DI (Restaurant) (system message) You are an AI assistant that follows instruction extremely well. User will give you a question. Your task is to answer as faithfully as you can. (task description) You are presented with a restaurant record that is missing a specific attribute: the city. Your task is to deduce or infer the city of the restaurant using the available information in the record. You may be provided with fields like ’Name’, ’Address’, ’Phone’, and ’Type’ to help you in the inference. (instance content) Record: [name: ”darbar”, addr: ”44 w. 56th st.”, phone: ”212-432-7227”, type: ”indian”]. (question) Based on the provided restaurant record, what would you infer is the value for the missing attribute ”City”? (output format) Answer the name of the city.
Instruction Data – SM (MIMIC-III) (system message) You are an AI assistant that follows instruction extremely well. User will give you a question. Your task is to answer as faithfully as you can. (task description) Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables. Each attribute will be described by its name and a brief description. Your goal is to assess if they refer to the same information based on these names and descriptions provided. (instance content) Attribute A is [name: ”visit_occurrence-visit_end_date”, description: ”the end date of the visit. if this is a one-day visit the end date should match the start date.”]. Attribute B is [name: ”admissions-dischtime”, description: ”dischtime provides the date and time the patient was discharged from the hospital.”]. (question) Are Attribute A and Attribute B semantically equivalent? (output format) Choose your answer from: [Yes, No]
Instruction Data – EM (Beer) (system message) You are an AI assistant that follows instruction extremely well. User will give you a question. Your task is to answer as faithfully as you can. (task description) You are tasked with determining whether two Products listed below are the same based on the information provided. Carefully compare all the attributes before making your decision. (injected knowledge) Note that missing values (N/A or ”nan”) should not be used as a basis for your decision. (instance content) Product A: [name: ”Sequoia American Amber Ale”, factory: ”Wig And Pen”] Product B: [name: ”Aarhus Cains Triple A American Amber Ale”, factory: ”Aarhus Bryghus”] (question) Are Product A and Product B the same Product? (output format) Choose your answer from: [Yes, No]

B.2 Reasoning Data

To construct reasoning data, we use the following prompt.

Reasoning Data (system message) [same as Instruction Data] While answering, provide detailed explanation and justify your answer. (task description – question) [same as Instruction Data] (output format) After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].

We use the following prompt to collect ground truth from GPT-4 (for the Beer dataset in EM).

Reasoning Ground Truth Collection (system message – output format) [same as Reasoning Data] (injected knowledge) Note that different factories can belong to the same parent company. The company name of Product B may occur in its product name. (answer hint) You can use the ”Hint” below, but your response cannot contain any information from it. Hint: the final answer is ”No”

In the above prompt, we inject a piece of knowledge specific to the dataset. With this additional knowledge, GPT-4 can produce high-quality reasoning result. Note that such knowledge is not prompted to Jellyfish models, as for unseen datasets such knowledge is not always available. In addition to the injected knowledge, GPT-4 also receives a hint to the answer of yes or no. As such, we can guarantee that the output reason always points to the correct direction. A sample answer from GPT-4 is given below.

The two products listed are not the same. Product A is named ”Sequoia American Amber Ale” and is produced by ”Wig And Pen”. In contrast, Product B is ”Aarhus Cains Triple A American Amber Ale” and is produced by ”Aarhus Bryghus”. Despite both being types of American Amber Ale, the names and manufacturers of the two products are different, indicating that they are distinct products.
No

Appendix C Injected Knowledge

Table 14: General knowledge.
Prompt

Missing values (N/A or ”nan”) should not be used as a basis for your decision.

If there are missing values, you should make inferences based only on the information that is available.

Table 15: Task-specific knowledge.
Task Prompt
ED

Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don’t make sense given the context of the whole record. (Used when showing the whole record)

Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don’t make sense for that attribute. (Used when showing only one attribute)

Capitalization should not be a factor in deciding whether there is an error or not.

DI

Note that values such as ’nan’ and ’N/A’ mean missing vaules, and they are not considered as errors. (used when we decide not to treat missing values as errors)

Note that values such as ’nan’ and ’N/A’ mean missing values, and they ARE errors. (used when we decide to treat missing values as errors)

EM

To determine if two values are identical, you need to examine both their full names and corresponding acronyms.

Table 16: Dataset-specific knowledge.
Task Dataset Prompt
ED Adult

Both the ’age’ attribute and the ’hoursperweek’ attribute can represent a range of integer values.

Verify the consistency of target attribute with related attributes to identify any errors.

Hospital

The value of attribute ”score” can be a percentage number.

DI Restaurant

The city can often be deduced from the area code of the phone number and the specific street name.

EM Amazon-Google

Different editions, versions, or operating systems for the same software are all considered as different products.

You should compare the two product numbers first.

Beer

Note that different factories can belong to the same parent company.

Beverages that undergo different production processes, such as the use of various types of wood in the barrelling process, may be considered distinct products.

Fodors-Zagats

The type of a specific restaurant might vary between different datasets.

iTunes-Amazon

The length of the same song might vary slightly across different datasets due to rounding or data entry discrepancies.

DBLP-ACM

The names of authors might be presented in various formats or sequences, even when referring to the same article.

DBLP-GoogleScholar

The names of authors might be presented in various formats or sequences, even when referring to the same article.

Appendix D Few-Shot Prompting

We apply few-shot prompting by manually selecting a subset of data instances from the dataset and labeling them. For instance, a few-shot example for the Beer dataset is presented as follows:

Few-Shot Prompting (system message – injected knowledge) [same as Instruction Data] (1st example’s instance content) ### Instruction: Product A: [name: ”Shirt Tail Amber”, factory: ”Iron Hill Brewery & Restaurant”] Product B: [name: ”Iron Hill Shirt Tail Amber”, factory: ”Iron Hill Maple Shade”] (1st example’s question) Are Product A and Product B the same Product? (1st example’s output format) Choose your answer from: [Yes, No] (1st example’s answer) ### Response: Yes (other examples) (instance content – output format) [same as Instruction Data] ### Response:

The example follows the same format of instance content, question, and output format as in the instruction data. It also provides the answer indicated by ### Response: Yes. Whereas we only show an positive example here, it is suggested to include both positive and negative examples. After the final example, the instance to be processed is presented in the prompt, and the model follows the same output format as demonstrated in the examples.

Since ground truths are usually not available in real applications, users can handcraft few-shot examples for inference. On the other hand, few-shot examples can be automatically generated by randomly injecting errors for ED and DI, such as missing values, typographical/formatting errors, and randomly swapping values for two columns in a tuple or two tuples in a column. For SM and EM, we can employ rule-based methods (e.g., blocking rules [43]) to quickly find a few matches and use them as few-shot examples.

D.1 Error Detection

The few-shot examples for the Flights and Rayyan datasets are given as follows.

Flights – 1st Example ### Instruction: Record [datasource: ”flightview”, flight: ”AA-3063-SLC-LAX”, scheduled departure time: ”nan”, actual departure time: ”8:40 p.m.”, scheduled arrival time: ”nan”, actual arrival time: ”9:11 p.m.”] Attribute for Verification: [scheduled departure time: ”nan”] Question: Is there an error in the value of the ”scheduled departure time” attribute? Choose your answer from: [Yes, No] ### Response: Yes
Flights – 2st Example ### Instruction: Record [datasource: ”aa”, flight: ”AA-3823-LAX-DEN”, scheduled departure time: ”9:00 p.m.”, actual departure time: ”nan”, scheduled arrival time: ”12/02/2011 12:15 a.m.”, actual arrival time: ”nan”] Attribute for Verification: [scheduled arrival time: ”12/02/2011 12:15 a.m.”] Question: Is there an error in the value of the ”scheduled arrival time” attribute? Choose your answer from: [Yes, No] ### Response: Yes
Flights – 3rd Example ### Instruction: Record [datasource: ”flightview”, flight: ”AA-616-DFW-DTW”, scheduled departure time: ”9:05 a.m.”, actual departure time: ”10:11 a.m.”, scheduled arrival time: ”12:35 p.m.”, actual arrival time: ”1:18 p.m.”] Attribute for Verification: [datasource: ”flightview”] Question: Is there an error in the value of the ”datasource” attribute? Choose your answer from: [Yes, No] ### Response: No
Rayyan – 1st Example ### Instruction: Record [article_title: ”A re-appraisal of screening for colour vision impairments”, article_language: ”nan”, journal_title: ”Child: Care, Health & Development”, jounral_abbreviation: ”nan”, journal_issn: ”0305-1862”, article_jvolumn: ”23”, article_jissue: ”5”, article_jcreated_at: ”1/1/97”, article_pagination: ”391-398”, author_list: ””D. M. B. Hall”,”E. Holroyd””] Attribute for Verification: [jounral_abbreviation: ”nan”] Question: Is there an error in the value of the ”jounral_abbreviation” attribute? Choose your answer from: [Yes, No] ### Response: Yes
Rayyan – 2nd Example ### Instruction: Record [article_title: ”Nurturing students’ interest in primary care research through summer training in meta-analysis.”, article_language: ”eng”, journal_title: ”Academic Medicine: Journal Of The Association Of American Medical Colleges”, jounral_abbreviation: ”nan”, journal_issn: ”1040-2446”, article_jvolumn: ”76”, article_jissue: ”5”, article_jcreated_at: ”5/1/01”, article_pagination: ”526”, author_list: ””L N Meurer””] Attribute for Verification: [article_jissue: ”5”] Question: Is there an error in the value of the ”article_jissue” attribute? Choose your answer from: [Yes, No] ### Response: No
Rayyan – 3rd Example ### Instruction: Record [article_title: ”Different renal toxicity profiles in the association of cyclosporine and tacrolimus with sirolimus in rats.”, article_language: ”eng”, journal_title: ”Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association”, jounral_abbreviation: ”Nephrol. Dial. Transplant.”, journal_issn: ”1460-2385”, article_jvolumn: ”23”, article_jissue: ”10”, article_jcreated_at: ”10/1/08”, article_pagination: ”3111-9”, author_list: ””N\u033cria Lloberas”,”Marcel la Franquesa”,”Josep M Cruzado”,”Josep M Griny\ufffd_”,”In\u0329s Rama”,”Gabriela Alperovich”,”Immaculada Herrero-Fresneda”,”Joan Torras”,”Pepita Gim\u0329nez-Bonaf\u0329””] Attribute for Verification: [article_pagination: ”3111-9”] Question: Is there an error in the value of the ”article_pagination” attribute? Choose your answer from: [Yes, No] ### Response: Yes

D.2 Data Imputation

The few-shot examples for the Flikpkart and Phone datasets are given as follows.

Flipkart – 1st Example ### Instruction: Record: [Product Name: ”Himmlisch ST381 Magnetic Sun Shade For Maruti Alto”, description: ”Himmlisch ST381 Magnetic Sun Shade For Maruti Alto (Side Window) Price: Rs. 1,899 Beat the heat this summer and feel like a VIP with Himmlisch Car Window Magnetic Sunshades. These magnetic sunshades create a mesh layer to stops the heat. Magnet border gets easily stick to your car window door edges (No need of Suction cups) Features: Block UV Rays Keeps Car Cool Easy to install and remove Durable and Exact Fit Provides Complete privacy Resists Heat Mesh Type Sunshade Package Contents: 1 x Set Of 4 Magnetic Sunshades,Specifications of Himmlisch ST381 Magnetic Sun Shade For Maruti Alto (Side Window) General Brand Himmlisch Model Number ST381 Magnetic Placement Position Side Window Color Black Dimensions Weight 4000 g Depth 1.1 cm In the Box Sales Package 4 Sun Shade Pack of 4”] Based on the provided product record, what would you infer is the value for the missing attribute ”brand”? Answer the name of the brand. ### Response: Himmlisch
Flipkart – 2nd Example ### Instruction: Record: [Product Name: ”dilli bazaaar Bellies, Corporate Casuals, Casuals”, description: ”Key Features of dilli bazaaar Bellies, Corporate Casuals, Casuals Material: Fabric Occasion: Ethnic, Casual, Party, Formal Color: Pink Heel Height: 0,Specifications of dilli bazaaar Bellies, Corporate Casuals, Casuals General Occasion Ethnic, Casual, Party, Formal Ideal For Women Shoe Details Weight 200 g (per single Shoe) - Weight of the product may vary depending on size. Heel Height 0 inch Outer Material Fabric Color Pink”] Based on the provided product record, what would you infer is the value for the missing attribute ”brand”? Answer the name of the brand. ### Response: dilli bazaaar
Flipkart – 3rd Example ### Instruction: Record: [Product Name: ”Shining Diva Alloy Yellow Gold Bangle Set”, description: ”Shining Diva Alloy Yellow Gold Bangle Set (Pack of 2) Price: Rs. 499 Accentuate Your Feminine Charm Wearing This Beautiful Bangle From The House Of Shining Diva. Made From Premium Quality Material, It Will Retain Its Quality And Lustre For Years To Come. This Bangle Is Lightweight And Skin Friendly. Featuring A Stylish Design And Great Finish, It Will Definitely Give Your Overall Look An Ethereal Dimension. This Bangle Will Surely Catch Your Fancy At Once. It Is Worth Investing In And Will Definitely Get You Noticed. This Bangle Comes In A Set Of Two.Accentuate Your Feminine Charm Wearing This Beautiful Bangle From The House Of Shining Diva. Made From Premium Quality Material, It Will Retain Its Quality And Lustre For Years To Come. This Bangle Is Lightweight And Skin Friendly. Featuring A Stylish Design And Great Finish, It Will Definitely Give Your Overall Look An Ethereal Dimension. This Bangle Will Surely Catch Your Fancy At Once. It Is Worth Investing In And Will Definitely Get You Noticed. This angle Comes In A Set Of Two.”] Based on the provided product record, what would you infer is the value for the missing attribute ”brand”? Answer the name of the brand. ### Response: Shining Diva
Phone – 1st Example ### Instruction: Record: Record: [Product Name: ”UNLOCKED RIM BlackBerry Pearl Flip 8220 Smart Cell Phone - Red”] Based on the provided cellphone record, what would you infer is the value for the missing attribute ”brand”? Answer the name of the brand. ### Response: BlackBerry
Phone – 2nd Example ### Instruction: Record: [Product Name: ”OtterBox Apple iPhone 4 & 4S Protective ION Defender Series Case (Retail Packaging) Black”] Based on the provided cellphone record, what would you infer is the value for the missing attribute ”brand”? Answer the name of the brand. ### Response: OtterBox
Phone – 3rd Example ### Instruction: Record: [Product Name: ”DTECH @ 2 PECES! Universal Ring Grip/Stand Holder for any Smart Device,Universal Black Bunker Ring Stand Holder for Apple iPhone 4 4s iphone 5 Samsung Galaxy s3 SIII Samsung GALAXY S6,S6 EDGE.Note II iPad 2 3 ipad mini iPod Nokia LG HTC One X etc,RING Essentials ” Cell Phone and Tablets Anti Drop Ring for iPhone 6 plus iPad mini iPad2 iPad iPod Samsung GALAXY NOTE S5 Universal Mobile Devices”] Based on the provided cellphone record, what would you infer is the value for the missing attribute ”brand”? Answer the name of the brand. ### Response: DTECH

D.3 Schema Matching

The few-shot examples for the CMS dataset are given as follows.

CMS – 1st Example ### Instruction: Attribute A is [name: ”condition_occurrence-condition_source_value”, description: ”the source code for the condition as it appears in the source data. this code is mapped to a standard condition concept in the standardized vocabularies and the original code is stored here for reference.”] Attribute B is [name: ”inpatientclaims-admtng_icd9_dgns_cd”, description: ”claim admitting diagnosis code”] Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No] ### Response: Yes
CMS – 2nd Example ### Instruction: Attribute A is [name: ”provider-npi”, description: ”the national provider identifier (npi) of the provider.”] Attribute B is [name: ”outpatientclaims-op_physn_npi”, description: ”operating physician – national provider identifier number”] Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No] ### Response: Yes
CMS – 3rd Example ### Instruction: Attribute A is [name: ”visit_detail-visit_detail_start_datetime”, description: ”the date and time of the visit started.”] Attribute B is [name: ”outpatientclaims-desynpuf_id”, description: ”beneficiary code”] Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No] ### Response: No

D.4 Entity Matching

The few-shot examples for the Abt-Buy and Walmart-Amazon datasets are given as follows.

Abt-Buy – 1st Example ### Instruction: Product A: [name: ”samsung s3 black multimedia player yps3jab”, description: ”samsung s3 black multimedia player yps3jab 4 gb internal flash memory 1.8 ’ tft lcd display touch-sensitive led controls multi-formats support dnse 2.0 sound engine fm tuner and recorder with presets up to 25 hours audio playback up to 4 hours video playback black finish”] Product B: [name: ”samsung 4gb portable mltimdia plyr blk yps-s3jab / xaa”, description: ”nan”] Are Product A and Product B the same Product? Choose your answer from: [Yes, No] ### Response: Yes
Abt-Buy – 2nd Example ### Instruction: Product A: [name: ”sony white 8 ’ portable dvd player dvpfx820w”, description: ”sony dvp-fx820 white 8 ’ portable dvd player dvpfx820w swivel & flip screen with dual sensor for remote control control buttons on screen bezel 12 bit video dac with 108 mhz processing removable , rechargeable battery & car adapter included white finish”] Product B: [name: ”toshiba sd-p71s portable dvd player”, description: ”toshiba sd-p71s 7 ’ portable dvd player”] Are Product A and Product B the same Product? Choose your answer from: [Yes, No] ### Response: No
Abt-Buy – 3rd Example ### Instruction: Product A: [name: ”sony xplod 10-disc add-on cd/mp3 changer cdx565mxrf”, description: ”sony xplod 10-disc add-on cd/mp3 changer cdx565mxrf cd/cd-r/cd-rw and mp3 playback mp3 decoding d-bass 12-second advanced electronic shock protection fm modulator 9 modulation frequencies wireless remote”] Product B: [name: ”sony cdx-565mxrf 10-disc cd/mp3 changer”, description: ”nan”] Are Product A and Product B the same Product? Choose your answer from: [Yes, No] ### Response: Yes
Walmart-Amazon – 1st Example ### Instruction: Product A: [name: ”d-link dgs-1005g 5-port gigabit desktop switch”, modelno: ”dgs1005g”] Product B: [name: ”d-link dgs-1005g 5-port gigabit desktop switch”, modelno: ”dgs-1005g”] Are Product A and Product B the same Product? Choose your answer from: [Yes, No] ### Response: Yes
Walmart-Amazon – 2nd Example ### Instruction: Product A: [name: ”nzxt phantom crafted series atx full tower steel chassis black”, modelno: ”nzxt phantom”] Product B: [name: ”nzxt crafted series atx full tower steel chassis - phantom white”, modelno: ”phantom white”] Are Product A and Product B the same Product? Choose your answer from: [Yes, No] ### Response: No
Walmart-Amazon – 3rd Example ### Instruction: Product A: [name: ”at t prepaid gophone samsung a187 with bluetooth blue”, modelno: ”a187”] Product B: [name: ”samsung a107 prepaid gophone at t”, modelno: ”a107”] Are Product A and Product B the same Product? Choose your answer from: [Yes, No] ### Response: No