Transforming BioASQ Factoid Questions into SQuAD Format: A Preprocessing Approach

5 min readJul 6, 2024

BioASQ is a large-scale question-answering dataset widely used for training and evaluating machine learning models for biomedical question answering. Converting BioASQ to the popular SQuAD (Stanford Question Answering Dataset) format can be challenging due to differences in the structure and representation of the dataset. This article provides a Python script to pre-process the BioASQ factoid question dataset to SQuAD format.

Overview of the Script

The provided Python script reads BioASQ data from a JSON file and fetches abstracts and titles from PubMed using PMIDs (PubMed IDs) for constructing the “context” field and structures the data into the SQuAD format.

Detailed Breakdown of the Code

Imports and Dependencies:

The script begins by importing the necessary libraries:

import json
import requests
import xml.etree.ElementTree as ET
import time
import os

json is used for reading and writing JSON files.
requests is used for making HTTP requests to the PubMed API.
xml.etree.ElementTree is used for parsing XML responses.
time is used to add delays between API requests to avoid overwhelming the server.
os is used for file path manipulations.

2. Fetching PubMed Abstract and Title:

The function fetch_pubmed_abstract_and_title(pmid) is responsible for fetching the abstract and title of a given PubMed article using the PMID. It constructs the URL for the PubMed API, sends a request, and parses the XML response to extract the title and abstract.

def fetch_pubmed_abstract_and_title(pmid):
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={pmid}&retmode=xml"
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes

        try:
            root = ET.fromstring(response.content)
            abstract_text = root.find(".//AbstractText")
            article_title = root.find(".//ArticleTitle")
            if abstract_text is not None and article_title is not None:
                context = f"{article_title.text}. {abstract_text.text}"
                return context
            else:
                print(f"No abstract or title found for PMID {pmid}")
                return ""
        except ET.ParseError as e:
            print(f"XML parsing error for PMID {pmid}: {str(e)}")
            print(f"Response content: {response.content[:500]}...")  # Print first 500 characters
            return ""
    except requests.RequestException as e:
        print(f"Request error for PMID {pmid}: {str(e)}")
        return ""

The function constructs the PubMed API URL using the provided pmid.
It sends an HTTP GET request to the PubMed API and checks for errors.
If the request is successful, it parses the XML response to extract the abstract and title.
If both the abstract and title are found, it concatenates them to form the context and returns it.
If any error occurs (request or parsing), it prints an error message and returns an empty string.

3. Preprocessing BioASQ to SQuAD:

The preprocess_bioasq_to_squad(input_json) function reads the input JSON, processes each question, fetches corresponding abstracts, and converts the data into SQuAD format. It includes logic to handle factoid-type questions, extract PMIDs, and structure the data appropriately.

def preprocess_bioasq_to_squad(input_json):
    # Parse the input JSON
    data = json.loads(input_json)

    # Initialize the SQuAD format structure
    squad_format = {
        "version": "v1.0",
        "data": [{
            "title": "BioASQ Dataset",
            "paragraphs": []
        }]
    }

    # Process each question in the dataset
    for question in data['questions']:
        # Only process factoid type questions
        if question['type'] != 'factoid':
            continue

        # Process each document URL
        for doc_url in question['documents']:
            # Extract PMID from the URL
            pmid = doc_url.split('/')[-1]

            # Fetch the abstract and title from PubMed
            context = fetch_pubmed_abstract_and_title(pmid)

            if not context:
                print(f"Skipping PMID {pmid} due to empty abstract or title")
                continue  # Skip if no abstract or title found

            # Process answers
            exact_answers = question['exact_answer']
            if not isinstance(exact_answers, list):
                exact_answers = [exact_answers]

            # Flatten nested lists
            flat_answers = []
            for answer in exact_answers:
                if isinstance(answer, list):
                    flat_answers.extend(answer)
                else:
                    flat_answers.append(answer)

            # Create a separate QA pair for each answer
            for i, answer in enumerate(flat_answers):
                paragraph = {
                    "context": context,
                    "qas": [{
                        "id": f"{question['id']}_{pmid}_{i}",
                        "question": question['body'],
                        "answers": [],
                        "is_impossible": True
                    }]
                }

                # Find the start position of the answer in the context
                answer_start = context.lower().find(answer.lower())
                if answer_start != -1:
                    paragraph['qas'][0]['answers'].append({
                        "text": answer,
                        "answer_start": answer_start
                    })
                    paragraph['qas'][0]['is_impossible'] = False

                # Add the paragraph to the main dataset
                squad_format['data'][0]['paragraphs'].append(paragraph)

            # Add a small delay to avoid overwhelming the PubMed server
            time.sleep(0.5)

    return json.dumps(squad_format, indent=2)

The function reads the BioASQ JSON data.
It initializes a dictionary to hold the SQuAD format data.
For each question in the dataset, it checks if the question is of type factoid.
For each document URL associated with the question, it extracts the PMID and fetches the abstract and title from PubMed.
It processes the exact answers and flattens any nested lists of answers.
For each answer, it creates a QA pair, finds the answer’s start position in the context, and updates the paragraph structure.
It adds a small delay between requests to avoid overwhelming the PubMed server.
The function returns the processed data in SQuAD format as a JSON string.

4. Reading Input and Writing Output:

The script reads the BioASQ input JSON file, processes it using the preprocess_bioasq_to_squad function, and writes the converted data to a new JSON file.

# Read the input JSON from the file
input_file_path = '/content/test_process.json'
with open(input_file_path, 'r') as file:
    input_json = file.read()

# Preprocess the data
squad_json = preprocess_bioasq_to_squad(input_json)

# Determine the output file name
input_file_name = os.path.basename(input_file_path)
output_file_name = os.path.splitext(input_file_name)[0] + '_preprocessed.json'
output_file_path = os.path.join(os.path.dirname(input_file_path), output_file_name)

# Write the preprocessed data to the new file
with open(output_file_path, 'w') as file:
    file.write(squad_json)

print(f"Preprocessing complete. Output saved to '{output_file_path}'")

The script reads the input JSON file containing the BioASQ data.
It calls the preprocess_bioasq_to_squad function to convert the data into SQuAD format.
It determines the output file name based on the input file name.
It writes the converted data to a new JSON file.
It prints a message indicating the location of the output file.

Customization

To customize this script for your needs, you can:

Modify the URL and request parameters in the fetch_pubmed_abstract_and_title function if you need to fetch additional data or use a different API.
Adjust the way questions and answers are processed in the preprocess_bioasq_to_squad function to handle different types of questions or additional answer formats.
Change the file paths and input/output handling to suit your environment or workflow.

Conclusion

This article provides a detailed walk-through of a python script to transform BioASQ dataset to SQuAD format.

Resources:

Google Collab Link : https://colab.research.google.com/drive/1ehhfsvRkAS_SKZbPbKloV3ndEXWkmVI8?usp=sharing

Transforming BioASQ Factoid Questions into SQuAD Format: A Preprocessing Approach

Overview of the Script

Detailed Breakdown of the Code

Customization

Conclusion

Resources:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Ahmed Ajmine Nehal

Responses (1)