Content from API Introduction


Last updated on 2025-10-14 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • What is an API?
  • What is a 401 status code?

Objectives

  • Understand what an API is and how it works
  • Understand what HTTP requests are

Introduction


HTML websites are a widespread means of sharing information on the internet. It is unsurprising then that scraping websites is a common practice (in research) to obtain information from the web in an automated way.

However, scraping websites has many downsides; it would be much easier if a computer program could instead communicate with a data provider directly, requesting exactly the information that is needed for the research purpose. This is what Application Programming Interfaces (APIs) accomplish.

Nowadays, many organizations have restricted access to their public APIs drastically. The open-source community remains strong, however, and there are plenty good examples of public open APIs such as the scholarly database OpenAlex.

Their API can be reached at:

https://api.openalex.org/

Exploring an API


When you open this link in your browser, you won’t see much at first. That’s because we haven’t actually specified the data we would like to retrieve. Luckily, on that page, you will find a link to the API documentation, a very crucial source of information when communicating with an API.

Reading through the documentation, you will find many so called endpoints, URLs that represent resources such as publications:

https://api.openalex.org/works

You can copy-paste this link to a browser. Alternatively, on the command line you can use curl. In the next chapter, we will also see how we can use Python and R to obtain data from the API.

Here is a subset of the data displayed when opening the works URL in a browser. OpenAlex returns data as JSON, a common data format for these types of APIs. We won’t cover the specifics of the format here, but a quick web search can give you some answers.

JSON

{
  "id":"https://openalex.org/W1775749144",
  "doi":"https://doi.org/10.1016/s0021-9258(19)52451-6",
  "title":"PROTEIN MEASUREMENT WITH THE FOLIN PHENOL REAGENT",
  "display_name":"PROTEIN MEASUREMENT WITH THE FOLIN PHENOL REAGENT",
  "publication_year":1951,
  "publication_date":"1951-11-01"
}

When looking at the meta.count field, you’ll notice that the total number of publications available via works is 270,765,445 which is quite large. For the purpose of this workshop, we would like to reduce this number by applying filters. This is what most APIs are designed to do and it can be very useful if you only wish to obtain specific subsets of data.

Let’s say we are only interested in publications from 2024 written by at least one author from the University of Amsterdam. To filter by institutions, OpenAlex uses the so called ROR identifier.

We can modify the URL like this:

https://api.openalex.org/works?filter=institutions.ror:04dkp9463,publication_year:2024

Take a look at the API documentation to explore other filter parameters.

HTTP status codes


We’ve probably all encountered the famous 404 error message when being redirected to a website that does not exist. APIs usually provide more informative error messages.

For example, when inserting a typo into one of our parameters, let’s say replacing publication_year with pppublication_year, we will get a message like this:

Invalid query parameters error

This error tells us what to look for.

When making HTTP requests from Python or R, we will see that handling error messages is necessary for your code to run without interruptions. Below you will find a list of the most common status codes and what they mean.

Code Name Meaning
200 OK The request has succeeded.
204 No Content The server has completed the request, but doesn’t need to return any data.
400 Bad Request The request is badly formatted.
401 Unauthorized The request requires authentication.
404 Not Found The requested resource could not be found.
408 Timeout The server gave up waiting for the client.
500 Internal Server Error An error occurred in the server.

Content from Getting data


Last updated on 2025-10-14 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • How can I read the dataset into Python or R
  • How can I access data via an API using Python or R

Objectives

  • Read a dataset from file or fetch data via API

Before looking at the LiteLLM API, we want to make sure that we have a dataset to work with. We will be using publications data from OpenAlex. You can either load a prepared dataset (see Setup instructions) or use the API programmatically and obtain the data yourself.

Callout

You do this part on your own. Follow the instructions for your language.

Method 1: read data from file


Method 2: get data via API


Callout

For this part, we will need HTTP libraries for either Python or R. We will use the most common libraries, but if you are already familiar with another, feel free to use those.

GET vs. POST requests

The two most relevant HTTP request types for our purposes are GET and POST.

In simple words, GET requests retrieve data from the server, without sending any data, while POST requests send data and retrieve data in response.

When interacting with a language model, we are most likely going to send input data (prompts) to the server, which means we will be making POST requests. To retrieve data from OpenAlex, we will be using GET.

Content from LiteLLM API


Last updated on 2025-10-14 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • What can I use the API for?

Objectives

  • Understand the API endpoints
  • Understand how requests are made
Callout

In this episode, we will be working with the LiteLLM proxy server. This infrastructure setup is specific to the University of Amsterdam. However, if your institution uses LiteLLM as well, you should be able to follow these instructions, too.

Overview


The API can be reached at https://ai-research-proxy.azurewebsites.net/

When you open the link in a browser, you will see the so called Swagger UI, a user interface that lists all of the API’s endpoints; it allows you to test those, as long as you have an API key.

You will note that the list is quite long; however, most of those endpoints are not relevant for this workshop, neither for most research purposes.

Callout

For this part of the workshop, you will need a working API key.

Using Swagger UI


In order to explore and test the available API endpoints, the Swagger UI is an ideal starting point. To be able to execute requests, you need to authorize first (top right button) by inserting your API key.

Let’s have a look at a simple GET request using /models:

We don’t need to pass any data, simply press Execute. The status code of our request should be 200 and the response body should contain a list of available models like this:

JSON

{
  "data": [
    {
      "id": "gpt-4",
      "object": "model",
      "created": 1677610602,
      "owned_by": "openai"
    },
    {
      "id": "text-embedding-ada-002",
      "object": "model",
      "created": 1677610602,
      "owned_by": "openai"
    },
    {
      "id": "gpt-4.1",
      "object": "model",
      "created": 1677610602,
      "owned_by": "openai"
    },
    {
      "id": "Llama-3.3-70B-Instruct",
      "object": "model",
      "created": 1677610602,
      "owned_by": "openai"
    }
  ]
}

This information is handy once we start sending input text as we need to specify the exact model to use.

Using Python and R


In the previous chapter, we installed and used the requests (Python) and httr2 (R) libraries. We will now use those again to make requests to the LiteLLM API.

Since it is now necessary to authorize ourselves using an API key, we need to send additional information to the server, so called headers. In essence, we are specifying that we are going to send JSON data and that we would like to receive JSON data as well. We are also sending the API key with each request so that the server is able to authorize us.

Content from Examples and exercises


Last updated on 2025-10-14 | Edit this page

Estimated time: 65 minutes

Overview

Questions

  • How can I apply this knowledge?

Objectives

  • Explore, experiment, showcase

We’ve covered the basics of what an API is, how to interact with it and looked at two API specifications. It is now time to put this knowledge into practice. Below you’ll find links to language-specific scripts with GenAI use cases. We also provide additional links for further reading.

Discussion

In pairs or groups of 3, use the available data to come up with a generative AI (research) application - possible outcomes can be tables, graphs, summaries, or small pieces of software (advanced). Ideally, you can share these outcomes with the other workshop participants at the end of the exercise.

Example 1: SDG classification

We read a subset of publications and abstracts, and use structured outputs to retrieve labels for publication titles and abstracts.

Example 2: RAGs

Retrieval-augmented generation (RAG) can be a powerful tool to generate more reliable output. By providing a set of domain-specific documents, the LLM is able to query those documents before generating its answer to the user.

To implement RAG, you will need to use an embeddings model next to the LLM, e.g. text-embedding-ada-002.