Francesco Montanari

The INSPIRE search engine provides an API to do automated searching of physics bibliographic references. Information can be retrieved through a programmatic query interface. Several applications use the API interface. For instance, some look up all the \cite{...} references in a file and output the corresponding \bibitem{...}'s. INSPIRE itself provides an online references extractor and a bibliography generator.

Pyinspire (modified BSD license) retrieves results from the INSPIRE HEP database from the command line. A complete list of options is available through pyinspire.py --help. Let's consider the following query from command line:

$ pyinspire.py -b -s "find a Maldacena and date 1997 and topcite 1000+"
INFO:pyinspire:Search of INSPIRE started...

@article{Maldacena:1997re,
      author         = "Maldacena, Juan Martin",
      title          = "{The Large N limit of superconformal field theories and
                        supergravity}",
      journal        = "Int. J. Theor. Phys.",
      volume         = "38",
      year           = "1999",
      pages          = "1113-1133",
      doi            = "10.1023/A:1026654312961",
      note           = "[Adv. Theor. Math. Phys.2,231(1998)]",
      eprint         = "hep-th/9711200",
      archivePrefix  = "arXiv",
      primaryClass   = "hep-th",
      reportNumber   = "HUTP-97-A097, HUTP-98-A097",
      SLACcitation   = "%%CITATION = HEP-TH/9711200;%%"
}

We instructed pyinspire to download bibtex references (flag -b) resulting from the query passed within the string after the -s flag. The string can include any option accepted by the INSPIRE query format. In this case we looked for references by author Maldacena, dating 1997 and having been cited more than 1000 times. This matched one entry from the INSPIRE database and printed on screen the corresponding bibtex entry. Options to return formats more manageable by databases, such as JSON, are also available.

While pyinspire is supposed to work as a command, its main function can be easily imported in Python scripts. This allows to treat programmatically particular cases. While the interface is very simple, it allows to retrieve bibliographic references automatically with a good flexibility thanks to the feature-rich INSPIRE API query format.

Let's suppose that a bibtex reference file has to be created starting from inhomogeneous reference lists. For example, a reference list may be a .bib bibtex file. Another one may be a list of bibitem's. The two lists may have overlapping references, but the citation labels are different and do not correspond to those from the INSPIRE database. However, let's say that both lists report the arXiv numbers for each entry. Then, we can parse the files to lookup for all arXiv numbers, and send a query to INSPIRE based on that. The query will return a consistent .bib bibtex file.

First, we have to match all arXiv references. This is easily done using regular expressions (regexps). The following function reads a string (that will be the content of a file) and returns a list containing arxiv numbers. It matches patterns such as arXiv:1234.5678 or arXiv:gr-qc/1234567 ignoring the case, since the prefix may appear in any case combination (e.g., arxiv, arXiv, ARXIV).

import re

PREFIX = 'arxiv:'

def get_arxiv_ids(string):
    """Read string and return a list containing arxiv numbers. Match
    patterns such as `arXiv:1234.5678` or `arXiv:gr-qc/1234567`.

    """
    regexp = r'[A-Za-z0-9.\-\/]*'
    return re.findall(PREFIX+regexp, string, re.IGNORECASE)

(The variable PREFIX does not need to be global and should be better passed as an argument of the function instead. Here it is declared as a global variable just because an additional function argument is not needed to clarify this simple example.)

Then, we define a funtion that prints to stdout the INSPIRE query result based on arxiv numbers listed in a file.

from pyinspire.pyinspire import get_text_from_inspire

def get_bibtex(myfile):
    """Print to stdout the inspire query result based on arxiv numbers
    listed in myfile.

    """
    with open(myfile, 'r') as f:
        string = f.read()

    arxiv_ids = set(get_arxiv_ids(string))

    resultformat = 'bibtex'
    tags = None

    for arxiv in sorted(arxiv_ids):
        result = get_text_from_inspire(search=arxiv[len(PREFIX):],
                                       resultformat=resultformat,
                                       ot=tags)
        print(result)

The function above starts by reading the given file. It calls the function get_arxiv_ids() to retrieve arXiv numbers listed anywhere in the input file. The list is converted to a set not because we need to operate on a set, but just as an easy way to remove duplicates. For each arXiv number sorted alphabetically, call the function get_text_from_inspire() to send a query to INSPIRE. This function is provided by the pyinspire package. It is simple enough to use, as it receives just three clear arguments:

search: search string to use in the query. In our case it coincides with the arxiv number. Note that we remove PREFIX (i.e., 'arxiv:') from the query, as it may be problematic with older arXiv number formats. To do that we use list slices instead of Python replace() method since the prefix may appear in any case combination (arxiv, arXiv, ARXIV, ...).
resultformat: string containing the name of the format ('brief', 'bibtex', 'latexEU', 'latexUS'). In our case it is the bibtex format.
ot: tags to be included in MarcXML or JSON output. We don't need it.

The function can be called as get_bibtex('refs.tex') from Python, where refs.tex is a text file containing all the arXiv numbers (it does not need to be a Tex file or any other specific format, it just has to contain arXiv numbers to be parsed). If the input file contains Unicode characters, using Python3 is recommended over Python2.

The INSPIRE API does not seem to mention limits on the number of requests allowed per time interval. Of course, querying simultaneously large reference lists in a small time should be avoided. While a concurrent download would improve significantly the speed of the script above, we used a sequential implementation to avoid to inadvertently launch a DOS (Denial of Service) attack.

Retrieve bibtex entries with Python