OAI-PMH fr Resurce Harvesting Herbert Van de Smpel Digital Library Research & Prttyping Team Research Library, Ls Alams Natinal Labratry Michael Nelsn Cmputer Science Department Old Dminin University OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Tutrial Outline OAI-PMH fr Resurce Harvesting: prblem statement and cnceptual slutin MPEG-21 DIDL: An XML-based Cmplex Object Frmat fr OAI-PMHbased Resurce Harvesting Accurate mirrring the cllectin f the American Physical Sciety using OAI-PMH-based Resurce Harvesting md_ai: An OAI-PMH-based mdel fr Web Resurce Harvesting OAIResurce: A sftware tl fr OAI-PMH-based Resurce Harvesting OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Resurce Harvesting: Use cases Discvery: use cntent itself in the creatin f services search engines that make full-text searchable citatin indexing systems that extract references frm the full-text cntent brwsing interfaces that include thumbnail versins f high-quality images frm cultural heritage cllectins Preservatin: peridically transfer digital cntent frm a data repsitry t ne r mre trusted digital repsitries trusted digital repsitries need a mechanism t autmatically synchrnize with the riginating data repsitry OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Resurce Harvesting: Use cases Discvery: Institutinal Repsitry & Digital Library Prjects: UK JISC, DARE, DINI Web search engines: cmpetitin fr cntent (cf Ggle Schlar) Preservatin: Institutinal Repsitry & Digital Library Prjects: UK JISC, DARE, DINI Library f Cngress: NDIIP Archive Exprt/Ingest, e-depsit OAI-PMH is well-established. Can OAI-PMH be used fr Resurce Harvesting? OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Existing OAI-PMH based appraches Typical scenari: 1. An OAI-PMH harvester harvests Dublin Cre recrds frm the OAI-PMH repsitry. 2. The harvester analyzes each Dublin Cre recrd, extracting dc.identifier infrmatin in rder t determine the netwrk lcatin f the described resurce. 3. A separate prcess, ut-f-band frm the OAI-PMH, cllects the described resurce frm its netwrk lcatin. OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Existing OAI-PMH based appraches : Issue 1 Lcating the resurce based n infrmatin prvided in dc.identifier dc.identifier used t cnvey a variety f identifier: (simultaneusly) URL DOI, bibligraphic citatin, Nt expressive enugh t distinguish between identifier, lcatr. Several derferencing attempts required URI prvided in dc.identifier is cmmnly that f a bibligraphic splash page Hw t knw it is a bibligraphic splash page, nt the resurce? If it is a bibligraphic splash page, where is the resurce? OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Existing OAI-PMH based appraches : Issue 2 Using the OAI-PMH datestamp f the Dublin Cre recrd t trigger incremental harvesting: Datestamp f DC recrd des nt necessarily change when resurce changes DC recrd datestamp n change DC recrd datestamp change n resurce update resurce update n metadata update OK missed resurce update metadata update unnecessary resurce dwnlad OK OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Existing OAI-PMH based appraches : Cnventins Cnventins address Issue 1; Issue 2 can nt really be addressed. First dc.identifier is lcatr f the resurce what if the resurce is nt digital? Use f dc.frmat and/r dc.relatin t cnvey lcatr OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Existing OAI-PMH based appraches : Cnventins <ai_dc:dc> <dc:title>a Simple Parallel-Plate Resnatr Technique fr Micrwave. Characterizatin f Thin Resistive Films</dc:title> <dc:creatr>vrbiev, A.</dc:creatr> <dc:subject>ing-inf/01 Elettrnica</dc:subject> <dc:descriptin>a parallel-plate resnatr methd is prpsed fr nn-destructive characterisatin f resistive films used in micrwave integrated circuits. A slt made in ne... </dc:descriptin> <dc:publisher>micrwave engineering Eurpe</dc:publisher> <dc:date>2002</dc:date> <dc:type>dcument relativ ad una Cnferenza altr Event</dc:type> <dc:type>peerreviewed</dc:type> <dc:identifier>http://amsacta.cib.unib.it/archive/00000014/</dc:identifier> <dc:frmat>pdf http://amsacta.cib.unib.it/archive/00000014/01/gaas_1_vrbiev.pdf </dc:frmat> </ai_dc:dc> splash page lcatr f resurce OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Existing OAI-PMH based appraches : Cnventins <dc:identifier>http://amsacta.cib.unib.it/archive/00000014/</dc:identifier> <dc:relatin> http://amsacta.cib.unib.it/archive/00000014/01/gaas_1_vrbiev.pdf </dc:relatin> splash page lcatr f resurce OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Existing OAI-PMH based appraches : Cnventins <dc:identifier>http://amsacta.cib.unib.it/archive/00000014/</dc:identifier> <dc:relatin> http://reslver.unib.it/00000014/ </dc:relatin> <dc:relatin> http://amsacta.cib.unib.it/archive/00000014/01/gaas_1_vrbiev.pdf </dc:relatin> splash page splash page lcatr f resurce OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Existing OAI-PMH based appraches : Other attempts dc.identifier leads t splash page & splash page cntains special purpse XHTML link t resurce(s) What if there is n splash page? Hw des a harvester knw he is in this situatin? OA-X: prtcl extensin OK in lcal cntext Strategic prblem t generalize Hw t cnslidate with OAI-PMH data mdel Qualified Dublin Cre Culd bring expressiveness t distinguish between lcatr & identifier But what with datestamp issue? OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Prpsed OAI-PMH based apprach Use metadata frmats that were specifically created fr representatin f digital bjects: Cmplex Object Frmats as OAI-PMH metadata frmats MPEG-21 DIDL, METS,.. OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
OAI-PMH data mdel resurce OAI-PMH identifier = entry pint t all recrds pertaining t the resurce item metadata pertaining t the resurce Dublin Cre MARCXML metadata metadata recrds simple mre expressive highly expressive highly expressive OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Cmplex Object Frmats : characteristics Representatin f a digital bject by means f a wrapper XML dcument Represented resurce can be: simple digital bject (cnsisting f a single datastream) cmpund digital bject (cnsisting f multiple datastreams) Unambiguus apprach t cnvey identifiers f the digital bject and its cnstituent datastreams Include datastream: By-Value: embedding f base64-encded datastream By-Reference: embedding netwrk lcatin f the datastream nt mutually exclusive; equivalent Include a variety f secndary infrmatin By-Value By-Reference Descriptive metadata, rights infrmatin, technical metadata, OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
<didl:didl> <didl:item> <didl:descriptr><didl:statement mimetype="text/xml; charset=utf-8"> <dii:identifier> http://amsacta.cib.unib.it/archive/00000014/ </dii:identifier> </didl:statement></didl:descriptr> <didl:descriptr><didl:statement mimetype="text/xml; charset=utf-8"> <ai_dc:dc> <dc:title>a Simple Parallel-Plate Resnatr Technique fr Micrwave. Characterizatin f Thin Resistive Films </dc:title> <dc:creatr>vrbiev, A.</dc:creatr> <dc:identifier> http://amsacta.cib.unib.it/archive/00000014/</dc:identifier> <dc:frmat>applicatin/pdf</dc:frmat> </ai_dc:dc> </didl:statement></didl:descriptr> <didl:cmpnent> <didl:resurce mimetype="applicatin/pdf" ref="http://amsacta.cib.unib.it/archive/00000014/01/gaas_1_vrbiev.pdf"/> </didl:cmpnent> </didl:item> </didl:didl> OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Cmplex Object Frmats & OAI-PMH Resurce represented via XML wrapper => OAI-PMH <metadata> Unifrm slutin fr simple & cmpund bjects Unambiguus expressin f lcatr f datastream Disambiguatin between lcatrs & identifiers OAI-PMH datestamp changes whenever the resurce (datastreans, secndary infrmatin) changes OAI-PMH semantics apply: abut cntainers, set membership OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
OAI-PMH based apprach using Cmplex Object Frmat Typical scenari: 1. An OAI-PMH harvester checks fr supprt f a cmplex bject frmat using the ListMetadataFrmats verb 2. The harvester harvests the cmplex bject metadata. Semantics f the OAI-PMH datestamp guarantee that new and mdified resurces are detected. 3. A parser at the end f the harvesting applicatin analyzes each harvested cmplex bject recrd: - The parser extracts the bitstreams that were delivered By-Value. - The parser extracts the unambiguus references t the netwrk lcatin f bitstreams delivered By-Reference. 4. A separate prcess, ut-f-band frm the OAI-PMH, cllects the bitstreams delivered By-Reference frm the extracted netwrk lcatins. OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Cmplex Object Frmats & OAI-PMH : existing implementatins LANL Repsitry Lcal strage f Terrabytes f schlarly assets Assets stred as MPEG-21 DIDL dcuments DIDL dcuments made accessible t dwnstream applicatins via the OAI-PMH Mirrring f American Physical Sciety cllectin at LANL Maps APS dcument mdel t MPEG-21 DIDL Transfer Prfile Expses MPEG-21 DIDL dcuments thrugh OAI-PMH infrastructure Inlcudes digests/signatures DSpace & Fedra plug-ins md_ai Maps DSpace/Fedra dcument mdel t MPEG-21 DIDL Transfer Prfile Expses MPEG-21 DIDL dcuments thrugh OAI-PMH infrastructure OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Cmplex Object Frmats & OAI-PMH : issues Which Cmplex Object Frmat(s) Hw t Prfile Cmplex Object Frmat(s) fr OAI-PMH Harvesting Large recrds Cmpund bjects with multiple datastreams. What if nly 1 datastream gets updated? Because the resurce is represented as <metadata>, can rights pertaining t the resurce be expressed accrding t the rights fr metadata OAI-rights guideline? Tls: Sftware library t write cmpliant cmplex bjects Integratin f this library with repsitry systems (Fedra, DSpace, eprints.rg,.) Sftware t harvest resurces based n OAI-PMH mdel OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland
Readings Herbert Van de Smpel, Michael Nelsn, Carl Lagze, Simen Warner. Resurce Harvesting witin the OAI-PMH Framewrk. D-Lib Magazine. December 2004. http://dx.di.rg/10.1045/december2004-vandesmpel OAI-PMH fr Resurce Harvesting Tutrial OAI4, Octber 20 th 2005, CERN, Geneva, Switzerland