新手避坑:用Requests库爬中国大学MOOC时,这几个反爬和编码问题你遇到了吗?

发布时间:2026/6/3 8:32:11

新手避坑:用Requests库爬中国大学MOOC时,这几个反爬和编码问题你遇到了吗? Python爬虫实战中国大学MOOC数据采集的五大避坑指南当你在浏览器中轻松浏览中国大学MOOC的课程时是否想过用Python将这些宝贵的学习资源转化为结构化的数据作为Python爬虫初学者你可能已经尝试过使用Requests库进行简单的数据抓取但很快就会发现实际项目中隐藏着各种坑。本文将带你深入分析五个最常见的实战问题并提供经过验证的解决方案。1. 动态Cookie失效的应对策略许多初学者在复制浏览器中的请求头时会直接使用获取到的Cookie值但很快发现几分钟后请求就开始失败。这是因为中国大学MOOC平台采用了动态Cookie机制而非静态不变的认证标识。典型错误表现headers { Cookie: NTESSTUDYSI9ddd9641afce4905aa429bf754db5b1b; ... # 硬编码的Cookie }解决方案使用Session对象保持会话实现Cookie自动更新机制import requests session requests.Session() # 首次请求获取有效Cookie init_url https://www.icourse163.org session.get(init_url) # 后续请求会自动携带更新后的Cookie api_url https://www.icourse163.org/web/j/mocSearchBean.searchCourseCardByChannelAndCategoryId.rpc response session.post(api_url, datayour_data)提示定期检查响应状态码当收到403或401时重新初始化Session2. 请求参数动态变化的破解方法中国大学MOOC的API接口往往会要求携带一些看似随机的参数如csrfKey等。这些参数通常隐藏在HTML页面或初始API响应中。参数获取技巧参数类型常见位置获取方法csrfKey首页HTML正则表达式提取timestampAPI响应JSON解析加密token隐藏inputXPath定位import re def get_csrf_key(session): home_page session.get(https://www.icourse163.org).text match re.search(rcsrfKey:([a-f0-9]{32}), home_page) return match.group(1) if match else None动态参数构建示例csrf_key get_csrf_key(session) data { mocCourseQueryVo: json.dumps({ categoryId: -1, categoryChannelId: channel_id, csrfKey: csrf_key, # 其他必要参数... }) }3. JSON数据编码乱码问题处理即使设置了正确的响应编码有时从API获取的JSON数据仍会出现乱码特别是包含中文内容时。常见乱码场景Unicode转义序列如\u4e2d\u6587混合编码的字符串错误的字节解码顺序多层级解决方案基础保障设置响应编码response.encoding utf-8深度处理解码unicode转义import codecs def decode_unicode(text): return codecs.decode(text, unicode_escape)终极方案自定义JSON解码器import json class MojibakeDecoder(json.JSONDecoder): def decode(self, s): obj super().decode(s) return self._decode_obj(obj) def _decode_obj(self, obj): if isinstance(obj, str): return obj.encode(raw_unicode_escape).decode(utf-8) elif isinstance(obj, dict): return {k: self._decode_obj(v) for k, v in obj.items()} elif isinstance(obj, list): return [self._decode_obj(item) for item in obj] return obj # 使用自定义解码器 data json.loads(response.text, clsMojibakeDecoder)4. 反爬机制与请求频率控制虽然中国大学MOOC没有特别严格的反爬措施但不加控制的频繁请求仍可能导致IP被暂时限制。智能请求策略基础防护随机延时import random import time def random_delay(min1, max3): time.sleep(random.uniform(min, max))高级防护自适应限速class AdaptiveRateLimiter: def __init__(self, base_delay1.0): self.base_delay base_delay self.error_count 0 def wait(self): delay self.base_delay * (1 self.error_count * 0.5) time.sleep(delay) def record_error(self): self.error_count min(self.error_count 1, 5) def record_success(self): self.error_count max(self.error_count - 1, 0) # 使用示例 limiter AdaptiveRateLimiter() for page in range(1, 100): try: response session.get(api_url) if response.status_code 200: limiter.record_success() else: limiter.record_error() except Exception: limiter.record_error() finally: limiter.wait()请求头优化组合headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36, Accept-Language: zh-CN,zh;q0.9, Referer: https://www.icourse163.org/, X-Requested-With: XMLHttpRequest }5. 数据分页与增量采集策略中国大学MOOC的课程数据通常采用分页加载正确处理分页逻辑对完整数据采集至关重要。分页实现要点基础分页参数params { pageIndex: 1, pageSize: 20, orderBy: 0 }智能终止条件def get_all_pages(base_url, initial_params): page_index initial_params[pageIndex] total_pages None while True: params {**initial_params, pageIndex: page_index} response session.post(base_url, dataparams) data response.json() # 首次请求获取总页数 if total_pages is None: total_pages data[result][query][totlePageCount] # 处理当前页数据 yield data[result][list] # 终止条件判断 page_index 1 if page_index total_pages: break # 合理延时 random_delay()断点续采实现import pickle class CrawlerState: def __init__(self, state_filecrawler_state.pkl): self.state_file state_file self.state self._load_state() def _load_state(self): try: with open(self.state_file, rb) as f: return pickle.load(f) except FileNotFoundError: return {last_page: 0, processed_ids: set()} def save_state(self): with open(self.state_file, wb) as f: pickle.dump(self.state, f) def should_skip(self, item_id): return item_id in self.state[processed_ids] def record_processed(self, item_id): self.state[processed_ids].add(item_id) def update_page(self, page_num): self.state[last_page] page_num # 使用示例 state CrawlerState() for page in range(state.state[last_page], total_pages): data get_page_data(page) for item in data: if not state.should_skip(item[id]): process_item(item) state.record_processed(item[id]) state.update_page(page) state.save_state()6. 数据存储优化方案采集到的数据需要合理存储既要考虑写入效率也要便于后续分析使用。CSV写入优化技巧批量写入减少IO操作import csv from itertools import islice def batch_write_csv(filename, data, batch_size100): with open(filename, a, newline, encodingutf-8-sig) as f: writer csv.writer(f) while True: batch list(islice(data, batch_size)) if not batch: break writer.writerows(batch)多线程安全写入from threading import Lock write_lock Lock() def thread_safe_write(filename, row): with write_lock: with open(filename, a, newline, encodingutf-8-sig) as f: writer csv.writer(f) writer.writerow(row)数据预处理管道def process_data(raw_data): # 数据清洗 cleaned {k: v.strip() if isinstance(v, str) else v for k, v in raw_data.items()} # 字段转换 if startTime in cleaned: cleaned[startTime] pd.to_datetime(cleaned[startTime], unitms) # 空值处理 for field in [teacherName, schoolName]: cleaned[field] cleaned.get(field, 未知) return cleaned7. 异常处理与日志记录健壮的爬虫需要完善的异常处理机制和详细的日志记录方便问题排查。异常处理框架import logging from functools import wraps logging.basicConfig( levellogging.INFO, format%(asctime)s [%(levelname)s] %(message)s, handlers[ logging.FileHandler(mooc_crawler.log), logging.StreamHandler() ] ) def handle_errors(max_retries3): def decorator(func): wraps(func) def wrapper(*args, **kwargs): retries 0 while retries max_retries: try: return func(*args, **kwargs) except requests.exceptions.RequestException as e: logging.warning(f请求失败: {str(e)}) retries 1 if retries max_retries: logging.error(f达到最大重试次数 {max_retries}) raise time.sleep(2 ** retries) except json.JSONDecodeError as e: logging.error(fJSON解析失败: {str(e)}) raise except Exception as e: logging.error(f未知错误: {str(e)}, exc_infoTrue) raise return wrapper return decorator # 使用示例 handle_errors(max_retries2) def fetch_page(url, params): response session.get(url, paramsparams, timeout10) response.raise_for_status() return response.json()监控指标记录class CrawlerMetrics: def __init__(self): self.start_time time.time() self.items_processed 0 self.pages_processed 0 self.errors_occurred 0 def log_page(self): self.pages_processed 1 def log_items(self, count): self.items_processed count def log_error(self): self.errors_occurred 1 def report(self): duration time.time() - self.start_time return { 运行时间: f{duration:.2f}秒, 处理页数: self.pages_processed, 处理条目: self.items_processed, 错误次数: self.errors_occurred, 平均速度: f{self.items_processed/max(1, duration):.2f} items/秒 } # 使用示例 metrics CrawlerMetrics() try: data fetch_page(url, params) metrics.log_page() metrics.log_items(len(data)) except Exception: metrics.log_error() finally: logging.info(metrics.report())

相关新闻